Re: Does solr supports indexing of files other than UTF-8

2011-01-28 Thread Yonik Seeley
On Thu, Jan 27, 2011 at 3:51 AM, prasad deshpande
prasad.deshpand...@gmail.com wrote:
 The size of docs can be huge, like suppose there are 800MB pdf file to index
 it I need to translate it in UTF-8 and then send this file to index.

PDF is binary AFAIK... you shouldn't need to do any charset
translation before sending it to solr, or any other extraction
library.  If you're using solr-cell then it's the Tika component that
is responsible for pulling out the text in the right format.

-Yonik
http://lucidimagination.com


Re: Searching for negative numbers very slow

2011-01-28 Thread Yonik Seeley
On Thu, Jan 27, 2011 at 6:32 PM, Simon Wistow si...@thegestalt.org wrote:
 If I do

        qt=dismax
    fq=uid:1

 (or any other positive number) then queries are as quick as normal - in
 the 20ms range.

 However, any of

        fq=uid:\-1

 or

    fq=uid:[* TO -1]

 or

    fq=uid:[-1 to -1]

 or

    fq=-uid:[0 TO *]

 then queries are incredibly slow - in the 9 *second* range.

That's odd - there should be nothing special about negative numbers.
Here are a couple of ideas:
  - if you have a really big index and querying by a negative number
is much more rare, it could just be that part of the index wasn't
cached by the OS and so the query needs to hit the disk.  This can
happen with any term and a really big index - nothing special for
negatives here.
 - if -1 is a really common value, it can be slower.  is fq=uid:\-2 or
other negative numbers really slow also?

-Yonik
http://lucidimagination.com


Re: edismax vs dismax

2011-01-28 Thread Yonik Seeley
On Fri, Jan 28, 2011 at 3:00 PM, Thumuluri, Sai
sai.thumul...@verizonwireless.com wrote:
 I recently upgraded to Solr 1.4.1 from Solr 1.3 and with the upgrade
 used edismax query parser. Here is my solrconfig.xml . When I search for
 mw verification and payment information - I get no results with
 defType set to edismax,

It's probably a bit of natural language query parsing in edismax...
- and is treated as AND (the lucene operator) in the appropriate
context (i.e. we won't if it's at the start or end of the query, etc)
- or is treated as OR in the appropriate context

The lowercaseOperators parameter can control this, so try setting
lowercaseOperators=false

-Yonik
http://lucidimagination.com



 if I switch the deftype to dismax - I get the results I am looking for

 Can anyone explain, why this would be the case? I thought edismax is
 dismax and more.

 Thank you,

 For 1.4.1
 requestHandler name=partitioned class=solr.SearchHandler
 default=true
    lst name=defaults
     str name=defTypedismax/str
     str name=echoParamsexplicit/str
     float name=tie0.01/float
     str name=qf
        body^1.0 title^10.0 name^3.0 taxonomy_names^2.0 tags_h1^5.0
 tags_h2_h3^3.0 tags_h4_h5_h6^2.0 tags_inline^1.0
     /str
     str name=pf
        body^10.0
     /str
     int name=ps4/int
     str name=mm
        2lt;-25%
     /str
     str name=q.alt*:*/str

     str name=hltrue/str
     str name=hl.flbody/str
     int name=hl.snippets3/int
     str name=hl.mergeContiguoustrue/str
   !-- instructs Solr to return the field itself if no query terms are
        found --
     str name=f.body.hl.alternateFieldbody/str
     str name=f.body.hl.maxAlternateFieldLength256/str
     !--str name=f.body.hl.fragmenterregex/str-- !-- defined
 below --

 Sai Thumuluri


Re: Local param tag voodoo ?

2011-01-20 Thread Yonik Seeley
On Thu, Jan 20, 2011 at 4:59 AM, Xavier SCHEPLER
xavier.schep...@sciences-po.fr wrote:
 Ok,
 I tryed to use nested queries this way:
 wt=jsonindent=truefl=qFRq=sarkozy 
 _query_:{!tag=test}chiracfacet=truefacet.field={!ex=test}studyDescriptionId
 It resulted in this error:
 facet_counts:{
  facet_queries:{},
  exception:java.lang.NullPointerException\n\tat


There's currently no way to exclude part of a query... the things you
tag must be a top level q or fq query.

But this has uncovered a bug - we don't handle the case when
everything is excluded (all q and fq).

-Yonik
http://www.lucidimagination.com


Re: utf-8 tomcat and solr problem

2011-01-06 Thread Yonik Seeley
On Thu, Jan 6, 2011 at 2:23 AM, Julian Hille julian.hi...@netimpact.de wrote:
 Hi,

 if i search for a german umlaut like ä or ö i get something like weird 
 conversions from latin to utf in query response. The encoding of the result 
 is ok,
 but not the you queried for this part. There is my ä wrong encoded.  
 There it seems like it had been interpreted from latin to utf 8.

 Solr is set to use utf-8 and tomcat got in the connector URIEncoding=UTF-8 
 but that didnt change anything.

You can verify that the container is configured correctly via
example/exampledocs/test_utf8.sh

Another trick I sometimes use is to use the python response format
(wt=python) since that uses escapes for anything outside of ASCII and
then it's easy to see the actual unicode value that's being returned
in a response.

-Yonik
http://www.lucidimagination.com


Re: Including Small Amounts of New Data in Searches (MultiSearcher ?)

2011-01-06 Thread Yonik Seeley
On Thu, Jan 6, 2011 at 12:37 PM, Stephen Boesch java...@gmail.com wrote:
 Solr/lucene newbie here ..

 We would like searches against a solr/lucene index to immediately be able to
 view data that was added.  I stress small amount of new data given that
 any significant amount would require excessive  latency.

There has been significant ongoing work in lucene-core for NRT (near real time).
We need to overhaul Solr's DirectUpdateHandler2 to take advantage of
all this work.
Mark Miller took a first crack at it (sharing a single IndexWriter,
letting lucene handle the concurrency issues, etc)
but if there's a JIRA issue, I'm having trouble finding it.

 Looking around, i'm wondering if the direction would be a MultiSearcher
 living on top of our standard directory-based IndexReader as well as a
 custom Searchable that handles the newest documents - and then combines the
 two results?

If you look at trunk, MultiSearcher has already gone away.

-Yonik
http://www.lucidimagination.com


Re: Will Result Grouping return documents that don't contain the specified group.field?

2011-01-06 Thread Yonik Seeley
On Thu, Jan 6, 2011 at 5:55 PM, Andy angelf...@yahoo.com wrote:
 So by default Solr will not return documents that don't contain the specified 
 group.field?

Solr will.  Documents without a value for that field should be grouped
under the null value.

-Yonik
http://www.lucidimagination.com


Re: Replication: the web application [/solr] .. likely to create a memory leak

2011-01-04 Thread Yonik Seeley
On Tue, Jan 4, 2011 at 9:34 AM, Robert Muir rcm...@gmail.com wrote:
    [junit] WARNING: test class left thread running:
 Thread[MultiThreadedHttpConnectionManager cleanup,5,main]

I suppose we should move MultiThreadedHttpConnectionManager to CoreContainer.

-Yonik
http://www.lucidimagination.com


Re: SpatialTierQueryParserPlugin Loading Error

2010-12-28 Thread Yonik Seeley
On Tue, Dec 28, 2010 at 8:54 PM, Adam Estrada estrada.a...@gmail.com wrote:
 I would gladly update this page if I could just get it working.
 http://wiki.apache.org/solr/SpatialSearch

Everything on that wiki page should work w/o patches on trunk.
I just ran through all of the examples, and everything seemed to be
working fine.

-Yonik
http://www.lucidimaignation.com


Re: Map failed at getSearcher

2010-12-24 Thread Yonik Seeley
On Fri, Dec 24, 2010 at 10:23 AM, Robert Muir rcm...@gmail.com wrote:
 hmm, i think you are actually running out of virtual address space,
 even on 64-bit!

I don't know if there are any x86 processors that allow 64 bits of
address space yet.
AFAIK, they are mostly 48 bit.

 http://msdn.microsoft.com/en-us/library/aa366778(v=VS.85).aspx#memory_limits

 Apparently windows limits you to 8TB virtual address space
 (ridiculous), so i think you should try one of the following:
 * continue using mmap directory, but specify MMapDirectoryFactory
 yourself, and specify the maxChunkSize parameter. The default
 maxChunkSize is Integer.MAX_VALUE, but with a smaller one you might be
 able to work around fragmentation problems.

Hmmm, maybe we should default to a smaller value?  Perhaps something
like 1G wouldn't impact performance, but could help avoid OOM due to
fragmentation?

-Yonik
http://www.lucidimagination.com


Re: White space in facet values

2010-12-22 Thread Yonik Seeley
On Wed, Dec 22, 2010 at 9:53 AM, Dyer, James james.d...@ingrambook.com wrote:
 The phrase solution works as does escaping the space with a backslash:  
 fq=Product:Electric\ Guitar ... actually a lot of characters need to be 
 escaped like this (amperstands and parenthesis come to mind)...

One way to avoid escaping is to use the raw or term query parsers:

fq={!raw f=Product}Electric Guitar

In 4.0-dev, use {!term} since that will work with field types that
need to transform the external representation into the internal one
(like numeric fields need to do).

http://wiki.apache.org/solr/SolrQuerySyntax

-Yonik
http://www.lucidimagination.com




 I assume you already have this indexed as string, not text...

 James Dyer
 E-Commerce Systems
 Ingram Content Group
 (615) 213-4311


 -Original Message-
 From: Andy [mailto:angelf...@yahoo.com]
 Sent: Wednesday, December 22, 2010 1:11 AM
 To: solr-user@lucene.apache.org
 Subject: White space in facet values

 How do I handle facet values that contain whitespace? Say I have a field 
 Product that I want to facet on. A value for Product could be Electric 
 Guitar. How should I handle the white space in Electric Guitar during 
 indexing? What about when I apply the constraint fq=Product:Electric Guitar?






Re: Faceting memory requirements

2010-12-21 Thread Yonik Seeley
On Tue, Dec 21, 2010 at 4:02 PM, Rok Rejc rokrej...@gmail.com wrote:
 Dear all,

 I have created an index with aprox. 1.1 billion of documents (around 500GB)
 running on Solr 1.4.1. (64 bit JVM).

 I want to enable faceted navigation on am int field, which contains around
 250 unique values.
 According to the wiki there are two methods:

 facet.method=fc which uses field cache. This method should use MaxDoc*4
 bytes of memory which is around: 4.1GB.

facet.method=fc uses the fieldcache, but it uses the StringIndex for
all field types currently, so
you need to add in space for the string representation of all the
unique values.  But this is only
250, so given the large number of docs, your estimate should still be close.

 facet.method=enum which crated a bitset for each unique value. This method
 should use NumberOfUniqueValues * SizeOfBitSet which is around 32GB.

A more efficient representation is used for a set when the set size is
less than maxDoc/64.
This set type uses an int per doc in the set, so should use roughly
the same amount of memory
as a numeric fieldcache entry.


 Are my calculations correct?

 My memory settings in Tomcat (windows) are:
 Initial memory pool: 4096 MB
 Maximum memory pool: 8192 MB (total 12GB in my test machine)

 I have tried to run a query
 (...facet=truefacet.field=PublisherIdfacet.method=fc) but I am still
 getting OOM:

 HTTP Status 500 - Java heap space java.lang.OutOfMemoryError: Java heap
 space at
 org.apache.lucene.search.FieldCacheImpl$StringIndexCache.createValue(FieldCacheImpl.java:703)
 at
 org.apache.lucene.search.FieldCacheImpl$Cache.get(FieldCacheImpl.java:224)
 at
 org.apache.lucene.search.FieldCacheImpl.getStringIndex(FieldCacheImpl.java:692)
 at
 org.apache.solr.request.SimpleFacets.getFieldCacheCounts(SimpleFacets.java:350)
 at org.apache.solr.request.SimpleFacets.getTermCounts(SimpleFacets.java:255)
 at
 org.apache.solr.request.SimpleFacets.getFacetFieldCounts(SimpleFacets.java:283)
 at
 org.apache.solr.request.SimpleFacets.getFacetCounts(SimpleFacets.java:166)
 at
 org.apache.solr.handler.component.FacetComponent.process(FacetComponent.java:72)
 at
 ...

 Any idea what am I doing wrong, or have I miscalculated the memory
 requirements?

Perhaps you are already sorting by another field or faceting on
another field that is causing a lot of memory to already be used, and
this pushes it over the edge?  Or perhaps the JVM simply can't find a
contiguous area of memory this large?
Line 703 is this:  so it's failing to create the first array:
  final int[] retArray = new int[reader.maxDoc()];

Although the line after it is even more troublesome:
  String[] mterms = new String[reader.maxDoc()+1];

Although you only need an array of 250 to contain all the unique
terms, the FieldCacheImpl starts out with maxDoc.

I think trunk will be far better in this regard.  You should also try
facet.method=enum though too.

-Yonik
http://www.lucidimagination.com


Re: Why does Solr commit block indexing?

2010-12-17 Thread Yonik Seeley
On Fri, Dec 17, 2010 at 8:05 AM, Grant Ingersoll gsing...@apache.org wrote:
 I'm not sure if there is a issue open, but I know I've talked w/ Yonik about 
 this and a few other changes to the DirectUpdateHandler2 in the past.  It 
 does indeed need to be fixed.

It stems from the APIs that were available at the time in Lucene 1.4.
IIRC, Mark worked up a patch that avoided ever closing the reader I
think, and delegated more of the concurrency control to Lucene (since
it can handle it these days).  I think maybe there was just a problem
with rollback or something...

-Yonik
http://www.lucidimagination.com




 -Grant

 On Dec 17, 2010, at 7:04 AM, Renaud Delbru wrote:

 Hi Michael,

 thanks for your answer.
 Do the Solr team is aware of the problem ? Is there an issue opened about 
 this, or ongoing work about that ?

 Regards,
 --
 Renaud Delbru

 On 16/12/10 16:45, Michael McCandless wrote:
 Unfortunately, (I think?) Solr currently commits by closing the
 IndexWriter, which must wait for any running merges to complete, and
 then opening a new one.

 This is really rather silly because IndexWriter has had its own commit
 method (which does not block ongoing indexing nor merging) for quite
 some time now.

 I'm not sure why we haven't switched over already... there must be
 some trickiness involved.

 Mike

 On Thu, Dec 16, 2010 at 9:39 AM, Renaud Delbrurenaud.del...@deri.org  
 wrote:
 Hi,

 See log at [1].
 We are using the latest snapshot of lucene_branch3.1. We have configured
 Solr to use the ConcurrentMergeScheduler:
 mergeScheduler class=org.apache.lucene.index.ConcurrentMergeScheduler/

 When a commit() runs, it blocks indexing (all imcoming update requests are
 blocked until the commit operation is finished) ... at the end of the log 
 we
 notice a 4 minute gap during which none of the solr cients trying to add
 data receive any attention.
 This is a bit annoying as it leads to timeout exception on the client side.
 Here, the commit time is only 4 minutes, but it can be larger if there are
 merges of large segments
 I thought Solr was able to handle commits and updates at the same time: the
 commit operation should be done in the background, and the server still
 continue to receive update requests (maybe at a slower rate than normal).
 But it looks like it is not the case. Is it a normal behaviour ?

 [1] http://pastebin.com/KPkusyVb

 Regards
 --
 Renaud Delbru



 --
 Grant Ingersoll
 http://www.lucidimagination.com/

 Search the Lucene ecosystem docs using Solr/Lucene:
 http://www.lucidimagination.com/search




Re: WARNING: re-index all trunk indices!

2010-12-17 Thread Yonik Seeley
On Fri, Dec 17, 2010 at 11:18 AM, Michael McCandless
luc...@mikemccandless.com wrote:
 If you are using Lucene's trunk (nightly build) release, read on...

 I just committed a change (for LUCENE-2811) that changes the index
 format on trunk, thus breaking (w/ likely strange exceptions on
 reading the segments_N file) any trunk indices created in the past
 week or so.

For reference, the exception I got trying to start Solr with an older
index on Windows is below.

-Yonik
http://www.lucidimagination.com


SEVERE: java.lang.RuntimeException: java.io.IOException: read past EOF
at org.apache.solr.core.SolrCore.getSearcher(SolrCore.java:1095)
at org.apache.solr.core.SolrCore.init(SolrCore.java:587)
at org.apache.solr.core.CoreContainer.create(CoreContainer.java:660)
at org.apache.solr.core.CoreContainer.load(CoreContainer.java:412)
at org.apache.solr.core.CoreContainer.load(CoreContainer.java:294)
at 
org.apache.solr.core.CoreContainer$Initializer.initialize(CoreContainer.java:243)
at 
org.apache.solr.servlet.SolrDispatchFilter.init(SolrDispatchFilter.java:86)
at org.mortbay.jetty.servlet.FilterHolder.doStart(FilterHolder.java:97)
at 
org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:50)
at 
org.mortbay.jetty.servlet.ServletHandler.initialize(ServletHandler.java:713)
at org.mortbay.jetty.servlet.Context.startContext(Context.java:140)
at 
org.mortbay.jetty.webapp.WebAppContext.startContext(WebAppContext.java:1282)
at 
org.mortbay.jetty.handler.ContextHandler.doStart(ContextHandler.java:518)
at 
org.mortbay.jetty.webapp.WebAppContext.doStart(WebAppContext.java:499)
at 
org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:50)
at 
org.mortbay.jetty.handler.HandlerCollection.doStart(HandlerCollection.java:152)
at 
org.mortbay.jetty.handler.ContextHandlerCollection.doStart(ContextHandlerCollection.java:156)
at 
org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:50)
at 
org.mortbay.jetty.handler.HandlerCollection.doStart(HandlerCollection.java:152)
at 
org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:50)
at 
org.mortbay.jetty.handler.HandlerWrapper.doStart(HandlerWrapper.java:130)
at org.mortbay.jetty.Server.doStart(Server.java:224)
at 
org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:50)
at org.mortbay.xml.XmlConfiguration.main(XmlConfiguration.java:985)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.mortbay.start.Main.invokeMain(Main.java:194)
at org.mortbay.start.Main.start(Main.java:534)
at org.mortbay.start.Main.start(Main.java:441)
at org.mortbay.start.Main.main(Main.java:119)
Caused by: java.io.IOException: read past EOF
at 
org.apache.lucene.store.MMapDirectory$MMapIndexInput.readBytes(MMapDirectory.java:242)
at 
org.apache.lucene.store.ChecksumIndexInput.readBytes(ChecksumIndexInput.java:48)
at org.apache.lucene.store.DataInput.readString(DataInput.java:121)
at 
org.apache.lucene.store.DataInput.readStringStringMap(DataInput.java:148)
at org.apache.lucene.index.SegmentInfo.init(SegmentInfo.java:192)
at 
org.apache.lucene.index.codecs.DefaultSegmentInfosReader.read(DefaultSegmentInfosReader.java:57)
at org.apache.lucene.index.SegmentInfos.read(SegmentInfos.java:220)
at 
org.apache.lucene.index.DirectoryReader$1.doBody(DirectoryReader.java:90)
at 
org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:623)
at org.apache.lucene.index.DirectoryReader.open(DirectoryReader.java:86)
at org.apache.lucene.index.IndexReader.open(IndexReader.java:437)
at org.apache.lucene.index.IndexReader.open(IndexReader.java:316)
at 
org.apache.solr.core.StandardIndexReaderFactory.newReader(StandardIndexReaderFactory.java:38)
at org.apache.solr.core.SolrCore.getSearcher(SolrCore.java:1084)
... 31 more


Re: bulk commits

2010-12-16 Thread Yonik Seeley
On Thu, Dec 16, 2010 at 3:06 PM, Dennis Gearon gear...@sbcglobal.net wrote:
 That easy, huh? Heck, this gets better and better.

 BTW, how about escaping?

The CSV escaping?  It's configurable to allow for loading different
CSV dialects.

http://wiki.apache.org/solr/UpdateCSV

By default it uses double quote encapsulation, like excel would.
The bottom of the wiki page shows how to configure tab separators and
backslash escaping like MySQL produces by default.

-Yonik
http://www.lucidimagination.com



  Dennis Gearon


 Signature Warning
 
 It is always a good idea to learn from your own mistakes. It is usually a 
 better
 idea to learn from others’ mistakes, so you do not have to make them yourself.
 from 'http://blogs.techrepublic.com.com/security/?p=4501tag=nl.e036'


 EARTH has a Right To Life,
 otherwise we all die.



 - Original Message 
 From: Adam Estrada estrada.adam.gro...@gmail.com
 To: Dennis Gearon gear...@sbcglobal.net; solr-user@lucene.apache.org
 Sent: Thu, December 16, 2010 10:58:47 AM
 Subject: Re: bulk commits

 This is how I import a lot of data from a cvs file. There are close to 100k
 records in there. Note that you can either pre-define the column names using
 the fieldnames param like I did here *or* include header=true which will
 automatically pick up the column header if your file has it.

 curl 
 http://localhost:8983/solr/update/csv?commit=trueseparator=%2Cfieldnames=id,name,asciiname,lat,lng,countrycode,population,elevation,gtopo30,timezone,modificationdate,catstream.file=C

 :\tmp\cities1000.csvoverwrite=truestream.contentType=text/plain;charset=utf-8

 This seems to load everything in to some kind of temporary location before
 it's actually committed. If something goes wrong there is a rollback feature
 that will undo anything that happened before the commit.

 As far as batching a bunch of files, I copied and pasted the following in to
 Cygwin and it worked just fine.

 curl 
 http://localhost:8983/solr/update/csv?commit=trueseparator=%2Cfieldnames=id,name,asciiname,lat,lng,countrycode,population,elevation,gtopo30,timezone,modificationdate,catstream.file=C

 :\tmp\cities1000.csvoverwrite=truestream.contentType=text/plain;charset=utf-8
 curl 
 http://localhost:8983/solr/update/csv?commit=trueseparator=%2Cfieldnames=id,name,asciiname,latitude,longitude,featureclass,featurecode,countrycode,admin1code,admin2code,admin3code,admin4code,population,elevation,gtopo30,timezone,modificationdatestream.file=C

 :\tmp\xab.csvoverwrite=truestream.contentType=text/plain;charset=utf-8
 curl 
 http://localhost:8983/solr/update/csv?commit=trueseparator=%2Cfieldnames=id,name,asciiname,latitude,longitude,featureclass,featurecode,countrycode,admin1code,admin2code,admin3code,admin4code,population,elevation,gtopo30,timezone,modificationdatestream.file=C

 :\tmp\xac.csvoverwrite=truestream.contentType=text/plain;charset=utf-8
 curl 
 http://localhost:8983/solr/update/csv?commit=trueseparator=%2Cfieldnames=id,name,asciiname,latitude,longitude,featureclass,featurecode,countrycode,admin1code,admin2code,admin3code,admin4code,population,elevation,gtopo30,timezone,modificationdatestream.file=C

 :\tmp\xad.csvoverwrite=truestream.contentType=text/plain;charset=utf-8
 curl 
 http://localhost:8983/solr/update/csv?commit=trueseparator=%2Cfieldnames=id,name,asciiname,latitude,longitude,featureclass,featurecode,countrycode,admin1code,admin2code,admin3code,admin4code,population,elevation,gtopo30,timezone,modificationdatestream.file=C

 :\tmp\xae.csvoverwrite=truestream.contentType=text/plain;charset=utf-8
 curl 
 http://localhost:8983/solr/update/csv?commit=trueseparator=%2Cfieldnames=id,name,asciiname,latitude,longitude,featureclass,featurecode,countrycode,admin1code,admin2code,admin3code,admin4code,population,elevation,gtopo30,timezone,modificationdatestream.file=C

 :\tmp\xaf.csvoverwrite=truestream.contentType=text/plain;charset=utf-8
 curl 
 http://localhost:8983/solr/update/csv?commit=trueseparator=%2Cfieldnames=id,name,asciiname,latitude,longitude,featureclass,featurecode,countrycode,admin1code,admin2code,admin3code,admin4code,population,elevation,gtopo30,timezone,modificationdatestream.file=C

 :\tmp\xag.csvoverwrite=truestream.contentType=text/plain;charset=utf-8
 curl 
 http://localhost:8983/solr/update/csv?commit=trueseparator=%2Cfieldnames=id,name,asciiname,latitude,longitude,featureclass,featurecode,countrycode,admin1code,admin2code,admin3code,admin4code,population,elevation,gtopo30,timezone,modificationdatestream.file=C

 :\tmp\xah.csvoverwrite=truestream.contentType=text/plain;charset=utf-8
 curl 
 http://localhost:8983/solr/update/csv?commit=trueseparator=%2Cfieldnames=id,name,asciiname,latitude,longitude,featureclass,featurecode,countrycode,admin1code,admin2code,admin3code,admin4code,population,elevation,gtopo30,timezone,modificationdatestream.file=C

 :\tmp\xai.csvoverwrite=truestream.contentType=text/plain;charset=utf-8
 curl 
 

Re: Memory use during merges (OOM)

2010-12-16 Thread Yonik Seeley
On Thu, Dec 16, 2010 at 5:51 AM, Michael McCandless
luc...@mikemccandless.com wrote:
 If you are doing false deletions (calling .updateDocument when in fact
 the Term you are replacing cannot exist) it'd be best if possible to
 change the app to not call .updateDocument if you know the Term
 doesn't exist.

FWIW, if you're going to add a batch of documents you know aren't
already in the index,
you can use the overwrite=false parameter for that Solr update request.

-Yonik
http://www.lucidimagination.com


Re: Faceted Search Slows Down as index gets larger

2010-12-16 Thread Yonik Seeley
Another thing you can try is trunk.  This specific case has been
improved by an order of magnitude recenty.
The case that has been sped up is initial population of the
filterCache, or when the filterCache can't hold all of the unique
values, or when faceting is configured to not use the filterCache much
of the time via facet.enum.cache.minDf.

-Yonik
http://www.lucidimagination.com

On Thu, Dec 16, 2010 at 6:39 PM, Furkan Kuru furkank...@gmail.com wrote:
 I am sorry for raising up this thread after 6 months.

 But we have still problems with faceted search on full-text fields.

 We try to get most frequent words in a text field that is created in 1 hour.
 The faceted search takes too much time even the matching number of documents
 (created_at within 1 HOUR) is constant (10-20K) as the total number of
 documents increases (now 20M) the query gets slower. Solr throws exceptions
 and does not respond. We have to restart and delete old docs. (3G RAM) Index
 is around 2.2 GB.
 And we store the data in solr as well. The documents are small.

 $response = $solr-search('created_at:[NOW-'.$hours.'HOUR TO NOW]', 0, 1,
 array( 'facet' = 'true', 'facet.field'= $field, 'facet.mincount' = 1,
 'facet.method' = 'enum', 'facet.enum.cache.minDf' = 100 ));

 Yonik had suggested distributed search. But I am not sure if we set every
 configuration correctly. For example the solr caches if they are related
 with faceted searching.

 We use default values:

 filterCache
   class=solr.FastLRUCache
   size=512
   initialSize=512
   autowarmCount=0/


 queryResultCache
   class=solr.LRUCache
   size=512
   initialSize=512
   autowarmCount=0/



 Any help is appreciated.



 On Sun, Jun 6, 2010 at 8:54 PM, Yonik Seeley yo...@lucidimagination.com
 wrote:

 On Sun, Jun 6, 2010 at 1:12 PM, Furkan Kuru furkank...@gmail.com wrote:
  We try to provide real-time search. So the index is changing almost in
  every
  minute.
 
  We commit for every 100 documents received.
 
  The facet search is executed every 5 mins.

 OK, that's the problem - pretty much every facet search is rebuilding
 the facet cache, which takes most of the time (and facet.fc is more
 expensive than facet.enum in this regard).

 One strategy is to use distributed search... have some big cores that
 don't change often, and then small cores for the new stuff that
 changes rapidly.

 -Yonik
 http://www.lucidimagination.com



 --
 Furkan Kuru



Re: OutOfMemory GC: GC overhead limit exceeded - Why isn't WeakHashMap getting collected?

2010-12-13 Thread Yonik Seeley
On Mon, Dec 13, 2010 at 8:47 PM, John Russell jjruss...@gmail.com wrote:
 Wow, you read my mind.  We are committing very frequently.  We are trying to
 get as close to realtime access to the stuff we put in as possible.  Our
 current commit time is... ahem every 4 seconds.

 Is that insane?

Not necessarily insane, but challenging ;-)
I'd start by setting maxWarmingSearchers to 1 in solrconfig.xml.  When
that is exceeded, a commit will fail (this just means a new searcher
won't be opened on that commit... the docs will be visible with the
next commit that does succeed.)

-Yonik
http://www.lucidimagination.com


Re: OutOfMemory GC: GC overhead limit exceeded - Why isn't WeakHashMap getting collected?

2010-12-13 Thread Yonik Seeley
On Mon, Dec 13, 2010 at 9:27 PM, Jonathan Rochkind rochk...@jhu.edu wrote:
 Yonik, how will maxWarmingSearchers in this scenario effect replication?  If 
 a slave is pulling down new indexes so quickly that the warming searchers 
 would ordinarily pile up, but maxWarmingSearchers is set to 1 what 
 happens?

Like any other commits, this will limit the number of searchers
warming in the background to 1.  If a commit is called, and that tries
to open a new searcher while another is already warming, it will fail.
 The next commit that does succeed will have all the updates though.

Today, this maxWarmingSearchers check is done after the writer has
closed and before a new searcher is opened... so calling commit too
often won't affect searching, but it will currently affect indexing
speed (since the IndexWriter is constantly being closed/flushed).

-Yonik
http://www.lucidimagination.com


Re: Userdefined Field type - Faceting

2010-12-13 Thread Yonik Seeley
Perhaps try overriding indexedToReadable() also?

-Yonik
http://www.lucidimagination.com

On Mon, Dec 13, 2010 at 10:00 PM, Viswa S svis...@hotmail.com wrote:

 Hello,

 We implemented an IP-Addr field type which internally stored the ips as 
 hex-ed string (e.g. 192.2.103.29 will be stored as c002671d). My 
 toExternal and toInternal methods for appropriate conversion seems to be 
 working well for query results, but however when faceting on this field it 
 returns the raw strings. in other words the query response would have 
 192.2.103.29, but facet on the field would return int 
 name=c002671d1/int

 Why are these methods not used by the faceting component to convert the 
 resulting values?

 Thanks
 Viswa



Re: Shards + dismax - scoring process?

2010-12-11 Thread Yonik Seeley
On Sat, Dec 11, 2010 at 2:18 AM, bbarani bbar...@gmail.com wrote:
 Also, if I try to sort the query result from shards.. will sorting happens
 on the consolidated data or on each individual core data?

Both - to find the top 10 docs by any sort, the top 10 docs from each
shard are collected and then
sorted to find the top 10 out of those.

 I am just trying to figure out best possible way to implement distributed
 search without affecting the search relevancy.

The IDF part of the relevancy score is the only place that
distributed search scoring won't match up with no distributed
scoring because the document frequency used for the term is local to
every core instead of global.  If you distribute your documents fairly
randomly to the different shards, this won't matter.

There is a patch in the works to add global idf, but I think that even
when it's committed, it will default to off because of the higher cost
associated with it.

-Yonik
http://www.lucidimagination.com


Re: Map size must not be negative with spatial results + php serialized

2010-12-08 Thread Yonik Seeley
On Wed, Dec 8, 2010 at 9:45 AM, Markus Jelsma
markus.jel...@openindex.io wrote:
 I know, but since it's an Apache component throwing the exception, i'd figure
 someone just might know more about this.

That's fine - it could be a Solr bug too.
IMO, solr-user traffic just needs to be solr related and hopefully
useful to other uses.

-Yonik
http://www.lucidimagination.com


Webcast: Better Search Results Faster with Apache Solr and LucidWorks Enterprise

2010-12-08 Thread Yonik Seeley
We're holding a free webinar about relevancy enhancements in our
commercial version of Solr.  Details below.

-Yonik
http://www.lucidimagination.com

-
Join us for a free technical webcast
Better Search Results Faster with Apache Solr and LucidWorks Enterprise
Thursday, December 16, 2010
11:00 AM PST / 2:00 PM EST / 20:00 CET


Click here to sign up
http://www.eventsvc.com/lucidimagination/121610?trk=AP


In the key dimensions of search relevancy and query-targeted results,
users have become accustomed to internet-search style facilities like
page-rank, user-driven feedback, auto-suggest and more. Even with the
power of Apache Lucene/Solr, building such features into your own
search application is easier said than done.


Now, with LucidWorks Enterprise, the search solution development
platform built on the Solr/Lucene open source technology, developing
killer search apps with these features and more is faster, simpler,
and more powerful than ever before!


Join Andrzej Bialecki, Lucene/Solr Committer and inventor of the Luke
index utility, for a hands-on technical workshop that details how
LucidWorks Enterprise puts powerful search and relevancy at your
fingertips -- at a fraction of the time and effort required to program
them yourself with native Apache Solr. Andrzej will discuss and
present how you can use LucidWorks Enterprise for:
* Click Scoring to automatically configure relevance for most popular results
* Simplified implementation of auto-complete and did-you-mean functionality
* Unsupervised feedback  to automatically provide relevance
improvement on every query


Click here to sign up
http://www.eventsvc.com/lucidimagination/121610?trk=AP


--
About the presenter:
Andrzej Bialecki is a committer of the Apache Lucene/Solr project, a
Lucene PMC member, and chairman of the Apache Nutch project. He is
also the author of Luke, the Lucene Index Toolbox. Andrzej
participates in many commercial projects that use Lucene/Solr, Nutch
and Hadoop to implement enterprise and vertical search.
--
Presented by Lucid Imagination, the commercial entity exclusively
dedicated to Apache Lucene/Solr open source search technology.
LucidWorks Enterprise, our search solution development platform, helps
you build better search application more quickly and productively,
develop and We also offer solutions including SLA-based support,
professional training, best practices consulting, free developer
downloads free documentation.
Follow us on Twitter:twitter.com/LucidImagineer.
--
Apache Lucene and Apache Solr are trademarks of the Apache
Software Foundation.


Re: How to handle multivalued hierarchical facets?

2010-12-08 Thread Yonik Seeley
Hoss had a great webinar on faceting that also covered how you could
do hierarchical.
http://www.lucidimagination.com/solutions/webcasts/faceting
See taxonomy facets, about 28 minutes in.

-Yonik
http://www.lucidimagination.com

On Wed, Dec 8, 2010 at 5:28 PM, Andy angelf...@yahoo.com wrote:
 I have facets that are hierarchical. For example, Location can be represented 
 as this hierarchy:

 Country  State  City

 If each document can only have a single value for each of these facets, then 
 I can just use separate fields for each facet.

 But if multiple values are allowed, then that approach would not work. For 
 example if a document has 2 Location values:

 USCASan Francisco
 USMABoston

 If I just put the values CA  MA in the field State, and San 
 Francisco  Boston in City, facetting would not work. Someone could 
 select CA and the value Boston would be displayed for the field City.

 How do I handle this use case?

 Thanks


Re: Changing a solr schema from non-stored to stored on the fly

2010-12-08 Thread Yonik Seeley
On Wed, Dec 8, 2010 at 6:07 PM, Kaktu Chakarabati jimmoe...@gmail.com wrote:
 Can I do this? i.e change that value in schema, and then incrementally
 re-index documents to populate it?
 would that work? what would be returned if at all for documents that werent
 re-indexed post-schema change?

Yes, this should work fine.
A document that was added with an unstored field will act exactly like
a document with that field missing.

-Yonik
http://www.lucidimagination.com


Re: Field Collapsing - sort by group count, get total groups

2010-12-07 Thread Yonik Seeley
On Tue, Dec 7, 2010 at 7:03 AM, ssetem sse...@googlemail.com wrote:
 I wondered if it is possible to sort groups by the total within the group,
 and to bring back total amount groups?

That is planned, but not currently implemented.
You can use faceting to get both totals and sort by highest total though.

Total number of groups is a different problem - we don't return it
because we don't know.
It will take a different algorithm (that's more memory intensive) to
find out the total number of groups.
If the number is unlikely to be too large, you could just return all
groups (or use faceting to do that more efficiently).

-Yonik
http://www.lucidimagination.com


Re: Field Collapsing - sort by group count, get total groups

2010-12-07 Thread Yonik Seeley
On Tue, Dec 7, 2010 at 9:07 AM, ssetem sse...@googlemail.com wrote:
 Thanks for the reply,

 How would i get the total amount of possible facets(non zero), I've searched
 around but have no luck.

Only current way would be to request them all.

Just like field collapsing, this is a number we don't (generally)
have.  There are optimizations like short-circuiting on the docfreq
that would need to be disabled to generate that count.

-Yonik
http://www.lucidimagination.com


Re: autocommit commented out -- what is the default?

2010-12-04 Thread Yonik Seeley
On Sat, Dec 4, 2010 at 10:36 AM, Brian Whitman br...@echonest.com wrote:
 Hi, if you comment out the block in solrconfig.xml

 !--
   autoCommit
      maxDocs1/maxDocs
      maxTime60/maxTime
    /autoCommit
 --

 Does this mean that (a) commits never happen automatically or (b) some
 default autocommit is applied?

Commented out means they never happen automatically (i.e., no default).
In general commitWithin is a better strategy to use... bulk updates
can use a large value (or no value w/ explicit commit at end) for
better indexing performance, while other updates can use a smaller
value depending on how soon the update needs to be visible.

-Yonik
http://www.lucidimagination.com


Re: ramBufferSizeMB not reflected in segment sizes in index

2010-12-02 Thread Yonik Seeley
On Wed, Dec 1, 2010 at 3:01 PM, Shawn Heisey s...@elyograg.org wrote:
 I have seen this.  In Solr 1.4.1, the .fdt, .fdx, and the .tv* files do not
 segment, but all the other files do.  I can't remember whether it behaves
 the same under 3.1, or whether it also creates these files in each segment.

Yep, that's the shared doc store (where stored fields go.. the
non-inverted part of the index), and it works like that in 3.x and
trunk too.
It's nice because when you merge segments, you don't have to re-copy
the docs (provided you're within a single indexing session).
There have been discussions about removing it in trunk though... we'll see.

-Yonik
http://www.lucidimagination.com


Re: ArrayIndexOutOfBoundsException for query with rows=0 and sort param

2010-11-30 Thread Yonik Seeley
On Tue, Nov 30, 2010 at 8:24 AM, Martin Grotzke
martin.grot...@googlemail.com wrote:
 Still I'm wondering, why this issue does not occur with the plain
 example solr setup with 2 indexed docs. Any explanation?

It's an old option you have in your solrconfig.xml that causes a
different code path to be followed in Solr:

   !-- An optimization that attempts to use a filter to satisfy a search.
 If the requested sort does not include score, then the filterCache
 will be checked for a filter matching the query. If found, the filter
 will be used as the source of document ids, and then the sort will be
 applied to that. --
useFilterForSortedQuerytrue/useFilterForSortedQuery

Most apps would be better off commenting that out or setting it to
false.  It only makes sense when a high number of queries will be
duplicated, but with different sorts.



 But: why is your app doing this?  Ie, if numHits (rows) is 0, the only
 useful thing you can get is totalHits?

 Actually I don't know this (yet). Normally our search logic should
 optimize this and ignore a requested sorting with rows=0, but there
 seems to be a case that circumvents this - still figuring out.


 Still I think we should fix it in Lucene -- it's a nuisance to push
 such corner case checks up into the apps.  I'll open an issue...

 Just for the record, this is https://issues.apache.org/jira/browse/LUCENE-2785

 One question: as leaving out sorting leads to better performance, this
 should also be true for rows=0. Or is lucene/solr already that clever
 that it makes this optimization (ignoring sort) automatically?

Solr has always special-cased this case and avoided sorting altogether
(for the normal code path... but overlooked it when
useFilterForSortedQuery=true.

-Yonik
http://www.lucidimagination.com


Re: entire farm fails at the same time with OOM issues

2010-11-30 Thread Yonik Seeley
On Tue, Nov 30, 2010 at 6:04 PM, Robert Petersen rober...@buy.com wrote:
 My question is this.  Why in the world would all of my slaves, after
 running fine for some days, suddenly all at the exact same minute
 experience OOM heap errors and go dead?

If there is no change in query traffic when this happens, then it's
due to what the index looks like.

My guess is a large index merge happened, which means that when the
searchers re-open on the new index, it requires more memory than
normal (much less can be shared with the previous index).

I'd try bumping the heap a little bit, and then optimizing once a day
during off-peak hours.
If you still get OOM errors, bump the heap a little more.

-Yonik
http://www.lucidimagination.com


Re: Preventing index segment corruption when windows crashes

2010-11-29 Thread Yonik Seeley
On Mon, Nov 29, 2010 at 10:46 AM, Peter Sturge peter.stu...@gmail.com wrote:
 If a Solr index is running at the time of a system halt, this can
 often corrupt a segments file, requiring the index to be -fix'ed by
 rewriting the offending file.

Really?  That shouldn't be possible (if you mean the index is truly
corrupt - i.e. you can't open it).

-Yonik
http://www.lucidimagination.com


Re: solr admin

2010-11-29 Thread Yonik Seeley
On Mon, Nov 29, 2010 at 8:02 PM, Ahmet Arslan iori...@yahoo.com wrote:
 in Solr admin (http://localhost:8180/services/admin/)
 I can specify something like:

 +category_id:200 +xxx:300

 but how can I specify a sort option?

 sort:category_id+asc

 There is an [FULL INTERFACE] /admin/form.jsp link but it does not have sort 
 option. It seems that you need to append it to your search url.

Heh - yeah... that's an old interface, from the times when sort was
specified along with the query.
Can someone provide a patch to add a way to specify the sort?

-Yonik
http://www.lucidimagination.com


Re: geospatial

2010-11-24 Thread Yonik Seeley
On Wed, Nov 24, 2010 at 2:41 PM, Dennis Gearon gear...@sbcglobal.net wrote:
 What is the recommended Solr version and/or plugin combination to get 
 geospatial
 search up and running the quickest and easiest?

It depends on what capabilities you need.
The current state of what is committed to trunk is reflected here:
http://wiki.apache.org/solr/SpatialSearch

-Yonik
http://www.lucidimagination.com


Re: Problem with synonyms

2010-11-22 Thread Yonik Seeley
On Sat, Nov 20, 2010 at 5:59 AM, sivaprasad sivaprasa...@echidnainc.com wrote:
 Even after expanding the synonyms also i am unable to get same results.

What you are trying to do should work with index-time synonym expansion.
Just make sure to remove the synonym filter at query time (or use a
synonym filter w/o multi-word synonyms).

What's the original text in the document you are trying to match?

-Yonik
http://www.lucidimagination.com


Re: Problem with synonyms

2010-11-22 Thread Yonik Seeley
On Mon, Nov 22, 2010 at 10:29 AM, Yonik Seeley
yo...@lucidimagination.com wrote:
 On Sat, Nov 20, 2010 at 5:59 AM, sivaprasad sivaprasa...@echidnainc.com 
 wrote:
 Even after expanding the synonyms also i am unable to get same results.

 What you are trying to do should work with index-time synonym expansion.
 Just make sure to remove the synonym filter at query time (or use a
 synonym filter w/o multi-word synonyms).

Actually, to be more precise, the current query-time restriction is
that you can't produce synonyms of different lengths.
Hence you could normalize High Definition TV to hdtv at both query
time and index time.

Optionally you can expand to both High Definition TV and hdtv at
index time (in which case you would normally turn off query time
synonym processing).

-Yonik
http://www.lucidimagination.com


Re: Must require quote with single word token query?

2010-11-19 Thread Yonik Seeley
On Tue, Nov 16, 2010 at 10:28 PM, Chamnap Chhorn
chamnapchh...@gmail.com wrote:
 I have one question related to single word token with dismax query. In order
 to be found I need to add the quote around the search query all the time.
 This is quite hard for me to do since it is part of full text search.

 Here is my solr query and field type definition (Solr 1.4):
    fieldType name=text_keyword class=solr.TextField
 positionIncrementGap=100
      analyzer
        tokenizer class=solr.KeywordTokenizerFactory/
        filter class=solr.LowerCaseFilterFactory /
        filter class=solr.TrimFilterFactory /
        filter class=solr.StopFilterFactory ignoreCase=true
 words=stopwords.txt enablePositionIncrements=true/
        filter class=solr.SynonymFilterFactory synonyms=synonyms.txt
 ignoreCase=true expand=false /
        filter class=solr.RemoveDuplicatesTokenFilterFactory/
      /analyzer
    /fieldType

    field name=keyphrase type=text_keyword indexed=true
 stored=false multiValued=true/

 With this query q=smart%20mobileqf=keyphrasedebugQuery=ondefType=dismax,
 solr returns nothing. However, with quote on the search query q=smart
 mobileqf=keyphrasedebugQuery=ondefType=dismax, the result is found.

 Is it a must to use quote for a single word token field?

Yes, you must currently quote tokens if they contain whitespace -
otherwise the query parser first breaks on whitespace before doing
analysis on each part separately.

Using dismax is an odd choice if you are only querying on keyphrase though.
You might look at the field query parser - it is a basic single-field
single-value parser with no operators (hence no need to escape any
special characters).

q={!field f=keyphrase}smart%20mobile

or you can decompose it using param dereferencing (sometimes easier to
construct)

q={!field f=keyphrase v=$qq}qq=smart%20mobile

-Yonik
http://www.lucidimagination.com


Re: Must require quote with single word token query?

2010-11-19 Thread Yonik Seeley
On Fri, Nov 19, 2010 at 9:41 PM, Chamnap Chhorn chamnapchh...@gmail.com wrote:
 Wow, i never know this syntax before. What's that called?

I dubbed it local params since it adds local info to a parameter
(think extra metadata, like XML attributes on an element).

http://wiki.apache.org/solr/LocalParams

It's used mostly to invoke different query parsers, but it's also used
to add extra metadata to faceting commands too (and is required for
stuff like multi-select faceting):

http://wiki.apache.org/solr/SimpleFacetParameters#Multi-Select_Faceting_and_LocalParams


-Yonik
http://www.lucidimagination.com



 On 11/19/10, Yonik Seeley yo...@lucidimagination.com wrote:
 On Tue, Nov 16, 2010 at 10:28 PM, Chamnap Chhorn
 chamnapchh...@gmail.com wrote:
 I have one question related to single word token with dismax query. In
 order
 to be found I need to add the quote around the search query all the time.
 This is quite hard for me to do since it is part of full text search.

 Here is my solr query and field type definition (Solr 1.4):
    fieldType name=text_keyword class=solr.TextField
 positionIncrementGap=100
      analyzer
        tokenizer class=solr.KeywordTokenizerFactory/
        filter class=solr.LowerCaseFilterFactory /
        filter class=solr.TrimFilterFactory /
        filter class=solr.StopFilterFactory ignoreCase=true
 words=stopwords.txt enablePositionIncrements=true/
        filter class=solr.SynonymFilterFactory synonyms=synonyms.txt
 ignoreCase=true expand=false /
        filter class=solr.RemoveDuplicatesTokenFilterFactory/
      /analyzer
    /fieldType

    field name=keyphrase type=text_keyword indexed=true
 stored=false multiValued=true/

 With this query
 q=smart%20mobileqf=keyphrasedebugQuery=ondefType=dismax,
 solr returns nothing. However, with quote on the search query q=smart
 mobileqf=keyphrasedebugQuery=ondefType=dismax, the result is found.

 Is it a must to use quote for a single word token field?

 Yes, you must currently quote tokens if they contain whitespace -
 otherwise the query parser first breaks on whitespace before doing
 analysis on each part separately.

 Using dismax is an odd choice if you are only querying on keyphrase though.
 You might look at the field query parser - it is a basic single-field
 single-value parser with no operators (hence no need to escape any
 special characters).

 q={!field f=keyphrase}smart%20mobile

 or you can decompose it using param dereferencing (sometimes easier to
 construct)

 q={!field f=keyphrase v=$qq}qq=smart%20mobile

 -Yonik
 http://www.lucidimagination.com


 --
 Sent from my mobile device

 Chhorn Chamnap
 http://chamnapchhorn.blogspot.com/



result grouping / field collapsing changes

2010-11-16 Thread Yonik Seeley
We've recently added randomized testing for result grouping that
resulted in finding + fixing a number of bugs.
I've you've been using this feature, you should move to the latest
trunk version.

I've also added a section at the bottom of the wiki page to list
current limitations.
http://wiki.apache.org/solr/FieldCollapsing

-Yonik
http://www.lucidimagination.com


Re: hash uniqueKey generation?

2010-11-16 Thread Yonik Seeley
On Tue, Nov 16, 2010 at 5:31 AM, Dennis Gearon gear...@sbcglobal.net wrote:
 hashing is not 100% guaranteed to produce unique values.

But if you go to enough bits with a good hash function, you can get
the odds lower than the odds of something else changing the value like
cosmic rays flipping a bit on you.

-Yonik
http://www.lucidimagination.com


Re: hash uniqueKey generation?

2010-11-16 Thread Yonik Seeley
On Tue, Nov 16, 2010 at 9:05 PM, Dennis Gearon gear...@sbcglobal.net wrote:
 Read up on WikiPedia, but I believe that no Hash Function is much good above 
 50%
 of the address space it generates.

50% is way to high - collisions will happen before that.

But given that something like MD5 has 128 bits, that's 3.4e38, so even
a small fraction of that address space will work.  The probabilities
follow the birthday problem:
http://en.wikipedia.org/wiki/Birthday_problem

Using a 128 bit hash, you can hash 26B docs with a hash collision
probability of e-18 (and yes, that is lower than the probability of
something else going wrong).

It also says: For comparison, 10-18 to 10-15 is the uncorrectable bit
error rate of a typical hard disk [2]. In theory, MD5, 128 bits,
should stay within that range until about 820 billion documents, even
if its possible outputs are many more.

-Yonik
http://www.lucidimagination.com


Re: Solr Negative query

2010-11-15 Thread Yonik Seeley
On Mon, Nov 15, 2010 at 12:42 AM, Viswa S svis...@hotmail.com wrote:

 Apologies for starting a new thread again, my mailing list subscription 
 didn't finalize till later than Yonik's response.

 Using Field1:Val1 AND (*:* NOT Field2:Val2) works, thanks.

 Does my original query Field1:Value1 AND (NOT Field2:Val2) fall into need 
 the *:* trick if all of the clauses of a boolean query are negative case?

Yes - the parens create a new boolean query, and all of it's clauses
are negative.
The top level boolean query has that as a required clause, hence it
won't match anything because that sub-query won't match anything.

But, your original example without the parens should have worked.

-Yonik
http://www.lucidimagination.com


Re: Solr Negative query

2010-11-14 Thread Yonik Seeley
On Sun, Nov 14, 2010 at 4:17 AM, Leonardo Menezes
leonardo.menez...@googlemail.com wrote:
 try
 Field1:Val1 AND (*:* NOT Field2:Val2), that shoud work ok

That should be equivalent to Field1:Val1 -Field2:Val2
You only need the *:* trick if all of the clauses of a boolean query
are negative.

-Yonik
http://www.lucidimagination.com


Re: facetting when using field collapsing

2010-11-13 Thread Yonik Seeley
On Wed, Nov 10, 2010 at 9:12 AM, Lukas Kahwe Smith m...@pooteeweet.org wrote:
 The above wiki page seems to be out of date. Reading the comments in 
 https://issues.apache.org/jira/browse/SOLR-236 it seems like group should 
 be replaced with collapse.

The Wiki page is not expansive, but I've tried to make it easy for
people to get started, and make everything there correct.  If you can
point out what is incorrect, we can fix!

With regards to faceting, it works, but is unaffected by grouping
(i.e. facet counts will be the same as a non-grouped response).

-Yonik
http://www.lucidimagination.com


Re: facetting when using field collapsing

2010-11-13 Thread Yonik Seeley
On Sat, Nov 13, 2010 at 10:46 AM, Lukas Kahwe Smith m...@pooteeweet.org wrote:

 On 13.11.2010, at 10:30, Yonik Seeley wrote:

 On Wed, Nov 10, 2010 at 9:12 AM, Lukas Kahwe Smith m...@pooteeweet.org 
 wrote:
 The above wiki page seems to be out of date. Reading the comments in 
 https://issues.apache.org/jira/browse/SOLR-236 it seems like group should 
 be replaced with collapse.

 The Wiki page is not expansive, but I've tried to make it easy for
 people to get started, and make everything there correct.  If you can
 point out what is incorrect, we can fix!

 With regards to faceting, it works, but is unaffected by grouping
 (i.e. facet counts will be the same as a non-grouped response).


 The wiki page uses group, but in the ticket all examples always speak of 
 collapse. Which syntax is correct?

It's group - try out the examples on the wiki page.
JIRA tickets are for development, not documentation.

 Other than that the ticket also speaks of a few parameters not mentioned, 
 specifically if facetting should happen before or after group/collapse:
 collapse.facet=before|after

This currently doesn't exist in the committed code, hence the param is
not documented.
Grouping/collapsing currently has no effect on faceting (i.e. set
group=false and you will get a non grouped result with the exact same
facet counts).

-Yonik
http://www.lucidimagination.com


Re: IndexableBinaryStringTools (was FieldCache)

2010-11-13 Thread Yonik Seeley
On Sat, Nov 13, 2010 at 1:50 PM, Steven A Rowe sar...@syr.edu wrote:
 Looks to me like the returned value is in a Solr-internal form of XML 
 character escaping: \u is represented as #0; and \u0008 is represented 
 as #8;.  (The escaping code is in 
 solr/src/java/org/apache/common/util/XML.java.)

Yep, there is no legal way to represent some unicode code points in XML.

 You can get the value back in its original binary form by unescaping the 
 /#[0-9]+;/ format.  Here is a test illustrating this fix that I added to 
 SolrExampleTests, then ran from SolrExampleEmbeddedTest:

The problem here is that one might then unescape what was meant to be
a literal #8;
One could come up with a full escaping mechanism over XML I suppose...
but I'm not sure it would be worth it.

-Yonik
http://www.lucidimagination.com


FAST ESP - Solr migration webinar

2010-11-11 Thread Yonik Seeley
We're holding a free webinar on migration from FAST to Solr.  Details below.

-Yonik
http://www.lucidimagination.com

=
Solr To The Rescue: Successful Migration From FAST ESP to Open Source
Search Based on Apache Solr

Thursday, Nov 18, 2010, 14:00 EST (19:00 GMT)
Hosted by SearchDataManagement.com

For anyone concerned about the future of their FAST ESP applications
since the purchase of Fast Search and Transfer by Microsoft in 2008,
this webinar will provide valuable insights on making the switch to
Solr.  A three-person rountable will discuss factors driving the need
for FAST ESP alternatives, differences between FAST and Solr, a
typical migration project lifecycle  methodology, complementary open
source tools, best practices, customer examples, and recommended next
steps.

The speakers for this webinar are:

Helge Legernes, Founding Partner  CTO of Findwise
Michael McIntosh, VP Search Solutions for TNR Global
Eric Gaumer, Chief Architect for ESR Technology.

For more information and to register, please go to:

http://SearchDataManagement.bitpipe.com/detail/RES/1288718603_527.html?asrc=CL_PRM_Lucid2
=


Re: solr 4.0 - pagination

2010-11-07 Thread Yonik Seeley
On Sun, Nov 7, 2010 at 10:55 AM, Papp Richard ccode...@gmail.com wrote:
  this is fantastic, but can you tell any time it will be ready ?

It already is ;-)  Grab the latest trunk or the latest nightly build.

-Yonik
http://www.lucidimagination.com


Re: solr 4.0 - pagination

2010-11-07 Thread Yonik Seeley
On Sun, Nov 7, 2010 at 2:45 PM, Papp Richard ccode...@gmail.com wrote:
 Hi Yonik,

  I've just tried the latest stable version from nightly build:
 apache-solr-4.0-2010-11-05_08-06-28.war

  I have some concerns however: I have 3 documents; 2 in the first group, 1
 in the 2nd group.

  1. I got for matches 3 - which is good, but I still don't know how many
 groups I have. (using start = 0, rows = 10)
  2. as far as I see the start / rows is working now, but the matches is
 returned incorrectly = it said matches = 3 instead of = 1, when I used
 start = 1, rows = 1

matches is the number of documents before grouping, so start/rows or
group.offset/group.limit will not affect this number.

  so can you help me, how to compute how many pages I'll have, because the
 matches can't use for this.

Solr doesn't even know given the current algorithm, hence it can't
return that info.

The issue is that to calculate the total number of groups, we would
need to keep each group in memory (which could cause a big blowup if
there are tons of groups).  The current algorithm only keeps the top
10 groups (assuming rows=10) in memory at any one time, hence it has
no idea what the total number of groups is.

-Yonik
http://www.lucidimagination.com


Re: Negative or zero value for fieldNorm

2010-11-04 Thread Yonik Seeley
On Thu, Nov 4, 2010 at 8:04 AM, Markus Jelsma
markus.jel...@openindex.io wrote:
 The question remains, why does the title field return a fieldNorm=0 for many
 queries?

Because the index-time boost was set to 0 when the doc was indexed.  I
can't say how that happened... look to your indexing code.

 And a subquestion, does the luke request handler return boost values
 for documents? I know i get boost values for fields but i haven't seen boost
 values for documents.

The doc boost is just multiplied into each field boost and doesn't
have a separate representation in the index.

-Yonik
http://www.lucidimagination.com


Re: Negative or zero value for fieldNorm

2010-11-04 Thread Yonik Seeley
On Thu, Nov 4, 2010 at 9:51 AM, Markus Jelsma
markus.jel...@openindex.io wrote:
 I've done some testing with the example docs and it behaves similar when there
 is a zero doc boost. Luke, however, does not show me the index-time boosts.

Remember that the norm is a product of the length norm and the index
time boost... it's recorded as a single number in the index.

 Bost document and field boosts are not visible in Luke's output. I've changed
 doc boost and field boosts for the mp500.xml document but all i ever see
 returned is boost=1.0. Is this correct?

Perhaps you still have omitNorms=true for the field you are querying?

-Yonik
http://www.lucidimagination.com


Re: Negative or zero value for fieldNorm

2010-11-03 Thread Yonik Seeley
Regarding Negative or zero value for fieldNorm, I don't see any
negative fieldNorms here... just very small positive ones?

Anyway the fieldNorm is the product of the lengthNorm and the
index-time boost of the field (which is itself the product of the
index time boost on the document and the index time boost of all
instances of that field).  Index time boosts default to 1 though, so
they have no effect unless something has explicitly set a boost.

-Yonik
http://www.lucidimagination.com



On Wed, Nov 3, 2010 at 2:30 PM, Markus Jelsma
markus.jel...@openindex.io wrote:
 Hi all,

 I've got some puzzling issue here. During tests i noticed a document at the
 bottom of the results where it should not be. I query using DisMax on title
 and content field and have a boost on title using qf. Out of 30 results, only
 two documents also have the term in the title.

 Using debugQuery and fl=*,score i quickly noticed large negative maxScore of
 the complete resultset and a portion of the resultset where scores sum up to
 zero because of a product with 0 (fieldNorm).

 See below for debug output for a result with score = 0:

 0.0 = (MATCH) sum of:
  0.0 = (MATCH) max of:
    0.0 = (MATCH) weight(content:kunstgrasveld in 7), product of:
      0.75658196 = queryWeight(content:kunstgrasveld), product of:
        6.6516633 = idf(docFreq=33, maxDocs=9682)
        0.113743275 = queryNorm
      0.0 = (MATCH) fieldWeight(content:kunstgrasveld in 7), product of:
        2.236068 = tf(termFreq(content:kunstgrasveld)=5)
        6.6516633 = idf(docFreq=33, maxDocs=9682)
        0.0 = fieldNorm(field=content, doc=7)
    0.0 = (MATCH) fieldWeight(title:kunstgrasveld in 7), product of:
      1.0 = tf(termFreq(title:kunstgrasveld)=1)
      8.791729 = idf(docFreq=3, maxDocs=9682)
      0.0 = fieldNorm(field=title, doc=7)

 And one with a negative score:

 3.0716116E-4 = (MATCH) sum of:
  3.0716116E-4 = (MATCH) max of:
    3.0716116E-4 = (MATCH) weight(content:kunstgrasveld in 1462), product of:
      0.75658196 = queryWeight(content:kunstgrasveld), product of:
        6.6516633 = idf(docFreq=33, maxDocs=9682)
        0.113743275 = queryNorm
      4.059853E-4 = (MATCH) fieldWeight(content:kunstgrasveld in 1462), product
 of:
        1.0 = tf(termFreq(content:kunstgrasveld)=1)
        6.6516633 = idf(docFreq=33, maxDocs=9682)
        6.1035156E-5 = fieldNorm(field=content, doc=1462)

 There are no funky issues with term analysis for the text fieldType, in fact,
 the term passes through unchanged. I don't do omitNorms, i store termVectors
 etc.

 Because fieldNorm = fieldBoost / sqrt(numTermsForField) i suspect my input 
 from
 Nutch is messed up. A fieldNorm can never be = 0 for a normal positive boost
 and field boosts should not be zero or negative (correct me if i'm wrong). 
 But,
 since i can't yet figure out what field boosts Nutch sends to me i thought i'd
 drop by on this mailing list first.

 There are quite a few query terms that return with zero or negative scores and
 many that behave as i expect. I find it also a bit hard to comprehend why the
 docs with negative score rank higher in the result set than documents with
 zero score. Sorting defaults to score DESC,  but this is perhaps another
 issue.

 Anyway, the test runs on a Solr 1.4.1 instance with Java 6 under the hood.
 Help or directions are appreciated =)

 Cheers,

 --
 Markus Jelsma - CTO - Openindex
 http://www.linkedin.com/in/markus17
 050-8536600 / 06-50258350



Re: blacklist docs by uniqueKey

2010-11-03 Thread Yonik Seeley
On Wed, Nov 3, 2010 at 3:05 PM, Erick Erickson erickerick...@gmail.com wrote:
 How dynamic is this list? Is it feasable to add a field to your docs like
 blacklisteddocs, and at editorial's discretion add values to that field
 like app1, app2?

 At that point you can just filter them out via a filter query...

Right, or a combination of the two approaches.
For a realtime approach, add the newest filters (say any filters added
that day) to a filter query, and roll those into a nightly reindex.

-Yonik
http://www.lucidimagination.com


 Best
 Erick

 On Wed, Nov 3, 2010 at 2:40 PM, Ravi Kiran ravi.bhas...@gmail.com wrote:

 Hello,
        I have a single core servicing 3 different applications, one of the
 application doesnt want some specific docs to show up (driven by Editorial
 decision). Over a period of time the amount of blacklisted docs could grow,
 hence I do not want to restrict them in a query as it the query could get
 extremely large. Is there a configuration option where we can blacklist ids
 (uniqueKey) from showing up in results.

 Is there anything similar to EvelationComponent that demotes docs ? This
 could be ideal. I tried to look up and see if there was a boosting option
 in
 elevation component so that I could negatively boost certain docs but could
 not find any.

 Can anybody kindly point me in the right direction.

 Thanks

 Ravi Kiran Bhaskar




Re: Possible memory leaks with frequent replication

2010-11-02 Thread Yonik Seeley
On Tue, Nov 2, 2010 at 12:32 PM, Simon Wistow si...@thegestalt.org wrote:
 On Mon, Nov 01, 2010 at 05:42:51PM -0700, Lance Norskog said:
 You should query against the indexer. I'm impressed that you got 5s
 replication to work reliably.

 That's our current solution - I was just wondering if there was anything
 I was missing.

You could also try dialing down maxWarmingSearchers to 1 - that should
prevent multiple searchers warming at the same time and may be the
source of you running out of memory.

-Yonik
http://www.lucidimagination.com


Re: big terms in UnInvertedField

2010-11-01 Thread Yonik Seeley
2010/11/1 Koji Sekiguchi k...@r.email.ne.jp:
 With solr example, using facet.field=text creates UnInvertedField
 for the text field in fieldValueCache. After that, I saw stats page
 and I was surprised at counters in *filterCache* were up:

 Do they cause of big words in UnInvertedField?

Yes.  big terms (defined as matching more than 5% of the index) are
not uninverted since it's more efficient (both CPU and memory) to use
the filterCache and calculate intersections.

 If so, when using both facet for multiValued field and facet for
 single valued field/facet query, it is difficult to estimate
 the size of filterCache.

Yep.  At least fieldValueCache (for UnInvertedField) tells you the
number of big terms in each field you are faceting on though.

-Yonik
http://www.lucidimagination.com


Re: Facet count of zero

2010-11-01 Thread Yonik Seeley
On Mon, Nov 1, 2010 at 12:55 PM, Tod listac...@gmail.com wrote:
 I'm trying to exclude certain facet results from a facet query.  It seems to
 work but rather than being excluded from the facet list its returned with a
 count of zero.

If you don't want to see 0 counts, use facet.mincount=1

http://wiki.apache.org/solr/SimpleFacetParameters

-Yonik
http://www.lucidimagination.co


 Ex:
 q=(-foo:bar)facet=truefacet.field=foofacet.sort=idxwt=jsonindent=true

 This returns bar with a count of zero.  All the other foo's show up with
 valid counts.

 Can I do this?  Is my syntax incorrect?



 Thanks - Tod



Re: solr 4.0 - pagination

2010-10-30 Thread Yonik Seeley
On Sat, Oct 30, 2010 at 12:22 PM, Papp Richard ccode...@gmail.com wrote:
  I'm using Solr 4.0 with grouping (field collapsing), but unfortunately I
 can't solve the pagination.

It's not implemented yet, but I'm working on that right now.

-Yonik
http://www.lucidimagination.com


Re: eDismax result differs from Dismax

2010-10-29 Thread Yonik Seeley
On Fri, Oct 29, 2010 at 9:30 AM, Ryan Walker r...@recruitmilitary.com wrote:

 We are launching a new version of our job board helping returning veterans 
 find a civilian job, and we chose Solr and Sunspot[1] to power our search. We 
 really didn't consider the power users in the HR world who are trained to use 
 boolean search, for example:

 Engineer AND (Electrical OR Mechanical)

 Sunspot supports the Dismax request handler, which unfortunately does not 
 handle the query above properly. So we read about eDismax and that it was 
 baked into Solr 1.5. At the same time, Sunspot has switched from LocalSolr 
 integration to storing a geohash in a full-text searchable field.

 We're having some problems with some complex queries that Sunspot generates:

 INFO: [] webapp=/solr path=/select 
 params={fl=+scorestart=0q=query:{!dismax+qf%3D'title_text+description_text'}Ruby+on+Rails+Developer+(location_details_s:dngythdb25fu^1.0+OR+location_details_s:dngythdb25f^0.0625+OR+location_details_s:dngythdb25*^0.00391+OR+location_details_s:dngythdb2*^0.000244+OR+location_details_s:dngythdb*^0.153+OR+location_details_s:dngythd*^0.00954+OR+location_details_s:dngyth*^0.000596+OR+location_details_s:dngyt*^0.373+OR+location_details_s:dngy*^0.0233+OR+location_details_s:dng*^0.00146)wt=rubyfq=type:JobdefType=edismaxrows=20}
  hits=1 status=0 QTime=13

 Under Dismax no results are returned for this query, however, as you can see 
 above with eDismax a result is returned -- the only difference between the 
 two queries are 'defType=edismax' vs 'defType=dismax'

That's to be expected.  Dismax doesn't even support fielded queries
(where you specify the fieldname in the query itself) so this clause
is treated all as text:

(location_details_s:dngythdb25fu^1.0

and dismax QP will be looking for tokens like location_details_s
dngythdb25fu (assuming tokenization would split on the
non-alphanumeric chars) in your text fields.

-Yonik
http://www.lucidimagination.com


Re: Custom Sorting in Solr

2010-10-29 Thread Yonik Seeley
On Fri, Oct 29, 2010 at 3:39 PM, Ezequiel Calderara ezech...@gmail.com wrote:
 Hi all guys!
 I'm in a weird situation here.
 We have index a set of documents which are ordered using a linked list (each
 documents has the reference of the previous and the next).

 Is there a way when sorting in the solr search, Use the linked list to sort?

It seems like you should be able to encode this linked list as an
integer instead, and sort by that?
If there are multiple linked lists in the index, it seems like you
could even use the high bits of the int to designate which list the
doc belongs to, and the low order bits as the order in that list.

-Yonik
http://www.lucidimagination.com


Re: documentCache clarification

2010-10-29 Thread Yonik Seeley
On Fri, Oct 29, 2010 at 3:49 PM, Chris Hostetter
hossman_luc...@fucit.org wrote:

 : This is a limitation in the SolrCache API.
 : The key into the cache does not contain rows, so the cache returns the
 : first 10 docs and increments it's hit count.  Then the cache user
 : (SolrIndexSearcher) looks at the entry and determines it can't use it.

 Wow, I never realized that.

 Why don't we just include the start  rows (modulo the window size) in
 the cache key?

The implementation of equals() would be rather difficult... actually
impossible w/o abusing the semantics.
It would also be impossible w/o the Map implementation guaranteeing
what object was on the LHS vs the RHS when equals was called.

Unless I'm missing something obvious?

-Yonik
http://www.lucidimagination.com


Re: documentCache clarification

2010-10-29 Thread Yonik Seeley
On Fri, Oct 29, 2010 at 4:21 PM, Chris Hostetter
hossman_luc...@fucit.org wrote:

 :  Why don't we just include the start  rows (modulo the window size) in
 :  the cache key?
 :
 : The implementation of equals() would be rather difficult... actually
 : impossible w/o abusing the semantics.
 : It would also be impossible w/o the Map implementation guaranteeing
 : what object was on the LHS vs the RHS when equals was called.
 :
 : Unless I'm missing something obvious?

 You've totally confused me.

 What i'm saying is that SolrIndexSearcher should consult the window size
 before consulting the cache -- the start param should be rounded down to
 the nearest multiple of hte window size, and start+rows (ie: end) should
 be rounded up to one less then the nearest multiple of the windows size,
 and then that should be looked up in the cache.

That's already done.
In example, do
q=*:*rows=12
q=*:*rows=16
and you should see a queryResultCache hit since queryResultWindowSize
is 20 and both requests round up to that.

*but* if you do this (with an index with more than 20  docs in it)
q=*:*rows=25

Currently that query will round up to 40, but since nResults
(start+row) isn't in the key, it will still get a cache hit but then
not be usable.

Now, if your proposal is to put nResults into the key, we then have a
worse problem.
Assume we're starting over with a clean cache.
q=*:*rows=25   // cached under a key including nResults=40
q=*:*rows=15  // looked up under a key including nResults=20... not found!

 but that's why people are suppose to pick a window size greater
 then the largest number of rows typically requested)

Hmmm, I don't think so.  If that were the case, there would be no need
for two parameters (no need for queryResultWindowSize) since we would
always just pick queryResultMaxDocsCached.

-Yonik
http://www.lucidimagination.com


Re: SolrCore.getSearcher() and postCommit()

2010-10-29 Thread Yonik Seeley
On Fri, Oct 29, 2010 at 5:36 PM, Grant Ingersoll gsing...@apache.org wrote:
 Is it OK to call and increment a Searcher ref (i.e. SolrCore.getSearcher()) 
 in a SolrEventListener.postCommit() hook as long as I decrement it when I am 
 done?  I need to get a handle on an IndexReader so I can dump out a portion 
 of the index to an external process.

Yes, just be aware that the searcher you will get will not contain the
recently committed documents.
If you want that, look at the newSearcher hook instead.

-Yonik
http://www.lucidimagination.com


Re: How to index long words with StandardTokenizerFactory?

2010-10-24 Thread Yonik Seeley
On Sun, Oct 24, 2010 at 10:47 AM, Sergey Bartunov sbos@gmail.com wrote:
 I did it just as you recommended. Solr indexes files around 15kb, but
 no more. The same effect was with patched constants

Lucene also has max token sizes it can index.
IIRC, lengths used to be stored inline with the char data, and a
single char was used for the length.

The bigger question: Is this a problem for you (do you actually have a
use case)?

-Yonik
http://www.lucidimagination.com


Re: How to index long words with StandardTokenizerFactory?

2010-10-24 Thread Yonik Seeley
On Sun, Oct 24, 2010 at 11:29 AM, Sergey Bartunov sbos@gmail.com wrote:
 It's a kind of research. There is no particular practical use case as
 far as I know.
 Do you know how to set all these max token lengths?

It's a practical limit given how things are coded, not an arbitrary
one.  Given the lack of use cases, It would be a mistake to complicate
the code or make it less performant trying to support a larger limit.

-Yonik
http://www.lucidimagination.com


Re: How to index long words with StandardTokenizerFactory?

2010-10-23 Thread Yonik Seeley
On Fri, Oct 22, 2010 at 12:07 PM, Sergey Bartunov sbos@gmail.com wrote:
 I'm trying to force solr to index words which length is more than 255

If the field is not a text field, the Solr's default analyzer is used,
which currently limits the token to 256 bytes.
Out of curiosity, what's your usecase that you really need a single 34KB token?

-Yonik
http://www.lucidimagination.com


Re: Date faceting +1MONTH problem

2010-10-22 Thread Yonik Seeley
On Fri, Sep 17, 2010 at 9:51 PM, Chris Hostetter
hossman_luc...@fucit.org wrote:
  the default query parser
 doesn't support range queries with mixed upper/lower bound inclusion.

This has just been added to trunk.
Things like [0 TO 100} now work.

-Yonik
http://www.lucidimagination.com


Re: Date faceting +1MONTH problem

2010-10-22 Thread Yonik Seeley
On Fri, Oct 22, 2010 at 6:02 PM, Shawn Heisey s...@elyograg.org wrote:
 On 10/22/2010 3:01 PM, Yonik Seeley wrote:

 On Fri, Sep 17, 2010 at 9:51 PM, Chris Hostetter
 hossman_luc...@fucit.org  wrote:

  the default query parser
 doesn't support range queries with mixed upper/lower bound inclusion.

 This has just been added to trunk.
 Things like [0 TO 100} now work.

 Awesome!  Is it easily ported back to branch_3x?

Between the refactoring work on the QP, and the back compat concerns,
it's not trivial.

-Yonik
http://www.lucidimagination.com


Re: why sorl is slower than lucene so much?

2010-10-21 Thread Yonik Seeley
2010/10/21 kafka0102 kafka0...@163.com:
 I found the problem's cause.It's the DocSetCollector. my fitler query 
 result's size is about 300,so the DocSetCollector.getDocSet() is 
 OpenBitSet. And 300 OpenBitSet.fastSet(doc) op is too slow.


As I said in my other response to you, that's a perfect reason why you
want Solr to cache that for you (unless the filter will be different
each time).

-Yonik
http://www.lucidimagination.com


Re: why solr search is slower than lucene so much?

2010-10-20 Thread Yonik Seeley
Careful comparing apples to oranges ;-)
For one, your lucene code doesn't retrieve stored fields.
Did you try the solr request more than once (with a different q, but
the same filters?)

Also, by default, Solr independently caches the filters.  This can be
higher up-front cost, but a win when filters are reused.  If you want
something closer to your lucene code, you could add all the filters to
 the main query and not use fq.

-Yonik
http://www.lucidimagination.com



On Wed, Oct 20, 2010 at 7:07 AM, kafka0102 kafka0...@163.com wrote:
 HI.
 my solr seach has some performance problem recently.
 my query is like that: q=xxfq=fid:1fq=atm:[int_time1 TO int_time2],
 fid's type is : fieldType name=int class=solr.TrieIntField
 precisionStep=0 omitNorms=true positionIncrementGap=0/
 atm's type is : fieldType name=sint class=solr.TrieIntField
 precisionStep=8 omitNorms=true positionIncrementGap=0/
 my index's size is about 500M and record num is 3984274.
 when I use solr's SolrIndexSearcher.search(QueryResult qr, QueryCommand
 cmd), it cost about70ms. When I changed use lucence'API, just like bottom:

      final SolrQueryRequest req = rb.req;
      final SolrIndexSearcher searcher = req.getSearcher();
      final SolrIndexSearcher.QueryCommand cmd = rb.getQueryCommand();
      final ExecuteTimeStatics timeStatics =
 ExecuteTimeStatics.getExecuteTimeStatics();
      final ExecuteTimeUnit staticUnit =
 timeStatics.addExecuteTimeUnit(test2);
      staticUnit.start();
      final ListQuery query = cmd.getFilterList();
      final BooleanQuery booleanFilter = new BooleanQuery();
      for (final Query q : query) {
        booleanFilter.add(new BooleanClause(q,Occur.MUST));
      }
      booleanFilter.add(new BooleanClause(cmd.getQuery(),Occur.MUST));
      logger.info(q:+query);
      final Sort sort = cmd.getSort();
      final TopFieldDocs docs = searcher.search(booleanFilter,null,20,sort);
      final StringBuilder sbBuilder = new StringBuilder();
      for (final ScoreDoc doc :docs.scoreDocs) {
        sbBuilder.append(doc.doc+,);
      }
      logger.info(hits:+docs.totalHits+,result:+sbBuilder.toString());
      staticUnit.end();

 it cost only about 20ms.
 I'm so confused. For solr's config, I closed cache. For test, I first called
 lucene's, and then solr's.
 Maybe I should look solr's source more carefully. But now, can anyone knows
 the reason?





Re: filter query from external list of Solr unique IDs

2010-10-15 Thread Yonik Seeley
On Fri, Oct 15, 2010 at 11:49 AM, Burton-West, Tom tburt...@umich.edu wrote:
 At the Lucene Revolution conference I asked about efficiently building a 
 filter query from an external list of Solr unique ids.

Yeah, I've thought about a special query parser and query to deal with
this (relatively) efficiently, both from a query perspective and a
memory perspective.

Should be pretty quick to throw together:
- comma separated list of terms (unique ids are a special case of this)
- in the query, store as a single byte array for efficiency
- sort the ids if they aren't already sorted
- do lookups with a term enumerator and skip weighting or anything
else like that
- configurable caching... may, or may not want to cache this big query

That's only part of the stuff you mention, but seems like it would be
useful to a number of people.

-Yonik
http://www.lucidimagination.com


Re: facet.field :java.lang.NullPointerException

2010-10-15 Thread Yonik Seeley
This is https://issues.apache.org/jira/browse/SOLR-2142
I'll look into it soon.
-Yonik
http://www.lucidimagination.com



On Fri, Oct 15, 2010 at 3:12 PM, Pradeep Singh pksing...@gmail.com wrote:
 Faceting blows up when the field has no data. And this seems to be random.
 Sometimes it will work even with no data, other times not. Sometimes the
 error goes away if the field is set to multiValued=true (even though it's
 one value every time), other times it doesn't. In all cases setting
 facet.method to enum takes care of the problem. If this param is not set,
 the default leads to null pointer exception.


 09:18:52,218 SEVERE [SolrCore] Exception during facet.field of
 xyz:java.lang.NullPointerException

      at java.lang.System.arraycopy(Native Method)

      at org.apache.lucene.util.PagedBytes.copy(PagedBytes.java:247)

      at
 org.apache.solr.request.TermIndex$1.setTerm(UnInvertedField.java:1164)

      at
 org.apache.solr.request.NumberedTermsEnum.init(UnInvertedField.java:960)

      at
 org.apache.solr.request.TermIndex$1.init(UnInvertedField.java:1151)

      at
 org.apache.solr.request.TermIndex.getEnumerator(UnInvertedField.java:1151)

      at
 org.apache.solr.request.UnInvertedField.uninvert(UnInvertedField.java:204)

      at
 org.apache.solr.request.UnInvertedField.init(UnInvertedField.java:188)

      at
 org.apache.solr.request.UnInvertedField.getUnInvertedField(UnInvertedField.java:911)

      at
 org.apache.solr.request.SimpleFacets.getTermCounts(SimpleFacets.java:298)

      at
 org.apache.solr.request.SimpleFacets.getFacetFieldCounts(SimpleFacets.java:354)

      at
 org.apache.solr.request.SimpleFacets.getFacetCounts(SimpleFacets.java:190)

      at
 org.apache.solr.handler.component.FacetComponent.process(FacetComponent.java:72)

      at
 org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:210)

      at
 org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131)

      at org.apache.solr.core.SolrCore.execute(SolrCore.java:1323)

      at
 org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:337)

      at
 org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:240)
                at



Re: Which version of Solr to use?

2010-10-14 Thread Yonik Seeley
On Thu, Oct 14, 2010 at 1:58 PM, Lukas Kahwe Smith m...@pooteeweet.org wrote:
 the current confusing list of branches is a result of the merge of the lucene 
 and solr svn repositories. what baffpes me is that so far the countless 
 plea's for at least a rough roadmap or even just explanation for why so many 
 branches are needed

There is one branch users need  to be concerned about: branch_3x
All 3.x releases will be made from that branch.

trunk (which is technically not a branch) is 4.0

-Yonik
http://www.lucidimagination.com


Re: Which version of Solr to use?

2010-10-14 Thread Yonik Seeley
On Thu, Oct 14, 2010 at 1:50 PM, Jonathan Rochkind rochk...@jhu.edu wrote:
 I'm kind of confused about Solr development plans in general, highlighted by
 this thread.

 I think 1.4.1 is the latest officially stable release, yes?

 Why is there both a 1.5 and a 3.x, anyway?  Not to mention a 4.x?  Which of
 these will end up being a stable release? Both? From which will come the
 next stable release?

1.5 is pre lucene/solr merge, and is very unlikely to ever be released.
3.1 is the next lucene/solr point release (3x branch in svn)
4.0 is the next major release (trunk in svn)

-Yonik
http://www.lucidimagination.com


Re: Which version of Solr to use?

2010-10-14 Thread Yonik Seeley
On Thu, Oct 14, 2010 at 2:39 PM, Jonathan Rochkind rochk...@jhu.edu wrote:
 Thanks Yonik!  So I gather that the 1.5 branch has essentially been
 abandoned, we can pretend it doesn't exist at all, it's been entirely
 superceded by the 3.x branch, with the changes made just for the purposes of
 syncronizing versions with lucene.

Right.  Everything marked as 1.5 in the past is in 3.1-dev and 4.0-dev.

1.5 was always just a place-holder for the next release, which could
have been 2.0 if we had upgraded Lucene and changed enough stuff in
Solr.  So even before the Lucene/Solr merge, a 1.5 release was never
really guaranteed.

-Yonik
http://www.lucidimagination.com


Re: Which version of Solr to use?

2010-10-14 Thread Yonik Seeley
On Thu, Oct 14, 2010 at 2:55 PM, Mike Squire mike.squ...@gmail.com wrote:
 As pointed out before it would be useful to have some kind of
 documented road map for development, and some kind of indication of
 how close certain versions are to release.

Such things have proven to be very unreliable in the past, due to the
volunteer nature of open source.  It would also require everyone
agreeing up-front - which rarely happens ;-)

Specifically for 3.1, everyone seems to want to do a release, and we
have plenty of new features to support that.  I expect it's close, but
the work still needs to be done.

Anyway, our new split branch_3x / trunk development model *should*
allow for more frequent releases in the future, once we get things
rolling.

Side note: I would submit that those projects that release every few weeks
add no additional value over our (currently) infrequent releases.  Due
to our high quality test suites and peer reviewed patches, I'd bet the
stability of our nightly snapshots over some of those other projects
any day!

-Yonik
http://www.lucidimagination.com


Re: Faceting and first letter of fields

2010-10-14 Thread Yonik Seeley
On Thu, Oct 14, 2010 at 3:42 PM, Jonathan Rochkind rochk...@jhu.edu wrote:
 I believe that should work fine in Solr 1.4.1.  Creating a field with just
 first letter of author is definitely the right (possibly only) way to allow
 facetting on first letter of author's name.

 I have very voluminous facets (few facet values, many docs in each value)
 like that in my app too, works fine.

 I get confused over the different facetting methods available in 1.4.1, and
 exactly when each is called for. If you see initial problems, you could try
 switching the facet.method and see what happens.

Right - for faceting on first letter, you should probably use facet.method=enum
since there will only be 26 values (assuming english/western languages).

In the future, I'm hoping we can come up with a smarter way to pick
the facet.method if it's not supplied.  The new flex API in 4.0-dev
should help out here.

-Yonik
http://www.lucidimagination.com


Re: LuceneRevolution - NoSQL: A comparison

2010-10-13 Thread Yonik Seeley
On Tue, Oct 12, 2010 at 12:11 PM, Jan Høydahl / Cominvent
jan@cominvent.com wrote:
 I'm pretty sure the 2nd phase to fetch doc-summaries goes directly to same 
 server as first phase. But what if you stick a LB in between?

A related point - the load balancing implementation that's part of
SolrCloud (and looks like it will be committed to trunk soon), does
keep track of what server it used for the first phase and uses that
for subsequent phases.

-Yonik
http://www.lucidimagination.com


Re: Spatial search in Solr 1.5

2010-10-13 Thread Yonik Seeley
On Wed, Oct 13, 2010 at 7:28 AM, PeterKerk vettepa...@hotmail.com wrote:
 Hi,

 Thanks for the quick reply :)

 I downloaded the latest version from the trunk. Got it up and running, and
 got the error below:

Hopefully the QuickStart on the wiki all worked for you,
but you only got the error when customizing your own config?

Anyway, it looks like you haven't defined a _latLon dynamic field type
for the lat / lon components.

Here's what is in the example schema:

   fieldType name=location class=solr.LatLonType
subFieldSuffix=_coordinate/
   dynamicField name=*_coordinate  type=tdouble indexed=true
stored=false/

   field name=store type=location indexed=true stored=true/

-Yonik
http://www.lucidimagination.com

 URL:
 http://localhost:8983/solr/db/select/?wt=xmlindent=onfacet=truefl=id,title,lat,lng,cityfacet.field=province_rawq=*:*fq={!geofilt%20pt=45.15,-93.85%20sfield=geolocation%20d=5}

 HTTP ERROR 400

 Problem accessing /solr/db/select/. Reason:

    undefined field geolocation_0_latLon

 Powered by Jetty://



 My field definition is:

 I added this in schema.xml:
 field name=geolocation type=latLon indexed=true stored=true/
 fieldType name=latLon class=solr.LatLonType subFieldSuffix=_latLon/


 data-config.xml:
 entity name=location_geolocations query=select (lat+','+lng) as geoloc
 FROM locations WHERE id='${location.id}'
        field name=geolocation column=geoloc  /
 /entity


 I looked in the schema.xml of the latest download, but it turns out in the
 download there's nothing defined in that schema.xml on latLon type either.

 Any suggestions what im doing wrong?

 Thanks!
 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/Spatial-search-in-Solr-1-5-tp489948p1693797.html
 Sent from the Solr - User mailing list archive at Nabble.com.



Re: Spatial search in Solr 1.5

2010-10-13 Thread Yonik Seeley
On Wed, Oct 13, 2010 at 9:42 AM, PeterKerk vettepa...@hotmail.com wrote:
 Im now thinking I downloaded the wrong solr zip, I tried this one:
 https://hudson.apache.org/hudson/job/Solr-trunk/lastSuccessfulBuild/artifact/trunk/solr/dist/apache-solr-4.0-2010-10-12_08-05-48.zip

 In that example scheme
 (\apache-solr-4.0-2010-10-12_08-05-48\example\example-DIH\solr\db\conf\schema.xml)
 nothing is mentioned about a fieldtype of class solr.LatLonType.

Ah, right - DIH has a separate schema.   Blech.

-Yonik
http://www.lucidimagination.com


Re: Spatial search in Solr 1.5

2010-10-13 Thread Yonik Seeley
On Wed, Oct 13, 2010 at 10:06 AM, PeterKerk vettepa...@hotmail.com wrote:

 haha ;)

 But so I DO have the right solr version?

 Anyways...I have added the lines you mentioned, what else can I do?

The fact that the geolocation field does not show up in the results means that
it's not getting added (i.e. something is probably wrong with your DIH config).

-Yonik
http://www.lucidimagination.com


Re: Spatial search in Solr 1.5

2010-10-12 Thread Yonik Seeley
You may want to check the docs, which were recently updated to reflect
the state of trunk:
http://wiki.apache.org/solr/SpatialSearch

-Yonik
http://www.lucidimagination.com



On Tue, Oct 12, 2010 at 7:49 PM, PeterKerk vettepa...@hotmail.com wrote:

 Hey Grant,

 Just came accross this post of yours.

 Run a query:  http://localhost:8983/solr/select/?q=_val_:recip(dist(2,
 store, vector(34.0232,-81.0664)),1,1,0)fl=*,score  // Note, I just updated
 this, it used to be point instead of vector and that was wrong.

 What does your suggested query actually do?

 I really need great circle calcucation. Dont care if its from the trunk, as
 long as I can have it in my projects asap :)

 Thanks ahead!
 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/Spatial-search-in-Solr-1-5-tp489948p1691361.html
 Sent from the Solr - User mailing list archive at Nabble.com.



Re: Spatial search in Solr 1.5

2010-10-12 Thread Yonik Seeley
On Tue, Oct 12, 2010 at 8:07 PM, PeterKerk vettepa...@hotmail.com wrote:

 Ok, so does this actually say:
 for now you have to do calculations based on bounding box instead of great
 circle?

I tried to make the documentation a little simpler... there's
 - geofilt... filters within a radius of d km  (i.e. great circle distance)
 - bbox... filters using a bounding box
 - geodist... function query that yields the distance (again, great
circle distance)

If you point out the part to the docs you found confusing, I can try
and improve it.
Did you try and step through the quick start?  Those links actually work!

 And the fact that on top of the page it says Solr4.0, does that imply I
 cant use this right now? Or where could I find the latest trunk for this?

The wiki says If you haven't already, get a recent nightly build of Solr4.0...
and links to the Solr4.0 page, which points to
http://wiki.apache.org/solr/FrontPage#solr_development
for nightly builds.

-Yonik

http://www.lucidimagination.com


Re: LuceneRevolution - NoSQL: A comparison

2010-10-11 Thread Yonik Seeley
On Mon, Oct 11, 2010 at 8:32 PM, Peter Keegan peterlkee...@gmail.com wrote:
 I listened with great interest to Grant's presentation of the NoSQL
 comparisons/alternatives to Solr/Lucene. It sounds like the jury is still
 out on much of this. Here's a use case that might favor using a NoSQL
 alternative for storing 'stored fields' outside of Lucene.

 When Solr does a distributed search across shards, it does this in 2 phases
 (correct me if I'm wrong):

 1. 1st query to get the docIds and facet counts
 2. 2nd query to retrieve the stored fields of the top hits

 The problem here is that the index could change between (1) and (2), so it's
 not an atomic transaction.

Yep.

As I discussed with Peter at Lucene Revolution, if this feature is
important to people, I think the easiest way to solve it would be via
leases.

During a query, a client could request a lease for a certain amount of
time on whatever index version is used to generate the response.  Solr
would then return the index version to the client along with the
response, and keep the index open for that amount of time.  The client
could make consistent additional requests (such as the 2nd phase of a
distributed request)  by requesting the same version of the index.

-Yonik


Re: Upgrade to Solr 1.4, very slow at start up when loading all cores

2010-10-02 Thread Yonik Seeley
On Fri, Oct 1, 2010 at 5:42 PM, Renee Sun renee_...@mcafee.com wrote:
 Hi Yonik,

 I attached the solrconfig.xml to you in previous post, and we do have
 firstSearch and newSearch hook ups.

 I commented them out, all 130 cores loaded up in 1 minute, same as in solr
 1.3.  total memory took about 1GB. Whereas in 1.3, with hook ups, it took
 about 6.5GB for same amount of data.

For other's reference: here is the warming query (it's the same for
newSearcher too):

listener event=firstSearcher class=solr.QuerySenderListener
arr name=queries
lst
str name=qtype:message/str
str name=start0/str
str name=rows10/str
str name=sortmessage_date desc/str
/lst
/arr
/listener

The sort field message_date is what will be taking up the memory.

Starting with Lucene 2.9 (which is used in Solr 1.4), searching and
sorting is per-segment.
This is generally beneficial, but in this case I believe it is causing
the extra memory usage because the same date value that would have
been shared across all documents in the fieldcache is now repeated in
each segment it is used in.

One potential fix (that requires you to reindex) is to use the date
fieldType as defined in the new 1.4 schema:
fieldType name=date class=solr.TrieDateField omitNorms=true
precisionStep=0 positionIncrementGap=0/

This will use 8 bytes per document in your index, rather than 4 bytes
per doc + an array of unique string-date values per index.

Trunk (4.0-dev) is also much more efficient at storing string-based
fields in the FieldCache - but that will only help you if you're
comfortable with using development versions.

-Yonik
http://lucenerevolution.org  Lucene/Solr Conference, Boston Oct 7-8


Re: Local Solr, Spatial Search, and LatLonType clarification

2010-09-30 Thread Yonik Seeley
On Thu, Sep 30, 2010 at 1:09 PM, webdev1977 webdev1...@gmail.com wrote:
 1.  I noticed that it said that the type of LatLongType can not be
 mulitvalued. Does that mean that I can not have multiple lat/lon values for
 one document.

That means that if you want to have multiple points per document, each
point must be in a different field.
This often makes sense anyway, when the points have different
semantics - i.e. work and home locations.

-Yonik
http://lucenerevolution.org  Lucene/Solr Conference, Boston Oct 7-8


Re: Local Solr, Spatial Search, and LatLonType clarification

2010-09-30 Thread Yonik Seeley
On Thu, Sep 30, 2010 at 1:40 PM, webdev1977 webdev1...@gmail.com wrote:
 Or.. do you mean each field must have a unique name, but both be of type
 latLon(solr.LatLonType).
 work x,y/work
 homex,y/home

Yes.

 If the statement directly above is true (I hope that it is not), how does
 one dynamically create fields when adding geotags?

Dynamic field types.  You can configure it such that anything ending
with _latlon is of type LatLonType.
Perhaps we should do this in the example schema.

-Yonik
http://lucenerevolution.org  Lucene/Solr Conference, Boston Oct 7-8


Re: Local Solr, Spatial Search, and LatLonType clarification

2010-09-30 Thread Yonik Seeley
On Thu, Sep 30, 2010 at 1:48 PM, Yonik Seeley
yo...@lucidimagination.com wrote:
 Dynamic field types.  You can configure it such that anything ending
 with _latlon is of type LatLonType.
 Perhaps we should do this in the example schema.

Looks like we already have it:

   dynamicField name=*_p  type=location indexed=true stored=true/

So you should be able to add stuff like home_p and work_p w/o defining
them ahead of time.  Anything ending in _p is of type location.

-Yonik
http://lucenerevolution.org  Lucene/Solr Conference, Boston Oct 7-8


Re: Upgrade to Solr 1.4, very slow at start up when loading all cores

2010-09-30 Thread Yonik Seeley
On Thu, Sep 30, 2010 at 10:41 AM, Renee Sun renee_...@mcafee.com wrote:

 Hi -
 I posted this problem but no response, I guess I need to post this in the
 Solr-User forum. Hopefully you will help me on this.

 We were running Solr 1.3 for long time, with 130 cores. Just upgrade to Solr
 1.4, then when we start the Solr, it took about 45 minutes. The catalina.log
 shows Solr is very slowly loading all the cores.

Have you tried 1.4.1 yet?
Could you open a JIRA issue for this and give whatever info you can?
Info like:
  - do you have any warming queries configured?
  - do the cores have documents already, and if so, how many per core?
  - are you using the same schema  solrconfig, or did you upgrade?
  - have you tried finding out what is taking up all the memory (or
all the CPU time)?

-Yonik
http://lucenerevolution.org  Lucene/Solr Conference, Boston Oct 7-8


Re: Queries, Functions, and Params

2010-09-29 Thread Yonik Seeley
On Tue, Sep 28, 2010 at 6:08 PM, Robert Thayer
robert.tha...@bankserv.com wrote:
 On the http://wiki.apache.org/solr/FunctionQuery page, the following query 
 function is listed:

 q={!func}add($v1,$v2)v1=sqrt(popularity)v2=100.0

 When run against the default solr instance, server returns the error(400): 
 undefined field $v1.

 Any way to remedy this?

 Using version: 3.1-2010-09-28_05-53-44


The wiki page indicates this is a 4.0 feature - so you need a recent
4.0-dev build to try it out.

-Yonik
http://lucenerevolution.org  Lucene/Solr Conference, Boston Oct 7-8


Re: Conditional Function Queries

2010-09-28 Thread Yonik Seeley
On Tue, Sep 28, 2010 at 11:33 AM, Jan Høydahl / Cominvent
jan@cominvent.com wrote:
 Have anyone written any conditional functions yet for use in Function Queries?

Nope - but it makes sense and has been on my list of things to do for
a long time.

-Y
http://lucenerevolution.org  Lucene/Solr Conference, Boston Oct 7-8


Re: bi-grams for common terms - any analyzers do that?

2010-09-25 Thread Yonik Seeley
On Sat, Sep 25, 2010 at 8:21 PM, Jonathan Rochkind rochk...@jhu.edu wrote:
 Huh, okay, I didn't know that #2 happened at all. Can you explain or point me 
 to documentation to explain when it happens?  I'm afraid I'm having trouble 
 understanding   if the analyzer returns more than one position back from a 
 queryparser token (whitespace). 

 Not entirely sure what that means.  Can you give an example?

It's always happened, up until recently when it's been made configurable.
An example is IndexReader being split into two tokens by
WordDelimiterFilter and searched as index reader (i.e. the two terms
must be directly next to each other for the document to match).  If
the new autoGeneratePhraseQueries is off, position doesn't matter,
and the query will be treated as index OR reader.

-Yonik
http://lucenerevolution.org  Lucene/Solr Conference, Boston Oct 7-8


Re: matches in result grouping

2010-09-23 Thread Yonik Seeley
2010/9/23 Koji Sekiguchi k...@r.email.ne.jp:
  (10/09/23 18:14), Koji Sekiguchi wrote:
  I'm using recent committed field collapsing / result grouping
 feature in trunk.

 I'm confusing matches parameter in the result at the second
 sample output of Wiki:

 http://wiki.apache.org/solr/FieldCollapsing#Quick_Start

 I cannot understand why there are two matches:5 entries
 in the result. Can anyone explain it?
 Probably multiple GroupCollectors are generated for each group.field,
 group.func and group.query and match can be counted per collector.

Correct.  The matches is the doc count before any grouping (and for
field.query that means before the restriction given by field.query is
applied).  It won't always be the same though - for example we might
implement filter excludes like we do with faceting, etc.

-Yonik
http://lucenerevolution.org  Lucene/Solr Conference, Boston Oct 7-8


Re: Range query not working

2010-09-23 Thread Yonik Seeley
On Thu, Sep 23, 2010 at 4:30 PM, PeterKerk vettepa...@hotmail.com wrote:
 I have this in my query:
  q=*:*facet.query=location_rating_total:[3 TO 100]

 And this document:
 result name=response numFound=6 start=0 maxScore=1.0
 -
 doc
 float name=score1.0/float
 str name=id1/str
 int name=location_rating_total2/int
 /doc

 But still my total results equals 6 (total population) and not 0 as I would
 expect

 Why?

facet.query will give you the number of docs matching
location_rating_total:[3 TO 100], it does not restrict the results
list.  If you want that, you want a filter.

Try
q=*:*fq=location_rating_total:[3 TO 100]

-Yonik
http://lucenerevolution.org  Lucene/Solr Conference, Boston Oct 7-8


Re: Range query not working

2010-09-23 Thread Yonik Seeley
On Thu, Sep 23, 2010 at 5:44 PM, Jonathan Rochkind rochk...@jhu.edu wrote:
 The field type in a standard schema.xml that's defined as integer is NOT
 sortable.

Right - before 1.4.  There is no integer field type in 1.4 and
beyond in the example schema.

 You can not sort on this and get what you want. (What's the point
 of it even existing then, if it pretty much does the same thing as a string
 field?

You can sort on it... you just can't do range queries on it because
the term order isn't correct for numerics.
It's there only for support of legacy lucene indexes that indexed
numerics as plain strings.
They are now named pint for plain integer in 1.4 and above.

Perhaps we should retain support for that, but remove them from the
example schema and only document them somewhere (under supporting
lucene indexed built by other software or something?)

-Yonik
http://lucenerevolution.org  Lucene/Solr Conference, Boston Oct 7-8


Re: multiple spatial values

2010-09-21 Thread Yonik Seeley
On Tue, Sep 21, 2010 at 12:12 PM, dan sutton danbsut...@gmail.com wrote:
 I was looking at the LatLonType and how it might represent multiple lon/lat
 values ... it looks to me like the lat would go in {latlongfield}_0_LatLon
 and the long in {latlongfield}_1_LatLon ... how then if we have multiple
 lat/long points for a doc when filtering for example we choose the correct
 points.

 e.g. if thinking in cartisean coords and we have

 P1(3,4), P2(6,7) ... x is stored with 3,6 and y with 4,7 ...

 then how does it ensure we're not erroneously picking (3,7) or (6,4) whilst
 filtering with the spatial query?

That's why it's a single-valued field only for now...

 don't we have to store both values together ? what am i missing here?

The problem is that we don't have a way to query both values together,
so we must index them separately.  The basic LatLonType uses numeric
queries on the lat and lon fields separately.

-Yonik
http://lucenerevolution.org  Lucene/Solr Conference, Boston Oct 7-8


Re: Version stability [was: svn branch issues]

2010-09-17 Thread Yonik Seeley
I think we aim for a stable trunk (4.0-dev) too, as we always have
(in the functional sense... i.e. operate correctly, don't crash, etc).

The stability is more a reference to API stability - the Java APIs are
much more likely to change on trunk.  Solr's *external* APIs are much
less likely to change for core services.  For example, I don't see us
ever changing the rows parameter or the XML update format in a
non-back-compat way.

Companies can (and do) go to production on trunk versions of Solr
after thorough testing in their scenario (as they should do with *any*
new version of solr that isn't strictly bugfix).

-Yonik
http://lucenerevolution.org  Lucene/Solr Conference, Boston Oct 7-8

On Fri, Sep 17, 2010 at 10:16 AM, Mark Miller markrmil...@gmail.com wrote:
 The 3.x line should be pretty stable. Hopefully we will do a release
 soon. A conversation was again started about more frequent releases
 recently, and hopefully that will lead to a 3.x release near term.

 In any case, 3.x is the stable branch - 4.x is where the more crazy
 stuff happens. If you are used to the terms, 4.x is the unstable branch,
 though some freak out if you call that for fear you think its 'really
 unstable'. In reality, it just means likely less stable than the stable
 branch (3.x), as we target 3.x for stability and 4.x for stickier or non
 back compat changes.

 Eventually 4.x will be stable and 5.x unstable, with possible
 maintenance support for previous stable lines as well.

 - Mark
 lucidimagination.com

 On 9/17/10 9:58 AM, Mark Allan wrote:
 OK, 1.5 won't be released, so we'll avoid that.  I've now got my code
 additions compiling against a version of 3.x so we'll stick with that
 rather than solr_trunk for the time being.

 Does anyone have any sense of when 3.x might be considered stable enough
 for a release?  We're hoping to go to service with something built on
 Solr in Jan 2011 and would like to avoid development phase software, but
 if needs must...

 Thanks
 Mark


 On 9 Sep 2010, at 12:10 pm, Markus Jelsma wrote:

 Well, it's under heavy development but the 3.x branch is more likely
 to become released than 1.5.x, which is highly unlikely to be ever
 released.


 On Thursday 09 September 2010 13:04:38 Mark Allan wrote:
 Thanks. Are you suggesting I use branch_3x and is that considered
 stable?
 Cheers
 Mark

 On 9 Sep 2010, at 10:47 am, Markus Jelsma wrote:
 http://svn.apache.org/repos/asf/lucene/dev/branches/






Re: Version stability [was: svn branch issues]

2010-09-17 Thread Yonik Seeley
On Fri, Sep 17, 2010 at 10:46 AM, Mark Miller markrmil...@gmail.com wrote:
 I agree it's mainly API wise, but there are other issues - largely due
 to Lucene right now - consider the bugs that have been dug up this year
 on the 4.x line because flex has been such a large rewrite deep in
 Lucene. We wouldn't do flex on the 3.x stable line and it's taken a
 while for everything to shake out in 4.x (and it's prob still swaying).

Right.  That big difference also has implications for the 3.x line too
though - possible backports of new features like field collapsing or
per-segment faceting that involve the flex API would involve a good
amount of re-writing (along with the introduction of new bugs).  I'd
put my money on 4.0-dev being actually *more* stable for these new
features.

-Yonik
http://lucenerevolution.org  Lucene/Solr Conference, Boston Oct 7-8


Re: doc into doc

2010-09-17 Thread Yonik Seeley
On Fri, Sep 17, 2010 at 4:12 PM, facholi rfach...@gmail.com wrote:

 Hi,

 I would like a json result like that:

 {
   id:2342,
   name:Abracadabra,
   metadatas: [
      {type:tag, name:tutorial},
      {type:value, name:2323.434/434},
   ]
 }

Do you mean JSON with the tags not quoted (that's not legal JSON), or
do you mean the metadata part?

Anyway, I assume you're not asking about how to get a JSON response in general?
If so, search for json here:http://lucene.apache.org/solr/tutorial.html

If you're looking for something else, you'll need to be more specific.

-Yonik
http://lucenerevolution.org  Lucene/Solr Conference, Boston Oct 7-8


Re: Null Pointer Exception while indexing

2010-09-16 Thread Yonik Seeley
On Wed, Sep 15, 2010 at 2:01 PM, andrewdps mstpa...@gmail.com wrote:
 I still get the same error when I try to index the mrc file...

If you get the exact same error, then you are still using GCJ.
When you type java it's probably going to GCJ because of your path
(i.e. change it or directly specify the path to the new JVM you just
installed).

-Yonik
http://lucenerevolution.org  Lucene/Solr Conference, Boston Oct 7-8


Re: SOLR interface with PHP using javabin?

2010-09-16 Thread Yonik Seeley
On Thu, Sep 16, 2010 at 2:30 PM, onlinespend...@gmail.com
onlinespend...@gmail.com wrote:
  I am planning on creating a website that has some SOLR search capabilities
 for the users, and was also planning on using PHP for the server-side
 scripting.

 My goal is to find the most efficient way to submit search queries from the
 website, interface with SOLR, and display the results back on the website.
  If I use PHP, it seems that all the solutions use some form of character
 based stream for the interface.  It would seem that using a binary
 representation, such as javabin, would be more efficient.

 If using javabin, or some similar efficient binary stream to interface SOLR
 with PHP is not possible, what do people recommend as the most efficient
 solution that provides the best performance, even if that means not using
 PHP and going with some other alternative?

I'd recommend going with JSON - it will be quite a bit smaller than
XML, and the parsers are generally quite efficient.

-Yonik
http://lucenerevolution.org  Lucene/Solr Conference, Boston Oct 7-8


<    6   7   8   9   10   11   12   13   14   15   >