Solr Quries

2009-10-06 Thread Pravin Karne
Hi,
I am new to solr. I have following queries :


1.   Is solr work in distributed environment ? if yes, how to configure it?



2.   Is solr have Hadoop support? if yes, how to setup it with Hadoop/HDFS? 
(Note: I am familiar with Hadoop)



3.   I have employee information(id, name ,address, cell no, personal info) 
of 1 TB ,To post(index)this data on solr server, shall I have to create xml 
file with this data and then post it to solr server? Or is there any other 
optimal way?  In future my data will grow upto 10 TB , then how can I index 
this data ?(because creating xml is more headache )





Thanks in advance

-Pravin




DISCLAIMER
==
This e-mail may contain privileged and confidential information which is the 
property of Persistent Systems Ltd. It is intended only for the use of the 
individual or entity to which it is addressed. If you are not the intended 
recipient, you are not authorized to read, retain, copy, print, distribute or 
use this message. If you have received this communication in error, please 
notify the sender and delete all copies of this message. Persistent Systems 
Ltd. does not accept any liability for virus infected mails.


Re: solr 1.4 formats last_index_time for SQL differently than 1.3 ?!?

2009-10-06 Thread Noble Paul നോബിള്‍ नोब्ळ्
really?
I don't remember that being changed.

what difference do u notice?

On Wed, Oct 7, 2009 at 2:30 AM, michael8  wrote:
>
> Just looking for confirmation from others, but it appears that the formatting
> of last_index_time from dataimport.properties (using DataImportHandler) is
> different in 1.4 vs. that in 1.3.  I was troubleshooting why delta imports
> are no longer working for me after moving over to solr 1.4 (10/2 nighly) and
> noticed that format is different.
>
> Michael
> --
> View this message in context: 
> http://www.nabble.com/solr-1.4-formats-last_index_time-for-SQL-differently-than-1.3--%21--tp25776496p25776496.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>
>



-- 
-
Noble Paul | Principal Engineer| AOL | http://aol.com


Re: Weird Facet and KeywordTokenizerFactory Issue

2009-10-06 Thread Ravi Kiran
Hello Mr.Hostetter,
Thank you for patiently reading through my post, I apologize for being
cryptic in my previous messages..

>>when you cut/pasted the facet output, you excluded the field names.  based
>>on the schema & solrconfig.xml snippets you posted later, i'm assuming
>>they are usstate, and keyword, but you have to be explicit so that people
can help correlate the
>>results you are getting with the schema you posted

I had to be brief as my facets are in the order of 100K over 800K documents
and also if I give the complete schema.xml I was afraid nobody would read my
long message :-) ..Hence I showed only relevant pieces of the result showing
different fields having same problem

>>i'm assuming they are usstate, and keyword, but you have to be explicit so
that people can help correlate the
>>results you are getting with the schema you posted -- for example, you
haven't posted anything that would verify that the usstate >>field actually
uses your keywordText field

Yes, you are right here is the compete relavant snippet regarding
keywordText and associated fields. keyword, keywordlower and
keywordformatted are all aggregations of all other fields like - person,
personformatted, organization, location. location itself is aggregation of
usstate, country. The aggregation is done seperately in custom code even
before indexing into solr


  






  
  





  













>>A huge gap is in what your synonym files contain ... something weird in
>>there could easily explain superfluous terms getting added to your data.
Here are my synonym entries
---

#Persons
barack obama, barak obama, barack h. obama, barack hussein obama, barak
hussein obama
hillary clinton, hillary r. clinton, hillary rodham clinton
timothy geithner, tim geithner, timothy f. geithner, geithner, timothy franz
geithner
vladimir putin, putin

#Organizations
U.N, U.N., u.n, un, UN, United Nations => U.N
DHS,D.H.S,D.H.S.,Department of Homeland Security,Homeland Security => D.H.S
USCIS, United States Citizenship and Immigration Services, U.S.C.I.S. =>
United States Citizenship and Immigration Services, U.S.C.I.S
SEC, Securities and Exchange Commission, S.E.C, S.E.C, SEC. => Securities
and Exchange Commission, S.E.C
FCC, Federal Communications Commission, F.C.C, F.C.C. => Federal
Communications Commission, F.C.C
GSA, General Services Administration, G.S.A, G.S.A. => General Services
Administration, G.S.A
SBA, Small Business Administration, S.B.A, S.B.A. => Small Business
Administration, S.B.A.
FEMA, Federal Emergency Management Agency, FEMA. => FEMA
AT&T, ATT, ATT., AT&T., AT&T Wireless => AT&T
BBC, British Broadcasting Corporation, B.B.C, B.B.C. => B.B.C,BBC
Bank of America, BOA, B.O.A, Bank of America Corp, Bank of America Corp. =>
B.O.A
General Motors, G.M., G.M, GM, General Motors Corp., General Motors Corp =>
General Motors, G.M
NFL, National Football League, N.F.L, N.F.L. => N.F.L
Exxon Mobil, Exxon Mobil Corp => Exxon Mobil
Google, Google Inc, Google Inc. => Google
AIG, A.I.G, A.I.G., American International Group => American International
Group, A.I.G
Goldman Sachs, Goldman Sachs Inc., Goldman Sachs Group Inc, Goldman Sachs
Group Inc. => Goldman Sachs
GE, General Electric Co., General Electric Co, G.E, G.E., General Electric
=> G.E, General Electric
General Dynamics, General Dynamics Corp,General Dynamics Corp., General
Dynamics Information Technology, General Dynamics Advanced Information
Systems => General Dynamics
HP, Hewlett Packard Co,Hewlett Packard Co., Hewlett Packard,
Hewlett-Packard, Hewlett-Packard Corp,H.P, H.P. => Hewlett Packard, H.P
IBM, International Business Machines, I.B.M, International Business Machines
Corp => I.B.M
Johns Hopkins University, Johns Hopkins, JHU, J.H.U, J.H.U. => Johns Hopkins
University, JHU, J.H.U
J.C. Penney, J.C. Penney Co. => J.C. Penney
JPMorgan Chase, JPMorgan Chase & Co., JPMorgan Chase & Co, JPMorgan =>
JPMorgan Chase & Co.
Lockheed Martin, Lockheed Martin Corp, Lockheed Martin Corp., Lockheed,
Lockheed VH => Lockheed Martin
Merrill Lynch, Merrill Lynch & Co., Merrill, Merrill. => Merrill Lynch
Microsoft, Microsoft Corp., Microsoft Corp, Microsoft. => Microsoft
Northrop Grumman, Northrop Grumman Corp., Northrop Grumman Corp, Northrop,
Northrop Corp. => Northrop Grumman
Smyth Co., Smyth Co
Sony, Sony Corp., Sony Corp => Sony Corp.
TJX Companies, TJX, TJX Cos. => TJX Companies
Target Corp., Target Corp, Target Corp stores => Target Corp.
Walmart, WalMart Inc, WalMart Stores, WalMart Stores Inc, WalMart Stores
Inc. => WalMart Inc.
Yahoo, Yahoo Inc co, Yahoo Inc. => Yahoo Inc.
AP, AP., A.P, A.P., Associated Press => Associated Press

#Countries
USA,USA.,U.S.A.,u.s.a,u.s.a.,U.S,U.S.,US,US.,u.s, u.s.,United States,United
States of America,United States Of America,united states,united states of
america,

Re: DataImportHandler problem: Feeding the XPathEntityProcessor with the FieldReaderDataSource

2009-10-06 Thread Noble Paul നോബിള്‍ नोब्ळ्
hi Lance. db.blob is the correct field name so that is fine.
you can probbaly open an issue and provide the testcase as a patch.
That can help us track this better

On Wed, Oct 7, 2009 at 12:45 AM, Lance Norskog  wrote:
> A side note that might help: if I change the dataField from 'db.blob'
> to 'blob', this DIH stack emits no documents.
>
> On 10/5/09, Lance Norskog  wrote:
>> I've added a unit test for the problem down below. It feeds document
>> field data into the XPathEntityProcessor via the
>> FieldReaderDataSource, and the XPath EP does not emit unpacked fields.
>>
>> Running this under the debugger, I can see the supplied StringReader,
>> with the XML string, being piped into the XPath EP. But somehow the
>> XPath EP does not pick it apart the right way.
>>
>> Here is the DIH configuration file separately.
>>
>> 
>>   
>>   
>>   
>>       
>>         
>>         
>>         
>>         > processor='XPathEntityProcessor'
>>               forEach='/names' dataField='db.blob'>
>>           
>>         
>>       
>>   
>> 
>>
>> Any ideas?
>>
>> ---
>>
>> package org.apache.solr.handler.dataimport;
>>
>> import static
>> org.apache.solr.handler.dataimport.AbstractDataImportHandlerTest.createMap;
>> import junit.framework.TestCase;
>>
>> import java.util.ArrayList;
>> import java.util.HashMap;
>> import java.util.List;
>> import java.util.Map;
>>
>> import org.apache.solr.common.SolrInputDocument;
>> import org.apache.solr.common.SolrInputField;
>> import org.apache.solr.handler.dataimport.TestDocBuilder.SolrWriterImpl;
>> import org.junit.Test;
>>
>> /*
>>  * Demonstrate problem feeding XPathEntity from a FieldReaderDatasource
>>  */
>>
>> public class TestFieldReaderXPath extends TestCase {
>>       static final String KISSINGER = "Henry";
>>
>>       static final String[][][] DBDOCS = {
>>               {{"dbid", "1"}, {"blob", KISSINGER}},
>>       };
>>
>>       /*
>>        * Receive a row from SQL and fetch a row from Solr - no value matching
>>        * stolen from TestDocBuilder
>>        * */
>>
>>       @Test
>>       public void testSolrEmbedded() throws Exception {
>>               try {
>>                       DataImporter di = new DataImporter();
>>                       di.loadDataConfig(dih_config_FR_into_XP);
>>                       DataImporter.RequestParams rp = new 
>> DataImporter.RequestParams();
>>                       rp.command = "full-import";
>>                       rp.requestParams = new HashMap();
>>
>>                       DataConfig cfg = di.getConfig();
>>                       DataConfig.Entity entity = 
>> cfg.document.entities.get(0);
>>                       List> l = new 
>> ArrayList>();
>>                       addDBDocuments(l);
>>                       MockDataSource.setIterator("select * from x", 
>> l.iterator());
>>                       entity.dataSrc = new MockDataSource();
>>                       entity.isDocRoot = true;
>>                       SolrWriterImpl swi = new SolrWriterImpl();
>>                       di.runCmd(rp, swi);
>>
>>                       assertEquals(1, swi.docs.size());
>>                       SolrInputDocument doc = swi.docs.get(0);
>>                       SolrInputField field;
>>                       field = doc.getField("dbid");
>>                       assertEquals(field.getValue().toString(), "1");
>>                       field = doc.getField("blob");
>>                       assertEquals(field.getValue().toString(), KISSINGER);
>>                       field = doc.getField("name");
>>                       assertNotNull(field);
>>                       assertEquals(field.getValue().toString(), "Henry");
>>               } finally {
>>                       MockDataSource.clearCache();
>>               }
>>       }
>>
>>
>>       private void addDBDocuments(List> l) {
>>               for(String[][] dbdoc: DBDOCS) {
>>                       l.add(createMap(dbdoc[0][0], dbdoc[0][1], dbdoc[1][0], 
>> dbdoc[1][1]));
>>               }
>>       }
>>
>>        String dih_config_FR_into_XP = "\r\n" +
>>        "  \r\n" +
>>        "  \r\n" +
>>        "  \r\n" +
>>        "      > dataSource='db'>\r\n" +
>>        "        \r\n" +
>>        "        \r\n" +
>>        "        \r\n" +
>>        "        > processor='XPathEntityProcessor'\r\n" +
>>        "                      forEach='/names' dataField='db.blob'>\r\n" +
>>        "          \r\n" +
>>        "        \r\n" +
>>        "      \r\n" +
>>        "  \r\n" +
>>        "\r\n"
>>        ;
>>
>>
>> }
>>
>
>
> --
> Lance Norskog
> goks...@gmail.com
>



-- 
-
Noble Paul | Principal Engineer| AOL | http://aol.com


Re: Problems with DIH XPath flatten

2009-10-06 Thread Noble Paul നോബിള്‍ नोब्ळ्
send a small sample xml snippet you are trying to index and it may help

On Tue, Oct 6, 2009 at 9:29 PM, Adam Foltzer  wrote:
> Hi all,
>
> I'm trying to set up DataImportHandler to index some XML documents available
> over web services. The XML includes both content and metadata, so for the
> indexable content, I'm trying to just index everything under the content
> tag:
>
>         url="resturl" processor="XPathEntityProcessor"
>        forEach="/document" transformer="HTMLStripTransformer"
> flatten="true">
>  flatten="true" stripHTML="true" />
> 
> 
>
> The result of this is that the title field gets populated and indexed (there
> are no child nodes of /document/kbml/kbq), but content does not get indexed
> at all. Since /document/kbml/body has many children, I expected that
> flatten="true" would store all of the body text in the field. Instead, it
> stores nothing at all. I've tried this with many combinations of
> transformers and flatten options, and the result is the same each time.
>
> Here are the relevant field declarations from the schema (the type="text" is
> just the one from the example's schema.xml). I have tried combinations here
> as well of stored= and multiValued=, with the same result each time.
>
>  multiValued="true" />
>  multiValued="true" />
>
> If it would help troubleshooting, I could send along some sample XML. I
> don't want to spam the list with an attachment unless it's necessary, though
> :)
>
> Thanks in advance for your help,
>
> Adam Foltzer
>



-- 
-
Noble Paul | Principal Engineer| AOL | http://aol.com


Re: search by some functionality

2009-10-06 Thread David Smiley @MITRE.org

Maybe I'm missing something, but function queries aren't involved in
determining whether a document matches or not, only its score.  How is a a
custom function / value-source going to filter?

~ David Smiley


hossman wrote:
> 
> 
> : I read about this chapter before. It did not mention how to create my
> : own customized function.
> : Can you point me to some instructions?
> 
> The first step is to figure out how you can code your custom functionality 
> as an extension of the ValueSource class...
> 
> http://lucene.apache.org/solr/api/org/apache/solr/search/function/ValueSource.html
> 
> ...which has to be able to generate DocValues implementation for an 
> IndexReader.  DocValues is where you would some numeric score for each 
> document based on whatever criteria you wanted (including arguments passed 
> in when your ValueSource is constructed, like field names and constants)
> 
> Then you need a simple ValueSourceParser class to be able to specify when 
> to use your ValueSource, and that's what you register in solrconfig.xml...
> 
> http://wiki.apache.org/solr/SolrPlugins#ValueSourceParser
> 
> 
> 
> -Hoss
> 
> 
> 

-- 
View this message in context: 
http://www.nabble.com/search-by-some-functionality-tp25721533p25779702.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: Importing CSV file slow/crashes

2009-10-06 Thread Nasseam Elkarra

Hello Yonik,

Thank you for looking into this. Your question of if I'm using stock  
solr put me in the right direction. I am in fact using a patched  
version of solr to get hierarchal facet support (http://issues.apache.org/jira/browse/SOLR-64 
). I took out the 4 hiefacet fields from the schema and the import was  
back to normal times of less than a minute. This same configuration  
worked fine with the 5/1 patched build.


Here is the field definition:
omitNorms="true" positionIncrementGap="0" indexed="true"  
stored="false" delimiter="/" />



multiValued="true"/>
stored="true" multiValued="true"/>



stored="false" multiValued="true"/>
stored="false" multiValued="true"/>





CSV file snippet:
category,category_seo
"T-Shirt Mens/Crew Neck/","t-shirt-mens/crew-neck/"

Thanks again!
Nasseam

On Oct 6, 2009, at 3:22 PM, Yonik Seeley wrote:

On Tue, Oct 6, 2009 at 1:06 PM, Nasseam Elkarra  
 wrote:
I had a dev build of 1.4 from 5/1/2009 and importing a 20K row took  
less
than a minute. Updating to the latest as of yesterday, the import  
is really

slow and I had to cancel it after a half hour. This prevented me from
upgrading a few months ago as well.


I haven't had any success at replicating this problem.

I just tried a 100K row CSV file, consisting of an id and a few text
fields.  The total size of the file is 79MB.

On trunk (today): 22 seconds to index, another 5-7 secons to commit
5/21 version: 28 seconds to index, another 8 seconds to commit

Then I modified the 5/1 schema to closer match the trunk schema
(removing defaults, copyfields that could slow things down).
Modified 5/1 version: 25 seconds to index, another 8 seconds to commit

I only did 2 runs with trunk and 2 with one from 5/1, so the accuracy
is probably low... but good enough to see there wasn't a problem in
this test.

We really need more info to help reproduce this.
Are you using stock solr?  Do you have any custom plugins, analyzers,
token filters, etc?

You're going to need to provide something so others can reproduce  
this.


-Yonik
http://www.lucidimagination.com




RE: Need "OR" in DisMax Query

2009-10-06 Thread Dean Missikowski (Consultant), CLSA
Hi David,

See this thread for how I use OR with Dismax.
http://www.mail-archive.com/solr-user@lucene.apache.org/msg19375.html

-- Dean

-Original Message-
From: Ingo Renner [mailto:i...@typo3.org] 
Sent: 06 October 2009 05:00
To: solr-user@lucene.apache.org
Subject: Re: Need "OR" in DisMax Query


Am 05.10.2009 um 20:36 schrieb David Giffin:

Hi David,

> Maybe I'm missing something, but I can't seem to get the dismax
> request handler to perform and OR query. It appears that OR is removed
> by the stop words.

It's not the stop words, Dismax simply doesn't do any boolean  
operations, the only thing you can do is using +searchWord and - 
searchWord or changing to the standard request handler.


best
Ingo

-- 
Ingo Renner
TYPO3 Core Developer, Release Manager TYPO3 4.2

Apache Solr for TYPO3: http://www.typo3-solr.com



CLSA CLEAN & GREEN: Please consider our environment before printing this email. 
The content of this communication is subject to CLSA Legal and Regulatory 
Notices
These can be viewed at https://www.clsa.com/disclaimer.html or sent to you upon 
request.



Re: Authentication/Authorization with Master-Slave over HTTP

2009-10-06 Thread Chris Hostetter

: I want to be able to have SOLR Slave instance on publicly available host
: (accessible via HTTP), and synchronize with Master securely (via HTTP)

HTTP based replication only works with the the new ReplicationHandler ... 
if you setup a proxy in front of your Master (either as a seperate daemon, 
or using a custom ServletFilter, or by runing on special settings in your 
ServletCOntainer) that can require HTTP Basic Authentication you cna 
then configure the slave to use an arbitrary usernam/password of your 
choice (look for the httpBasicAuthUser/httpBasicAuthPassword in the 
example slave configs)


-Hoss



Re: Why isn't the DateField implementation of ISO 8601 broader?

2009-10-06 Thread Chris Hostetter

: I would expect field:2001-03 to be a hit on a partial match such as
: field:[2001-02-28T00:00:00Z TO 2001-03-13T00:00:00Z].  I suppose that my
: expectation would be that field:2001-03 would be counted once per day for each
: day in its range. It would follow that a user looking for documents relating

...meanwhile someone else might expect that unless the ambiguous date must 
be entirely contained within the range being queried on.

(your implication of counting once per day would have pretty weird results 
on faceting by the way)

with unambiguous dates, you can have exactly what you want just by being a 
little more verbose when indexing/quering, (and somoene else can have 
exactly what they want by being equally verbose using slightly differnet 
options/queries

in your case: i would suggest that you use two fields: date_low and 
date_high ... when you have an exact date (down to the smallest level of 
granularity you care about) you put the same value in both fields, when 
you have an ambiguous value (like 2001-03) you put the largest value 
possible in date_high and the lowest value possible in date_low (ie: 
date_low:2001-03-01T00:00:00Z & date_high:2001-03-31T23:59:59.999Z) then a 
query for anything *overlapping* the range from feb28 to march 13 would 
be...

+date_low:[* TO 2001-03-13T00:00:00Z] +date_high:[2001-02-28T00:00:00Z TO *]

...it works for ambiguous dates, and it works for exact dates.

(someone else who only wants to see matches if the ranges *completely* 
overlap would just swap which end point they queried against which field)


-Hoss



Re: TermsComponent or auto-suggest with filter

2009-10-06 Thread R. Tan
Nice. In comparison, how do you do it with faceting?

> "Two other approaches are to use either the TermsComponent (new in Solr
> 1.4) or faceting."



On Wed, Oct 7, 2009 at 1:51 AM, Jay Hill  wrote:

> Have a look at a blog I posted on how to use EdgeNGrams to build an
> auto-suggest tool:
>
> http://www.lucidimagination.com/blog/2009/09/08/auto-suggest-from-popular-queries-using-edgengrams/
>
> You could easily add filter queries to this approach. Ffor example, the
> query used in the blog could add filter queries like this:
>
> http://localhost:8983/solr/select/?q=user_query:
> ”i”&wt=json&fl=user_query&indent=on&echoParams=none&rows=10&sort=count
> desc&fq=yourField:yourQuery&fq=anotherField:anotherQuery
>
> -Jay
> http://www.lucidimagination.com
>
>
>
>
> On Tue, Oct 6, 2009 at 4:40 AM, R. Tan  wrote:
>
> > Hello,
> > What's the best way to get auto-suggested terms/keywords that is filtered
> > by
> > one or more fields? TermsComponent should have been the solution but
> > filters
> > are not supported.
> >
> > Thanks,
> > Rihaed
> >
>


Re: Weird Facet and KeywordTokenizerFactory Issue

2009-10-06 Thread Chris Hostetter

A few comments about the info you've provided...

when you cut/pasted the facet output, you excluded the field names.  based 
on the schema & solrconfig.xml snippets you posted later, i'm assuming 
they are usstate, and keyword, but you have to be explicit so that people can 
help correlate the 
results you are getting with the schema you posted -- for example, you 
haven't posted anything that would verify that the usstate field actually 
uses your keywordText field, for ll we know it has a different field type 
by mistake (which would explain your problem). ... you have to post 
everything that would let us connect the dots from input to output in 
order to see where things might be going wrong.

A huge gap is in what your synonym files contain ... something weird in 
there could easily explain superfluous terms getting added to your data.

all that said: my best guess is that you have old data in your index from 
an older version of your schema when you had differnet analyzers 
configured.

if a term is showing up in the facet counts, you can search on it -- find 
the first doc that matches, verify that the term isn't actually in the 
data, and then reindex that one doc -- if it stops matching your search 
(and the facet count drops by one) then i'm right, just reindex 
everything.

(this is where a timestamp field recording exactly when each doc was added 
to the index comes in handy, you can compare it with the file modification 
time on your schema.xml and be certain which docs where indexed prior to 
you changes)



-Hoss



Re: search by some functionality

2009-10-06 Thread Chris Hostetter

: I read about this chapter before. It did not mention how to create my
: own customized function.
: Can you point me to some instructions?

The first step is to figure out how you can code your custom functionality 
as an extension of the ValueSource class...

http://lucene.apache.org/solr/api/org/apache/solr/search/function/ValueSource.html

...which has to be able to generate DocValues implementation for an 
IndexReader.  DocValues is where you would some numeric score for each 
document based on whatever criteria you wanted (including arguments passed 
in when your ValueSource is constructed, like field names and constants)

Then you need a simple ValueSourceParser class to be able to specify when 
to use your ValueSource, and that's what you register in solrconfig.xml...

http://wiki.apache.org/solr/SolrPlugins#ValueSourceParser



-Hoss



Re: Why isn't the DateField implementation of ISO 8601 broader?

2009-10-06 Thread Walter Lewis

On 6 Oct 09, at 5:31 PM, Chris Hostetter wrote:

...your expectations may be different then everyone elses.  by  
requiring
that the dates be explicit there is no ambiguity, you are in control  
of

the behavior.


The power of some of the other formulas in ISO 8601 is that you don't  
introduce false levels of precision.  The "October 2009" issue of a  
magazine is precisely tagged as "200910" or "2009-10" .  It doesn't  
have a day, hour or minute.  Most books come with a copyright year: no  
month, no day ...


In the library/book/periodical world these are a common set of  
expectations.


Walter







Re: Why isn't the DateField implementation of ISO 8601 broader?

2009-10-06 Thread Tricia Williams
Thanks for making me think about this a little bit deeper, Hoss.  
Comments in-line.


Chris Hostetter wrote:
because those would be ambiguous.  if you just indexed field:2001-03 would 
you expect it to match field:[2001-02-28T00:00:00Z TO 
2001-03-13T00:00:00Z] ... what about date faceting, what should the 
counts be if you facet per day?
  


I would expect field:2001-03 to be a hit on a partial match such as 
field:[2001-02-28T00:00:00Z TO 2001-03-13T00:00:00Z].  I suppose that my 
expectation would be that field:2001-03 would be counted once per day 
for each day in its range. It would follow that a user looking for 
documents relating to 1919 might also be interested in 1910.  But 
conversely a user looking for documents relating to 1919 might really 
only want documents specifically related to 1919.  Maybe the 
implementation would be smart (or configurable) about precision so that 
it wouldn't be counted when the precision asked to be represented by 
facets had more significant figures that the indexed/stored value.  
Maybe there would be another facet category at each precision for 
"others" -- the documents that have less precision than the current date 
facet precision.  I'm envisioning a hierarchical system that starts 
general with century with click-throughs drilling down eventually to days.


...your expectations may be different then everyone elses.  by requiring 
that the dates be explicit there is no ambiguity, you are in control of 
the behavior.
  


I can see your point but surely there are others out there with non 
explicit data regarding dates out there?  Does my use case makes sense 
to anyone else?


in can always just index the first date of whatever block of time (month, 
yera, century, etc..) and then facet normally.


  
Until a better solution presents itself we've gone the route of creating 
more fields for faceting on different blocks of time.  So fields for 
century, decade, year, month, and day will let us facet on each of these 
time periods as needed.  Documents with dates with less precision will 
not show up in date facets with more precision.  I was hoping there was 
an elegant hack for faceting on prefix of a defined number of characters 
(prefix=*, prefix=**, prefix=***, ...) without having to explicitly 
specify ..., prefix=188, prefix=189, prefix=190, prefix=191, ...


Regards,
Tricia


Re: Question about PatternReplace filter and automatic Synonym generation

2009-10-06 Thread Chris Hostetter

:  I ll try to explain with an example. Given the term 'it!' in the title, it
: should match both 'it' and 'it!' in the query as an exact match. Currently,
: this is done by using a synonym entry  (and index time SynonymFilter) as
: follows:
: 
:  it! => it, it!
: 
:  Now, the above holds true for all cases where you have a title token of the
: form [aA-zZ]*!. Handling all of those cases requires adding synonyms
: manually for each case which is not easy to manage and does not scale.
: 
:  I am hoping to do the same by using a index time filter that takes in a
: pattern like the PatternReplace filter and adds the newly created token
: instead of replacing the original one. Does this make sense? Am I missing
: something that would break this approach?

something like this would be fairly easy to implement in Lucene, but 
somewhat confusing to try and configure in Solr.  I was going to suggest 
that you use something like...
 

..and then have a subsequent filter that splits the tokens on the 
whitespace (or any other special character you could use in the 
replacement) ... but aparently we don't have any built in filters that 
will just split tokens on a character/pattern for you.  that would also be 
fairly easy to write if someone wnats to submit a patch.


-Hoss



Re: Importing CSV file slow/crashes

2009-10-06 Thread Yonik Seeley
On Tue, Oct 6, 2009 at 1:06 PM, Nasseam Elkarra  wrote:
> I had a dev build of 1.4 from 5/1/2009 and importing a 20K row took less
> than a minute. Updating to the latest as of yesterday, the import is really
> slow and I had to cancel it after a half hour. This prevented me from
> upgrading a few months ago as well.

I haven't had any success at replicating this problem.

I just tried a 100K row CSV file, consisting of an id and a few text
fields.  The total size of the file is 79MB.

On trunk (today): 22 seconds to index, another 5-7 secons to commit
5/21 version: 28 seconds to index, another 8 seconds to commit

Then I modified the 5/1 schema to closer match the trunk schema
(removing defaults, copyfields that could slow things down).
Modified 5/1 version: 25 seconds to index, another 8 seconds to commit

I only did 2 runs with trunk and 2 with one from 5/1, so the accuracy
is probably low... but good enough to see there wasn't a problem in
this test.

We really need more info to help reproduce this.
Are you using stock solr?  Do you have any custom plugins, analyzers,
token filters, etc?

You're going to need to provide something so others can reproduce this.

-Yonik
http://www.lucidimagination.com


Re: conditional sorting

2009-10-06 Thread Chris Hostetter


: I tried to simplify the problem, but the point is that I could have 
: really: complex requirements. For instance, "if in the first 5 results 
: none are older than one year, use sort by X, otherwise sort by Y".

First 5 in what order? X? Y or something else?

: So, the question is, is there a way to make Solr recognize complex
: situations and apply different sorting criterion.

your question may seem simple to you, but unless you codify all the 
examples of what you consider a "complex situation" and how you expect 
those to be specified at run time, it's pretty much imposisble to give you 
an answer as to what the best way to achieve your goal is.

the simplest answer based on the information available: if you can express 
your requirements in java, and put them in a custom Search Component, then 
Solr can do it.


-Hoss



Re: Weird Facet and KeywordTokenizerFactory Issue

2009-10-06 Thread Christian Zambrano

Got it. Sorry for not having an answer for your problem.

On 10/06/2009 04:58 PM, Ravi Kiran wrote:

You dont see any facet fields in my query because I have configured them in
the solrconfig.xml to give specific fields as facets by default in the
dismax and standard handlers so that I dont have to specify all those fields
individually everytime I query, all I need to do is just set facet=true
thats all

   
 
  dismax
  explicit
  0.01
  
 systemid^20.0 headline^20.0 keyword^18.0 person^18.0
organization^18.0 usstate^18.0 country^18.0 subject^18.0 quote^18.0
blurb^15.0 articlesubhead^8.0 byline^7.0 articleblurb^2.0 body^1.5
multimediablurb^1.5
  
  
 headline^20.5 keyword^18.5 person^18.5 organization^18.5
usstate^18.5 country^18.5 subject^18.5 quote^18.5 blurb^15.5
articlesubhead^8.5 byline^7.5 articleblurb^2.5 body^2.0 multimediablurb^2.0
  
  
 recip(rord(pubdatetime),1,1000,1000)^1.0
  
  
 *
  
  
 2<-1 5<-3 6<90%
  
  100
  *:*
  
  keyword
  
  0
  
  keyword
  regex  
  false
  1
  5
  5
  5
  5
  5
  5
  contenttype
  keyword
  keywordlower
  keywordformatted
  person
  personformatted
  organization
  usstate
  country
  subject
 
   


On Tue, Oct 6, 2009 at 5:45 PM, Christian Zambranowrote:

   

I am stumped then. I had a similar issue when I was using a field that was
being heavily tokenized, but I corrected the issue by using a
field(generated using copyField) that doesn't get analyzed at all.

On the query you provided before I didn't see the parameters to tell solr
for which field it should produce facets.

Something like:


http://localhost:8080/solr-admin/topicscore/select/?facet=true&facet.limit=-1&*facet.field=location*




On 10/06/2009 04:09 PM, Ravi Kiran wrote:

 

Yes Exactly the same

On Tue, Oct 6, 2009 at 4:52 PM, Christian Zambrano   

wrote:
 



   

And you had the analyzer for that field set-up the same way as shown on
your previous e-mail when you indexed the data?




On 10/06/2009 03:46 PM, Ravi Kiran wrote:



 

I did infact check it out any there is no weirdness in analysis
page...see
result below

Index Analyzer org.apache.solr.analysis.KeywordTokenizerFactory {}  term
position 1 term text New York term type word source start,end 0,8
payload
  org.apache.solr.analysis.TrimFilterFactory {}  term position 1 term
text
New
York term type word source start,end 0,8 payload
  org.apache.solr.analysis.StopFilterFactory {words=entity-stopwords.txt,
ignoreCase=true, enablePositionIncrements=true}  term position 1 term
text
New
York term type word source start,end 0,8 payload
  org.apache.solr.analysis.SynonymFilterFactory {synonyms=synonyms.txt,
expand=false, ignoreCase=true}  term position 1 term text New York term
type
word source start,end 0,8 payload
  org.apache.solr.analysis.RemoveDuplicatesTokenFilterFactory {}  term
position 1 term text New York term type word source start,end 0,8
payload
  Query Analyzer org.apache.solr.analysis.KeywordTokenizerFactory {}
  term
position 1 term text New York term type word source start,end 0,8
payload
  org.apache.solr.analysis.TrimFilterFactory {}  term position 1 term
text
New
York term type word source start,end 0,8 payload
  org.apache.solr.analysis.StopFilterFactory {words=entity-stopwords.txt,
ignoreCase=true, enablePositionIncrements=true}  term position 1 term
text
New
York term type word source start,end 0,8 payload
  org.apache.solr.analysis.SynonymFilterFactory {synonyms=synonyms.txt,
expand=false, ignoreCase=true}  term position 1 term text New York term
type
word source start,end 0,8 payload
  org.apache.solr.analysis.RemoveDuplicatesTokenFilterFactory {} term
position 1 term text New York term type word source start,end 0,8
payload


On Tue, Oct 6, 2009 at 4:19 PM, Christian Zambrano   

wrote:


 




   

Have you tried using the Analysis page to see what tokens are generated
for
the string "New York"? It could be one of the token filter is adding
the
token 'new' for all strings that start with 'new'


On 10/06/2009 02:54 PM, Ravi Kiran wrote:





 

Hello All,
   Iam getting some ghost facets in solr 1.4. Can anybody
kindly
help me understand why I get them and how to eliminate them. My
schema.xml
snippet is given at the end. Iam indexing Named Entities extracted via
OpenNLP into solr. My understanding regarding KeywordTokenizerFactory
is
that it will use all words as a single token, am I right ? for
example:
"New
York" will be indexed as 'New York' and will not be split right???
However
I
see then splitup in facets as follows when running the query "



http://localhost:8080/solr-admin/topicscore/select/?facet=true&facet.limit=-1
"...but
when I search with standard handler qt=standard&q=keyword:"New" I dont
find
any doc which has 

Re: Weird Facet and KeywordTokenizerFactory Issue

2009-10-06 Thread Ravi Kiran
You dont see any facet fields in my query because I have configured them in
the solrconfig.xml to give specific fields as facets by default in the
dismax and standard handlers so that I dont have to specify all those fields
individually everytime I query, all I need to do is just set facet=true
thats all

  

 dismax
 explicit
 0.01
 
systemid^20.0 headline^20.0 keyword^18.0 person^18.0
organization^18.0 usstate^18.0 country^18.0 subject^18.0 quote^18.0
blurb^15.0 articlesubhead^8.0 byline^7.0 articleblurb^2.0 body^1.5
multimediablurb^1.5
 
 
headline^20.5 keyword^18.5 person^18.5 organization^18.5
usstate^18.5 country^18.5 subject^18.5 quote^18.5 blurb^15.5
articlesubhead^8.5 byline^7.5 articleblurb^2.5 body^2.0 multimediablurb^2.0
 
 
recip(rord(pubdatetime),1,1000,1000)^1.0
 
 
*
 
 
2<-1 5<-3 6<90%
 
 100
 *:*
 
 keyword
 
 0
 
 keyword
 regex 
 false
 1
 5
 5
 5
 5
 5
 5
 contenttype
 keyword
 keywordlower
 keywordformatted
 person
 personformatted
 organization
 usstate
 country
 subject

  


On Tue, Oct 6, 2009 at 5:45 PM, Christian Zambrano wrote:

> I am stumped then. I had a similar issue when I was using a field that was
> being heavily tokenized, but I corrected the issue by using a
> field(generated using copyField) that doesn't get analyzed at all.
>
> On the query you provided before I didn't see the parameters to tell solr
> for which field it should produce facets.
>
> Something like:
>
>
> http://localhost:8080/solr-admin/topicscore/select/?facet=true&facet.limit=-1&*facet.field=location*
>
>
>
>
> On 10/06/2009 04:09 PM, Ravi Kiran wrote:
>
>> Yes Exactly the same
>>
>> On Tue, Oct 6, 2009 at 4:52 PM, Christian Zambrano> >wrote:
>>
>>
>>
>>> And you had the analyzer for that field set-up the same way as shown on
>>> your previous e-mail when you indexed the data?
>>>
>>>
>>>
>>>
>>> On 10/06/2009 03:46 PM, Ravi Kiran wrote:
>>>
>>>
>>>
 I did infact check it out any there is no weirdness in analysis
 page...see
 result below

 Index Analyzer org.apache.solr.analysis.KeywordTokenizerFactory {}  term
 position 1 term text New York term type word source start,end 0,8
 payload
  org.apache.solr.analysis.TrimFilterFactory {}  term position 1 term
 text
 New
 York term type word source start,end 0,8 payload
  org.apache.solr.analysis.StopFilterFactory {words=entity-stopwords.txt,
 ignoreCase=true, enablePositionIncrements=true}  term position 1 term
 text
 New
 York term type word source start,end 0,8 payload
  org.apache.solr.analysis.SynonymFilterFactory {synonyms=synonyms.txt,
 expand=false, ignoreCase=true}  term position 1 term text New York term
 type
 word source start,end 0,8 payload
  org.apache.solr.analysis.RemoveDuplicatesTokenFilterFactory {}  term
 position 1 term text New York term type word source start,end 0,8
 payload
  Query Analyzer org.apache.solr.analysis.KeywordTokenizerFactory {}
  term
 position 1 term text New York term type word source start,end 0,8
 payload
  org.apache.solr.analysis.TrimFilterFactory {}  term position 1 term
 text
 New
 York term type word source start,end 0,8 payload
  org.apache.solr.analysis.StopFilterFactory {words=entity-stopwords.txt,
 ignoreCase=true, enablePositionIncrements=true}  term position 1 term
 text
 New
 York term type word source start,end 0,8 payload
  org.apache.solr.analysis.SynonymFilterFactory {synonyms=synonyms.txt,
 expand=false, ignoreCase=true}  term position 1 term text New York term
 type
 word source start,end 0,8 payload
  org.apache.solr.analysis.RemoveDuplicatesTokenFilterFactory {} term
 position 1 term text New York term type word source start,end 0,8
 payload


 On Tue, Oct 6, 2009 at 4:19 PM, Christian Zambrano>>>

> wrote:
>
>




> Have you tried using the Analysis page to see what tokens are generated
> for
> the string "New York"? It could be one of the token filter is adding
> the
> token 'new' for all strings that start with 'new'
>
>
> On 10/06/2009 02:54 PM, Ravi Kiran wrote:
>
>
>
>
>
>> Hello All,
>>   Iam getting some ghost facets in solr 1.4. Can anybody
>> kindly
>> help me understand why I get them and how to eliminate them. My
>> schema.xml
>> snippet is given at the end. Iam indexing Named Entities extracted via
>> OpenNLP into solr. My understanding regarding KeywordTokenizerFactory
>> is
>> that it will use all words as a single token, am I right ? for
>> example:
>> "New
>> York" will be indexed as 'New York' and will not be split right???
>> However
>> I
>>

Re: What to set in query.setMaxRows()?

2009-10-06 Thread Chris Hostetter

: Sorry about asking this here, but I can't reach wiki.apache.org right now.
:  What do I set in query.setMaxRows() to get all the rows?


http://wiki.apache.org/solr/FAQ#How_can_I_get_ALL_the_matching_documents_back.3F_..._How_can_I_return_an_unlimited_number_of_rows.3F

How can I get ALL the matching documents back? ... How can I return an 
unlimited number of rows?

This is impractical in most cases. People typically only want to do this 
when they know they are dealing with an index whose size guarantees the 
result sets will be always be small enough that they can feasibly be 
transmitted in a manageable amount -- but if that's the case just specify 
what you consider a "manageable amount" as your rows param and get the 
best of both worlds (all the results when your assumption is right, and a 
sanity cap on the result size if it turns out your assumptions are wrong)




-Hoss



Re: stats page slow in latest nightly

2009-10-06 Thread Chris Hostetter

: When I was working on it, I was actually going to default to not show
: the size, and make you click a link that added a param to get the sizes
: in the display too. But I foolishly didn't bring it up when Hoss made my
: life easier with his simpler patch.

we can always turn the size estimator off ... or turn it only only when 
doing the insanity checks (so normal stats are fast, buf if anything is 
duplicated you'll get info on the size of the discrepancy)




-Hoss



Re: Weird Facet and KeywordTokenizerFactory Issue

2009-10-06 Thread Christian Zambrano
I am stumped then. I had a similar issue when I was using a field that 
was being heavily tokenized, but I corrected the issue by using a 
field(generated using copyField) that doesn't get analyzed at all.


On the query you provided before I didn't see the parameters to tell 
solr for which field it should produce facets.


Something like:

http://localhost:8080/solr-admin/topicscore/select/?facet=true&facet.limit=-1&*facet.field=location*



On 10/06/2009 04:09 PM, Ravi Kiran wrote:

Yes Exactly the same

On Tue, Oct 6, 2009 at 4:52 PM, Christian Zambranowrote:

   

And you had the analyzer for that field set-up the same way as shown on
your previous e-mail when you indexed the data?




On 10/06/2009 03:46 PM, Ravi Kiran wrote:

 

I did infact check it out any there is no weirdness in analysis page...see
result below

Index Analyzer org.apache.solr.analysis.KeywordTokenizerFactory {}  term
position 1 term text New York term type word source start,end 0,8 payload
  org.apache.solr.analysis.TrimFilterFactory {}  term position 1 term text
New
York term type word source start,end 0,8 payload
  org.apache.solr.analysis.StopFilterFactory {words=entity-stopwords.txt,
ignoreCase=true, enablePositionIncrements=true}  term position 1 term text
New
York term type word source start,end 0,8 payload
  org.apache.solr.analysis.SynonymFilterFactory {synonyms=synonyms.txt,
expand=false, ignoreCase=true}  term position 1 term text New York term
type
word source start,end 0,8 payload
  org.apache.solr.analysis.RemoveDuplicatesTokenFilterFactory {}  term
position 1 term text New York term type word source start,end 0,8 payload
  Query Analyzer org.apache.solr.analysis.KeywordTokenizerFactory {}  term
position 1 term text New York term type word source start,end 0,8 payload
  org.apache.solr.analysis.TrimFilterFactory {}  term position 1 term text
New
York term type word source start,end 0,8 payload
  org.apache.solr.analysis.StopFilterFactory {words=entity-stopwords.txt,
ignoreCase=true, enablePositionIncrements=true}  term position 1 term text
New
York term type word source start,end 0,8 payload
  org.apache.solr.analysis.SynonymFilterFactory {synonyms=synonyms.txt,
expand=false, ignoreCase=true}  term position 1 term text New York term
type
word source start,end 0,8 payload
  org.apache.solr.analysis.RemoveDuplicatesTokenFilterFactory {} term
position 1 term text New York term type word source start,end 0,8 payload


On Tue, Oct 6, 2009 at 4:19 PM, Christian Zambrano   

wrote:
 



   

Have you tried using the Analysis page to see what tokens are generated
for
the string "New York"? It could be one of the token filter is adding the
token 'new' for all strings that start with 'new'


On 10/06/2009 02:54 PM, Ravi Kiran wrote:



 

Hello All,
   Iam getting some ghost facets in solr 1.4. Can anybody
kindly
help me understand why I get them and how to eliminate them. My
schema.xml
snippet is given at the end. Iam indexing Named Entities extracted via
OpenNLP into solr. My understanding regarding KeywordTokenizerFactory is
that it will use all words as a single token, am I right ? for example:
"New
York" will be indexed as 'New York' and will not be split right???
However
I
see then splitup in facets as follows when running the query "


http://localhost:8080/solr-admin/topicscore/select/?facet=true&facet.limit=-1
"...but
when I search with standard handler qt=standard&q=keyword:"New" I dont
find
any doc which has just "New". After digging in a bit I found that if
several
keywords have a common starting word it is being pulled out as another
facet
like the following. Any help is greatly appreciated

Result

47   >Ghost
7
16
10
147
23
8
5
6
8
10
8
5
7

7   -->Ghost
5
5


7 -->Ghost
6
26
6

27
8
7
12

Schema.xml
-

 
   
 
 
 

 
 
   
   
 
 
 
 
 
   
 

 
 
 
 





   


 


   
 
   


Re: Why isn't the DateField implementation of ISO 8601 broader?

2009-10-06 Thread Chris Hostetter

:My question is why isn't the DateField implementation of ISO 8601 broader
: so that it could include  and MM as acceptable date strings?  What

because those would be ambiguous.  if you just indexed field:2001-03 would 
you expect it to match field:[2001-02-28T00:00:00Z TO 
2001-03-13T00:00:00Z] ... what about date faceting, what should the 
counts be if you facet per day?

...your expectations may be different then everyone elses.  by requiring 
that the dates be explicit there is no ambiguity, you are in control of 
the behavior.

: would it take to do so?  Are there any work-arounds for faceting by century,
: year, month without creating new fields in my schema?  The last resort would

in can always just index the first date of whatever block of time (month, 
yera, century, etc..) and then facet normally.


-Hoss



Re: Merging multicore indexes

2009-10-06 Thread Shalin Shekhar Mangar
On Wed, Oct 7, 2009 at 2:40 AM, Paul Rosen wrote:

> Shalin Shekhar Mangar wrote:
>
>  The path on the wiki page was wrong. You need to use the adminPath in the
>> url. Look at the adminPath attribute in solr.xml. It is typically
>> /admin/cores
>>
>> So the correct path for you would be:
>>
>>
>> http://localhost:8983/solr/admin/cores?action=mergeindexes&core=merged&indexDir=/Users/my/path/solr_1.4/solr/data/reindexed_marc/index&indexDir=/Users/my/path/solr_1.4/solr/data/reindexed_rdf/index
>> <
>> http://localhost:8983/solr/merged/admin/?action=mergeindexes&core=merged&indexDir=/Users/my/path/solr_1.4/solr/data/reindexed_marc/index&indexDir=/Users/my/path/solr_1.4/solr/data/reindexed_rdf/index
>>
>> I've fixed the wiki too.
>>
>
> I think I've got it working. The only difference to the above is that it
> seems to want a relative path, so when I took off the
> "/Users/my/path/solr_1.4/" part I stopped getting errors.
>
>
There's no reason why it won't work with an absolute path. Can you post the
error? Also, did you correctly urlencode the parameters (if you are using
the browser to make such a request, perhaps the '/' character is causing a
problem)?


> (Also, I had an insidious problem when using the interface to the browser
> in FF 3.5. It cached my results, so when I queried the core with "*:*" I got
> no results until I cleared my cache. - Hopefully that will save someone else
> a little time.)
>
>
Yeah, if you are not using any HTTP caches, you can turn it off by adding
the following in the  section:



-- 
Regards,
Shalin Shekhar Mangar.


Re: Solr Trunk Heap Space Issues

2009-10-06 Thread Mark Miller
Mark Miller wrote:
> Jeff Newburn wrote:
>   
>> So could that potentially explain our use of more ram on indexing? Or is
>> this a rare edge case.
>>   
>> 
> I think it could explain the JVM using more RAM while indexing - but it
> should be fairly easily recoverable from what I can tell - so no
> explanation on the OOM yet. Still looking at that one.
>
> Is you system basically stock, or do you have custom plugins in it?
>
>   
No matter what I try with however many cores, I can't duplicate your
problem.

-- 
- Mark

http://www.lucidimagination.com





Re: Problems with DIH XPath flatten

2009-10-06 Thread Adam Foltzer
Hi Shalin,

Good question; sorry I forgot it in the initial post. I have tried with both
a nightly build from earlier this month (Oct 2 I believe) as well as a build
from the trunk as of yesterday afternoon.

Adam

On Tue, Oct 6, 2009 at 5:04 PM, Shalin Shekhar Mangar <
shalinman...@gmail.com> wrote:

> On Tue, Oct 6, 2009 at 9:29 PM, Adam Foltzer  wrote:
>
> > Hi all,
> >
> > I'm trying to set up DataImportHandler to index some XML documents
> > available
> > over web services. The XML includes both content and metadata, so for the
> > indexable content, I'm trying to just index everything under the content
> > tag:
> >
> >  >url="resturl" processor="XPathEntityProcessor"
> >forEach="/document" transformer="HTMLStripTransformer"
> > flatten="true">
> >  > flatten="true" stripHTML="true" />
> > 
> > 
> >
> > The result of this is that the title field gets populated and indexed
> > (there
> > are no child nodes of /document/kbml/kbq), but content does not get
> indexed
> > at all. Since /document/kbml/body has many children, I expected that
> > flatten="true" would store all of the body text in the field. Instead, it
> > stores nothing at all. I've tried this with many combinations of
> > transformers and flatten options, and the result is the same each time.
> >
> >
> Which Solr version are you using? The flatten attribute was introduced
> after
> 1.3 released.
>
> --
> Regards,
> Shalin Shekhar Mangar.
>


Re: Merging multicore indexes

2009-10-06 Thread Paul Rosen

Shalin Shekhar Mangar wrote:


The path on the wiki page was wrong. You need to use the adminPath in the
url. Look at the adminPath attribute in solr.xml. It is typically
/admin/cores

So the correct path for you would be:

http://localhost:8983/solr/admin/cores?action=mergeindexes&core=merged&indexDir=/Users/my/path/solr_1.4/solr/data/reindexed_marc/index&indexDir=/Users/my/path/solr_1.4/solr/data/reindexed_rdf/index

Re: Weird Facet and KeywordTokenizerFactory Issue

2009-10-06 Thread Ravi Kiran
Yes Exactly the same

On Tue, Oct 6, 2009 at 4:52 PM, Christian Zambrano wrote:

> And you had the analyzer for that field set-up the same way as shown on
> your previous e-mail when you indexed the data?
>
>
>
>
> On 10/06/2009 03:46 PM, Ravi Kiran wrote:
>
>> I did infact check it out any there is no weirdness in analysis page...see
>> result below
>>
>> Index Analyzer org.apache.solr.analysis.KeywordTokenizerFactory {}  term
>> position 1 term text New York term type word source start,end 0,8 payload
>>  org.apache.solr.analysis.TrimFilterFactory {}  term position 1 term text
>> New
>> York term type word source start,end 0,8 payload
>>  org.apache.solr.analysis.StopFilterFactory {words=entity-stopwords.txt,
>> ignoreCase=true, enablePositionIncrements=true}  term position 1 term text
>> New
>> York term type word source start,end 0,8 payload
>>  org.apache.solr.analysis.SynonymFilterFactory {synonyms=synonyms.txt,
>> expand=false, ignoreCase=true}  term position 1 term text New York term
>> type
>> word source start,end 0,8 payload
>>  org.apache.solr.analysis.RemoveDuplicatesTokenFilterFactory {}  term
>> position 1 term text New York term type word source start,end 0,8 payload
>>  Query Analyzer org.apache.solr.analysis.KeywordTokenizerFactory {}  term
>> position 1 term text New York term type word source start,end 0,8 payload
>>  org.apache.solr.analysis.TrimFilterFactory {}  term position 1 term text
>> New
>> York term type word source start,end 0,8 payload
>>  org.apache.solr.analysis.StopFilterFactory {words=entity-stopwords.txt,
>> ignoreCase=true, enablePositionIncrements=true}  term position 1 term text
>> New
>> York term type word source start,end 0,8 payload
>>  org.apache.solr.analysis.SynonymFilterFactory {synonyms=synonyms.txt,
>> expand=false, ignoreCase=true}  term position 1 term text New York term
>> type
>> word source start,end 0,8 payload
>>  org.apache.solr.analysis.RemoveDuplicatesTokenFilterFactory {} term
>> position 1 term text New York term type word source start,end 0,8 payload
>>
>>
>> On Tue, Oct 6, 2009 at 4:19 PM, Christian Zambrano> >wrote:
>>
>>
>>
>>> Have you tried using the Analysis page to see what tokens are generated
>>> for
>>> the string "New York"? It could be one of the token filter is adding the
>>> token 'new' for all strings that start with 'new'
>>>
>>>
>>> On 10/06/2009 02:54 PM, Ravi Kiran wrote:
>>>
>>>
>>>
 Hello All,
   Iam getting some ghost facets in solr 1.4. Can anybody
 kindly
 help me understand why I get them and how to eliminate them. My
 schema.xml
 snippet is given at the end. Iam indexing Named Entities extracted via
 OpenNLP into solr. My understanding regarding KeywordTokenizerFactory is
 that it will use all words as a single token, am I right ? for example:
 "New
 York" will be indexed as 'New York' and will not be split right???
 However
 I
 see then splitup in facets as follows when running the query "


 http://localhost:8080/solr-admin/topicscore/select/?facet=true&facet.limit=-1
 "...but
 when I search with standard handler qt=standard&q=keyword:"New" I dont
 find
 any doc which has just "New". After digging in a bit I found that if
 several
 keywords have a common starting word it is being pulled out as another
 facet
 like the following. Any help is greatly appreciated

 Result
 
 47  >   Ghost
 7
 16
 10
 147
 23
 8
 5
 6
 8
 10
 8
 5
 7

 7  -->   Ghost
 5
 5


 7-->   Ghost
 6
 26
 6

 27
 8
 7
 12

 Schema.xml
 -

 >>> sortMissingLast="true" omitNorms="true" positionIncrementGap="100">
   
 
 
 >>> words="stopwords.txt,entity-stopwords.txt"
 enablePositionIncrements="true"/>

 >>> synonyms="synonyms.txt"
 ignoreCase="true" expand="false" />
 
   
   
 
 
 >>> words="stopwords.txt,entity-stopwords.txt"
 enablePositionIncrements="true"
 />
 >>> synonyms="synonyms.txt"
 ignoreCase="true" expand="false" />
 
   
 

 >>> multiValued="true" termVectors="false" termPositions="false"
 termOffsets="false"/>
 >>> stored="true" multiValued="true" termVectors="false"
 termPositions="false"
 termOffsets="false"/>
 >>> stored="true"
 multiValued="true" termVectors="false" termPositions="false"
 termOffsets="false"/>
 >>> stored="true"
 multiValued="true" termVectors="false" termPositions="false"
 termOffsets="false"/>





>>>
>>>
>>
>>
>


Re: Problems with DIH XPath flatten

2009-10-06 Thread Shalin Shekhar Mangar
On Tue, Oct 6, 2009 at 9:29 PM, Adam Foltzer  wrote:

> Hi all,
>
> I'm trying to set up DataImportHandler to index some XML documents
> available
> over web services. The XML includes both content and metadata, so for the
> indexable content, I'm trying to just index everything under the content
> tag:
>
> url="resturl" processor="XPathEntityProcessor"
>forEach="/document" transformer="HTMLStripTransformer"
> flatten="true">
>  flatten="true" stripHTML="true" />
> 
> 
>
> The result of this is that the title field gets populated and indexed
> (there
> are no child nodes of /document/kbml/kbq), but content does not get indexed
> at all. Since /document/kbml/body has many children, I expected that
> flatten="true" would store all of the body text in the field. Instead, it
> stores nothing at all. I've tried this with many combinations of
> transformers and flatten options, and the result is the same each time.
>
>
Which Solr version are you using? The flatten attribute was introduced after
1.3 released.

-- 
Regards,
Shalin Shekhar Mangar.


solr 1.4 formats last_index_time for SQL differently than 1.3 ?!?

2009-10-06 Thread michael8

Just looking for confirmation from others, but it appears that the formatting
of last_index_time from dataimport.properties (using DataImportHandler) is
different in 1.4 vs. that in 1.3.  I was troubleshooting why delta imports
are no longer working for me after moving over to solr 1.4 (10/2 nighly) and
noticed that format is different.

Michael
-- 
View this message in context: 
http://www.nabble.com/solr-1.4-formats-last_index_time-for-SQL-differently-than-1.3--%21--tp25776496p25776496.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: Weird Facet and KeywordTokenizerFactory Issue

2009-10-06 Thread Christian Zambrano
And you had the analyzer for that field set-up the same way as shown on 
your previous e-mail when you indexed the data?




On 10/06/2009 03:46 PM, Ravi Kiran wrote:

I did infact check it out any there is no weirdness in analysis page...see
result below

Index Analyzer org.apache.solr.analysis.KeywordTokenizerFactory {}  term
position 1 term text New York term type word source start,end 0,8 payload
  org.apache.solr.analysis.TrimFilterFactory {}  term position 1 term text New
York term type word source start,end 0,8 payload
  org.apache.solr.analysis.StopFilterFactory {words=entity-stopwords.txt,
ignoreCase=true, enablePositionIncrements=true}  term position 1 term text New
York term type word source start,end 0,8 payload
  org.apache.solr.analysis.SynonymFilterFactory {synonyms=synonyms.txt,
expand=false, ignoreCase=true}  term position 1 term text New York term type
word source start,end 0,8 payload
  org.apache.solr.analysis.RemoveDuplicatesTokenFilterFactory {}  term
position 1 term text New York term type word source start,end 0,8 payload
  Query Analyzer org.apache.solr.analysis.KeywordTokenizerFactory {}  term
position 1 term text New York term type word source start,end 0,8 payload
  org.apache.solr.analysis.TrimFilterFactory {}  term position 1 term text New
York term type word source start,end 0,8 payload
  org.apache.solr.analysis.StopFilterFactory {words=entity-stopwords.txt,
ignoreCase=true, enablePositionIncrements=true}  term position 1 term text New
York term type word source start,end 0,8 payload
  org.apache.solr.analysis.SynonymFilterFactory {synonyms=synonyms.txt,
expand=false, ignoreCase=true}  term position 1 term text New York term type
word source start,end 0,8 payload
  org.apache.solr.analysis.RemoveDuplicatesTokenFilterFactory {} term
position 1 term text New York term type word source start,end 0,8 payload


On Tue, Oct 6, 2009 at 4:19 PM, Christian Zambranowrote:

   

Have you tried using the Analysis page to see what tokens are generated for
the string "New York"? It could be one of the token filter is adding the
token 'new' for all strings that start with 'new'


On 10/06/2009 02:54 PM, Ravi Kiran wrote:

 

Hello All,
   Iam getting some ghost facets in solr 1.4. Can anybody
kindly
help me understand why I get them and how to eliminate them. My schema.xml
snippet is given at the end. Iam indexing Named Entities extracted via
OpenNLP into solr. My understanding regarding KeywordTokenizerFactory is
that it will use all words as a single token, am I right ? for example:
"New
York" will be indexed as 'New York' and will not be split right??? However
I
see then splitup in facets as follows when running the query "

http://localhost:8080/solr-admin/topicscore/select/?facet=true&facet.limit=-1
"...but
when I search with standard handler qt=standard&q=keyword:"New" I dont
find
any doc which has just "New". After digging in a bit I found that if
several
keywords have a common starting word it is being pulled out as another
facet
like the following. Any help is greatly appreciated

Result

47  >   Ghost
7
16
10
147
23
8
5
6
8
10
8
5
7

7  -->   Ghost
5
5


7-->   Ghost
6
26
6

27
8
7
12

Schema.xml
-

 
   
 
 
 

 
 
   
   
 
 
 
 
 
   
 

 
 
 
 



   
 
   


Re: stats page slow in latest nightly

2009-10-06 Thread Joe Calderon
thx much guys, no biggie for me, i just wanted to get to the bottom of
it in case i had screwed something else up..

--joe

On Tue, Oct 6, 2009 at 1:19 PM, Mark Miller  wrote:
> I was worried about that actually. I havn't tested how fast the RAM
> estimator is on huge String FieldCaches - it will be fast on everything
> else, but it checks the size of each String in the array.
>
> When I was working on it, I was actually going to default to not show
> the size, and make you click a link that added a param to get the sizes
> in the display too. But I foolishly didn't bring it up when Hoss made my
> life easier with his simpler patch.
>
> Yonik Seeley wrote:
>> Might be the new Lucene fieldCache stats stuff that was recently added?
>>
>> -Yonik
>> http://www.lucidimagination.com
>>
>>
>> On Tue, Oct 6, 2009 at 3:56 PM, Joe Calderon  wrote:
>>
>>> hello *, ive been noticing that /admin/stats.jsp is really slow in the
>>> recent builds, has anyone else encountered this?
>>>
>>>
>>> --joe
>>>
>
>
> --
> - Mark
>
> http://www.lucidimagination.com
>
>
>
>


RE: Solr Timeouts

2009-10-06 Thread Giovanni Fernandez-Kincade
Yeah that's exactly right Mark.

What does the "maxCommitsToKeep"(from SolrDeletionPolicy in SolrConfig.xml) 
parameter actually do? Increasing this value seems to have helped a little, but 
I'm wary of cranking it without having a better understanding of what it does.

-Original Message-
From: Mark Miller [mailto:markrmil...@gmail.com]
Sent: Tuesday, October 06, 2009 4:44 PM
To: solr-user@lucene.apache.org
Subject: Re: Solr Timeouts

It sounds like he is indexing on a local disk, but reading the files to
be index from NFS - which would be fine.

You can get Lucene indexes to work on NFS (though still not recommended)
, but you need to use a custom IndexDeletionPolicy to keep older commit
points around longer and be sure not to use NIOFSDirectory.

Feak, Todd wrote:
> I seem to recall hearing something about *not* putting a Solr index directory 
> on an NFS mount. Might want to search on that.
>
> That, of course, doesn't have anything to do with commits showing up 
> unexpectedly in stack traces, per your original email.
>
> -Todd
>
> -Original Message-
> From: Giovanni Fernandez-Kincade [mailto:gfernandez-kinc...@capitaliq.com]
> Sent: Tuesday, October 06, 2009 12:39 PM
> To: solr-user@lucene.apache.org; yo...@lucidimagination.com
> Subject: RE: Solr Timeouts
>
> That thread was blocking for an hour while all other threads were idle or 
> blocked.
>
> -Original Message-
> From: ysee...@gmail.com [mailto:ysee...@gmail.com] On Behalf Of Yonik Seeley
> Sent: Tuesday, October 06, 2009 3:07 PM
> To: solr-user@lucene.apache.org
> Subject: Re: Solr Timeouts
>
> This specific thread was blocked for an hour?
> If so, I'd echo Lance... this is a local disk right?
>
> -Yonik
> http://www.lucidimagination.com
>
>
> On Mon, Oct 5, 2009 at 2:11 PM, Giovanni Fernandez-Kincade
>  wrote:
>
>> I just grabbed another stack trace for a thread that has been similarly 
>> blocking for over an hour. Notice that there is no Commit in this one:
>>
>> http-8080-Processor67 [RUNNABLE] CPU time: 1:02:05
>> org.apache.lucene.index.TermBuffer.read(IndexInput, FieldInfos)
>> org.apache.lucene.index.SegmentTermEnum.next()
>> org.apache.lucene.index.SegmentTermEnum.scanTo(Term)
>> org.apache.lucene.index.TermInfosReader.get(Term, boolean)
>> org.apache.lucene.index.TermInfosReader.get(Term)
>> org.apache.lucene.index.SegmentTermDocs.seek(Term)
>> org.apache.lucene.index.DocumentsWriter.applyDeletes(IndexReader, int)
>> org.apache.lucene.index.DocumentsWriter.applyDeletes(SegmentInfos)
>> org.apache.lucene.index.IndexWriter.applyDeletes()
>> org.apache.lucene.index.IndexWriter.doFlushInternal(boolean, boolean)
>> org.apache.lucene.index.IndexWriter.doFlush(boolean, boolean)
>> org.apache.lucene.index.IndexWriter.flush(boolean, boolean, boolean)
>> org.apache.lucene.index.IndexWriter.updateDocument(Term, Document, Analyzer)
>> org.apache.lucene.index.IndexWriter.updateDocument(Term, Document)
>> org.apache.solr.update.DirectUpdateHandler2.addDoc(AddUpdateCommand)
>> org.apache.solr.update.processor.RunUpdateProcessor.processAdd(AddUpdateCommand)
>> org.apache.solr.handler.extraction.ExtractingDocumentLoader.doAdd(SolrContentHandler,
>>  AddUpdateCommand)
>> org.apache.solr.handler.extraction.ExtractingDocumentLoader.addDoc(SolrContentHandler)
>> org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(SolrQueryRequest,
>>  SolrQueryResponse, ContentStream)
>> org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(SolrQueryRequest,
>>  SolrQueryResponse)
>> org.apache.solr.handler.RequestHandlerBase.handleRequest(SolrQueryRequest, 
>> SolrQueryResponse)
>> org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(SolrQueryRequest,
>>  SolrQueryResponse)
>> org.apache.solr.core.SolrCore.execute(SolrRequestHandler, SolrQueryRequest, 
>> SolrQueryResponse)
>> org.apache.solr.servlet.SolrDispatchFilter.execute(HttpServletRequest, 
>> SolrRequestHandler, SolrQueryRequest, SolrQueryResponse)
>> org.apache.solr.servlet.SolrDispatchFilter.doFilter(ServletRequest, 
>> ServletResponse, FilterChain)
>> org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ServletRequest,
>>  ServletResponse)
>> org.apache.catalina.core.ApplicationFilterChain.doFilter(ServletRequest, 
>> ServletResponse)
>> org.apache.catalina.core.StandardWrapperValve.invoke(Request, Response)
>> org.apache.catalina.core.StandardContextValve.invoke(Request, Response)
>> org.apache.catalina.core.StandardHostValve.invoke(Request, Response)
>> org.apache.catalina.valves.ErrorReportValve.invoke(Request, Response)
>> org.apache.catalina.core.StandardEngineValve.invoke(Request, Response)
>> org.apache.catalina.connector.CoyoteAdapter.service(Request, Response)
>> org.apache.coyote.http11.Http11Processor.process(InputStream, OutputStream)
>> org.apache.coyote.http11.Http11BaseProtocol$Http11ConnectionHandler.processConnection(TcpConnection,
>>  Object[])
>> org.apache.tomcat.util.net.PoolTcpEndpoint.processSocket(Sock

Re: Weird Facet and KeywordTokenizerFactory Issue

2009-10-06 Thread Ravi Kiran
I did infact check it out any there is no weirdness in analysis page...see
result below

Index Analyzer org.apache.solr.analysis.KeywordTokenizerFactory {}  term
position 1 term text New York term type word source start,end 0,8 payload
 org.apache.solr.analysis.TrimFilterFactory {}  term position 1 term text New
York term type word source start,end 0,8 payload
 org.apache.solr.analysis.StopFilterFactory {words=entity-stopwords.txt,
ignoreCase=true, enablePositionIncrements=true}  term position 1 term text New
York term type word source start,end 0,8 payload
 org.apache.solr.analysis.SynonymFilterFactory {synonyms=synonyms.txt,
expand=false, ignoreCase=true}  term position 1 term text New York term type
word source start,end 0,8 payload
 org.apache.solr.analysis.RemoveDuplicatesTokenFilterFactory {}  term
position 1 term text New York term type word source start,end 0,8 payload
 Query Analyzer org.apache.solr.analysis.KeywordTokenizerFactory {}  term
position 1 term text New York term type word source start,end 0,8 payload
 org.apache.solr.analysis.TrimFilterFactory {}  term position 1 term text New
York term type word source start,end 0,8 payload
 org.apache.solr.analysis.StopFilterFactory {words=entity-stopwords.txt,
ignoreCase=true, enablePositionIncrements=true}  term position 1 term text New
York term type word source start,end 0,8 payload
 org.apache.solr.analysis.SynonymFilterFactory {synonyms=synonyms.txt,
expand=false, ignoreCase=true}  term position 1 term text New York term type
word source start,end 0,8 payload
 org.apache.solr.analysis.RemoveDuplicatesTokenFilterFactory {} term
position 1 term text New York term type word source start,end 0,8 payload


On Tue, Oct 6, 2009 at 4:19 PM, Christian Zambrano wrote:

> Have you tried using the Analysis page to see what tokens are generated for
> the string "New York"? It could be one of the token filter is adding the
> token 'new' for all strings that start with 'new'
>
>
> On 10/06/2009 02:54 PM, Ravi Kiran wrote:
>
>> Hello All,
>>   Iam getting some ghost facets in solr 1.4. Can anybody
>> kindly
>> help me understand why I get them and how to eliminate them. My schema.xml
>> snippet is given at the end. Iam indexing Named Entities extracted via
>> OpenNLP into solr. My understanding regarding KeywordTokenizerFactory is
>> that it will use all words as a single token, am I right ? for example:
>> "New
>> York" will be indexed as 'New York' and will not be split right??? However
>> I
>> see then splitup in facets as follows when running the query "
>>
>> http://localhost:8080/solr-admin/topicscore/select/?facet=true&facet.limit=-1
>> "...but
>> when I search with standard handler qt=standard&q=keyword:"New" I dont
>> find
>> any doc which has just "New". After digging in a bit I found that if
>> several
>> keywords have a common starting word it is being pulled out as another
>> facet
>> like the following. Any help is greatly appreciated
>>
>> Result
>> 
>> 47 >  Ghost
>> 7
>> 16
>> 10
>> 147
>> 23
>> 8
>> 5
>> 6
>> 8
>> 10
>> 8
>> 5
>> 7
>>
>> 7 -->  Ghost
>> 5
>> 5
>>
>>
>> 7   -->  Ghost
>> 6
>> 26
>> 6
>>
>> 27
>> 8
>> 7
>> 12
>>
>> Schema.xml
>> -
>>
>> > sortMissingLast="true" omitNorms="true" positionIncrementGap="100">
>>   
>> 
>> 
>> > words="stopwords.txt,entity-stopwords.txt"
>> enablePositionIncrements="true"/>
>>
>> > ignoreCase="true" expand="false" />
>> 
>>   
>>   
>> 
>> 
>> > words="stopwords.txt,entity-stopwords.txt" enablePositionIncrements="true"
>> />
>> > ignoreCase="true" expand="false" />
>> 
>>   
>> 
>>
>> > multiValued="true" termVectors="false" termPositions="false"
>> termOffsets="false"/>
>> > stored="true" multiValued="true" termVectors="false" termPositions="false"
>> termOffsets="false"/>
>> > multiValued="true" termVectors="false" termPositions="false"
>> termOffsets="false"/>
>> > multiValued="true" termVectors="false" termPositions="false"
>> termOffsets="false"/>
>>
>>
>>
>


Re: Solr Timeouts

2009-10-06 Thread Mark Miller
It sounds like he is indexing on a local disk, but reading the files to
be index from NFS - which would be fine.

You can get Lucene indexes to work on NFS (though still not recommended)
, but you need to use a custom IndexDeletionPolicy to keep older commit
points around longer and be sure not to use NIOFSDirectory.

Feak, Todd wrote:
> I seem to recall hearing something about *not* putting a Solr index directory 
> on an NFS mount. Might want to search on that.
>
> That, of course, doesn't have anything to do with commits showing up 
> unexpectedly in stack traces, per your original email.
>
> -Todd
>
> -Original Message-
> From: Giovanni Fernandez-Kincade [mailto:gfernandez-kinc...@capitaliq.com]
> Sent: Tuesday, October 06, 2009 12:39 PM
> To: solr-user@lucene.apache.org; yo...@lucidimagination.com
> Subject: RE: Solr Timeouts
>
> That thread was blocking for an hour while all other threads were idle or 
> blocked.
>
> -Original Message-
> From: ysee...@gmail.com [mailto:ysee...@gmail.com] On Behalf Of Yonik Seeley
> Sent: Tuesday, October 06, 2009 3:07 PM
> To: solr-user@lucene.apache.org
> Subject: Re: Solr Timeouts
>
> This specific thread was blocked for an hour?
> If so, I'd echo Lance... this is a local disk right?
>
> -Yonik
> http://www.lucidimagination.com
>
>
> On Mon, Oct 5, 2009 at 2:11 PM, Giovanni Fernandez-Kincade
>  wrote:
>   
>> I just grabbed another stack trace for a thread that has been similarly 
>> blocking for over an hour. Notice that there is no Commit in this one:
>>
>> http-8080-Processor67 [RUNNABLE] CPU time: 1:02:05
>> org.apache.lucene.index.TermBuffer.read(IndexInput, FieldInfos)
>> org.apache.lucene.index.SegmentTermEnum.next()
>> org.apache.lucene.index.SegmentTermEnum.scanTo(Term)
>> org.apache.lucene.index.TermInfosReader.get(Term, boolean)
>> org.apache.lucene.index.TermInfosReader.get(Term)
>> org.apache.lucene.index.SegmentTermDocs.seek(Term)
>> org.apache.lucene.index.DocumentsWriter.applyDeletes(IndexReader, int)
>> org.apache.lucene.index.DocumentsWriter.applyDeletes(SegmentInfos)
>> org.apache.lucene.index.IndexWriter.applyDeletes()
>> org.apache.lucene.index.IndexWriter.doFlushInternal(boolean, boolean)
>> org.apache.lucene.index.IndexWriter.doFlush(boolean, boolean)
>> org.apache.lucene.index.IndexWriter.flush(boolean, boolean, boolean)
>> org.apache.lucene.index.IndexWriter.updateDocument(Term, Document, Analyzer)
>> org.apache.lucene.index.IndexWriter.updateDocument(Term, Document)
>> org.apache.solr.update.DirectUpdateHandler2.addDoc(AddUpdateCommand)
>> org.apache.solr.update.processor.RunUpdateProcessor.processAdd(AddUpdateCommand)
>> org.apache.solr.handler.extraction.ExtractingDocumentLoader.doAdd(SolrContentHandler,
>>  AddUpdateCommand)
>> org.apache.solr.handler.extraction.ExtractingDocumentLoader.addDoc(SolrContentHandler)
>> org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(SolrQueryRequest,
>>  SolrQueryResponse, ContentStream)
>> org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(SolrQueryRequest,
>>  SolrQueryResponse)
>> org.apache.solr.handler.RequestHandlerBase.handleRequest(SolrQueryRequest, 
>> SolrQueryResponse)
>> org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(SolrQueryRequest,
>>  SolrQueryResponse)
>> org.apache.solr.core.SolrCore.execute(SolrRequestHandler, SolrQueryRequest, 
>> SolrQueryResponse)
>> org.apache.solr.servlet.SolrDispatchFilter.execute(HttpServletRequest, 
>> SolrRequestHandler, SolrQueryRequest, SolrQueryResponse)
>> org.apache.solr.servlet.SolrDispatchFilter.doFilter(ServletRequest, 
>> ServletResponse, FilterChain)
>> org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ServletRequest,
>>  ServletResponse)
>> org.apache.catalina.core.ApplicationFilterChain.doFilter(ServletRequest, 
>> ServletResponse)
>> org.apache.catalina.core.StandardWrapperValve.invoke(Request, Response)
>> org.apache.catalina.core.StandardContextValve.invoke(Request, Response)
>> org.apache.catalina.core.StandardHostValve.invoke(Request, Response)
>> org.apache.catalina.valves.ErrorReportValve.invoke(Request, Response)
>> org.apache.catalina.core.StandardEngineValve.invoke(Request, Response)
>> org.apache.catalina.connector.CoyoteAdapter.service(Request, Response)
>> org.apache.coyote.http11.Http11Processor.process(InputStream, OutputStream)
>> org.apache.coyote.http11.Http11BaseProtocol$Http11ConnectionHandler.processConnection(TcpConnection,
>>  Object[])
>> org.apache.tomcat.util.net.PoolTcpEndpoint.processSocket(Socket, 
>> TcpConnection, Object[])
>> org.apache.tomcat.util.net.LeaderFollowerWorkerThread.runIt(Object[])
>> org.apache.tomcat.util.threads.ThreadPool$ControlRunnable.run()
>> java.lang.Thread.run()
>>
>>
>> -Original Message-
>> From: ysee...@gmail.com [mailto:ysee...@gmail.com] On Behalf Of Yonik Seeley
>> Sent: Monday, October 05, 2009 1:18 PM
>> To: solr-user@lucene.apache.org
>> Subject: Re: Solr Timeouts
>>
>> OK... next step 

RE: Solr Timeouts

2009-10-06 Thread Feak, Todd
I seem to recall hearing something about *not* putting a Solr index directory 
on an NFS mount. Might want to search on that.

That, of course, doesn't have anything to do with commits showing up 
unexpectedly in stack traces, per your original email.

-Todd

-Original Message-
From: Giovanni Fernandez-Kincade [mailto:gfernandez-kinc...@capitaliq.com]
Sent: Tuesday, October 06, 2009 12:39 PM
To: solr-user@lucene.apache.org; yo...@lucidimagination.com
Subject: RE: Solr Timeouts

That thread was blocking for an hour while all other threads were idle or 
blocked.

-Original Message-
From: ysee...@gmail.com [mailto:ysee...@gmail.com] On Behalf Of Yonik Seeley
Sent: Tuesday, October 06, 2009 3:07 PM
To: solr-user@lucene.apache.org
Subject: Re: Solr Timeouts

This specific thread was blocked for an hour?
If so, I'd echo Lance... this is a local disk right?

-Yonik
http://www.lucidimagination.com


On Mon, Oct 5, 2009 at 2:11 PM, Giovanni Fernandez-Kincade
 wrote:
> I just grabbed another stack trace for a thread that has been similarly 
> blocking for over an hour. Notice that there is no Commit in this one:
>
> http-8080-Processor67 [RUNNABLE] CPU time: 1:02:05
> org.apache.lucene.index.TermBuffer.read(IndexInput, FieldInfos)
> org.apache.lucene.index.SegmentTermEnum.next()
> org.apache.lucene.index.SegmentTermEnum.scanTo(Term)
> org.apache.lucene.index.TermInfosReader.get(Term, boolean)
> org.apache.lucene.index.TermInfosReader.get(Term)
> org.apache.lucene.index.SegmentTermDocs.seek(Term)
> org.apache.lucene.index.DocumentsWriter.applyDeletes(IndexReader, int)
> org.apache.lucene.index.DocumentsWriter.applyDeletes(SegmentInfos)
> org.apache.lucene.index.IndexWriter.applyDeletes()
> org.apache.lucene.index.IndexWriter.doFlushInternal(boolean, boolean)
> org.apache.lucene.index.IndexWriter.doFlush(boolean, boolean)
> org.apache.lucene.index.IndexWriter.flush(boolean, boolean, boolean)
> org.apache.lucene.index.IndexWriter.updateDocument(Term, Document, Analyzer)
> org.apache.lucene.index.IndexWriter.updateDocument(Term, Document)
> org.apache.solr.update.DirectUpdateHandler2.addDoc(AddUpdateCommand)
> org.apache.solr.update.processor.RunUpdateProcessor.processAdd(AddUpdateCommand)
> org.apache.solr.handler.extraction.ExtractingDocumentLoader.doAdd(SolrContentHandler,
>  AddUpdateCommand)
> org.apache.solr.handler.extraction.ExtractingDocumentLoader.addDoc(SolrContentHandler)
> org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(SolrQueryRequest,
>  SolrQueryResponse, ContentStream)
> org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(SolrQueryRequest,
>  SolrQueryResponse)
> org.apache.solr.handler.RequestHandlerBase.handleRequest(SolrQueryRequest, 
> SolrQueryResponse)
> org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(SolrQueryRequest,
>  SolrQueryResponse)
> org.apache.solr.core.SolrCore.execute(SolrRequestHandler, SolrQueryRequest, 
> SolrQueryResponse)
> org.apache.solr.servlet.SolrDispatchFilter.execute(HttpServletRequest, 
> SolrRequestHandler, SolrQueryRequest, SolrQueryResponse)
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(ServletRequest, 
> ServletResponse, FilterChain)
> org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ServletRequest,
>  ServletResponse)
> org.apache.catalina.core.ApplicationFilterChain.doFilter(ServletRequest, 
> ServletResponse)
> org.apache.catalina.core.StandardWrapperValve.invoke(Request, Response)
> org.apache.catalina.core.StandardContextValve.invoke(Request, Response)
> org.apache.catalina.core.StandardHostValve.invoke(Request, Response)
> org.apache.catalina.valves.ErrorReportValve.invoke(Request, Response)
> org.apache.catalina.core.StandardEngineValve.invoke(Request, Response)
> org.apache.catalina.connector.CoyoteAdapter.service(Request, Response)
> org.apache.coyote.http11.Http11Processor.process(InputStream, OutputStream)
> org.apache.coyote.http11.Http11BaseProtocol$Http11ConnectionHandler.processConnection(TcpConnection,
>  Object[])
> org.apache.tomcat.util.net.PoolTcpEndpoint.processSocket(Socket, 
> TcpConnection, Object[])
> org.apache.tomcat.util.net.LeaderFollowerWorkerThread.runIt(Object[])
> org.apache.tomcat.util.threads.ThreadPool$ControlRunnable.run()
> java.lang.Thread.run()
>
>
> -Original Message-
> From: ysee...@gmail.com [mailto:ysee...@gmail.com] On Behalf Of Yonik Seeley
> Sent: Monday, October 05, 2009 1:18 PM
> To: solr-user@lucene.apache.org
> Subject: Re: Solr Timeouts
>
> OK... next step is to verify that SolrCell doesn't have a bug that
> causes it to commit.
> I'll try and verify today unless someone else beats me to it.
>
> -Yonik
> http://www.lucidimagination.com
>
> On Mon, Oct 5, 2009 at 1:04 PM, Giovanni Fernandez-Kincade
>  wrote:
>> I'm fairly certain that all of the indexing jobs are calling SOLR with 
>> commit=false. They all construct the indexing URLs using a CLR function I 
>> wrote, which takes in a Commit parameter, which is

Re: Weird Facet and KeywordTokenizerFactory Issue

2009-10-06 Thread Christian Zambrano
Have you tried using the Analysis page to see what tokens are generated 
for the string "New York"? It could be one of the token filter is adding 
the token 'new' for all strings that start with 'new'


On 10/06/2009 02:54 PM, Ravi Kiran wrote:

Hello All,
   Iam getting some ghost facets in solr 1.4. Can anybody kindly
help me understand why I get them and how to eliminate them. My schema.xml
snippet is given at the end. Iam indexing Named Entities extracted via
OpenNLP into solr. My understanding regarding KeywordTokenizerFactory is
that it will use all words as a single token, am I right ? for example: "New
York" will be indexed as 'New York' and will not be split right??? However I
see then splitup in facets as follows when running the query "
http://localhost:8080/solr-admin/topicscore/select/?facet=true&facet.limit=-1"...but
when I search with standard handler qt=standard&q=keyword:"New" I dont find
any doc which has just "New". After digging in a bit I found that if several
keywords have a common starting word it is being pulled out as another facet
like the following. Any help is greatly appreciated

Result

47 >  Ghost
7
16
10
147
23
8
5
6
8
10
8
5
7

7 -->  Ghost
5
5


7   -->  Ghost
6
26
6

27
8
7
12

Schema.xml
-

 
   
 
 
 

 
 
   
   
 
 
 
 
 
   
 

 
 
 
 

   


Re: stats page slow in latest nightly

2009-10-06 Thread Mark Miller
I was worried about that actually. I havn't tested how fast the RAM
estimator is on huge String FieldCaches - it will be fast on everything
else, but it checks the size of each String in the array.

When I was working on it, I was actually going to default to not show
the size, and make you click a link that added a param to get the sizes
in the display too. But I foolishly didn't bring it up when Hoss made my
life easier with his simpler patch.

Yonik Seeley wrote:
> Might be the new Lucene fieldCache stats stuff that was recently added?
>
> -Yonik
> http://www.lucidimagination.com
>
>
> On Tue, Oct 6, 2009 at 3:56 PM, Joe Calderon  wrote:
>   
>> hello *, ive been noticing that /admin/stats.jsp is really slow in the
>> recent builds, has anyone else encountered this?
>>
>>
>> --joe
>> 


-- 
- Mark

http://www.lucidimagination.com





RE: Solr and Garbage Collection

2009-10-06 Thread Fuad Efendi
Master-Slave replica: new caches will be warmed&prepopulated _before_ making
new IndexReader available for _new_ requests and _before_ discarding old one
- it means that theoretical sizing for FieldCache (which is defined by
number of docs in an index and cardinality of a field) should be doubled...
of course we need to play with GC options too for performance tuning
(mostly) 


> > I read pretty much all posts on this thread (before and after this one).
> Looks
> > like the main suggestion from you and others is to keep max heap size
> (-Xmx)
> > as small as possible (as long as you don't see OOM exception).
> 
> 
> I suggested absolute opposite; please note also that "as small as
possible"
> does not have any meaning in multiuser environment of Tomcat. It depends
on
> query types (10 documents per request? OR, may be 1???) AND it depends
> on average server loading (one concurrent request? Or, may be 200 threads
> trying to deal with 2000 concurrent requests?) AND it depends on whether
it
> is Master (used for updates - parses tons of docs in a single file???) -
and
> it depends on unpredictable memory fragmentation - it all depends on use
> case too(!!!), additionally to schema / index size.
> 
> 
> Please note also, such staff depends on JVM vendor too: what if it
> precompiles everything into CPU native code (including memory dealloc
after
> each call)? Some do!
> 
> -Fuad
> http://www.linkedin.com/in/liferay
> 
> 
> ...but 'core' constantly disagrees with me :)
> 
> 
> 





Re: stats page slow in latest nightly

2009-10-06 Thread Yonik Seeley
Might be the new Lucene fieldCache stats stuff that was recently added?

-Yonik
http://www.lucidimagination.com


On Tue, Oct 6, 2009 at 3:56 PM, Joe Calderon  wrote:
> hello *, ive been noticing that /admin/stats.jsp is really slow in the
> recent builds, has anyone else encountered this?
>
>
> --joe


Re: Solr Trunk Heap Space Issues

2009-10-06 Thread Mark Miller
Jeff Newburn wrote:
> So could that potentially explain our use of more ram on indexing? Or is
> this a rare edge case.
>   
I think it could explain the JVM using more RAM while indexing - but it
should be fairly easily recoverable from what I can tell - so no
explanation on the OOM yet. Still looking at that one.

Is you system basically stock, or do you have custom plugins in it?

-- 
- Mark

http://www.lucidimagination.com





stats page slow in latest nightly

2009-10-06 Thread Joe Calderon
hello *, ive been noticing that /admin/stats.jsp is really slow in the
recent builds, has anyone else encountered this?


--joe


Weird Facet and KeywordTokenizerFactory Issue

2009-10-06 Thread Ravi Kiran
Hello All,
  Iam getting some ghost facets in solr 1.4. Can anybody kindly
help me understand why I get them and how to eliminate them. My schema.xml
snippet is given at the end. Iam indexing Named Entities extracted via
OpenNLP into solr. My understanding regarding KeywordTokenizerFactory is
that it will use all words as a single token, am I right ? for example: "New
York" will be indexed as 'New York' and will not be split right??? However I
see then splitup in facets as follows when running the query "
http://localhost:8080/solr-admin/topicscore/select/?facet=true&facet.limit=-1"...but
when I search with standard handler qt=standard&q=keyword:"New" I dont find
any doc which has just "New". After digging in a bit I found that if several
keywords have a common starting word it is being pulled out as another facet
like the following. Any help is greatly appreciated

Result

47> Ghost
7
16
10
147
23
8
5
6
8
10
8
5
7

7--> Ghost
5
5


7  --> Ghost
6
26
6

27
8
7
12

Schema.xml
-


  






  
  





  








Re: Solr Trunk Heap Space Issues

2009-10-06 Thread Jeff Newburn
So could that potentially explain our use of more ram on indexing? Or is
this a rare edge case.
-- 
Jeff Newburn
Software Engineer, Zappos.com
jnewb...@zappos.com - 702-943-7562


> From: Mark Miller 
> Reply-To: 
> Date: Tue, 06 Oct 2009 15:30:50 -0400
> To: 
> Subject: Re: Solr Trunk Heap Space Issues
> 
> This is looking like its just a Lucene oddity you get when adding a
> single doc due to some changes with the NRT stuff.
> 
> Mark Miller wrote:
>> Okay - I'm sorry - serves me right for working sick.
>> 
>> Now that I have put on my glasses and correctly tagged my two eclipse tests:
>> 
>> It still appears that trunk likes to use more RAM.
>> 
>> I switched both tests to one million iterations and watched the heap.
>> 
>> The test from the build around may 5th (I promise :) ) regularly GC's
>> down to about 70-80MB after a fair time
>> of running. It doesn't appear to climb - keeps GC'ing back to 70-80
>> (after starting at by GC'ing down to 40 for a bit).
>> 
>> The test from trunk, after a fair time of running, keeps GC'ing down to
>> about 120-150MB - 150 at the end, slowly working its
>> way up from 90-110 at the beginning.
>> 
>> Don't know what that means yet - but it appears trunk likes to use a bit
>> more RAM while indexing. Odd that its so much more because these docs
>> are tiny:
>> 
>> String[] fields = {"text","simple"
>> ,"text","test"
>> ,"text","how now brown cow"
>> ,"text","what's that?"
>> ,"text","radical!"
>> ,"text","what's all this about, anyway?"
>> ,"text","just how fast is this text indexing?"
>> };
>> 
>> Mark Miller wrote:
>>   
>>> Okay, I juggled the tests in eclipse and flipped the results. So they
>>> make sense.
>>> 
>>> Sorry - goose chase on this one.
>>> 
>>> Yonik Seeley wrote:
>>>   
>>> 
 I don't see this with trunk... I just tried TestIndexingPerformance
 with 1M docs, and it seemed to work fine.
 Memory use stabilized at 40MB.
 Most memory use was for indexing (not analysis).
 char[] topped out at 4.5MB
 
 -Yonik
 http://www.lucidimagination.com
 
 
 On Tue, Oct 6, 2009 at 12:31 PM, Mark Miller  wrote:
   
 
   
> Yeah - I was wondering about that ... not sure how these guys are
> stacking up ...
> 
> Yonik Seeley wrote:
> 
>   
> 
>> TestIndexingPerformance?
>> What the heck... that's not even multi-threaded!
>> 
>> -Yonik
>> http://www.lucidimagination.com
>> 
>> 
>> 
>> On Tue, Oct 6, 2009 at 12:17 PM, Mark Miller 
>> wrote:
>> 
>>   
>> 
>>   
>>> Darnit - didn't finish that email. This is after running your old short
>>> doc perf test for 10,000 iterations. You see the same thing with 1000
>>> iterations but much less pronounced eg gettin' worse with more
>>> iterations.
>>> 
>>> Mark Miller wrote:
>>> 
>>> 
>>>   
>>>
 A little before and after. The before is around may 5th'is - the after
 is trunk.
 
 http://myhardshadow.com/memanalysis/before.png
 http://myhardshadow.com/memanalysis/after.png
 
 Mark Miller wrote:
 
 
   
   
   
> Took a peak at the checkout around the time he says he's using.
> 
> CharTokenizer appears to be holding onto much large char[] arrays now
> than before. Same with snowball.Among - used to be almost nothing, now
> its largio.
> 
> The new TokenStream stuff appears to be clinging. Needs to find some
> inner peace.
>  
>  
>  
>>>   
>>> 
>> 
>> 
>>   
> 
> 
> -- 
> - Mark
> 
> http://www.lucidimagination.com
> 
> 
> 



RE: Solr Timeouts

2009-10-06 Thread Giovanni Fernandez-Kincade
That thread was blocking for an hour while all other threads were idle or 
blocked.

-Original Message-
From: ysee...@gmail.com [mailto:ysee...@gmail.com] On Behalf Of Yonik Seeley
Sent: Tuesday, October 06, 2009 3:07 PM
To: solr-user@lucene.apache.org
Subject: Re: Solr Timeouts

This specific thread was blocked for an hour?
If so, I'd echo Lance... this is a local disk right?

-Yonik
http://www.lucidimagination.com


On Mon, Oct 5, 2009 at 2:11 PM, Giovanni Fernandez-Kincade
 wrote:
> I just grabbed another stack trace for a thread that has been similarly 
> blocking for over an hour. Notice that there is no Commit in this one:
>
> http-8080-Processor67 [RUNNABLE] CPU time: 1:02:05
> org.apache.lucene.index.TermBuffer.read(IndexInput, FieldInfos)
> org.apache.lucene.index.SegmentTermEnum.next()
> org.apache.lucene.index.SegmentTermEnum.scanTo(Term)
> org.apache.lucene.index.TermInfosReader.get(Term, boolean)
> org.apache.lucene.index.TermInfosReader.get(Term)
> org.apache.lucene.index.SegmentTermDocs.seek(Term)
> org.apache.lucene.index.DocumentsWriter.applyDeletes(IndexReader, int)
> org.apache.lucene.index.DocumentsWriter.applyDeletes(SegmentInfos)
> org.apache.lucene.index.IndexWriter.applyDeletes()
> org.apache.lucene.index.IndexWriter.doFlushInternal(boolean, boolean)
> org.apache.lucene.index.IndexWriter.doFlush(boolean, boolean)
> org.apache.lucene.index.IndexWriter.flush(boolean, boolean, boolean)
> org.apache.lucene.index.IndexWriter.updateDocument(Term, Document, Analyzer)
> org.apache.lucene.index.IndexWriter.updateDocument(Term, Document)
> org.apache.solr.update.DirectUpdateHandler2.addDoc(AddUpdateCommand)
> org.apache.solr.update.processor.RunUpdateProcessor.processAdd(AddUpdateCommand)
> org.apache.solr.handler.extraction.ExtractingDocumentLoader.doAdd(SolrContentHandler,
>  AddUpdateCommand)
> org.apache.solr.handler.extraction.ExtractingDocumentLoader.addDoc(SolrContentHandler)
> org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(SolrQueryRequest,
>  SolrQueryResponse, ContentStream)
> org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(SolrQueryRequest,
>  SolrQueryResponse)
> org.apache.solr.handler.RequestHandlerBase.handleRequest(SolrQueryRequest, 
> SolrQueryResponse)
> org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(SolrQueryRequest,
>  SolrQueryResponse)
> org.apache.solr.core.SolrCore.execute(SolrRequestHandler, SolrQueryRequest, 
> SolrQueryResponse)
> org.apache.solr.servlet.SolrDispatchFilter.execute(HttpServletRequest, 
> SolrRequestHandler, SolrQueryRequest, SolrQueryResponse)
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(ServletRequest, 
> ServletResponse, FilterChain)
> org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ServletRequest,
>  ServletResponse)
> org.apache.catalina.core.ApplicationFilterChain.doFilter(ServletRequest, 
> ServletResponse)
> org.apache.catalina.core.StandardWrapperValve.invoke(Request, Response)
> org.apache.catalina.core.StandardContextValve.invoke(Request, Response)
> org.apache.catalina.core.StandardHostValve.invoke(Request, Response)
> org.apache.catalina.valves.ErrorReportValve.invoke(Request, Response)
> org.apache.catalina.core.StandardEngineValve.invoke(Request, Response)
> org.apache.catalina.connector.CoyoteAdapter.service(Request, Response)
> org.apache.coyote.http11.Http11Processor.process(InputStream, OutputStream)
> org.apache.coyote.http11.Http11BaseProtocol$Http11ConnectionHandler.processConnection(TcpConnection,
>  Object[])
> org.apache.tomcat.util.net.PoolTcpEndpoint.processSocket(Socket, 
> TcpConnection, Object[])
> org.apache.tomcat.util.net.LeaderFollowerWorkerThread.runIt(Object[])
> org.apache.tomcat.util.threads.ThreadPool$ControlRunnable.run()
> java.lang.Thread.run()
>
>
> -Original Message-
> From: ysee...@gmail.com [mailto:ysee...@gmail.com] On Behalf Of Yonik Seeley
> Sent: Monday, October 05, 2009 1:18 PM
> To: solr-user@lucene.apache.org
> Subject: Re: Solr Timeouts
>
> OK... next step is to verify that SolrCell doesn't have a bug that
> causes it to commit.
> I'll try and verify today unless someone else beats me to it.
>
> -Yonik
> http://www.lucidimagination.com
>
> On Mon, Oct 5, 2009 at 1:04 PM, Giovanni Fernandez-Kincade
>  wrote:
>> I'm fairly certain that all of the indexing jobs are calling SOLR with 
>> commit=false. They all construct the indexing URLs using a CLR function I 
>> wrote, which takes in a Commit parameter, which is always set to false.
>>
>> Also, I don't see any calls to commit in the Tomcat logs (whereas normally 
>> when I make a commit call I do).
>>
>> This suggests that Solr is doing it automatically, but the extract handler 
>> doesn't seem to be the problem:
>>  > class="org.apache.solr.handler.extraction.ExtractingRequestHandler" 
>> startup="lazy">
>>
>>  ignored_
>>  fileData
>>
>>  
>>
>>
>> There is no external config file specified, and I don't see anything about 
>>

RE: Solr Timeouts

2009-10-06 Thread Giovanni Fernandez-Kincade
Yeah this is Java 1.6.

The indexes are being written to a local disk, but they files being indexed 
live on a NFS.

-Original Message-
From: Lance Norskog [mailto:goks...@gmail.com]
Sent: Tuesday, October 06, 2009 2:59 PM
To: solr-user@lucene.apache.org
Subject: Re: Solr Timeouts

Is this Java 1.5? There are known threading bugs in 1.5 that were
fixed in Java 1.6. Also, there was one short series of 1.6 releases
that wrote bogus Lucene index files.

So, make sure you use the latest Java 1.6 release.

Also, I hope this is a local disk. Some shops try running over NFS or
Windows file sharing and this often does not work well.

Lance



On 10/6/09, Giovanni Fernandez-Kincade  wrote:
> Is it possible that deletions are triggering these commits? Some of the
> documents that I'm making indexing requests for already exist in the index,
> so they would result in deletions. I tried messing with some of these
> parameters but I'm still running into the same problem:
>
> 
>   
>   false
>   
>   100
>   
> 
>
> This is happening like every 30-40minutes and it's really hampering the
> indexing progress...
>
>
> -Original Message-
> From: Giovanni Fernandez-Kincade [mailto:gfernandez-kinc...@capitaliq.com]
> Sent: Monday, October 05, 2009 2:11 PM
> To: solr-user@lucene.apache.org; yo...@lucidimagination.com
> Subject: RE: Solr Timeouts
>
> I just grabbed another stack trace for a thread that has been similarly
> blocking for over an hour. Notice that there is no Commit in this one:
>
> http-8080-Processor67 [RUNNABLE] CPU time: 1:02:05
> org.apache.lucene.index.TermBuffer.read(IndexInput, FieldInfos)
> org.apache.lucene.index.SegmentTermEnum.next()
> org.apache.lucene.index.SegmentTermEnum.scanTo(Term)
> org.apache.lucene.index.TermInfosReader.get(Term, boolean)
> org.apache.lucene.index.TermInfosReader.get(Term)
> org.apache.lucene.index.SegmentTermDocs.seek(Term)
> org.apache.lucene.index.DocumentsWriter.applyDeletes(IndexReader, int)
> org.apache.lucene.index.DocumentsWriter.applyDeletes(SegmentInfos)
> org.apache.lucene.index.IndexWriter.applyDeletes()
> org.apache.lucene.index.IndexWriter.doFlushInternal(boolean, boolean)
> org.apache.lucene.index.IndexWriter.doFlush(boolean, boolean)
> org.apache.lucene.index.IndexWriter.flush(boolean, boolean, boolean)
> org.apache.lucene.index.IndexWriter.updateDocument(Term, Document, Analyzer)
> org.apache.lucene.index.IndexWriter.updateDocument(Term, Document)
> org.apache.solr.update.DirectUpdateHandler2.addDoc(AddUpdateCommand)
> org.apache.solr.update.processor.RunUpdateProcessor.processAdd(AddUpdateCommand)
> org.apache.solr.handler.extraction.ExtractingDocumentLoader.doAdd(SolrContentHandler,
> AddUpdateCommand)
> org.apache.solr.handler.extraction.ExtractingDocumentLoader.addDoc(SolrContentHandler)
> org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(SolrQueryRequest,
> SolrQueryResponse, ContentStream)
> org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(SolrQueryRequest,
> SolrQueryResponse)
> org.apache.solr.handler.RequestHandlerBase.handleRequest(SolrQueryRequest,
> SolrQueryResponse)
> org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(SolrQueryRequest,
> SolrQueryResponse)
> org.apache.solr.core.SolrCore.execute(SolrRequestHandler, SolrQueryRequest,
> SolrQueryResponse)
> org.apache.solr.servlet.SolrDispatchFilter.execute(HttpServletRequest,
> SolrRequestHandler, SolrQueryRequest, SolrQueryResponse)
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(ServletRequest,
> ServletResponse, FilterChain)
> org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ServletRequest,
> ServletResponse)
> org.apache.catalina.core.ApplicationFilterChain.doFilter(ServletRequest,
> ServletResponse)
> org.apache.catalina.core.StandardWrapperValve.invoke(Request, Response)
> org.apache.catalina.core.StandardContextValve.invoke(Request, Response)
> org.apache.catalina.core.StandardHostValve.invoke(Request, Response)
> org.apache.catalina.valves.ErrorReportValve.invoke(Request, Response)
> org.apache.catalina.core.StandardEngineValve.invoke(Request, Response)
> org.apache.catalina.connector.CoyoteAdapter.service(Request, Response)
> org.apache.coyote.http11.Http11Processor.process(InputStream, OutputStream)
> org.apache.coyote.http11.Http11BaseProtocol$Http11ConnectionHandler.processConnection(TcpConnection,
> Object[])
> org.apache.tomcat.util.net.PoolTcpEndpoint.processSocket(Socket,
> TcpConnection, Object[])
> org.apache.tomcat.util.net.LeaderFollowerWorkerThread.runIt(Object[])
> org.apache.tomcat.util.threads.ThreadPool$ControlRunnable.run()
> java.lang.Thread.run()
>
>
> -Original Message-
> From: ysee...@gmail.com [mailto:ysee...@gmail.com] On Behalf Of Yonik Seeley
> Sent: Monday, October 05, 2009 1:18 PM
> To: solr-user@lucene.apache.org
> Subject: Re: Solr Timeouts
>
> OK... next step is to verify that SolrCell doesn't have a bug that
> causes it to commit.
>

Re: Solr Trunk Heap Space Issues

2009-10-06 Thread Mark Miller
This is looking like its just a Lucene oddity you get when adding a
single doc due to some changes with the NRT stuff.

Mark Miller wrote:
> Okay - I'm sorry - serves me right for working sick.
>
> Now that I have put on my glasses and correctly tagged my two eclipse tests:
>
> It still appears that trunk likes to use more RAM.
>
> I switched both tests to one million iterations and watched the heap.
>
> The test from the build around may 5th (I promise :) ) regularly GC's
> down to about 70-80MB after a fair time
> of running. It doesn't appear to climb - keeps GC'ing back to 70-80
> (after starting at by GC'ing down to 40 for a bit).
>
> The test from trunk, after a fair time of running, keeps GC'ing down to
> about 120-150MB - 150 at the end, slowly working its
> way up from 90-110 at the beginning.
>
> Don't know what that means yet - but it appears trunk likes to use a bit
> more RAM while indexing. Odd that its so much more because these docs
> are tiny:
>
> String[] fields = {"text","simple"
> ,"text","test"
> ,"text","how now brown cow"
> ,"text","what's that?"
> ,"text","radical!"
> ,"text","what's all this about, anyway?"
> ,"text","just how fast is this text indexing?"
> };
>
> Mark Miller wrote:
>   
>> Okay, I juggled the tests in eclipse and flipped the results. So they
>> make sense.
>>
>> Sorry - goose chase on this one.
>>
>> Yonik Seeley wrote:
>>   
>> 
>>> I don't see this with trunk... I just tried TestIndexingPerformance
>>> with 1M docs, and it seemed to work fine.
>>> Memory use stabilized at 40MB.
>>> Most memory use was for indexing (not analysis).
>>> char[] topped out at 4.5MB
>>>
>>> -Yonik
>>> http://www.lucidimagination.com
>>>
>>>
>>> On Tue, Oct 6, 2009 at 12:31 PM, Mark Miller  wrote:
>>>   
>>> 
>>>   
 Yeah - I was wondering about that ... not sure how these guys are
 stacking up ...

 Yonik Seeley wrote:
 
   
 
> TestIndexingPerformance?
> What the heck... that's not even multi-threaded!
>
> -Yonik
> http://www.lucidimagination.com
>
>
>
> On Tue, Oct 6, 2009 at 12:17 PM, Mark Miller  
> wrote:
>
>   
> 
>   
>> Darnit - didn't finish that email. This is after running your old short
>> doc perf test for 10,000 iterations. You see the same thing with 1000
>> iterations but much less pronounced eg gettin' worse with more 
>> iterations.
>>
>> Mark Miller wrote:
>>
>> 
>>   
>> 
>>> A little before and after. The before is around may 5th'is - the after
>>> is trunk.
>>>
>>> http://myhardshadow.com/memanalysis/before.png
>>> http://myhardshadow.com/memanalysis/after.png
>>>
>>> Mark Miller wrote:
>>>
>>>
>>>   
>>> 
>>>   
 Took a peak at the checkout around the time he says he's using.

 CharTokenizer appears to be holding onto much large char[] arrays now
 than before. Same with snowball.Among - used to be almost nothing, now
 its largio.

 The new TokenStream stuff appears to be clinging. Needs to find some
 inner peace.
 
   
 
>>   
>> 
>
>
>   


-- 
- Mark

http://www.lucidimagination.com





RE: Solr and Garbage Collection

2009-10-06 Thread Fuad Efendi
> I read pretty much all posts on this thread (before and after this one).
Looks
> like the main suggestion from you and others is to keep max heap size
(-Xmx)
> as small as possible (as long as you don't see OOM exception). 


I suggested absolute opposite; please note also that "as small as possible"
does not have any meaning in multiuser environment of Tomcat. It depends on
query types (10 documents per request? OR, may be 1???) AND it depends
on average server loading (one concurrent request? Or, may be 200 threads
trying to deal with 2000 concurrent requests?) AND it depends on whether it
is Master (used for updates - parses tons of docs in a single file???) - and
it depends on unpredictable memory fragmentation - it all depends on use
case too(!!!), additionally to schema / index size.


Please note also, such staff depends on JVM vendor too: what if it
precompiles everything into CPU native code (including memory dealloc after
each call)? Some do!

-Fuad
http://www.linkedin.com/in/liferay


...but 'core' constantly disagrees with me :)






RE: Geo Coding Service

2009-10-06 Thread Fuad Efendi
If you are looking for (simplified) ZIP/PostalCode -> Longitude-Latitude
mapping (North America) check this: http://www.zipcodedownload.com

I am using it for service area calculations for casaGURU renovation
professionals at http://www.casaguru.com

They even have API library (including stored procedures for MySQL, Oracle,
etc + Java API) - to calculate distance between two postal codes, execute
queries, etc.


-Fuad
http://www.linkedin.com/in/liferay



> -Original Message-
> From: ram_sj [mailto:rpachaiyap...@gmail.com]
> Sent: October-06-09 2:33 PM
> To: solr-user@lucene.apache.org
> Subject: Geo Coding Service
> 
> 
> Hi,
> 
> Can someone suggest me a good geo-coding service or software for
commercial
> use. I want to find gecodes for large collection of address. I'm looking
for
> a good long term service.
> 
> Thanks
> Ram
> --
> View this message in context: http://www.nabble.com/Geo-Coding-Service-
> tp25774277p25774277.html
> Sent from the Solr - User mailing list archive at Nabble.com.





Re: Importing CSV file slow/crashes

2009-10-06 Thread Chris Hostetter

: Is it possible to narrow down what fields/field-types are causing the 
problems?
: Or perhaps profile and see what's taking up time compared to the older 
version?

Or: could you post your solrconfig + schema + csv files online so other 
people could help debug the problem?


: 
: -Yonik
: http://www.lucidimagination.com
: 
: 
: 
: On Tue, Oct 6, 2009 at 1:48 PM, Nasseam Elkarra  wrote:
: > Hello Erick,
: >
: > Sorry about that. I'm using the CSV update handler. Uploading a local CSV
: > using the stream.file parameter. There are 94 fields and 36 copyFields.
: >
: > Thank you,
: > Nasseam
: >
: > On Oct 6, 2009, at 10:09 AM, Erick Erickson wrote:
: >
: >> Well, without some better idea of *how* you're doing the import, it's a
: >> little hard to say anything meaningful (hint, hint).
: >> Best
: >> Erick
: >>
: >> On Tue, Oct 6, 2009 at 1:06 PM, Nasseam Elkarra 
: >> wrote:
: >>
: >>> Hello all,
: >>>
: >>> I had a dev build of 1.4 from 5/1/2009 and importing a 20K row took less
: >>> than a minute. Updating to the latest as of yesterday, the import is
: >>> really
: >>> slow and I had to cancel it after a half hour. This prevented me from
: >>> upgrading a few months ago as well.
: >>>
: >>> Any ideas as to the cause of this?
: >>>
: >>> Thank you,
: >>>
: >>> Nasseam Elkarra
: >>> http://bodukai.com/boutique/
: >>> The fastest possible shopping experience.
: >>>
: >>>
: >
: >
: 



-Hoss



Re: DataImportHandler problem: Feeding the XPathEntityProcessor with the FieldReaderDataSource

2009-10-06 Thread Lance Norskog
A side note that might help: if I change the dataField from 'db.blob'
to 'blob', this DIH stack emits no documents.

On 10/5/09, Lance Norskog  wrote:
> I've added a unit test for the problem down below. It feeds document
> field data into the XPathEntityProcessor via the
> FieldReaderDataSource, and the XPath EP does not emit unpacked fields.
>
> Running this under the debugger, I can see the supplied StringReader,
> with the XML string, being piped into the XPath EP. But somehow the
> XPath EP does not pick it apart the right way.
>
> Here is the DIH configuration file separately.
>
> 
>   
>   
>   
>   
> 
> 
> 
>  processor='XPathEntityProcessor'
>   forEach='/names' dataField='db.blob'>
>   
> 
>   
>   
> 
>
> Any ideas?
>
> ---
>
> package org.apache.solr.handler.dataimport;
>
> import static
> org.apache.solr.handler.dataimport.AbstractDataImportHandlerTest.createMap;
> import junit.framework.TestCase;
>
> import java.util.ArrayList;
> import java.util.HashMap;
> import java.util.List;
> import java.util.Map;
>
> import org.apache.solr.common.SolrInputDocument;
> import org.apache.solr.common.SolrInputField;
> import org.apache.solr.handler.dataimport.TestDocBuilder.SolrWriterImpl;
> import org.junit.Test;
>
> /*
>  * Demonstrate problem feeding XPathEntity from a FieldReaderDatasource
>  */
>
> public class TestFieldReaderXPath extends TestCase {
>   static final String KISSINGER = "Henry";
>   
>   static final String[][][] DBDOCS = {
>   {{"dbid", "1"}, {"blob", KISSINGER}},
>   };
>   
>   /*
>* Receive a row from SQL and fetch a row from Solr - no value matching
>* stolen from TestDocBuilder
>* */
>
>   @Test
>   public void testSolrEmbedded() throws Exception {
>   try {
>   DataImporter di = new DataImporter();
>   di.loadDataConfig(dih_config_FR_into_XP);
>   DataImporter.RequestParams rp = new 
> DataImporter.RequestParams();
>   rp.command = "full-import";
>   rp.requestParams = new HashMap();
>
>   DataConfig cfg = di.getConfig();
>   DataConfig.Entity entity = cfg.document.entities.get(0);
>   List> l = new 
> ArrayList>();
>   addDBDocuments(l);
>   MockDataSource.setIterator("select * from x", 
> l.iterator());
>   entity.dataSrc = new MockDataSource();
>   entity.isDocRoot = true;
>   SolrWriterImpl swi = new SolrWriterImpl();
>   di.runCmd(rp, swi);
>
>   assertEquals(1, swi.docs.size());
>   SolrInputDocument doc = swi.docs.get(0);
>   SolrInputField field;
>   field = doc.getField("dbid");
>   assertEquals(field.getValue().toString(), "1");
>   field = doc.getField("blob");
>   assertEquals(field.getValue().toString(), KISSINGER);
>   field = doc.getField("name");
>   assertNotNull(field);
>   assertEquals(field.getValue().toString(), "Henry");
>   } finally {
>   MockDataSource.clearCache();
>   }
>   }
>
>   
>   private void addDBDocuments(List> l) {
>   for(String[][] dbdoc: DBDOCS) {
>   l.add(createMap(dbdoc[0][0], dbdoc[0][1], dbdoc[1][0], 
> dbdoc[1][1]));
>   }
>   }
>
>String dih_config_FR_into_XP = "\r\n" +
>"  \r\n" +
>"  \r\n" +
>"  \r\n" +
>"  \r\n" 
> +
>"\r\n" +
>"\r\n" +
>"\r\n" +
>" processor='XPathEntityProcessor'\r\n" +
>"  forEach='/names' dataField='db.blob'>\r\n" +
>"  \r\n" +
>"\r\n" +
>"  \r\n" +
>"  \r\n" +
>"\r\n"
>;
>
>
> }
>


-- 
Lance Norskog
goks...@gmail.com


Re: Different sort behavior on same code

2009-10-06 Thread Yonik Seeley
Lucene's test for multi-valued fields is crude... it's essentially if
the number of values (un-inverted term instances) becomes greater than
the number of documents.

-Yonik
http://www.lucidimagination.com

On Tue, Oct 6, 2009 at 3:04 PM, wojtekpia  wrote:
>
> Hi,
>
> I'm running Solr version 1.3.0.2009.07.08.08.05.45 in 2 environments. I have
> a field defined as:
>
>  multiValued="true"/>
>
> The two environments have different data, but both have single and multi
> valued entries for myDate.
>
> On one environment sorting by myDate works (sort seems to be by the 'last'
> value if multi valued).
>
> On the other environment I get:
> HTTP Status 500 - there are more terms than documents in field "myDate", but
> it's impossible to sort on tokenized fields java.lang.RuntimeException:
> there are more terms than documents in field
>
> I've read that I shouldn't sort by multi-valued fields, so my solution will
> be to add a single-valued date field for sorting. But I don't understand why
> my two environments behave differently, and it doesn't seem like the error
> message makes sense (are date fields tokenized?). Any thoughts?
>
> Thanks,
>
> Wojtek
> --
> View this message in context: 
> http://www.nabble.com/Different-sort-behavior-on-same-code-tp25774769p25774769.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>
>


Re: Solr Timeouts

2009-10-06 Thread Yonik Seeley
This specific thread was blocked for an hour?
If so, I'd echo Lance... this is a local disk right?

-Yonik
http://www.lucidimagination.com


On Mon, Oct 5, 2009 at 2:11 PM, Giovanni Fernandez-Kincade
 wrote:
> I just grabbed another stack trace for a thread that has been similarly 
> blocking for over an hour. Notice that there is no Commit in this one:
>
> http-8080-Processor67 [RUNNABLE] CPU time: 1:02:05
> org.apache.lucene.index.TermBuffer.read(IndexInput, FieldInfos)
> org.apache.lucene.index.SegmentTermEnum.next()
> org.apache.lucene.index.SegmentTermEnum.scanTo(Term)
> org.apache.lucene.index.TermInfosReader.get(Term, boolean)
> org.apache.lucene.index.TermInfosReader.get(Term)
> org.apache.lucene.index.SegmentTermDocs.seek(Term)
> org.apache.lucene.index.DocumentsWriter.applyDeletes(IndexReader, int)
> org.apache.lucene.index.DocumentsWriter.applyDeletes(SegmentInfos)
> org.apache.lucene.index.IndexWriter.applyDeletes()
> org.apache.lucene.index.IndexWriter.doFlushInternal(boolean, boolean)
> org.apache.lucene.index.IndexWriter.doFlush(boolean, boolean)
> org.apache.lucene.index.IndexWriter.flush(boolean, boolean, boolean)
> org.apache.lucene.index.IndexWriter.updateDocument(Term, Document, Analyzer)
> org.apache.lucene.index.IndexWriter.updateDocument(Term, Document)
> org.apache.solr.update.DirectUpdateHandler2.addDoc(AddUpdateCommand)
> org.apache.solr.update.processor.RunUpdateProcessor.processAdd(AddUpdateCommand)
> org.apache.solr.handler.extraction.ExtractingDocumentLoader.doAdd(SolrContentHandler,
>  AddUpdateCommand)
> org.apache.solr.handler.extraction.ExtractingDocumentLoader.addDoc(SolrContentHandler)
> org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(SolrQueryRequest,
>  SolrQueryResponse, ContentStream)
> org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(SolrQueryRequest,
>  SolrQueryResponse)
> org.apache.solr.handler.RequestHandlerBase.handleRequest(SolrQueryRequest, 
> SolrQueryResponse)
> org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(SolrQueryRequest,
>  SolrQueryResponse)
> org.apache.solr.core.SolrCore.execute(SolrRequestHandler, SolrQueryRequest, 
> SolrQueryResponse)
> org.apache.solr.servlet.SolrDispatchFilter.execute(HttpServletRequest, 
> SolrRequestHandler, SolrQueryRequest, SolrQueryResponse)
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(ServletRequest, 
> ServletResponse, FilterChain)
> org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ServletRequest,
>  ServletResponse)
> org.apache.catalina.core.ApplicationFilterChain.doFilter(ServletRequest, 
> ServletResponse)
> org.apache.catalina.core.StandardWrapperValve.invoke(Request, Response)
> org.apache.catalina.core.StandardContextValve.invoke(Request, Response)
> org.apache.catalina.core.StandardHostValve.invoke(Request, Response)
> org.apache.catalina.valves.ErrorReportValve.invoke(Request, Response)
> org.apache.catalina.core.StandardEngineValve.invoke(Request, Response)
> org.apache.catalina.connector.CoyoteAdapter.service(Request, Response)
> org.apache.coyote.http11.Http11Processor.process(InputStream, OutputStream)
> org.apache.coyote.http11.Http11BaseProtocol$Http11ConnectionHandler.processConnection(TcpConnection,
>  Object[])
> org.apache.tomcat.util.net.PoolTcpEndpoint.processSocket(Socket, 
> TcpConnection, Object[])
> org.apache.tomcat.util.net.LeaderFollowerWorkerThread.runIt(Object[])
> org.apache.tomcat.util.threads.ThreadPool$ControlRunnable.run()
> java.lang.Thread.run()
>
>
> -Original Message-
> From: ysee...@gmail.com [mailto:ysee...@gmail.com] On Behalf Of Yonik Seeley
> Sent: Monday, October 05, 2009 1:18 PM
> To: solr-user@lucene.apache.org
> Subject: Re: Solr Timeouts
>
> OK... next step is to verify that SolrCell doesn't have a bug that
> causes it to commit.
> I'll try and verify today unless someone else beats me to it.
>
> -Yonik
> http://www.lucidimagination.com
>
> On Mon, Oct 5, 2009 at 1:04 PM, Giovanni Fernandez-Kincade
>  wrote:
>> I'm fairly certain that all of the indexing jobs are calling SOLR with 
>> commit=false. They all construct the indexing URLs using a CLR function I 
>> wrote, which takes in a Commit parameter, which is always set to false.
>>
>> Also, I don't see any calls to commit in the Tomcat logs (whereas normally 
>> when I make a commit call I do).
>>
>> This suggests that Solr is doing it automatically, but the extract handler 
>> doesn't seem to be the problem:
>>  > class="org.apache.solr.handler.extraction.ExtractingRequestHandler" 
>> startup="lazy">
>>    
>>      ignored_
>>      fileData
>>    
>>  
>>
>>
>> There is no external config file specified, and I don't see anything about 
>> commits here.
>>
>> I've tried setting up more detailed indexer logging but haven't been able to 
>> get it to work:
>> true
>>
>> I tried relative and absolute paths, but no dice so far.
>>
>> Any other ideas?
>>
>> -Gio.
>>
>> -Original Message-
>> From: ysee...@gmail.com [mail

Different sort behavior on same code

2009-10-06 Thread wojtekpia

Hi,

I'm running Solr version 1.3.0.2009.07.08.08.05.45 in 2 environments. I have
a field defined as:



The two environments have different data, but both have single and multi
valued entries for myDate.

On one environment sorting by myDate works (sort seems to be by the 'last'
value if multi valued).

On the other environment I get: 
HTTP Status 500 - there are more terms than documents in field "myDate", but
it's impossible to sort on tokenized fields java.lang.RuntimeException:
there are more terms than documents in field 

I've read that I shouldn't sort by multi-valued fields, so my solution will
be to add a single-valued date field for sorting. But I don't understand why
my two environments behave differently, and it doesn't seem like the error
message makes sense (are date fields tokenized?). Any thoughts?

Thanks,

Wojtek
-- 
View this message in context: 
http://www.nabble.com/Different-sort-behavior-on-same-code-tp25774769p25774769.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: Solr Timeouts

2009-10-06 Thread Lance Norskog
Is this Java 1.5? There are known threading bugs in 1.5 that were
fixed in Java 1.6. Also, there was one short series of 1.6 releases
that wrote bogus Lucene index files.

So, make sure you use the latest Java 1.6 release.

Also, I hope this is a local disk. Some shops try running over NFS or
Windows file sharing and this often does not work well.

Lance



On 10/6/09, Giovanni Fernandez-Kincade  wrote:
> Is it possible that deletions are triggering these commits? Some of the
> documents that I'm making indexing requests for already exist in the index,
> so they would result in deletions. I tried messing with some of these
> parameters but I'm still running into the same problem:
>
> 
>   
>   false
>   
>   100
>   
> 
>
> This is happening like every 30-40minutes and it's really hampering the
> indexing progress...
>
>
> -Original Message-
> From: Giovanni Fernandez-Kincade [mailto:gfernandez-kinc...@capitaliq.com]
> Sent: Monday, October 05, 2009 2:11 PM
> To: solr-user@lucene.apache.org; yo...@lucidimagination.com
> Subject: RE: Solr Timeouts
>
> I just grabbed another stack trace for a thread that has been similarly
> blocking for over an hour. Notice that there is no Commit in this one:
>
> http-8080-Processor67 [RUNNABLE] CPU time: 1:02:05
> org.apache.lucene.index.TermBuffer.read(IndexInput, FieldInfos)
> org.apache.lucene.index.SegmentTermEnum.next()
> org.apache.lucene.index.SegmentTermEnum.scanTo(Term)
> org.apache.lucene.index.TermInfosReader.get(Term, boolean)
> org.apache.lucene.index.TermInfosReader.get(Term)
> org.apache.lucene.index.SegmentTermDocs.seek(Term)
> org.apache.lucene.index.DocumentsWriter.applyDeletes(IndexReader, int)
> org.apache.lucene.index.DocumentsWriter.applyDeletes(SegmentInfos)
> org.apache.lucene.index.IndexWriter.applyDeletes()
> org.apache.lucene.index.IndexWriter.doFlushInternal(boolean, boolean)
> org.apache.lucene.index.IndexWriter.doFlush(boolean, boolean)
> org.apache.lucene.index.IndexWriter.flush(boolean, boolean, boolean)
> org.apache.lucene.index.IndexWriter.updateDocument(Term, Document, Analyzer)
> org.apache.lucene.index.IndexWriter.updateDocument(Term, Document)
> org.apache.solr.update.DirectUpdateHandler2.addDoc(AddUpdateCommand)
> org.apache.solr.update.processor.RunUpdateProcessor.processAdd(AddUpdateCommand)
> org.apache.solr.handler.extraction.ExtractingDocumentLoader.doAdd(SolrContentHandler,
> AddUpdateCommand)
> org.apache.solr.handler.extraction.ExtractingDocumentLoader.addDoc(SolrContentHandler)
> org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(SolrQueryRequest,
> SolrQueryResponse, ContentStream)
> org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(SolrQueryRequest,
> SolrQueryResponse)
> org.apache.solr.handler.RequestHandlerBase.handleRequest(SolrQueryRequest,
> SolrQueryResponse)
> org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(SolrQueryRequest,
> SolrQueryResponse)
> org.apache.solr.core.SolrCore.execute(SolrRequestHandler, SolrQueryRequest,
> SolrQueryResponse)
> org.apache.solr.servlet.SolrDispatchFilter.execute(HttpServletRequest,
> SolrRequestHandler, SolrQueryRequest, SolrQueryResponse)
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(ServletRequest,
> ServletResponse, FilterChain)
> org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ServletRequest,
> ServletResponse)
> org.apache.catalina.core.ApplicationFilterChain.doFilter(ServletRequest,
> ServletResponse)
> org.apache.catalina.core.StandardWrapperValve.invoke(Request, Response)
> org.apache.catalina.core.StandardContextValve.invoke(Request, Response)
> org.apache.catalina.core.StandardHostValve.invoke(Request, Response)
> org.apache.catalina.valves.ErrorReportValve.invoke(Request, Response)
> org.apache.catalina.core.StandardEngineValve.invoke(Request, Response)
> org.apache.catalina.connector.CoyoteAdapter.service(Request, Response)
> org.apache.coyote.http11.Http11Processor.process(InputStream, OutputStream)
> org.apache.coyote.http11.Http11BaseProtocol$Http11ConnectionHandler.processConnection(TcpConnection,
> Object[])
> org.apache.tomcat.util.net.PoolTcpEndpoint.processSocket(Socket,
> TcpConnection, Object[])
> org.apache.tomcat.util.net.LeaderFollowerWorkerThread.runIt(Object[])
> org.apache.tomcat.util.threads.ThreadPool$ControlRunnable.run()
> java.lang.Thread.run()
>
>
> -Original Message-
> From: ysee...@gmail.com [mailto:ysee...@gmail.com] On Behalf Of Yonik Seeley
> Sent: Monday, October 05, 2009 1:18 PM
> To: solr-user@lucene.apache.org
> Subject: Re: Solr Timeouts
>
> OK... next step is to verify that SolrCell doesn't have a bug that
> causes it to commit.
> I'll try and verify today unless someone else beats me to it.
>
> -Yonik
> http://www.lucidimagination.com
>
> On Mon, Oct 5, 2009 at 1:04 PM, Giovanni Fernandez-Kincade
>  wrote:
>> I'm fairly certain that all of the indexing jobs are calling SOLR with
>> commit=false. They all construct 

Re: Importing CSV file slow/crashes

2009-10-06 Thread Yonik Seeley
Is it possible to narrow down what fields/field-types are causing the problems?
Or perhaps profile and see what's taking up time compared to the older version?

-Yonik
http://www.lucidimagination.com



On Tue, Oct 6, 2009 at 1:48 PM, Nasseam Elkarra  wrote:
> Hello Erick,
>
> Sorry about that. I'm using the CSV update handler. Uploading a local CSV
> using the stream.file parameter. There are 94 fields and 36 copyFields.
>
> Thank you,
> Nasseam
>
> On Oct 6, 2009, at 10:09 AM, Erick Erickson wrote:
>
>> Well, without some better idea of *how* you're doing the import, it's a
>> little hard to say anything meaningful (hint, hint).
>> Best
>> Erick
>>
>> On Tue, Oct 6, 2009 at 1:06 PM, Nasseam Elkarra 
>> wrote:
>>
>>> Hello all,
>>>
>>> I had a dev build of 1.4 from 5/1/2009 and importing a 20K row took less
>>> than a minute. Updating to the latest as of yesterday, the import is
>>> really
>>> slow and I had to cancel it after a half hour. This prevented me from
>>> upgrading a few months ago as well.
>>>
>>> Any ideas as to the cause of this?
>>>
>>> Thank you,
>>>
>>> Nasseam Elkarra
>>> http://bodukai.com/boutique/
>>> The fastest possible shopping experience.
>>>
>>>
>
>


Geo Coding Service

2009-10-06 Thread ram_sj

Hi,

Can someone suggest me a good geo-coding service or software for commercial
use. I want to find gecodes for large collection of address. I'm looking for
a good long term service. 

Thanks
Ram
-- 
View this message in context: 
http://www.nabble.com/Geo-Coding-Service-tp25774277p25774277.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: TermsComponent or auto-suggest with filter

2009-10-06 Thread Jay Hill
Have a look at a blog I posted on how to use EdgeNGrams to build an
auto-suggest tool:
http://www.lucidimagination.com/blog/2009/09/08/auto-suggest-from-popular-queries-using-edgengrams/

You could easily add filter queries to this approach. Ffor example, the
query used in the blog could add filter queries like this:

http://localhost:8983/solr/select/?q=user_query:”i”&wt=json&fl=user_query&indent=on&echoParams=none&rows=10&sort=count
desc&fq=yourField:yourQuery&fq=anotherField:anotherQuery

-Jay
http://www.lucidimagination.com




On Tue, Oct 6, 2009 at 4:40 AM, R. Tan  wrote:

> Hello,
> What's the best way to get auto-suggested terms/keywords that is filtered
> by
> one or more fields? TermsComponent should have been the solution but
> filters
> are not supported.
>
> Thanks,
> Rihaed
>


Re: Importing CSV file slow/crashes

2009-10-06 Thread Nasseam Elkarra

Hello Erick,

Sorry about that. I'm using the CSV update handler. Uploading a local  
CSV using the stream.file parameter. There are 94 fields and 36  
copyFields.


Thank you,
Nasseam

On Oct 6, 2009, at 10:09 AM, Erick Erickson wrote:

Well, without some better idea of *how* you're doing the import,  
it's a

little hard to say anything meaningful (hint, hint).
Best
Erick

On Tue, Oct 6, 2009 at 1:06 PM, Nasseam Elkarra  
 wrote:



Hello all,

I had a dev build of 1.4 from 5/1/2009 and importing a 20K row took  
less
than a minute. Updating to the latest as of yesterday, the import  
is really

slow and I had to cancel it after a half hour. This prevented me from
upgrading a few months ago as well.

Any ideas as to the cause of this?

Thank you,

Nasseam Elkarra
http://bodukai.com/boutique/
The fastest possible shopping experience.






Re: Solr Trunk Heap Space Issues

2009-10-06 Thread Mark Miller
Okay - I'm sorry - serves me right for working sick.

Now that I have put on my glasses and correctly tagged my two eclipse tests:

It still appears that trunk likes to use more RAM.

I switched both tests to one million iterations and watched the heap.

The test from the build around may 5th (I promise :) ) regularly GC's
down to about 70-80MB after a fair time
of running. It doesn't appear to climb - keeps GC'ing back to 70-80
(after starting at by GC'ing down to 40 for a bit).

The test from trunk, after a fair time of running, keeps GC'ing down to
about 120-150MB - 150 at the end, slowly working its
way up from 90-110 at the beginning.

Don't know what that means yet - but it appears trunk likes to use a bit
more RAM while indexing. Odd that its so much more because these docs
are tiny:

String[] fields = {"text","simple"
,"text","test"
,"text","how now brown cow"
,"text","what's that?"
,"text","radical!"
,"text","what's all this about, anyway?"
,"text","just how fast is this text indexing?"
};

Mark Miller wrote:
> Okay, I juggled the tests in eclipse and flipped the results. So they
> make sense.
>
> Sorry - goose chase on this one.
>
> Yonik Seeley wrote:
>   
>> I don't see this with trunk... I just tried TestIndexingPerformance
>> with 1M docs, and it seemed to work fine.
>> Memory use stabilized at 40MB.
>> Most memory use was for indexing (not analysis).
>> char[] topped out at 4.5MB
>>
>> -Yonik
>> http://www.lucidimagination.com
>>
>>
>> On Tue, Oct 6, 2009 at 12:31 PM, Mark Miller  wrote:
>>   
>> 
>>> Yeah - I was wondering about that ... not sure how these guys are
>>> stacking up ...
>>>
>>> Yonik Seeley wrote:
>>> 
>>>   
 TestIndexingPerformance?
 What the heck... that's not even multi-threaded!

 -Yonik
 http://www.lucidimagination.com



 On Tue, Oct 6, 2009 at 12:17 PM, Mark Miller  wrote:

   
 
> Darnit - didn't finish that email. This is after running your old short
> doc perf test for 10,000 iterations. You see the same thing with 1000
> iterations but much less pronounced eg gettin' worse with more iterations.
>
> Mark Miller wrote:
>
> 
>   
>> A little before and after. The before is around may 5th'is - the after
>> is trunk.
>>
>> http://myhardshadow.com/memanalysis/before.png
>> http://myhardshadow.com/memanalysis/after.png
>>
>> Mark Miller wrote:
>>
>>
>>   
>> 
>>> Took a peak at the checkout around the time he says he's using.
>>>
>>> CharTokenizer appears to be holding onto much large char[] arrays now
>>> than before. Same with snowball.Among - used to be almost nothing, now
>>> its largio.
>>>
>>> The new TokenStream stuff appears to be clinging. Needs to find some
>>> inner peace.
>>> 
>>>   
>
>
>   


-- 
- Mark

http://www.lucidimagination.com





Re: Importing CSV file slow/crashes

2009-10-06 Thread Erick Erickson
Well, without some better idea of *how* you're doing the import, it's a
little hard to say anything meaningful (hint, hint).
Best
Erick

On Tue, Oct 6, 2009 at 1:06 PM, Nasseam Elkarra  wrote:

> Hello all,
>
> I had a dev build of 1.4 from 5/1/2009 and importing a 20K row took less
> than a minute. Updating to the latest as of yesterday, the import is really
> slow and I had to cancel it after a half hour. This prevented me from
> upgrading a few months ago as well.
>
> Any ideas as to the cause of this?
>
> Thank you,
>
> Nasseam Elkarra
> http://bodukai.com/boutique/
> The fastest possible shopping experience.
>
>


Re: Solr Trunk Heap Space Issues

2009-10-06 Thread Mark Miller
Okay, I juggled the tests in eclipse and flipped the results. So they
make sense.

Sorry - goose chase on this one.

Yonik Seeley wrote:
> I don't see this with trunk... I just tried TestIndexingPerformance
> with 1M docs, and it seemed to work fine.
> Memory use stabilized at 40MB.
> Most memory use was for indexing (not analysis).
> char[] topped out at 4.5MB
>
> -Yonik
> http://www.lucidimagination.com
>
>
> On Tue, Oct 6, 2009 at 12:31 PM, Mark Miller  wrote:
>   
>> Yeah - I was wondering about that ... not sure how these guys are
>> stacking up ...
>>
>> Yonik Seeley wrote:
>> 
>>> TestIndexingPerformance?
>>> What the heck... that's not even multi-threaded!
>>>
>>> -Yonik
>>> http://www.lucidimagination.com
>>>
>>>
>>>
>>> On Tue, Oct 6, 2009 at 12:17 PM, Mark Miller  wrote:
>>>
>>>   
 Darnit - didn't finish that email. This is after running your old short
 doc perf test for 10,000 iterations. You see the same thing with 1000
 iterations but much less pronounced eg gettin' worse with more iterations.

 Mark Miller wrote:

 
> A little before and after. The before is around may 5th'is - the after
> is trunk.
>
> http://myhardshadow.com/memanalysis/before.png
> http://myhardshadow.com/memanalysis/after.png
>
> Mark Miller wrote:
>
>
>   
>> Took a peak at the checkout around the time he says he's using.
>>
>> CharTokenizer appears to be holding onto much large char[] arrays now
>> than before. Same with snowball.Among - used to be almost nothing, now
>> its largio.
>>
>> The new TokenStream stuff appears to be clinging. Needs to find some
>> inner peace.
>> 


-- 
- Mark

http://www.lucidimagination.com





Re: Solr Trunk Heap Space Issues

2009-10-06 Thread Yonik Seeley
I don't see this with trunk... I just tried TestIndexingPerformance
with 1M docs, and it seemed to work fine.
Memory use stabilized at 40MB.
Most memory use was for indexing (not analysis).
char[] topped out at 4.5MB

-Yonik
http://www.lucidimagination.com


On Tue, Oct 6, 2009 at 12:31 PM, Mark Miller  wrote:
> Yeah - I was wondering about that ... not sure how these guys are
> stacking up ...
>
> Yonik Seeley wrote:
>> TestIndexingPerformance?
>> What the heck... that's not even multi-threaded!
>>
>> -Yonik
>> http://www.lucidimagination.com
>>
>>
>>
>> On Tue, Oct 6, 2009 at 12:17 PM, Mark Miller  wrote:
>>
>>> Darnit - didn't finish that email. This is after running your old short
>>> doc perf test for 10,000 iterations. You see the same thing with 1000
>>> iterations but much less pronounced eg gettin' worse with more iterations.
>>>
>>> Mark Miller wrote:
>>>
 A little before and after. The before is around may 5th'is - the after
 is trunk.

 http://myhardshadow.com/memanalysis/before.png
 http://myhardshadow.com/memanalysis/after.png

 Mark Miller wrote:


> Took a peak at the checkout around the time he says he's using.
>
> CharTokenizer appears to be holding onto much large char[] arrays now
> than before. Same with snowball.Among - used to be almost nothing, now
> its largio.
>
> The new TokenStream stuff appears to be clinging. Needs to find some
> inner peace.


De-basing / re-basing docIDs, or how to effectively pass calculated values from a Scorer or Filter up to (Solr's) QueryComponent.process

2009-10-06 Thread Aaron McKee

(Posted here, per Yonik's suggestion)

In the code I'm working with, I generate a cache of calculated values as 
a by-product within a Filter.getDocidSet implementation (and within a 
Query-ized version of the filter and its Scorer method) . These values 
are keyed off the IndexReader's docID values, since that's all that's 
accessible at that level. Ultimately, however, I need to be able to 
access these values much higher up in the stack (Solr's 
QueryComponent.process method), so that I can inject the dynamic values 
into the response as a fake field. The IDs available here, however, are 
for the entire index and not just relative to the current IndexReader. 
I'm still fairly new to Lucene and I've been scratching my head a bit 
trying to find a reliable way to map these values into the same space, 
without having to hack up too many base classes. I noticed that there 
was a related discussion at:


http://issues.apache.org/jira/browse/LUCENE-1821?focusedCommentId=12745041#action_12745041 



... but also a bit of disagreement on the suggested strategies. Ideally, 
I'm also hoping there's a strategy that won't require me to hack up too 
much of the core product; subclassing IndexSearcher in the way suggested 
would basically require me to change all of the various SearchComponents 
I use in Solr, and that sounds like it'd end up a real maintenance 
nightmare. I was looking at the Collector class as possible solution, 
since it has knowledge of the docbase, but it looks like I'd then need 
to change every derived collector that the code ultimately uses and, 
including the various anonymous Collectors in Solr, that also looks like 
it'd be a fairly ghoulish solution. I suppose I'm being wishful, or 
lazy, but is there a reasonable and reliable way to do this, without 
having to fork the core code? If not, any suggestion on the best 
strategy to accomplish this, without adding too much overhead every time 
I wanted to up-rev the core Lucene and/or Solr code to the latest version?


Thanks a ton,
Aaron



Re: HighLithing exact phrases with solr

2009-10-06 Thread Koji Sekiguchi

Please try hl.usePhraseHighlighter=true parameter.
(It should be true by default if you use the latest nightly, but I think 
you don't)


Koji

Antonio Calò wrote:

Hi Guys

I'm getting crazy with the highlighting in solr. The problem is the follow:
when I submit an exact phrase query, I get the related results and the
related snippets with highlight. But I've noticed that the *single term of
the phrase are highlighted too*. Here an example:

If I start a search for "quick brown fox", I obtain the correct result with
the doc wich contains the phrase, but the snippets came to me like this:


 


The quick brown fox jump over the lazy dog. The fox is a
nice animal.

 
  



Also with some documents, only single terms are highlighted insteand of
exact sentence even if the exact phrase is contained into the document i.
e.:

 


The fox is a nice animal.

 
  



My understanding of highlighting is that if I search for exact phrase, only
the exact phrase is should be highlighted.

Here an extract of my solrconfig.xml & schema.xml

solrconfig.xml:


   
   
   

 500

   

   
   

  
  700
  
  0.5
  
  [-\w ,/\n\"']{20,200}

  true

  true

   

   
   

 
 

   


schema.xml:



 


  














Maybe I'm missing something, or my understanding of the highlighting feature
is not correct. Any Idea?

As always, thanks for your support!

Regards, Antonio

  




RE: Solr Timeouts

2009-10-06 Thread Giovanni Fernandez-Kincade
Is it possible that deletions are triggering these commits? Some of the 
documents that I'm making indexing requests for already exist in the index, so 
they would result in deletions. I tried messing with some of these parameters 
but I'm still running into the same problem:


  
  false
  
  100
  


This is happening like every 30-40minutes and it's really hampering the 
indexing progress...


-Original Message-
From: Giovanni Fernandez-Kincade [mailto:gfernandez-kinc...@capitaliq.com]
Sent: Monday, October 05, 2009 2:11 PM
To: solr-user@lucene.apache.org; yo...@lucidimagination.com
Subject: RE: Solr Timeouts

I just grabbed another stack trace for a thread that has been similarly 
blocking for over an hour. Notice that there is no Commit in this one:

http-8080-Processor67 [RUNNABLE] CPU time: 1:02:05
org.apache.lucene.index.TermBuffer.read(IndexInput, FieldInfos)
org.apache.lucene.index.SegmentTermEnum.next()
org.apache.lucene.index.SegmentTermEnum.scanTo(Term)
org.apache.lucene.index.TermInfosReader.get(Term, boolean)
org.apache.lucene.index.TermInfosReader.get(Term)
org.apache.lucene.index.SegmentTermDocs.seek(Term)
org.apache.lucene.index.DocumentsWriter.applyDeletes(IndexReader, int)
org.apache.lucene.index.DocumentsWriter.applyDeletes(SegmentInfos)
org.apache.lucene.index.IndexWriter.applyDeletes()
org.apache.lucene.index.IndexWriter.doFlushInternal(boolean, boolean)
org.apache.lucene.index.IndexWriter.doFlush(boolean, boolean)
org.apache.lucene.index.IndexWriter.flush(boolean, boolean, boolean)
org.apache.lucene.index.IndexWriter.updateDocument(Term, Document, Analyzer)
org.apache.lucene.index.IndexWriter.updateDocument(Term, Document)
org.apache.solr.update.DirectUpdateHandler2.addDoc(AddUpdateCommand)
org.apache.solr.update.processor.RunUpdateProcessor.processAdd(AddUpdateCommand)
org.apache.solr.handler.extraction.ExtractingDocumentLoader.doAdd(SolrContentHandler,
 AddUpdateCommand)
org.apache.solr.handler.extraction.ExtractingDocumentLoader.addDoc(SolrContentHandler)
org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(SolrQueryRequest,
 SolrQueryResponse, ContentStream)
org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(SolrQueryRequest,
 SolrQueryResponse)
org.apache.solr.handler.RequestHandlerBase.handleRequest(SolrQueryRequest, 
SolrQueryResponse)
org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(SolrQueryRequest,
 SolrQueryResponse)
org.apache.solr.core.SolrCore.execute(SolrRequestHandler, SolrQueryRequest, 
SolrQueryResponse)
org.apache.solr.servlet.SolrDispatchFilter.execute(HttpServletRequest, 
SolrRequestHandler, SolrQueryRequest, SolrQueryResponse)
org.apache.solr.servlet.SolrDispatchFilter.doFilter(ServletRequest, 
ServletResponse, FilterChain)
org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ServletRequest,
 ServletResponse)
org.apache.catalina.core.ApplicationFilterChain.doFilter(ServletRequest, 
ServletResponse)
org.apache.catalina.core.StandardWrapperValve.invoke(Request, Response)
org.apache.catalina.core.StandardContextValve.invoke(Request, Response)
org.apache.catalina.core.StandardHostValve.invoke(Request, Response)
org.apache.catalina.valves.ErrorReportValve.invoke(Request, Response)
org.apache.catalina.core.StandardEngineValve.invoke(Request, Response)
org.apache.catalina.connector.CoyoteAdapter.service(Request, Response)
org.apache.coyote.http11.Http11Processor.process(InputStream, OutputStream)
org.apache.coyote.http11.Http11BaseProtocol$Http11ConnectionHandler.processConnection(TcpConnection,
 Object[])
org.apache.tomcat.util.net.PoolTcpEndpoint.processSocket(Socket, TcpConnection, 
Object[])
org.apache.tomcat.util.net.LeaderFollowerWorkerThread.runIt(Object[])
org.apache.tomcat.util.threads.ThreadPool$ControlRunnable.run()
java.lang.Thread.run()


-Original Message-
From: ysee...@gmail.com [mailto:ysee...@gmail.com] On Behalf Of Yonik Seeley
Sent: Monday, October 05, 2009 1:18 PM
To: solr-user@lucene.apache.org
Subject: Re: Solr Timeouts

OK... next step is to verify that SolrCell doesn't have a bug that
causes it to commit.
I'll try and verify today unless someone else beats me to it.

-Yonik
http://www.lucidimagination.com

On Mon, Oct 5, 2009 at 1:04 PM, Giovanni Fernandez-Kincade
 wrote:
> I'm fairly certain that all of the indexing jobs are calling SOLR with 
> commit=false. They all construct the indexing URLs using a CLR function I 
> wrote, which takes in a Commit parameter, which is always set to false.
>
> Also, I don't see any calls to commit in the Tomcat logs (whereas normally 
> when I make a commit call I do).
>
> This suggests that Solr is doing it automatically, but the extract handler 
> doesn't seem to be the problem:
>   class="org.apache.solr.handler.extraction.ExtractingRequestHandler" 
> startup="lazy">
>
>  ignored_
>  fileData
>
>  
>
>
> There is no external config file specified, and I don't see anything about 
> co

Re: Solr Trunk Heap Space Issues

2009-10-06 Thread Mark Miller
Yeah - I was wondering about that ... not sure how these guys are
stacking up ...

Yonik Seeley wrote:
> TestIndexingPerformance?
> What the heck... that's not even multi-threaded!
>
> -Yonik
> http://www.lucidimagination.com
>
>
>
> On Tue, Oct 6, 2009 at 12:17 PM, Mark Miller  wrote:
>   
>> Darnit - didn't finish that email. This is after running your old short
>> doc perf test for 10,000 iterations. You see the same thing with 1000
>> iterations but much less pronounced eg gettin' worse with more iterations.
>>
>> Mark Miller wrote:
>> 
>>> A little before and after. The before is around may 5th'is - the after
>>> is trunk.
>>>
>>> http://myhardshadow.com/memanalysis/before.png
>>> http://myhardshadow.com/memanalysis/after.png
>>>
>>> Mark Miller wrote:
>>>
>>>   
 Took a peak at the checkout around the time he says he's using.

 CharTokenizer appears to be holding onto much large char[] arrays now
 than before. Same with snowball.Among - used to be almost nothing, now
 its largio.

 The new TokenStream stuff appears to be clinging. Needs to find some
 inner peace.

 Yonik Seeley wrote:


 
> On Mon, Oct 5, 2009 at 4:54 PM, Jeff Newburn  wrote:
>
>
>
>   
>> Ok we have done some more testing on this issue.  When I only have the 1
>> core the reindex completes fine.  However, when I added a second core 
>> with
>> no documents it runs out of heap again.  This time the heap was 322Mb of
>> LRUCache.  The 1 query that warms returns exactly 2 documents so I have 
>> no
>> idea where the LRUCache is getting its information or what is even in 
>> there.
>>
>>
>>
>> 
> I guess the obvious thing to check would be the custom search component.
> Does it access documents?  I don't see how else the document cache
> could self populate with so many entries (assuming it is the document
> cache again).
>
> -Yonik
> http://www.lucidimagination.com
>
>
>
>
>
>
>
>   
>> --
>> Jeff Newburn
>> Software Engineer, Zappos.com
>> jnewb...@zappos.com - 702-943-7562
>>
>>
>>
>>
>>
>> 
>>> From: Yonik Seeley 
>>> Reply-To: 
>>> Date: Mon, 5 Oct 2009 13:32:32 -0400
>>> To: 
>>> Subject: Re: Solr Trunk Heap Space Issues
>>>
>>> On Mon, Oct 5, 2009 at 1:00 PM, Jeff Newburn  
>>> wrote:
>>>
>>>
>>>
>>>   
 Ok I have eliminated all queries for warming and am still getting the 
 heap
 space dump.  Any ideas at this point what could be wrong?  This seems 
 like a
 huge increase in memory to go from indexing without issues to not 
 being able
 to even with warming off.



 
>>> Do you have any custom Analyzers, Tokenizers, TokenFilters?
>>> Another change is that token streams are reused by caching in a
>>> thread-local, so every thread in your server could potentially have a
>>> copy of an analysis chain (token stream) per field that you have used.
>>>  This normally shouldn't be an issue since these will be small.  Also,
>>> how many unique fields do you have?
>>>
>>> -Yonik
>>> http://www.lucidimagination.com
>>>
>>>
>>>
>>>
>>>
>>>
>>>   
 Jeff Newburn
 Software Engineer, Zappos.com
 jnewb...@zappos.com - 702-943-7562





 
> From: Jeff Newburn 
> Reply-To: 
> Date: Thu, 01 Oct 2009 08:41:18 -0700
> To: "solr-user@lucene.apache.org" 
> Subject: Solr Trunk Heap Space Issues
>
> I am trying to update to the newest version of solr from trunk as of 
> May
> 5th.  I updated and compiled from trunk as of yesterday (09/30/2009). 
>  When
> I try to do a full import I am receiving a GC heap error after 
> changing
> nothing in the configuration files.  Why would this happen in the most
> recent versions but not in the version from a few months ago.  The 
> stack
> trace is below.
>
> Oct 1, 2009 8:34:32 AM 
> org.apache.solr.update.processor.LogUpdateProcessor
> finish
> INFO: {add=[166400, 166608, 166698, 166800, 166811, 167097, 167316, 
> 167353,
> ...(83 more)]} 0 35991
> Oct 1, 2009 8:34:32 AM org.apache.solr.common.SolrException log
> SEVERE: java.lang.OutOfMemoryError: GC overhead limit exceeded
> at java.util.Arrays.copyOfRange(Arrays.java:3209)
> at java.lang.String.(String.java:215)
> at 
> com.ctc.wstx.util.TextBuffer.contentsAsString(TextBuffer.java:384)
> at 
> com.ctc.wstx.s

Re: Solr Trunk Heap Space Issues

2009-10-06 Thread Yonik Seeley
TestIndexingPerformance?
What the heck... that's not even multi-threaded!

-Yonik
http://www.lucidimagination.com



On Tue, Oct 6, 2009 at 12:17 PM, Mark Miller  wrote:
> Darnit - didn't finish that email. This is after running your old short
> doc perf test for 10,000 iterations. You see the same thing with 1000
> iterations but much less pronounced eg gettin' worse with more iterations.
>
> Mark Miller wrote:
>> A little before and after. The before is around may 5th'is - the after
>> is trunk.
>>
>> http://myhardshadow.com/memanalysis/before.png
>> http://myhardshadow.com/memanalysis/after.png
>>
>> Mark Miller wrote:
>>
>>> Took a peak at the checkout around the time he says he's using.
>>>
>>> CharTokenizer appears to be holding onto much large char[] arrays now
>>> than before. Same with snowball.Among - used to be almost nothing, now
>>> its largio.
>>>
>>> The new TokenStream stuff appears to be clinging. Needs to find some
>>> inner peace.
>>>
>>> Yonik Seeley wrote:
>>>
>>>
 On Mon, Oct 5, 2009 at 4:54 PM, Jeff Newburn  wrote:



> Ok we have done some more testing on this issue.  When I only have the 1
> core the reindex completes fine.  However, when I added a second core with
> no documents it runs out of heap again.  This time the heap was 322Mb of
> LRUCache.  The 1 query that warms returns exactly 2 documents so I have no
> idea where the LRUCache is getting its information or what is even in 
> there.
>
>
>
 I guess the obvious thing to check would be the custom search component.
 Does it access documents?  I don't see how else the document cache
 could self populate with so many entries (assuming it is the document
 cache again).

 -Yonik
 http://www.lucidimagination.com







> --
> Jeff Newburn
> Software Engineer, Zappos.com
> jnewb...@zappos.com - 702-943-7562
>
>
>
>
>
>> From: Yonik Seeley 
>> Reply-To: 
>> Date: Mon, 5 Oct 2009 13:32:32 -0400
>> To: 
>> Subject: Re: Solr Trunk Heap Space Issues
>>
>> On Mon, Oct 5, 2009 at 1:00 PM, Jeff Newburn  wrote:
>>
>>
>>
>>> Ok I have eliminated all queries for warming and am still getting the 
>>> heap
>>> space dump.  Any ideas at this point what could be wrong?  This seems 
>>> like a
>>> huge increase in memory to go from indexing without issues to not being 
>>> able
>>> to even with warming off.
>>>
>>>
>>>
>> Do you have any custom Analyzers, Tokenizers, TokenFilters?
>> Another change is that token streams are reused by caching in a
>> thread-local, so every thread in your server could potentially have a
>> copy of an analysis chain (token stream) per field that you have used.
>>  This normally shouldn't be an issue since these will be small.  Also,
>> how many unique fields do you have?
>>
>> -Yonik
>> http://www.lucidimagination.com
>>
>>
>>
>>
>>
>>
>>> Jeff Newburn
>>> Software Engineer, Zappos.com
>>> jnewb...@zappos.com - 702-943-7562
>>>
>>>
>>>
>>>
>>>
 From: Jeff Newburn 
 Reply-To: 
 Date: Thu, 01 Oct 2009 08:41:18 -0700
 To: "solr-user@lucene.apache.org" 
 Subject: Solr Trunk Heap Space Issues

 I am trying to update to the newest version of solr from trunk as of 
 May
 5th.  I updated and compiled from trunk as of yesterday (09/30/2009).  
 When
 I try to do a full import I am receiving a GC heap error after changing
 nothing in the configuration files.  Why would this happen in the most
 recent versions but not in the version from a few months ago.  The 
 stack
 trace is below.

 Oct 1, 2009 8:34:32 AM 
 org.apache.solr.update.processor.LogUpdateProcessor
 finish
 INFO: {add=[166400, 166608, 166698, 166800, 166811, 167097, 167316, 
 167353,
 ...(83 more)]} 0 35991
 Oct 1, 2009 8:34:32 AM org.apache.solr.common.SolrException log
 SEVERE: java.lang.OutOfMemoryError: GC overhead limit exceeded
     at java.util.Arrays.copyOfRange(Arrays.java:3209)
     at java.lang.String.(String.java:215)
     at 
 com.ctc.wstx.util.TextBuffer.contentsAsString(TextBuffer.java:384)
     at 
 com.ctc.wstx.sr.BasicStreamReader.getText(BasicStreamReader.java:821)
     at org.apache.solr.handler.XMLLoader.readDoc(XMLLoader.java:280)
     at 
 org.apache.solr.handler.XMLLoader.processUpdate(XMLLoader.java:139)
     at org.apache.solr.handler.XMLLoader.load(XMLLoader.java:69)
     at
 org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentSt
 reamHandlerBase.java:54)
     at
 org.apache.solr.handler.RequestHan

Re: Solr Trunk Heap Space Issues

2009-10-06 Thread Mark Miller
Darnit - didn't finish that email. This is after running your old short
doc perf test for 10,000 iterations. You see the same thing with 1000
iterations but much less pronounced eg gettin' worse with more iterations.

Mark Miller wrote:
> A little before and after. The before is around may 5th'is - the after
> is trunk.
>
> http://myhardshadow.com/memanalysis/before.png
> http://myhardshadow.com/memanalysis/after.png
>
> Mark Miller wrote:
>   
>> Took a peak at the checkout around the time he says he's using.
>>
>> CharTokenizer appears to be holding onto much large char[] arrays now
>> than before. Same with snowball.Among - used to be almost nothing, now
>> its largio.
>>
>> The new TokenStream stuff appears to be clinging. Needs to find some
>> inner peace.
>>
>> Yonik Seeley wrote:
>>   
>> 
>>> On Mon, Oct 5, 2009 at 4:54 PM, Jeff Newburn  wrote:
>>>   
>>> 
>>>   
 Ok we have done some more testing on this issue.  When I only have the 1
 core the reindex completes fine.  However, when I added a second core with
 no documents it runs out of heap again.  This time the heap was 322Mb of
 LRUCache.  The 1 query that warms returns exactly 2 documents so I have no
 idea where the LRUCache is getting its information or what is even in 
 there.
 
   
 
>>> I guess the obvious thing to check would be the custom search component.
>>> Does it access documents?  I don't see how else the document cache
>>> could self populate with so many entries (assuming it is the document
>>> cache again).
>>>
>>> -Yonik
>>> http://www.lucidimagination.com
>>>
>>>
>>>
>>>
>>>   
>>> 
>>>   
 --
 Jeff Newburn
 Software Engineer, Zappos.com
 jnewb...@zappos.com - 702-943-7562


 
   
 
> From: Yonik Seeley 
> Reply-To: 
> Date: Mon, 5 Oct 2009 13:32:32 -0400
> To: 
> Subject: Re: Solr Trunk Heap Space Issues
>
> On Mon, Oct 5, 2009 at 1:00 PM, Jeff Newburn  wrote:
>   
> 
>   
>> Ok I have eliminated all queries for warming and am still getting the 
>> heap
>> space dump.  Any ideas at this point what could be wrong?  This seems 
>> like a
>> huge increase in memory to go from indexing without issues to not being 
>> able
>> to even with warming off.
>> 
>>   
>> 
> Do you have any custom Analyzers, Tokenizers, TokenFilters?
> Another change is that token streams are reused by caching in a
> thread-local, so every thread in your server could potentially have a
> copy of an analysis chain (token stream) per field that you have used.
>  This normally shouldn't be an issue since these will be small.  Also,
> how many unique fields do you have?
>
> -Yonik
> http://www.lucidimagination.com
>
>
>
>   
> 
>   
>> Jeff Newburn
>> Software Engineer, Zappos.com
>> jnewb...@zappos.com - 702-943-7562
>>
>>
>> 
>>   
>> 
>>> From: Jeff Newburn 
>>> Reply-To: 
>>> Date: Thu, 01 Oct 2009 08:41:18 -0700
>>> To: "solr-user@lucene.apache.org" 
>>> Subject: Solr Trunk Heap Space Issues
>>>
>>> I am trying to update to the newest version of solr from trunk as of May
>>> 5th.  I updated and compiled from trunk as of yesterday (09/30/2009).  
>>> When
>>> I try to do a full import I am receiving a GC heap error after changing
>>> nothing in the configuration files.  Why would this happen in the most
>>> recent versions but not in the version from a few months ago.  The stack
>>> trace is below.
>>>
>>> Oct 1, 2009 8:34:32 AM 
>>> org.apache.solr.update.processor.LogUpdateProcessor
>>> finish
>>> INFO: {add=[166400, 166608, 166698, 166800, 166811, 167097, 167316, 
>>> 167353,
>>> ...(83 more)]} 0 35991
>>> Oct 1, 2009 8:34:32 AM org.apache.solr.common.SolrException log
>>> SEVERE: java.lang.OutOfMemoryError: GC overhead limit exceeded
>>> at java.util.Arrays.copyOfRange(Arrays.java:3209)
>>> at java.lang.String.(String.java:215)
>>> at 
>>> com.ctc.wstx.util.TextBuffer.contentsAsString(TextBuffer.java:384)
>>> at 
>>> com.ctc.wstx.sr.BasicStreamReader.getText(BasicStreamReader.java:821)
>>> at org.apache.solr.handler.XMLLoader.readDoc(XMLLoader.java:280)
>>> at 
>>> org.apache.solr.handler.XMLLoader.processUpdate(XMLLoader.java:139)
>>> at org.apache.solr.handler.XMLLoader.load(XMLLoader.java:69)
>>> at
>>> org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentSt
>>> reamHandlerBase.java:54)
>>> at
>>> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.
>>> java:131)
>>> at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316

Re: Solr Trunk Heap Space Issues

2009-10-06 Thread Mark Miller
A little before and after. The before is around may 5th'is - the after
is trunk.

http://myhardshadow.com/memanalysis/before.png
http://myhardshadow.com/memanalysis/after.png

Mark Miller wrote:
> Took a peak at the checkout around the time he says he's using.
>
> CharTokenizer appears to be holding onto much large char[] arrays now
> than before. Same with snowball.Among - used to be almost nothing, now
> its largio.
>
> The new TokenStream stuff appears to be clinging. Needs to find some
> inner peace.
>
> Yonik Seeley wrote:
>   
>> On Mon, Oct 5, 2009 at 4:54 PM, Jeff Newburn  wrote:
>>   
>> 
>>> Ok we have done some more testing on this issue.  When I only have the 1
>>> core the reindex completes fine.  However, when I added a second core with
>>> no documents it runs out of heap again.  This time the heap was 322Mb of
>>> LRUCache.  The 1 query that warms returns exactly 2 documents so I have no
>>> idea where the LRUCache is getting its information or what is even in there.
>>> 
>>>   
>> I guess the obvious thing to check would be the custom search component.
>> Does it access documents?  I don't see how else the document cache
>> could self populate with so many entries (assuming it is the document
>> cache again).
>>
>> -Yonik
>> http://www.lucidimagination.com
>>
>>
>>
>>
>>   
>> 
>>> --
>>> Jeff Newburn
>>> Software Engineer, Zappos.com
>>> jnewb...@zappos.com - 702-943-7562
>>>
>>>
>>> 
>>>   
 From: Yonik Seeley 
 Reply-To: 
 Date: Mon, 5 Oct 2009 13:32:32 -0400
 To: 
 Subject: Re: Solr Trunk Heap Space Issues

 On Mon, Oct 5, 2009 at 1:00 PM, Jeff Newburn  wrote:
   
 
> Ok I have eliminated all queries for warming and am still getting the heap
> space dump.  Any ideas at this point what could be wrong?  This seems 
> like a
> huge increase in memory to go from indexing without issues to not being 
> able
> to even with warming off.
> 
>   
 Do you have any custom Analyzers, Tokenizers, TokenFilters?
 Another change is that token streams are reused by caching in a
 thread-local, so every thread in your server could potentially have a
 copy of an analysis chain (token stream) per field that you have used.
  This normally shouldn't be an issue since these will be small.  Also,
 how many unique fields do you have?

 -Yonik
 http://www.lucidimagination.com



   
 
> Jeff Newburn
> Software Engineer, Zappos.com
> jnewb...@zappos.com - 702-943-7562
>
>
> 
>   
>> From: Jeff Newburn 
>> Reply-To: 
>> Date: Thu, 01 Oct 2009 08:41:18 -0700
>> To: "solr-user@lucene.apache.org" 
>> Subject: Solr Trunk Heap Space Issues
>>
>> I am trying to update to the newest version of solr from trunk as of May
>> 5th.  I updated and compiled from trunk as of yesterday (09/30/2009).  
>> When
>> I try to do a full import I am receiving a GC heap error after changing
>> nothing in the configuration files.  Why would this happen in the most
>> recent versions but not in the version from a few months ago.  The stack
>> trace is below.
>>
>> Oct 1, 2009 8:34:32 AM 
>> org.apache.solr.update.processor.LogUpdateProcessor
>> finish
>> INFO: {add=[166400, 166608, 166698, 166800, 166811, 167097, 167316, 
>> 167353,
>> ...(83 more)]} 0 35991
>> Oct 1, 2009 8:34:32 AM org.apache.solr.common.SolrException log
>> SEVERE: java.lang.OutOfMemoryError: GC overhead limit exceeded
>> at java.util.Arrays.copyOfRange(Arrays.java:3209)
>> at java.lang.String.(String.java:215)
>> at com.ctc.wstx.util.TextBuffer.contentsAsString(TextBuffer.java:384)
>> at 
>> com.ctc.wstx.sr.BasicStreamReader.getText(BasicStreamReader.java:821)
>> at org.apache.solr.handler.XMLLoader.readDoc(XMLLoader.java:280)
>> at 
>> org.apache.solr.handler.XMLLoader.processUpdate(XMLLoader.java:139)
>> at org.apache.solr.handler.XMLLoader.load(XMLLoader.java:69)
>> at
>> org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentSt
>> reamHandlerBase.java:54)
>> at
>> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.
>> java:131)
>> at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316)
>> at
>> org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:3
>> 38)
>> at
>> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:
>> 241)
>> at
>> org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(Application
>> FilterChain.java:235)
>> at
>> org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterCh
>> ain.java:206)
>> at
>> org.apache.catalina.core.StandardWrappe

How to retrieve the index of a string within a field?

2009-10-06 Thread Elaine Li
Hi,

I have a field. The field has a sentence. If the user types in a word
or a phrase, how can I return the index of this word or the index of
the first word of the phrase?
I tried to use &bf=ord..., but it does not work as i expected.

Thanks.

Elaine


Re: Solr Trunk Heap Space Issues

2009-10-06 Thread Mark Miller
Took a peak at the checkout around the time he says he's using.

CharTokenizer appears to be holding onto much large char[] arrays now
than before. Same with snowball.Among - used to be almost nothing, now
its largio.

The new TokenStream stuff appears to be clinging. Needs to find some
inner peace.

Yonik Seeley wrote:
> On Mon, Oct 5, 2009 at 4:54 PM, Jeff Newburn  wrote:
>   
>> Ok we have done some more testing on this issue.  When I only have the 1
>> core the reindex completes fine.  However, when I added a second core with
>> no documents it runs out of heap again.  This time the heap was 322Mb of
>> LRUCache.  The 1 query that warms returns exactly 2 documents so I have no
>> idea where the LRUCache is getting its information or what is even in there.
>> 
>
> I guess the obvious thing to check would be the custom search component.
> Does it access documents?  I don't see how else the document cache
> could self populate with so many entries (assuming it is the document
> cache again).
>
> -Yonik
> http://www.lucidimagination.com
>
>
>
>
>   
>> --
>> Jeff Newburn
>> Software Engineer, Zappos.com
>> jnewb...@zappos.com - 702-943-7562
>>
>>
>> 
>>> From: Yonik Seeley 
>>> Reply-To: 
>>> Date: Mon, 5 Oct 2009 13:32:32 -0400
>>> To: 
>>> Subject: Re: Solr Trunk Heap Space Issues
>>>
>>> On Mon, Oct 5, 2009 at 1:00 PM, Jeff Newburn  wrote:
>>>   
 Ok I have eliminated all queries for warming and am still getting the heap
 space dump.  Any ideas at this point what could be wrong?  This seems like 
 a
 huge increase in memory to go from indexing without issues to not being 
 able
 to even with warming off.
 
>>> Do you have any custom Analyzers, Tokenizers, TokenFilters?
>>> Another change is that token streams are reused by caching in a
>>> thread-local, so every thread in your server could potentially have a
>>> copy of an analysis chain (token stream) per field that you have used.
>>>  This normally shouldn't be an issue since these will be small.  Also,
>>> how many unique fields do you have?
>>>
>>> -Yonik
>>> http://www.lucidimagination.com
>>>
>>>
>>>
>>>   
 Jeff Newburn
 Software Engineer, Zappos.com
 jnewb...@zappos.com - 702-943-7562


 
> From: Jeff Newburn 
> Reply-To: 
> Date: Thu, 01 Oct 2009 08:41:18 -0700
> To: "solr-user@lucene.apache.org" 
> Subject: Solr Trunk Heap Space Issues
>
> I am trying to update to the newest version of solr from trunk as of May
> 5th.  I updated and compiled from trunk as of yesterday (09/30/2009).  
> When
> I try to do a full import I am receiving a GC heap error after changing
> nothing in the configuration files.  Why would this happen in the most
> recent versions but not in the version from a few months ago.  The stack
> trace is below.
>
> Oct 1, 2009 8:34:32 AM org.apache.solr.update.processor.LogUpdateProcessor
> finish
> INFO: {add=[166400, 166608, 166698, 166800, 166811, 167097, 167316, 
> 167353,
> ...(83 more)]} 0 35991
> Oct 1, 2009 8:34:32 AM org.apache.solr.common.SolrException log
> SEVERE: java.lang.OutOfMemoryError: GC overhead limit exceeded
> at java.util.Arrays.copyOfRange(Arrays.java:3209)
> at java.lang.String.(String.java:215)
> at com.ctc.wstx.util.TextBuffer.contentsAsString(TextBuffer.java:384)
> at 
> com.ctc.wstx.sr.BasicStreamReader.getText(BasicStreamReader.java:821)
> at org.apache.solr.handler.XMLLoader.readDoc(XMLLoader.java:280)
> at org.apache.solr.handler.XMLLoader.processUpdate(XMLLoader.java:139)
> at org.apache.solr.handler.XMLLoader.load(XMLLoader.java:69)
> at
> org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentSt
> reamHandlerBase.java:54)
> at
> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.
> java:131)
> at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316)
> at
> org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:3
> 38)
> at
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:
> 241)
> at
> org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(Application
> FilterChain.java:235)
> at
> org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterCh
> ain.java:206)
> at
> org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.ja
> va:233)
> at
> org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.ja
> va:175)
> at
> org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:128
> )
> at
> org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102
> )
> at
> org.apache.catalina.core.StandardEngineValve.invoke(S

Problems with DIH XPath flatten

2009-10-06 Thread Adam Foltzer
Hi all,

I'm trying to set up DataImportHandler to index some XML documents available
over web services. The XML includes both content and metadata, so for the
indexable content, I'm trying to just index everything under the content
tag:






The result of this is that the title field gets populated and indexed (there
are no child nodes of /document/kbml/kbq), but content does not get indexed
at all. Since /document/kbml/body has many children, I expected that
flatten="true" would store all of the body text in the field. Instead, it
stores nothing at all. I've tried this with many combinations of
transformers and flatten options, and the result is the same each time.

Here are the relevant field declarations from the schema (the type="text" is
just the one from the example's schema.xml). I have tried combinations here
as well of stored= and multiValued=, with the same result each time.




If it would help troubleshooting, I could send along some sample XML. I
don't want to spam the list with an attachment unless it's necessary, though
:)

Thanks in advance for your help,

Adam Foltzer


Tr : Questions about synonyms and highlighting

2009-10-06 Thread Nourredine K.
Hello,

Even short/partial answers could satisfy me :)


Nourredine.


>Hi,

>Can you please give me some answers for those questions : 
>
>1 - How can I get synonyms found for  a keyword ? 
>  
>I mean i search "foo" and i have in my synonyms.txt file the following tokens 
>: "foo, foobar, fee" (with expand = true)
>My index contains "foo" and "foobar". I want to display a message in a result 
>page, on the header for example, only the 2 matched tokens and not >"fee"  
>like "Results found for foo and foobar" 
>
>2 - Can solR make analysis on an index to extract associations between tokens ?
>
>for example , if "foo" often appears with "fee" in a field, it will associate 
>the 2 tokens.
>
>3 - Is it possible and if so How can I configure solR to set or not 
>highlighting for tokens with diacritics ? 
>
>Settings for "vélo" (all highlighted) ==> the two words "vélo" and 
>"velo" are highlighted
>Settings for "vélo" ==> the first word "vélo" is highlighted but not 
>the second  : "velo"
>
>4 - the same question for highlighting with lemmatisation?
>
>Settings for "manage" (all highlighted) ==> the two wordsmanage and 
>"management" are highlighted
>Settings for "manage" ==> the first word "manage" is highlighted but 
>not the second  : "management"
>
>
>Thanks in advance.
>
>Regards 
>
>Nourredine.

__
Do You Yahoo!?
En finir avec le spam? Yahoo! Mail vous offre la meilleure protection possible 
contre les messages non sollicités 
http://mail.yahoo.fr Yahoo! Mail 

Re: FACET_SORT_INDEX descending?

2009-10-06 Thread Gerald Snyder
Reverse alphabetical ordering.   The option "index" provides 
alphabetical ordering. 

I have a year_facet field, that I would like to display in reverse order 
(most recent years first).  Perhaps there is some other way to 
accomplish this.


Thanks.
--Gerald


Chris Hostetter wrote:

: Is there any value for the "f.my_year_facet.facet.sort"  parameter that will
: return the facet values in descending order?   So far I only see "index" and
: "count" as the choices. 


descending what?

(count is descending order by count)



-Hoss


  


Re: Creating cores using SolrJ

2009-10-06 Thread Noble Paul നോബിള്‍ नोब्ळ्
yeah that is missing.
I've just committed a setter/getter for dataDir in create command

do this

 CoreAdminRequest.Create req = new CoreAdminRequest.Create();
req.setCoreName( name );
req.setInstanceDir(instanceDir);
req.setDataDirDir(dataDir);
return req.process( solrServer );



2009/10/6 Licinio Fernández Maurelo :
> Hi there,
>
> i want to create cores using SolrJ, but i also want to create then in a
> given datadir. How can i do this? Looking CoreAdminRequest methods i only
> found:
>
>
>   - createCore(name, instanceDir, server)
>   - createCore(name, instanceDir, server, configFile, schemaFile)
>
> None of above methods allow datadir param.
>
> Thx
>
> --
> Lici
>



-- 
-
Noble Paul | Principal Engineer| AOL | http://aol.com


Creating cores using SolrJ

2009-10-06 Thread Licinio Fernández Maurelo
Hi there,

i want to create cores using SolrJ, but i also want to create then in a
given datadir. How can i do this? Looking CoreAdminRequest methods i only
found:


   - createCore(name, instanceDir, server)
   - createCore(name, instanceDir, server, configFile, schemaFile)

None of above methods allow datadir param.

Thx

-- 
Lici


RE: using regular expressions in solr query

2009-10-06 Thread Feak, Todd
Any particular reason for the double quotes in the 2nd and 3rd query example, 
but not the 1st, or is this just an artifact of your email?

-Todd

-Original Message-
From: Rakhi Khatwani [mailto:rkhatw...@gmail.com] 
Sent: Tuesday, October 06, 2009 2:26 AM
To: solr-user@lucene.apache.org
Subject: using regular expressions in solr query

Hi,
  i have an example in which i want to use a regular expression in my
solr query:
for example: suppose i wanna search on a sample :
raakhi rajnish ninad goureya sheetal
ritesh rajnish ninad goureya sheetal
where my content field is of type text
when i type in
QUERY:   content:raa*
RESPONSE :   raakhi rajnish ninad goureya sheetal
QUERY: content:"ra*"
RESPONSE: 0 results
coz of this i am facing problems with the next query:
QUERY: content: "r* rajnish"
RESPONSE: 0 results
which should ideally return both the results.
any pointers??
Regards,
Raakhi



Re: search by some functionality

2009-10-06 Thread Sandeep Tagore

Hi Elaine,
You can implement a function query in Solr in two ways:
 1. Using  Dismax request handler (with bf parameter).
 2. Using the standard request handler (with _val_ field).
I recommend the first option. 

Sandeep


Elaine Li wrote:
> 
> Hi Sandeep,
> 
> I read about this chapter before. It did not mention how to create my
> own customized function.
> Can you point me to some instructions?
> Thanks.
> Elaine
> 
> 

-- 
View this message in context: 
http://www.nabble.com/search-by-some-functionality-tp25721533p25767741.html
Sent from the Solr - User mailing list archive at Nabble.com.



solr optimize - no space left on device

2009-10-06 Thread Phillip Farber
I am attempting to optimize a large shard on solr 1.4 and repeatedly get 
java.io.IOException: No space left on device. The shard, after a final 
commit before optimize, shows a size of about 192GB on a 400GB volume.  
I have successfully optimized 2 other shards that were similarly large 
without this problem on identical hardware boxes.


Before the optimize I see:

% df -B1 .
Filesystem 1B-blocks Used Available Use% Mounted on
/dev/mapper/internal-solr--build--2
 435440427008 205681356800 225335255040 48% 
/l/solrs/build-2


slurm-4:/l/solrs/build-2/data/index % du -B1
205441486848 .

There's a slight discrepancy between the du and df which appears to be 
orphaned inodes. But the du says there should be enough space to handle 
the doubling in size during optimization. However, for the second time 
we run out of space and the du and df are wildly different at that point 
and the volume is at 100%



% df -B1 .

Filesystem   1B-blocks  Used Available Use% Mounted on
/dev/mapper/internal-solr--build--2
435440427008 430985760768  30851072 100% 
/l/solrs/build-2


slurm-4:/l/solrs/build-2/data/index % du -B1
252552298496.

At this point it appears orphaned inodes are consuming space and not 
being freed-up. Any clue as to whether this is a lucene bug a solr bug 
or  some other problem.  Error traces follow.


Thanks!

Phil

---

Oct 6, 2009 2:12:37 AM 
org.apache.solr.update.processor.LogUpdateProcessor finish

INFO: {} 0 9110523
Oct 6, 2009 2:12:37 AM org.apache.solr.common.SolrException log
SEVERE: java.io.IOException: background merge hit exception: 
_ojl:C151080 _169w:C141302 _1j36:C80405 _1j35:C2043 _1j34:C192 into 
_1j37 [optimize]

   at org.apache.lucene.index.IndexWriter.optimize(IndexWriter.java:2737)
   at org.apache.lucene.index.IndexWriter.optimize(IndexWriter.java:2658)
   at 
org.apache.solr.update.DirectUpdateHandler2.commit(DirectUpdateHandler2.java:401)
   at 
org.apache.solr.update.processor.RunUpdateProcessor.processCommit(RunUpdateProcessorFactory.java:85)

   at org.apache.solr.handler.XMLLoader.processUpdate(XMLLoader.java:168)
   at org.apache.solr.handler.XMLLoader.load(XMLLoader.java:69)
   at 
org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:54)
   at 
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131)

   at org.apache.solr.core.SolrCore.execute(SolrCore.java:1299)
   at 
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:338)
   at 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:241)
   at 
org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:215)
   at 
org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:188)
   at 
org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:213)
   at 
org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:172)
   at 
org.apache.catalina.valves.AccessLogValve.invoke(AccessLogValve.java:548)
   at 
org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127)
   at 
org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:117)
   at 
org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:108)
   at 
org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:174)
   at 
org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:875)
   at 
org.apache.coyote.http11.Http11BaseProtocol$Http11ConnectionHandler.processConnection(Http11BaseProtocol.java:665)
   at 
org.apache.tomcat.util.net.PoolTcpEndpoint.processSocket(PoolTcpEndpoint.java:528)
   at 
org.apache.tomcat.util.net.LeaderFollowerWorkerThread.runIt(LeaderFollowerWorkerThread.java:81)
   at 
org.apache.tomcat.util.threads.ThreadPool$ControlRunnable.run(ThreadPool.java:689)

   at java.lang.Thread.run(Thread.java:619)
Caused by: java.io.IOException: No space left on device
   at java.io.RandomAccessFile.writeBytes(Native Method)
   at java.io.RandomAccessFile.write(RandomAccessFile.java:466)
   at 
org.apache.lucene.store.FSDirectory$FSIndexOutput.flushBuffer(FSDirectory.java:719)
   at 
org.apache.lucene.store.BufferedIndexOutput.flushBuffer(BufferedIndexOutput.java:96)
   at 
org.apache.lucene.store.BufferedIndexOutput.flush(BufferedIndexOutput.java:85)
   at 
org.apache.lucene.store.BufferedIndexOutput.seek(BufferedIndexOutput.java:124)
   at 
org.apache.lucene.store.FSDirectory$FSIndexOutput.seek(FSDirectory.java:744)
   at 
org.apache.lucene.index.TermInfosWriter.close(TermInfosWriter.java:220)
   at 
org.apache.lucene.index.FormatPostingsFieldsWriter.finish(FormatPostingsFieldsWriter.java:70)
   at 
org.apache.lucene.index.SegmentMerger.mergeTerms(SegmentMerger.java:493)

   at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:140)
   at 
org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWri

TermsComponent or auto-suggest with filter

2009-10-06 Thread R. Tan
Hello,
What's the best way to get auto-suggested terms/keywords that is filtered by
one or more fields? TermsComponent should have been the solution but filters
are not supported.

Thanks,
Rihaed


ISOLatin1AccentFilter before or after Snowball?

2009-10-06 Thread Chantal Ackermann

Hi all,

from reading through previous posts on that subject, it seems like the
accent filter has to come before the snowball filter.

I'd just like to make sure this is so. If it is the case, I'm wondering
whether snowball filters for i.e. French process accented language
correctly, at all, or whether they remove accents anyway... Or whether
accents should be removed whenever making use of snowball filters.


And also: it really is meant to take UTF-8 as input, even though it is
named ISOLatin1AccentFilter, isn't it?


Thanks in advance!
Chantal


--
Chantal Ackermann


Re: Re : Re : wildcard searches

2009-10-06 Thread Avlesh Singh
You are right, Angel. The problem would still persist.
Why don't you consider putting the original data in some field. While
querying, you can query on both the fields - analyzed and original one.
Wildcard queries will not give you any results from the analyzed field but
would match the data in your original field.

Works?

Cheers
Avlesh

On Tue, Oct 6, 2009 at 2:27 PM, Angel Ice  wrote:

> Ah yes, got it.
> But i'm not sure this will solve my problem.
> Because, I'm aloso using the IsoLatin1 filter, that remove the accentued
> characters.
> So I will have the same problem with accentued characters. Cause the
> original token is not stored with this filter.
>
> Laurent
>
>
>
>
>
>
> 
> De : Avlesh Singh 
> À : solr-user@lucene.apache.org
> Envoyé le : Mardi, 6 Octobre 2009, 10h41mn 56s
> Objet : Re: Re : wildcard searches
>
> You are processing your tokens in the filter that you wrote. I am assuming
> it is the first filter being applied and removes the character 'h' from
> tokens. When you are doing that, you can preserve the original token in the
> same field as well. Because as of now, you are simply removing the
> character. Subsequent filters don't even know that there was an 'h'
> character in the original token.
>
> Since wild card queries are not analyzed, the 'h' character in the query
> "hésita*" does NOT get removed during query time. This means that unless
> the
> original token was preserved in the field it wouldn't find any matches.
>
> This helps?
>
> Cheers
> Avlesh
>
> On Tue, Oct 6, 2009 at 2:02 PM, Angel Ice  wrote:
>
> > Hi.
> >
> > Thanks for your answers Christian and Avlesh.
> >
> > But I don't understant what you mean by :
> > "If you want to enable wildcard queries, preserving the original token
> > (while processing each token in your filter) might work."
> >
> > Could you explain this point please ?
> >
> > Laurent
> >
> >
> >
> >
> >
> > 
> > De : Avlesh Singh 
> > À : solr-user@lucene.apache.org
> > Envoyé le : Lundi, 5 Octobre 2009, 20h30mn 54s
> > Objet : Re: wildcard searches
> >
> > Zambrano is right, Laurent. The analyzers for a field are not invoked for
> > wildcard queries. You custom filter is not even getting executed at
> > query-time.
> > If you want to enable wildcard queries, preserving the original token
> > (while
> > processing each token in your filter) might work.
> >
> > Cheers
> > Avlesh
> >
> > On Mon, Oct 5, 2009 at 10:39 PM, Angel Ice  wrote:
> >
> > > Hi everyone,
> > >
> > > I have a little question regarding the search engine when a wildcard
> > > character is used in the query.
> > > Let's take the following example :
> > >
> > > - I have sent in indexation the word Hésitation (with an accent on the
> > "e")
> > > - The filters applied to the field that will handle this word, result
> in
> > > the indexation of "esit" (the mute H is suppressed (home made filter),
> > the
> > > accent too (IsoLatin1Filter), and the SnowballPorterFilter suppress the
> > > "ation".
> > >
> > > When i search for "hesitation", "esitation", "ésitation" etc ... all is
> > OK,
> > > the document is returned.
> > > But as soon as I use a wildcard, like "hésita*", the document is not
> > > returned. In fact, I have to put the wildcard in a manner that match
> the
> > > indexed term exactly (example "esi*")
> > >
> > > Does the search engine applies the filters to the word that prefix the
> > > wildcard ? Or does it use this prefix verbatim ?
> > >
> > > Thanks for you help.
> > >
> > > Laurent
> > >
> > >
> > >
> > >
> >
> >
> >
> >
>
>
>
>
>


using regular expressions in solr query

2009-10-06 Thread Rakhi Khatwani
Hi,
  i have an example in which i want to use a regular expression in my
solr query:
for example: suppose i wanna search on a sample :
raakhi rajnish ninad goureya sheetal
ritesh rajnish ninad goureya sheetal
where my content field is of type text
when i type in
QUERY:   content:raa*
RESPONSE :   raakhi rajnish ninad goureya sheetal
QUERY: content:"ra*"
RESPONSE: 0 results
coz of this i am facing problems with the next query:
QUERY: content: "r* rajnish"
RESPONSE: 0 results
which should ideally return both the results.
any pointers??
Regards,
Raakhi


Re: Need "OR" in DisMax Query

2009-10-06 Thread Ingo Renner


Am 05.10.2009 um 20:36 schrieb David Giffin:

Hi David,


Maybe I'm missing something, but I can't seem to get the dismax
request handler to perform and OR query. It appears that OR is removed
by the stop words.


It's not the stop words, Dismax simply doesn't do any boolean  
operations, the only thing you can do is using +searchWord and - 
searchWord or changing to the standard request handler.



best
Ingo

--
Ingo Renner
TYPO3 Core Developer, Release Manager TYPO3 4.2

Apache Solr for TYPO3: http://www.typo3-solr.com



Re : Re : wildcard searches

2009-10-06 Thread Angel Ice
Ah yes, got it.
But i'm not sure this will solve my problem.
Because, I'm aloso using the IsoLatin1 filter, that remove the accentued 
characters.
So I will have the same problem with accentued characters. Cause the original 
token is not stored with this filter.

Laurent







De : Avlesh Singh 
À : solr-user@lucene.apache.org
Envoyé le : Mardi, 6 Octobre 2009, 10h41mn 56s
Objet : Re: Re : wildcard searches

You are processing your tokens in the filter that you wrote. I am assuming
it is the first filter being applied and removes the character 'h' from
tokens. When you are doing that, you can preserve the original token in the
same field as well. Because as of now, you are simply removing the
character. Subsequent filters don't even know that there was an 'h'
character in the original token.

Since wild card queries are not analyzed, the 'h' character in the query
"hésita*" does NOT get removed during query time. This means that unless the
original token was preserved in the field it wouldn't find any matches.

This helps?

Cheers
Avlesh

On Tue, Oct 6, 2009 at 2:02 PM, Angel Ice  wrote:

> Hi.
>
> Thanks for your answers Christian and Avlesh.
>
> But I don't understant what you mean by :
> "If you want to enable wildcard queries, preserving the original token
> (while processing each token in your filter) might work."
>
> Could you explain this point please ?
>
> Laurent
>
>
>
>
>
> 
> De : Avlesh Singh 
> À : solr-user@lucene.apache.org
> Envoyé le : Lundi, 5 Octobre 2009, 20h30mn 54s
> Objet : Re: wildcard searches
>
> Zambrano is right, Laurent. The analyzers for a field are not invoked for
> wildcard queries. You custom filter is not even getting executed at
> query-time.
> If you want to enable wildcard queries, preserving the original token
> (while
> processing each token in your filter) might work.
>
> Cheers
> Avlesh
>
> On Mon, Oct 5, 2009 at 10:39 PM, Angel Ice  wrote:
>
> > Hi everyone,
> >
> > I have a little question regarding the search engine when a wildcard
> > character is used in the query.
> > Let's take the following example :
> >
> > - I have sent in indexation the word Hésitation (with an accent on the
> "e")
> > - The filters applied to the field that will handle this word, result in
> > the indexation of "esit" (the mute H is suppressed (home made filter),
> the
> > accent too (IsoLatin1Filter), and the SnowballPorterFilter suppress the
> > "ation".
> >
> > When i search for "hesitation", "esitation", "ésitation" etc ... all is
> OK,
> > the document is returned.
> > But as soon as I use a wildcard, like "hésita*", the document is not
> > returned. In fact, I have to put the wildcard in a manner that match the
> > indexed term exactly (example "esi*")
> >
> > Does the search engine applies the filters to the word that prefix the
> > wildcard ? Or does it use this prefix verbatim ?
> >
> > Thanks for you help.
> >
> > Laurent
> >
> >
> >
> >
>
>
>
>



  

Re: Date field being null

2009-10-06 Thread Avlesh Singh
>
> I am defining a field:
> 
>
indexed="false" and stored="false"? really? This field is as good as
nothing. What would you use it for?

Can I have a null for such a field?
>
Yes you can. Moreover, as you have sortMissingLast="true" specified in your
field type definition, documents having null values in this field would
appear in the end for any kind of sorting.

Cheers
Avlesh

On Tue, Oct 6, 2009 at 1:16 PM, Pooja Verlani wrote:

> Hi,
> My fieldtype definition is like:
>  omitNorms="true"/>
>
> I am defining a field:
> 
>
> Can I have a null for such a field? or is there a way I can use it as a
> date
> field only if the value is null. I cant put the field as a string type as I
> have to apply recency sort and some filters for that field.
> Regards,
> Pooja
>


Re: Re : wildcard searches

2009-10-06 Thread Avlesh Singh
You are processing your tokens in the filter that you wrote. I am assuming
it is the first filter being applied and removes the character 'h' from
tokens. When you are doing that, you can preserve the original token in the
same field as well. Because as of now, you are simply removing the
character. Subsequent filters don't even know that there was an 'h'
character in the original token.

Since wild card queries are not analyzed, the 'h' character in the query
"hésita*" does NOT get removed during query time. This means that unless the
original token was preserved in the field it wouldn't find any matches.

This helps?

Cheers
Avlesh

On Tue, Oct 6, 2009 at 2:02 PM, Angel Ice  wrote:

> Hi.
>
> Thanks for your answers Christian and Avlesh.
>
> But I don't understant what you mean by :
> "If you want to enable wildcard queries, preserving the original token
> (while processing each token in your filter) might work."
>
> Could you explain this point please ?
>
> Laurent
>
>
>
>
>
> 
> De : Avlesh Singh 
> À : solr-user@lucene.apache.org
> Envoyé le : Lundi, 5 Octobre 2009, 20h30mn 54s
> Objet : Re: wildcard searches
>
> Zambrano is right, Laurent. The analyzers for a field are not invoked for
> wildcard queries. You custom filter is not even getting executed at
> query-time.
> If you want to enable wildcard queries, preserving the original token
> (while
> processing each token in your filter) might work.
>
> Cheers
> Avlesh
>
> On Mon, Oct 5, 2009 at 10:39 PM, Angel Ice  wrote:
>
> > Hi everyone,
> >
> > I have a little question regarding the search engine when a wildcard
> > character is used in the query.
> > Let's take the following example :
> >
> > - I have sent in indexation the word Hésitation (with an accent on the
> "e")
> > - The filters applied to the field that will handle this word, result in
> > the indexation of "esit" (the mute H is suppressed (home made filter),
> the
> > accent too (IsoLatin1Filter), and the SnowballPorterFilter suppress the
> > "ation".
> >
> > When i search for "hesitation", "esitation", "ésitation" etc ... all is
> OK,
> > the document is returned.
> > But as soon as I use a wildcard, like "hésita*", the document is not
> > returned. In fact, I have to put the wildcard in a manner that match the
> > indexed term exactly (example "esi*")
> >
> > Does the search engine applies the filters to the word that prefix the
> > wildcard ? Or does it use this prefix verbatim ?
> >
> > Thanks for you help.
> >
> > Laurent
> >
> >
> >
> >
>
>
>
>


Re : wildcard searches

2009-10-06 Thread Angel Ice
Hi.

Thanks for your answers Christian and Avlesh.

But I don't understant what you mean by :
"If you want to enable wildcard queries, preserving the original token (while 
processing each token in your filter) might work."

Could you explain this point please ?

Laurent






De : Avlesh Singh 
À : solr-user@lucene.apache.org
Envoyé le : Lundi, 5 Octobre 2009, 20h30mn 54s
Objet : Re: wildcard searches

Zambrano is right, Laurent. The analyzers for a field are not invoked for
wildcard queries. You custom filter is not even getting executed at
query-time.
If you want to enable wildcard queries, preserving the original token (while
processing each token in your filter) might work.

Cheers
Avlesh

On Mon, Oct 5, 2009 at 10:39 PM, Angel Ice  wrote:

> Hi everyone,
>
> I have a little question regarding the search engine when a wildcard
> character is used in the query.
> Let's take the following example :
>
> - I have sent in indexation the word Hésitation (with an accent on the "e")
> - The filters applied to the field that will handle this word, result in
> the indexation of "esit" (the mute H is suppressed (home made filter), the
> accent too (IsoLatin1Filter), and the SnowballPorterFilter suppress the
> "ation".
>
> When i search for "hesitation", "esitation", "ésitation" etc ... all is OK,
> the document is returned.
> But as soon as I use a wildcard, like "hésita*", the document is not
> returned. In fact, I have to put the wildcard in a manner that match the
> indexed term exactly (example "esi*")
>
> Does the search engine applies the filters to the word that prefix the
> wildcard ? Or does it use this prefix verbatim ?
>
> Thanks for you help.
>
> Laurent
>
>
>
>



  

Date field being null

2009-10-06 Thread Pooja Verlani
Hi,
My fieldtype definition is like:


I am defining a field:


Can I have a null for such a field? or is there a way I can use it as a date
field only if the value is null. I cant put the field as a string type as I
have to apply recency sort and some filters for that field.
Regards,
Pooja


solr reporting tool adapter

2009-10-06 Thread Rakhi Khatwani
Hi,
i wanted to query solr and send the output  some reporting tool. has
anyone done something like that? moreover, which reporting filter is good??
ny suggesstions?

Regards,
Raakhi