Solr Quries
Hi, I am new to solr. I have following queries : 1. Is solr work in distributed environment ? if yes, how to configure it? 2. Is solr have Hadoop support? if yes, how to setup it with Hadoop/HDFS? (Note: I am familiar with Hadoop) 3. I have employee information(id, name ,address, cell no, personal info) of 1 TB ,To post(index)this data on solr server, shall I have to create xml file with this data and then post it to solr server? Or is there any other optimal way? In future my data will grow upto 10 TB , then how can I index this data ?(because creating xml is more headache ) Thanks in advance -Pravin DISCLAIMER == This e-mail may contain privileged and confidential information which is the property of Persistent Systems Ltd. It is intended only for the use of the individual or entity to which it is addressed. If you are not the intended recipient, you are not authorized to read, retain, copy, print, distribute or use this message. If you have received this communication in error, please notify the sender and delete all copies of this message. Persistent Systems Ltd. does not accept any liability for virus infected mails.
Re: solr 1.4 formats last_index_time for SQL differently than 1.3 ?!?
really? I don't remember that being changed. what difference do u notice? On Wed, Oct 7, 2009 at 2:30 AM, michael8 wrote: > > Just looking for confirmation from others, but it appears that the formatting > of last_index_time from dataimport.properties (using DataImportHandler) is > different in 1.4 vs. that in 1.3. I was troubleshooting why delta imports > are no longer working for me after moving over to solr 1.4 (10/2 nighly) and > noticed that format is different. > > Michael > -- > View this message in context: > http://www.nabble.com/solr-1.4-formats-last_index_time-for-SQL-differently-than-1.3--%21--tp25776496p25776496.html > Sent from the Solr - User mailing list archive at Nabble.com. > > -- - Noble Paul | Principal Engineer| AOL | http://aol.com
Re: Weird Facet and KeywordTokenizerFactory Issue
Hello Mr.Hostetter, Thank you for patiently reading through my post, I apologize for being cryptic in my previous messages.. >>when you cut/pasted the facet output, you excluded the field names. based >>on the schema & solrconfig.xml snippets you posted later, i'm assuming >>they are usstate, and keyword, but you have to be explicit so that people can help correlate the >>results you are getting with the schema you posted I had to be brief as my facets are in the order of 100K over 800K documents and also if I give the complete schema.xml I was afraid nobody would read my long message :-) ..Hence I showed only relevant pieces of the result showing different fields having same problem >>i'm assuming they are usstate, and keyword, but you have to be explicit so that people can help correlate the >>results you are getting with the schema you posted -- for example, you haven't posted anything that would verify that the usstate >>field actually uses your keywordText field Yes, you are right here is the compete relavant snippet regarding keywordText and associated fields. keyword, keywordlower and keywordformatted are all aggregations of all other fields like - person, personformatted, organization, location. location itself is aggregation of usstate, country. The aggregation is done seperately in custom code even before indexing into solr >>A huge gap is in what your synonym files contain ... something weird in >>there could easily explain superfluous terms getting added to your data. Here are my synonym entries --- #Persons barack obama, barak obama, barack h. obama, barack hussein obama, barak hussein obama hillary clinton, hillary r. clinton, hillary rodham clinton timothy geithner, tim geithner, timothy f. geithner, geithner, timothy franz geithner vladimir putin, putin #Organizations U.N, U.N., u.n, un, UN, United Nations => U.N DHS,D.H.S,D.H.S.,Department of Homeland Security,Homeland Security => D.H.S USCIS, United States Citizenship and Immigration Services, U.S.C.I.S. => United States Citizenship and Immigration Services, U.S.C.I.S SEC, Securities and Exchange Commission, S.E.C, S.E.C, SEC. => Securities and Exchange Commission, S.E.C FCC, Federal Communications Commission, F.C.C, F.C.C. => Federal Communications Commission, F.C.C GSA, General Services Administration, G.S.A, G.S.A. => General Services Administration, G.S.A SBA, Small Business Administration, S.B.A, S.B.A. => Small Business Administration, S.B.A. FEMA, Federal Emergency Management Agency, FEMA. => FEMA AT&T, ATT, ATT., AT&T., AT&T Wireless => AT&T BBC, British Broadcasting Corporation, B.B.C, B.B.C. => B.B.C,BBC Bank of America, BOA, B.O.A, Bank of America Corp, Bank of America Corp. => B.O.A General Motors, G.M., G.M, GM, General Motors Corp., General Motors Corp => General Motors, G.M NFL, National Football League, N.F.L, N.F.L. => N.F.L Exxon Mobil, Exxon Mobil Corp => Exxon Mobil Google, Google Inc, Google Inc. => Google AIG, A.I.G, A.I.G., American International Group => American International Group, A.I.G Goldman Sachs, Goldman Sachs Inc., Goldman Sachs Group Inc, Goldman Sachs Group Inc. => Goldman Sachs GE, General Electric Co., General Electric Co, G.E, G.E., General Electric => G.E, General Electric General Dynamics, General Dynamics Corp,General Dynamics Corp., General Dynamics Information Technology, General Dynamics Advanced Information Systems => General Dynamics HP, Hewlett Packard Co,Hewlett Packard Co., Hewlett Packard, Hewlett-Packard, Hewlett-Packard Corp,H.P, H.P. => Hewlett Packard, H.P IBM, International Business Machines, I.B.M, International Business Machines Corp => I.B.M Johns Hopkins University, Johns Hopkins, JHU, J.H.U, J.H.U. => Johns Hopkins University, JHU, J.H.U J.C. Penney, J.C. Penney Co. => J.C. Penney JPMorgan Chase, JPMorgan Chase & Co., JPMorgan Chase & Co, JPMorgan => JPMorgan Chase & Co. Lockheed Martin, Lockheed Martin Corp, Lockheed Martin Corp., Lockheed, Lockheed VH => Lockheed Martin Merrill Lynch, Merrill Lynch & Co., Merrill, Merrill. => Merrill Lynch Microsoft, Microsoft Corp., Microsoft Corp, Microsoft. => Microsoft Northrop Grumman, Northrop Grumman Corp., Northrop Grumman Corp, Northrop, Northrop Corp. => Northrop Grumman Smyth Co., Smyth Co Sony, Sony Corp., Sony Corp => Sony Corp. TJX Companies, TJX, TJX Cos. => TJX Companies Target Corp., Target Corp, Target Corp stores => Target Corp. Walmart, WalMart Inc, WalMart Stores, WalMart Stores Inc, WalMart Stores Inc. => WalMart Inc. Yahoo, Yahoo Inc co, Yahoo Inc. => Yahoo Inc. AP, AP., A.P, A.P., Associated Press => Associated Press #Countries USA,USA.,U.S.A.,u.s.a,u.s.a.,U.S,U.S.,US,US.,u.s, u.s.,United States,United States of America,United States Of America,united states,united states of america,
Re: DataImportHandler problem: Feeding the XPathEntityProcessor with the FieldReaderDataSource
hi Lance. db.blob is the correct field name so that is fine. you can probbaly open an issue and provide the testcase as a patch. That can help us track this better On Wed, Oct 7, 2009 at 12:45 AM, Lance Norskog wrote: > A side note that might help: if I change the dataField from 'db.blob' > to 'blob', this DIH stack emits no documents. > > On 10/5/09, Lance Norskog wrote: >> I've added a unit test for the problem down below. It feeds document >> field data into the XPathEntityProcessor via the >> FieldReaderDataSource, and the XPath EP does not emit unpacked fields. >> >> Running this under the debugger, I can see the supplied StringReader, >> with the XML string, being piped into the XPath EP. But somehow the >> XPath EP does not pick it apart the right way. >> >> Here is the DIH configuration file separately. >> >> >> >> >> >> >> >> >> >> > processor='XPathEntityProcessor' >> forEach='/names' dataField='db.blob'> >> >> >> >> >> >> >> Any ideas? >> >> --- >> >> package org.apache.solr.handler.dataimport; >> >> import static >> org.apache.solr.handler.dataimport.AbstractDataImportHandlerTest.createMap; >> import junit.framework.TestCase; >> >> import java.util.ArrayList; >> import java.util.HashMap; >> import java.util.List; >> import java.util.Map; >> >> import org.apache.solr.common.SolrInputDocument; >> import org.apache.solr.common.SolrInputField; >> import org.apache.solr.handler.dataimport.TestDocBuilder.SolrWriterImpl; >> import org.junit.Test; >> >> /* >> * Demonstrate problem feeding XPathEntity from a FieldReaderDatasource >> */ >> >> public class TestFieldReaderXPath extends TestCase { >> static final String KISSINGER = "Henry"; >> >> static final String[][][] DBDOCS = { >> {{"dbid", "1"}, {"blob", KISSINGER}}, >> }; >> >> /* >> * Receive a row from SQL and fetch a row from Solr - no value matching >> * stolen from TestDocBuilder >> * */ >> >> @Test >> public void testSolrEmbedded() throws Exception { >> try { >> DataImporter di = new DataImporter(); >> di.loadDataConfig(dih_config_FR_into_XP); >> DataImporter.RequestParams rp = new >> DataImporter.RequestParams(); >> rp.command = "full-import"; >> rp.requestParams = new HashMap(); >> >> DataConfig cfg = di.getConfig(); >> DataConfig.Entity entity = >> cfg.document.entities.get(0); >> List> l = new >> ArrayList>(); >> addDBDocuments(l); >> MockDataSource.setIterator("select * from x", >> l.iterator()); >> entity.dataSrc = new MockDataSource(); >> entity.isDocRoot = true; >> SolrWriterImpl swi = new SolrWriterImpl(); >> di.runCmd(rp, swi); >> >> assertEquals(1, swi.docs.size()); >> SolrInputDocument doc = swi.docs.get(0); >> SolrInputField field; >> field = doc.getField("dbid"); >> assertEquals(field.getValue().toString(), "1"); >> field = doc.getField("blob"); >> assertEquals(field.getValue().toString(), KISSINGER); >> field = doc.getField("name"); >> assertNotNull(field); >> assertEquals(field.getValue().toString(), "Henry"); >> } finally { >> MockDataSource.clearCache(); >> } >> } >> >> >> private void addDBDocuments(List> l) { >> for(String[][] dbdoc: DBDOCS) { >> l.add(createMap(dbdoc[0][0], dbdoc[0][1], dbdoc[1][0], >> dbdoc[1][1])); >> } >> } >> >> String dih_config_FR_into_XP = "\r\n" + >> " \r\n" + >> " \r\n" + >> " \r\n" + >> " > dataSource='db'>\r\n" + >> " \r\n" + >> " \r\n" + >> " \r\n" + >> " > processor='XPathEntityProcessor'\r\n" + >> " forEach='/names' dataField='db.blob'>\r\n" + >> " \r\n" + >> " \r\n" + >> " \r\n" + >> " \r\n" + >> "\r\n" >> ; >> >> >> } >> > > > -- > Lance Norskog > goks...@gmail.com > -- - Noble Paul | Principal Engineer| AOL | http://aol.com
Re: Problems with DIH XPath flatten
send a small sample xml snippet you are trying to index and it may help On Tue, Oct 6, 2009 at 9:29 PM, Adam Foltzer wrote: > Hi all, > > I'm trying to set up DataImportHandler to index some XML documents available > over web services. The XML includes both content and metadata, so for the > indexable content, I'm trying to just index everything under the content > tag: > > url="resturl" processor="XPathEntityProcessor" > forEach="/document" transformer="HTMLStripTransformer" > flatten="true"> > flatten="true" stripHTML="true" /> > > > > The result of this is that the title field gets populated and indexed (there > are no child nodes of /document/kbml/kbq), but content does not get indexed > at all. Since /document/kbml/body has many children, I expected that > flatten="true" would store all of the body text in the field. Instead, it > stores nothing at all. I've tried this with many combinations of > transformers and flatten options, and the result is the same each time. > > Here are the relevant field declarations from the schema (the type="text" is > just the one from the example's schema.xml). I have tried combinations here > as well of stored= and multiValued=, with the same result each time. > > multiValued="true" /> > multiValued="true" /> > > If it would help troubleshooting, I could send along some sample XML. I > don't want to spam the list with an attachment unless it's necessary, though > :) > > Thanks in advance for your help, > > Adam Foltzer > -- - Noble Paul | Principal Engineer| AOL | http://aol.com
Re: search by some functionality
Maybe I'm missing something, but function queries aren't involved in determining whether a document matches or not, only its score. How is a a custom function / value-source going to filter? ~ David Smiley hossman wrote: > > > : I read about this chapter before. It did not mention how to create my > : own customized function. > : Can you point me to some instructions? > > The first step is to figure out how you can code your custom functionality > as an extension of the ValueSource class... > > http://lucene.apache.org/solr/api/org/apache/solr/search/function/ValueSource.html > > ...which has to be able to generate DocValues implementation for an > IndexReader. DocValues is where you would some numeric score for each > document based on whatever criteria you wanted (including arguments passed > in when your ValueSource is constructed, like field names and constants) > > Then you need a simple ValueSourceParser class to be able to specify when > to use your ValueSource, and that's what you register in solrconfig.xml... > > http://wiki.apache.org/solr/SolrPlugins#ValueSourceParser > > > > -Hoss > > > -- View this message in context: http://www.nabble.com/search-by-some-functionality-tp25721533p25779702.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Importing CSV file slow/crashes
Hello Yonik, Thank you for looking into this. Your question of if I'm using stock solr put me in the right direction. I am in fact using a patched version of solr to get hierarchal facet support (http://issues.apache.org/jira/browse/SOLR-64 ). I took out the 4 hiefacet fields from the schema and the import was back to normal times of less than a minute. This same configuration worked fine with the 5/1 patched build. Here is the field definition: omitNorms="true" positionIncrementGap="0" indexed="true" stored="false" delimiter="/" /> multiValued="true"/> stored="true" multiValued="true"/> stored="false" multiValued="true"/> stored="false" multiValued="true"/> CSV file snippet: category,category_seo "T-Shirt Mens/Crew Neck/","t-shirt-mens/crew-neck/" Thanks again! Nasseam On Oct 6, 2009, at 3:22 PM, Yonik Seeley wrote: On Tue, Oct 6, 2009 at 1:06 PM, Nasseam Elkarra wrote: I had a dev build of 1.4 from 5/1/2009 and importing a 20K row took less than a minute. Updating to the latest as of yesterday, the import is really slow and I had to cancel it after a half hour. This prevented me from upgrading a few months ago as well. I haven't had any success at replicating this problem. I just tried a 100K row CSV file, consisting of an id and a few text fields. The total size of the file is 79MB. On trunk (today): 22 seconds to index, another 5-7 secons to commit 5/21 version: 28 seconds to index, another 8 seconds to commit Then I modified the 5/1 schema to closer match the trunk schema (removing defaults, copyfields that could slow things down). Modified 5/1 version: 25 seconds to index, another 8 seconds to commit I only did 2 runs with trunk and 2 with one from 5/1, so the accuracy is probably low... but good enough to see there wasn't a problem in this test. We really need more info to help reproduce this. Are you using stock solr? Do you have any custom plugins, analyzers, token filters, etc? You're going to need to provide something so others can reproduce this. -Yonik http://www.lucidimagination.com
RE: Need "OR" in DisMax Query
Hi David, See this thread for how I use OR with Dismax. http://www.mail-archive.com/solr-user@lucene.apache.org/msg19375.html -- Dean -Original Message- From: Ingo Renner [mailto:i...@typo3.org] Sent: 06 October 2009 05:00 To: solr-user@lucene.apache.org Subject: Re: Need "OR" in DisMax Query Am 05.10.2009 um 20:36 schrieb David Giffin: Hi David, > Maybe I'm missing something, but I can't seem to get the dismax > request handler to perform and OR query. It appears that OR is removed > by the stop words. It's not the stop words, Dismax simply doesn't do any boolean operations, the only thing you can do is using +searchWord and - searchWord or changing to the standard request handler. best Ingo -- Ingo Renner TYPO3 Core Developer, Release Manager TYPO3 4.2 Apache Solr for TYPO3: http://www.typo3-solr.com CLSA CLEAN & GREEN: Please consider our environment before printing this email. The content of this communication is subject to CLSA Legal and Regulatory Notices These can be viewed at https://www.clsa.com/disclaimer.html or sent to you upon request.
Re: Authentication/Authorization with Master-Slave over HTTP
: I want to be able to have SOLR Slave instance on publicly available host : (accessible via HTTP), and synchronize with Master securely (via HTTP) HTTP based replication only works with the the new ReplicationHandler ... if you setup a proxy in front of your Master (either as a seperate daemon, or using a custom ServletFilter, or by runing on special settings in your ServletCOntainer) that can require HTTP Basic Authentication you cna then configure the slave to use an arbitrary usernam/password of your choice (look for the httpBasicAuthUser/httpBasicAuthPassword in the example slave configs) -Hoss
Re: Why isn't the DateField implementation of ISO 8601 broader?
: I would expect field:2001-03 to be a hit on a partial match such as : field:[2001-02-28T00:00:00Z TO 2001-03-13T00:00:00Z]. I suppose that my : expectation would be that field:2001-03 would be counted once per day for each : day in its range. It would follow that a user looking for documents relating ...meanwhile someone else might expect that unless the ambiguous date must be entirely contained within the range being queried on. (your implication of counting once per day would have pretty weird results on faceting by the way) with unambiguous dates, you can have exactly what you want just by being a little more verbose when indexing/quering, (and somoene else can have exactly what they want by being equally verbose using slightly differnet options/queries in your case: i would suggest that you use two fields: date_low and date_high ... when you have an exact date (down to the smallest level of granularity you care about) you put the same value in both fields, when you have an ambiguous value (like 2001-03) you put the largest value possible in date_high and the lowest value possible in date_low (ie: date_low:2001-03-01T00:00:00Z & date_high:2001-03-31T23:59:59.999Z) then a query for anything *overlapping* the range from feb28 to march 13 would be... +date_low:[* TO 2001-03-13T00:00:00Z] +date_high:[2001-02-28T00:00:00Z TO *] ...it works for ambiguous dates, and it works for exact dates. (someone else who only wants to see matches if the ranges *completely* overlap would just swap which end point they queried against which field) -Hoss
Re: TermsComponent or auto-suggest with filter
Nice. In comparison, how do you do it with faceting? > "Two other approaches are to use either the TermsComponent (new in Solr > 1.4) or faceting." On Wed, Oct 7, 2009 at 1:51 AM, Jay Hill wrote: > Have a look at a blog I posted on how to use EdgeNGrams to build an > auto-suggest tool: > > http://www.lucidimagination.com/blog/2009/09/08/auto-suggest-from-popular-queries-using-edgengrams/ > > You could easily add filter queries to this approach. Ffor example, the > query used in the blog could add filter queries like this: > > http://localhost:8983/solr/select/?q=user_query: > ”i”&wt=json&fl=user_query&indent=on&echoParams=none&rows=10&sort=count > desc&fq=yourField:yourQuery&fq=anotherField:anotherQuery > > -Jay > http://www.lucidimagination.com > > > > > On Tue, Oct 6, 2009 at 4:40 AM, R. Tan wrote: > > > Hello, > > What's the best way to get auto-suggested terms/keywords that is filtered > > by > > one or more fields? TermsComponent should have been the solution but > > filters > > are not supported. > > > > Thanks, > > Rihaed > > >
Re: Weird Facet and KeywordTokenizerFactory Issue
A few comments about the info you've provided... when you cut/pasted the facet output, you excluded the field names. based on the schema & solrconfig.xml snippets you posted later, i'm assuming they are usstate, and keyword, but you have to be explicit so that people can help correlate the results you are getting with the schema you posted -- for example, you haven't posted anything that would verify that the usstate field actually uses your keywordText field, for ll we know it has a different field type by mistake (which would explain your problem). ... you have to post everything that would let us connect the dots from input to output in order to see where things might be going wrong. A huge gap is in what your synonym files contain ... something weird in there could easily explain superfluous terms getting added to your data. all that said: my best guess is that you have old data in your index from an older version of your schema when you had differnet analyzers configured. if a term is showing up in the facet counts, you can search on it -- find the first doc that matches, verify that the term isn't actually in the data, and then reindex that one doc -- if it stops matching your search (and the facet count drops by one) then i'm right, just reindex everything. (this is where a timestamp field recording exactly when each doc was added to the index comes in handy, you can compare it with the file modification time on your schema.xml and be certain which docs where indexed prior to you changes) -Hoss
Re: search by some functionality
: I read about this chapter before. It did not mention how to create my : own customized function. : Can you point me to some instructions? The first step is to figure out how you can code your custom functionality as an extension of the ValueSource class... http://lucene.apache.org/solr/api/org/apache/solr/search/function/ValueSource.html ...which has to be able to generate DocValues implementation for an IndexReader. DocValues is where you would some numeric score for each document based on whatever criteria you wanted (including arguments passed in when your ValueSource is constructed, like field names and constants) Then you need a simple ValueSourceParser class to be able to specify when to use your ValueSource, and that's what you register in solrconfig.xml... http://wiki.apache.org/solr/SolrPlugins#ValueSourceParser -Hoss
Re: Why isn't the DateField implementation of ISO 8601 broader?
On 6 Oct 09, at 5:31 PM, Chris Hostetter wrote: ...your expectations may be different then everyone elses. by requiring that the dates be explicit there is no ambiguity, you are in control of the behavior. The power of some of the other formulas in ISO 8601 is that you don't introduce false levels of precision. The "October 2009" issue of a magazine is precisely tagged as "200910" or "2009-10" . It doesn't have a day, hour or minute. Most books come with a copyright year: no month, no day ... In the library/book/periodical world these are a common set of expectations. Walter
Re: Why isn't the DateField implementation of ISO 8601 broader?
Thanks for making me think about this a little bit deeper, Hoss. Comments in-line. Chris Hostetter wrote: because those would be ambiguous. if you just indexed field:2001-03 would you expect it to match field:[2001-02-28T00:00:00Z TO 2001-03-13T00:00:00Z] ... what about date faceting, what should the counts be if you facet per day? I would expect field:2001-03 to be a hit on a partial match such as field:[2001-02-28T00:00:00Z TO 2001-03-13T00:00:00Z]. I suppose that my expectation would be that field:2001-03 would be counted once per day for each day in its range. It would follow that a user looking for documents relating to 1919 might also be interested in 1910. But conversely a user looking for documents relating to 1919 might really only want documents specifically related to 1919. Maybe the implementation would be smart (or configurable) about precision so that it wouldn't be counted when the precision asked to be represented by facets had more significant figures that the indexed/stored value. Maybe there would be another facet category at each precision for "others" -- the documents that have less precision than the current date facet precision. I'm envisioning a hierarchical system that starts general with century with click-throughs drilling down eventually to days. ...your expectations may be different then everyone elses. by requiring that the dates be explicit there is no ambiguity, you are in control of the behavior. I can see your point but surely there are others out there with non explicit data regarding dates out there? Does my use case makes sense to anyone else? in can always just index the first date of whatever block of time (month, yera, century, etc..) and then facet normally. Until a better solution presents itself we've gone the route of creating more fields for faceting on different blocks of time. So fields for century, decade, year, month, and day will let us facet on each of these time periods as needed. Documents with dates with less precision will not show up in date facets with more precision. I was hoping there was an elegant hack for faceting on prefix of a defined number of characters (prefix=*, prefix=**, prefix=***, ...) without having to explicitly specify ..., prefix=188, prefix=189, prefix=190, prefix=191, ... Regards, Tricia
Re: Question about PatternReplace filter and automatic Synonym generation
: I ll try to explain with an example. Given the term 'it!' in the title, it : should match both 'it' and 'it!' in the query as an exact match. Currently, : this is done by using a synonym entry (and index time SynonymFilter) as : follows: : : it! => it, it! : : Now, the above holds true for all cases where you have a title token of the : form [aA-zZ]*!. Handling all of those cases requires adding synonyms : manually for each case which is not easy to manage and does not scale. : : I am hoping to do the same by using a index time filter that takes in a : pattern like the PatternReplace filter and adds the newly created token : instead of replacing the original one. Does this make sense? Am I missing : something that would break this approach? something like this would be fairly easy to implement in Lucene, but somewhat confusing to try and configure in Solr. I was going to suggest that you use something like... ..and then have a subsequent filter that splits the tokens on the whitespace (or any other special character you could use in the replacement) ... but aparently we don't have any built in filters that will just split tokens on a character/pattern for you. that would also be fairly easy to write if someone wnats to submit a patch. -Hoss
Re: Importing CSV file slow/crashes
On Tue, Oct 6, 2009 at 1:06 PM, Nasseam Elkarra wrote: > I had a dev build of 1.4 from 5/1/2009 and importing a 20K row took less > than a minute. Updating to the latest as of yesterday, the import is really > slow and I had to cancel it after a half hour. This prevented me from > upgrading a few months ago as well. I haven't had any success at replicating this problem. I just tried a 100K row CSV file, consisting of an id and a few text fields. The total size of the file is 79MB. On trunk (today): 22 seconds to index, another 5-7 secons to commit 5/21 version: 28 seconds to index, another 8 seconds to commit Then I modified the 5/1 schema to closer match the trunk schema (removing defaults, copyfields that could slow things down). Modified 5/1 version: 25 seconds to index, another 8 seconds to commit I only did 2 runs with trunk and 2 with one from 5/1, so the accuracy is probably low... but good enough to see there wasn't a problem in this test. We really need more info to help reproduce this. Are you using stock solr? Do you have any custom plugins, analyzers, token filters, etc? You're going to need to provide something so others can reproduce this. -Yonik http://www.lucidimagination.com
Re: conditional sorting
: I tried to simplify the problem, but the point is that I could have : really: complex requirements. For instance, "if in the first 5 results : none are older than one year, use sort by X, otherwise sort by Y". First 5 in what order? X? Y or something else? : So, the question is, is there a way to make Solr recognize complex : situations and apply different sorting criterion. your question may seem simple to you, but unless you codify all the examples of what you consider a "complex situation" and how you expect those to be specified at run time, it's pretty much imposisble to give you an answer as to what the best way to achieve your goal is. the simplest answer based on the information available: if you can express your requirements in java, and put them in a custom Search Component, then Solr can do it. -Hoss
Re: Weird Facet and KeywordTokenizerFactory Issue
Got it. Sorry for not having an answer for your problem. On 10/06/2009 04:58 PM, Ravi Kiran wrote: You dont see any facet fields in my query because I have configured them in the solrconfig.xml to give specific fields as facets by default in the dismax and standard handlers so that I dont have to specify all those fields individually everytime I query, all I need to do is just set facet=true thats all dismax explicit 0.01 systemid^20.0 headline^20.0 keyword^18.0 person^18.0 organization^18.0 usstate^18.0 country^18.0 subject^18.0 quote^18.0 blurb^15.0 articlesubhead^8.0 byline^7.0 articleblurb^2.0 body^1.5 multimediablurb^1.5 headline^20.5 keyword^18.5 person^18.5 organization^18.5 usstate^18.5 country^18.5 subject^18.5 quote^18.5 blurb^15.5 articlesubhead^8.5 byline^7.5 articleblurb^2.5 body^2.0 multimediablurb^2.0 recip(rord(pubdatetime),1,1000,1000)^1.0 * 2<-1 5<-3 6<90% 100 *:* keyword 0 keyword regex false 1 5 5 5 5 5 5 contenttype keyword keywordlower keywordformatted person personformatted organization usstate country subject On Tue, Oct 6, 2009 at 5:45 PM, Christian Zambranowrote: I am stumped then. I had a similar issue when I was using a field that was being heavily tokenized, but I corrected the issue by using a field(generated using copyField) that doesn't get analyzed at all. On the query you provided before I didn't see the parameters to tell solr for which field it should produce facets. Something like: http://localhost:8080/solr-admin/topicscore/select/?facet=true&facet.limit=-1&*facet.field=location* On 10/06/2009 04:09 PM, Ravi Kiran wrote: Yes Exactly the same On Tue, Oct 6, 2009 at 4:52 PM, Christian Zambrano wrote: And you had the analyzer for that field set-up the same way as shown on your previous e-mail when you indexed the data? On 10/06/2009 03:46 PM, Ravi Kiran wrote: I did infact check it out any there is no weirdness in analysis page...see result below Index Analyzer org.apache.solr.analysis.KeywordTokenizerFactory {} term position 1 term text New York term type word source start,end 0,8 payload org.apache.solr.analysis.TrimFilterFactory {} term position 1 term text New York term type word source start,end 0,8 payload org.apache.solr.analysis.StopFilterFactory {words=entity-stopwords.txt, ignoreCase=true, enablePositionIncrements=true} term position 1 term text New York term type word source start,end 0,8 payload org.apache.solr.analysis.SynonymFilterFactory {synonyms=synonyms.txt, expand=false, ignoreCase=true} term position 1 term text New York term type word source start,end 0,8 payload org.apache.solr.analysis.RemoveDuplicatesTokenFilterFactory {} term position 1 term text New York term type word source start,end 0,8 payload Query Analyzer org.apache.solr.analysis.KeywordTokenizerFactory {} term position 1 term text New York term type word source start,end 0,8 payload org.apache.solr.analysis.TrimFilterFactory {} term position 1 term text New York term type word source start,end 0,8 payload org.apache.solr.analysis.StopFilterFactory {words=entity-stopwords.txt, ignoreCase=true, enablePositionIncrements=true} term position 1 term text New York term type word source start,end 0,8 payload org.apache.solr.analysis.SynonymFilterFactory {synonyms=synonyms.txt, expand=false, ignoreCase=true} term position 1 term text New York term type word source start,end 0,8 payload org.apache.solr.analysis.RemoveDuplicatesTokenFilterFactory {} term position 1 term text New York term type word source start,end 0,8 payload On Tue, Oct 6, 2009 at 4:19 PM, Christian Zambrano wrote: Have you tried using the Analysis page to see what tokens are generated for the string "New York"? It could be one of the token filter is adding the token 'new' for all strings that start with 'new' On 10/06/2009 02:54 PM, Ravi Kiran wrote: Hello All, Iam getting some ghost facets in solr 1.4. Can anybody kindly help me understand why I get them and how to eliminate them. My schema.xml snippet is given at the end. Iam indexing Named Entities extracted via OpenNLP into solr. My understanding regarding KeywordTokenizerFactory is that it will use all words as a single token, am I right ? for example: "New York" will be indexed as 'New York' and will not be split right??? However I see then splitup in facets as follows when running the query " http://localhost:8080/solr-admin/topicscore/select/?facet=true&facet.limit=-1 "...but when I search with standard handler qt=standard&q=keyword:"New" I dont find any doc which has
Re: Weird Facet and KeywordTokenizerFactory Issue
You dont see any facet fields in my query because I have configured them in the solrconfig.xml to give specific fields as facets by default in the dismax and standard handlers so that I dont have to specify all those fields individually everytime I query, all I need to do is just set facet=true thats all dismax explicit 0.01 systemid^20.0 headline^20.0 keyword^18.0 person^18.0 organization^18.0 usstate^18.0 country^18.0 subject^18.0 quote^18.0 blurb^15.0 articlesubhead^8.0 byline^7.0 articleblurb^2.0 body^1.5 multimediablurb^1.5 headline^20.5 keyword^18.5 person^18.5 organization^18.5 usstate^18.5 country^18.5 subject^18.5 quote^18.5 blurb^15.5 articlesubhead^8.5 byline^7.5 articleblurb^2.5 body^2.0 multimediablurb^2.0 recip(rord(pubdatetime),1,1000,1000)^1.0 * 2<-1 5<-3 6<90% 100 *:* keyword 0 keyword regex false 1 5 5 5 5 5 5 contenttype keyword keywordlower keywordformatted person personformatted organization usstate country subject On Tue, Oct 6, 2009 at 5:45 PM, Christian Zambrano wrote: > I am stumped then. I had a similar issue when I was using a field that was > being heavily tokenized, but I corrected the issue by using a > field(generated using copyField) that doesn't get analyzed at all. > > On the query you provided before I didn't see the parameters to tell solr > for which field it should produce facets. > > Something like: > > > http://localhost:8080/solr-admin/topicscore/select/?facet=true&facet.limit=-1&*facet.field=location* > > > > > On 10/06/2009 04:09 PM, Ravi Kiran wrote: > >> Yes Exactly the same >> >> On Tue, Oct 6, 2009 at 4:52 PM, Christian Zambrano> >wrote: >> >> >> >>> And you had the analyzer for that field set-up the same way as shown on >>> your previous e-mail when you indexed the data? >>> >>> >>> >>> >>> On 10/06/2009 03:46 PM, Ravi Kiran wrote: >>> >>> >>> I did infact check it out any there is no weirdness in analysis page...see result below Index Analyzer org.apache.solr.analysis.KeywordTokenizerFactory {} term position 1 term text New York term type word source start,end 0,8 payload org.apache.solr.analysis.TrimFilterFactory {} term position 1 term text New York term type word source start,end 0,8 payload org.apache.solr.analysis.StopFilterFactory {words=entity-stopwords.txt, ignoreCase=true, enablePositionIncrements=true} term position 1 term text New York term type word source start,end 0,8 payload org.apache.solr.analysis.SynonymFilterFactory {synonyms=synonyms.txt, expand=false, ignoreCase=true} term position 1 term text New York term type word source start,end 0,8 payload org.apache.solr.analysis.RemoveDuplicatesTokenFilterFactory {} term position 1 term text New York term type word source start,end 0,8 payload Query Analyzer org.apache.solr.analysis.KeywordTokenizerFactory {} term position 1 term text New York term type word source start,end 0,8 payload org.apache.solr.analysis.TrimFilterFactory {} term position 1 term text New York term type word source start,end 0,8 payload org.apache.solr.analysis.StopFilterFactory {words=entity-stopwords.txt, ignoreCase=true, enablePositionIncrements=true} term position 1 term text New York term type word source start,end 0,8 payload org.apache.solr.analysis.SynonymFilterFactory {synonyms=synonyms.txt, expand=false, ignoreCase=true} term position 1 term text New York term type word source start,end 0,8 payload org.apache.solr.analysis.RemoveDuplicatesTokenFilterFactory {} term position 1 term text New York term type word source start,end 0,8 payload On Tue, Oct 6, 2009 at 4:19 PM, Christian Zambrano>>> > wrote: > > > Have you tried using the Analysis page to see what tokens are generated > for > the string "New York"? It could be one of the token filter is adding > the > token 'new' for all strings that start with 'new' > > > On 10/06/2009 02:54 PM, Ravi Kiran wrote: > > > > > >> Hello All, >> Iam getting some ghost facets in solr 1.4. Can anybody >> kindly >> help me understand why I get them and how to eliminate them. My >> schema.xml >> snippet is given at the end. Iam indexing Named Entities extracted via >> OpenNLP into solr. My understanding regarding KeywordTokenizerFactory >> is >> that it will use all words as a single token, am I right ? for >> example: >> "New >> York" will be indexed as 'New York' and will not be split right??? >> However >> I >>
Re: What to set in query.setMaxRows()?
: Sorry about asking this here, but I can't reach wiki.apache.org right now. : What do I set in query.setMaxRows() to get all the rows? http://wiki.apache.org/solr/FAQ#How_can_I_get_ALL_the_matching_documents_back.3F_..._How_can_I_return_an_unlimited_number_of_rows.3F How can I get ALL the matching documents back? ... How can I return an unlimited number of rows? This is impractical in most cases. People typically only want to do this when they know they are dealing with an index whose size guarantees the result sets will be always be small enough that they can feasibly be transmitted in a manageable amount -- but if that's the case just specify what you consider a "manageable amount" as your rows param and get the best of both worlds (all the results when your assumption is right, and a sanity cap on the result size if it turns out your assumptions are wrong) -Hoss
Re: stats page slow in latest nightly
: When I was working on it, I was actually going to default to not show : the size, and make you click a link that added a param to get the sizes : in the display too. But I foolishly didn't bring it up when Hoss made my : life easier with his simpler patch. we can always turn the size estimator off ... or turn it only only when doing the insanity checks (so normal stats are fast, buf if anything is duplicated you'll get info on the size of the discrepancy) -Hoss
Re: Weird Facet and KeywordTokenizerFactory Issue
I am stumped then. I had a similar issue when I was using a field that was being heavily tokenized, but I corrected the issue by using a field(generated using copyField) that doesn't get analyzed at all. On the query you provided before I didn't see the parameters to tell solr for which field it should produce facets. Something like: http://localhost:8080/solr-admin/topicscore/select/?facet=true&facet.limit=-1&*facet.field=location* On 10/06/2009 04:09 PM, Ravi Kiran wrote: Yes Exactly the same On Tue, Oct 6, 2009 at 4:52 PM, Christian Zambranowrote: And you had the analyzer for that field set-up the same way as shown on your previous e-mail when you indexed the data? On 10/06/2009 03:46 PM, Ravi Kiran wrote: I did infact check it out any there is no weirdness in analysis page...see result below Index Analyzer org.apache.solr.analysis.KeywordTokenizerFactory {} term position 1 term text New York term type word source start,end 0,8 payload org.apache.solr.analysis.TrimFilterFactory {} term position 1 term text New York term type word source start,end 0,8 payload org.apache.solr.analysis.StopFilterFactory {words=entity-stopwords.txt, ignoreCase=true, enablePositionIncrements=true} term position 1 term text New York term type word source start,end 0,8 payload org.apache.solr.analysis.SynonymFilterFactory {synonyms=synonyms.txt, expand=false, ignoreCase=true} term position 1 term text New York term type word source start,end 0,8 payload org.apache.solr.analysis.RemoveDuplicatesTokenFilterFactory {} term position 1 term text New York term type word source start,end 0,8 payload Query Analyzer org.apache.solr.analysis.KeywordTokenizerFactory {} term position 1 term text New York term type word source start,end 0,8 payload org.apache.solr.analysis.TrimFilterFactory {} term position 1 term text New York term type word source start,end 0,8 payload org.apache.solr.analysis.StopFilterFactory {words=entity-stopwords.txt, ignoreCase=true, enablePositionIncrements=true} term position 1 term text New York term type word source start,end 0,8 payload org.apache.solr.analysis.SynonymFilterFactory {synonyms=synonyms.txt, expand=false, ignoreCase=true} term position 1 term text New York term type word source start,end 0,8 payload org.apache.solr.analysis.RemoveDuplicatesTokenFilterFactory {} term position 1 term text New York term type word source start,end 0,8 payload On Tue, Oct 6, 2009 at 4:19 PM, Christian Zambrano wrote: Have you tried using the Analysis page to see what tokens are generated for the string "New York"? It could be one of the token filter is adding the token 'new' for all strings that start with 'new' On 10/06/2009 02:54 PM, Ravi Kiran wrote: Hello All, Iam getting some ghost facets in solr 1.4. Can anybody kindly help me understand why I get them and how to eliminate them. My schema.xml snippet is given at the end. Iam indexing Named Entities extracted via OpenNLP into solr. My understanding regarding KeywordTokenizerFactory is that it will use all words as a single token, am I right ? for example: "New York" will be indexed as 'New York' and will not be split right??? However I see then splitup in facets as follows when running the query " http://localhost:8080/solr-admin/topicscore/select/?facet=true&facet.limit=-1 "...but when I search with standard handler qt=standard&q=keyword:"New" I dont find any doc which has just "New". After digging in a bit I found that if several keywords have a common starting word it is being pulled out as another facet like the following. Any help is greatly appreciated Result 47 >Ghost 7 16 10 147 23 8 5 6 8 10 8 5 7 7 -->Ghost 5 5 7 -->Ghost 6 26 6 27 8 7 12 Schema.xml -
Re: Why isn't the DateField implementation of ISO 8601 broader?
:My question is why isn't the DateField implementation of ISO 8601 broader : so that it could include and MM as acceptable date strings? What because those would be ambiguous. if you just indexed field:2001-03 would you expect it to match field:[2001-02-28T00:00:00Z TO 2001-03-13T00:00:00Z] ... what about date faceting, what should the counts be if you facet per day? ...your expectations may be different then everyone elses. by requiring that the dates be explicit there is no ambiguity, you are in control of the behavior. : would it take to do so? Are there any work-arounds for faceting by century, : year, month without creating new fields in my schema? The last resort would in can always just index the first date of whatever block of time (month, yera, century, etc..) and then facet normally. -Hoss
Re: Merging multicore indexes
On Wed, Oct 7, 2009 at 2:40 AM, Paul Rosen wrote: > Shalin Shekhar Mangar wrote: > > The path on the wiki page was wrong. You need to use the adminPath in the >> url. Look at the adminPath attribute in solr.xml. It is typically >> /admin/cores >> >> So the correct path for you would be: >> >> >> http://localhost:8983/solr/admin/cores?action=mergeindexes&core=merged&indexDir=/Users/my/path/solr_1.4/solr/data/reindexed_marc/index&indexDir=/Users/my/path/solr_1.4/solr/data/reindexed_rdf/index >> < >> http://localhost:8983/solr/merged/admin/?action=mergeindexes&core=merged&indexDir=/Users/my/path/solr_1.4/solr/data/reindexed_marc/index&indexDir=/Users/my/path/solr_1.4/solr/data/reindexed_rdf/index >> >> I've fixed the wiki too. >> > > I think I've got it working. The only difference to the above is that it > seems to want a relative path, so when I took off the > "/Users/my/path/solr_1.4/" part I stopped getting errors. > > There's no reason why it won't work with an absolute path. Can you post the error? Also, did you correctly urlencode the parameters (if you are using the browser to make such a request, perhaps the '/' character is causing a problem)? > (Also, I had an insidious problem when using the interface to the browser > in FF 3.5. It cached my results, so when I queried the core with "*:*" I got > no results until I cleared my cache. - Hopefully that will save someone else > a little time.) > > Yeah, if you are not using any HTTP caches, you can turn it off by adding the following in the section: -- Regards, Shalin Shekhar Mangar.
Re: Solr Trunk Heap Space Issues
Mark Miller wrote: > Jeff Newburn wrote: > >> So could that potentially explain our use of more ram on indexing? Or is >> this a rare edge case. >> >> > I think it could explain the JVM using more RAM while indexing - but it > should be fairly easily recoverable from what I can tell - so no > explanation on the OOM yet. Still looking at that one. > > Is you system basically stock, or do you have custom plugins in it? > > No matter what I try with however many cores, I can't duplicate your problem. -- - Mark http://www.lucidimagination.com
Re: Problems with DIH XPath flatten
Hi Shalin, Good question; sorry I forgot it in the initial post. I have tried with both a nightly build from earlier this month (Oct 2 I believe) as well as a build from the trunk as of yesterday afternoon. Adam On Tue, Oct 6, 2009 at 5:04 PM, Shalin Shekhar Mangar < shalinman...@gmail.com> wrote: > On Tue, Oct 6, 2009 at 9:29 PM, Adam Foltzer wrote: > > > Hi all, > > > > I'm trying to set up DataImportHandler to index some XML documents > > available > > over web services. The XML includes both content and metadata, so for the > > indexable content, I'm trying to just index everything under the content > > tag: > > > > >url="resturl" processor="XPathEntityProcessor" > >forEach="/document" transformer="HTMLStripTransformer" > > flatten="true"> > > > flatten="true" stripHTML="true" /> > > > > > > > > The result of this is that the title field gets populated and indexed > > (there > > are no child nodes of /document/kbml/kbq), but content does not get > indexed > > at all. Since /document/kbml/body has many children, I expected that > > flatten="true" would store all of the body text in the field. Instead, it > > stores nothing at all. I've tried this with many combinations of > > transformers and flatten options, and the result is the same each time. > > > > > Which Solr version are you using? The flatten attribute was introduced > after > 1.3 released. > > -- > Regards, > Shalin Shekhar Mangar. >
Re: Merging multicore indexes
Shalin Shekhar Mangar wrote: The path on the wiki page was wrong. You need to use the adminPath in the url. Look at the adminPath attribute in solr.xml. It is typically /admin/cores So the correct path for you would be: http://localhost:8983/solr/admin/cores?action=mergeindexes&core=merged&indexDir=/Users/my/path/solr_1.4/solr/data/reindexed_marc/index&indexDir=/Users/my/path/solr_1.4/solr/data/reindexed_rdf/index
Re: Weird Facet and KeywordTokenizerFactory Issue
Yes Exactly the same On Tue, Oct 6, 2009 at 4:52 PM, Christian Zambrano wrote: > And you had the analyzer for that field set-up the same way as shown on > your previous e-mail when you indexed the data? > > > > > On 10/06/2009 03:46 PM, Ravi Kiran wrote: > >> I did infact check it out any there is no weirdness in analysis page...see >> result below >> >> Index Analyzer org.apache.solr.analysis.KeywordTokenizerFactory {} term >> position 1 term text New York term type word source start,end 0,8 payload >> org.apache.solr.analysis.TrimFilterFactory {} term position 1 term text >> New >> York term type word source start,end 0,8 payload >> org.apache.solr.analysis.StopFilterFactory {words=entity-stopwords.txt, >> ignoreCase=true, enablePositionIncrements=true} term position 1 term text >> New >> York term type word source start,end 0,8 payload >> org.apache.solr.analysis.SynonymFilterFactory {synonyms=synonyms.txt, >> expand=false, ignoreCase=true} term position 1 term text New York term >> type >> word source start,end 0,8 payload >> org.apache.solr.analysis.RemoveDuplicatesTokenFilterFactory {} term >> position 1 term text New York term type word source start,end 0,8 payload >> Query Analyzer org.apache.solr.analysis.KeywordTokenizerFactory {} term >> position 1 term text New York term type word source start,end 0,8 payload >> org.apache.solr.analysis.TrimFilterFactory {} term position 1 term text >> New >> York term type word source start,end 0,8 payload >> org.apache.solr.analysis.StopFilterFactory {words=entity-stopwords.txt, >> ignoreCase=true, enablePositionIncrements=true} term position 1 term text >> New >> York term type word source start,end 0,8 payload >> org.apache.solr.analysis.SynonymFilterFactory {synonyms=synonyms.txt, >> expand=false, ignoreCase=true} term position 1 term text New York term >> type >> word source start,end 0,8 payload >> org.apache.solr.analysis.RemoveDuplicatesTokenFilterFactory {} term >> position 1 term text New York term type word source start,end 0,8 payload >> >> >> On Tue, Oct 6, 2009 at 4:19 PM, Christian Zambrano> >wrote: >> >> >> >>> Have you tried using the Analysis page to see what tokens are generated >>> for >>> the string "New York"? It could be one of the token filter is adding the >>> token 'new' for all strings that start with 'new' >>> >>> >>> On 10/06/2009 02:54 PM, Ravi Kiran wrote: >>> >>> >>> Hello All, Iam getting some ghost facets in solr 1.4. Can anybody kindly help me understand why I get them and how to eliminate them. My schema.xml snippet is given at the end. Iam indexing Named Entities extracted via OpenNLP into solr. My understanding regarding KeywordTokenizerFactory is that it will use all words as a single token, am I right ? for example: "New York" will be indexed as 'New York' and will not be split right??? However I see then splitup in facets as follows when running the query " http://localhost:8080/solr-admin/topicscore/select/?facet=true&facet.limit=-1 "...but when I search with standard handler qt=standard&q=keyword:"New" I dont find any doc which has just "New". After digging in a bit I found that if several keywords have a common starting word it is being pulled out as another facet like the following. Any help is greatly appreciated Result 47 > Ghost 7 16 10 147 23 8 5 6 8 10 8 5 7 7 --> Ghost 5 5 7--> Ghost 6 26 6 27 8 7 12 Schema.xml - >>> sortMissingLast="true" omitNorms="true" positionIncrementGap="100"> >>> words="stopwords.txt,entity-stopwords.txt" enablePositionIncrements="true"/> >>> synonyms="synonyms.txt" ignoreCase="true" expand="false" /> >>> words="stopwords.txt,entity-stopwords.txt" enablePositionIncrements="true" /> >>> synonyms="synonyms.txt" ignoreCase="true" expand="false" /> >>> multiValued="true" termVectors="false" termPositions="false" termOffsets="false"/> >>> stored="true" multiValued="true" termVectors="false" termPositions="false" termOffsets="false"/> >>> stored="true" multiValued="true" termVectors="false" termPositions="false" termOffsets="false"/> >>> stored="true" multiValued="true" termVectors="false" termPositions="false" termOffsets="false"/> >>> >>> >> >> >
Re: Problems with DIH XPath flatten
On Tue, Oct 6, 2009 at 9:29 PM, Adam Foltzer wrote: > Hi all, > > I'm trying to set up DataImportHandler to index some XML documents > available > over web services. The XML includes both content and metadata, so for the > indexable content, I'm trying to just index everything under the content > tag: > > url="resturl" processor="XPathEntityProcessor" >forEach="/document" transformer="HTMLStripTransformer" > flatten="true"> > flatten="true" stripHTML="true" /> > > > > The result of this is that the title field gets populated and indexed > (there > are no child nodes of /document/kbml/kbq), but content does not get indexed > at all. Since /document/kbml/body has many children, I expected that > flatten="true" would store all of the body text in the field. Instead, it > stores nothing at all. I've tried this with many combinations of > transformers and flatten options, and the result is the same each time. > > Which Solr version are you using? The flatten attribute was introduced after 1.3 released. -- Regards, Shalin Shekhar Mangar.
solr 1.4 formats last_index_time for SQL differently than 1.3 ?!?
Just looking for confirmation from others, but it appears that the formatting of last_index_time from dataimport.properties (using DataImportHandler) is different in 1.4 vs. that in 1.3. I was troubleshooting why delta imports are no longer working for me after moving over to solr 1.4 (10/2 nighly) and noticed that format is different. Michael -- View this message in context: http://www.nabble.com/solr-1.4-formats-last_index_time-for-SQL-differently-than-1.3--%21--tp25776496p25776496.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Weird Facet and KeywordTokenizerFactory Issue
And you had the analyzer for that field set-up the same way as shown on your previous e-mail when you indexed the data? On 10/06/2009 03:46 PM, Ravi Kiran wrote: I did infact check it out any there is no weirdness in analysis page...see result below Index Analyzer org.apache.solr.analysis.KeywordTokenizerFactory {} term position 1 term text New York term type word source start,end 0,8 payload org.apache.solr.analysis.TrimFilterFactory {} term position 1 term text New York term type word source start,end 0,8 payload org.apache.solr.analysis.StopFilterFactory {words=entity-stopwords.txt, ignoreCase=true, enablePositionIncrements=true} term position 1 term text New York term type word source start,end 0,8 payload org.apache.solr.analysis.SynonymFilterFactory {synonyms=synonyms.txt, expand=false, ignoreCase=true} term position 1 term text New York term type word source start,end 0,8 payload org.apache.solr.analysis.RemoveDuplicatesTokenFilterFactory {} term position 1 term text New York term type word source start,end 0,8 payload Query Analyzer org.apache.solr.analysis.KeywordTokenizerFactory {} term position 1 term text New York term type word source start,end 0,8 payload org.apache.solr.analysis.TrimFilterFactory {} term position 1 term text New York term type word source start,end 0,8 payload org.apache.solr.analysis.StopFilterFactory {words=entity-stopwords.txt, ignoreCase=true, enablePositionIncrements=true} term position 1 term text New York term type word source start,end 0,8 payload org.apache.solr.analysis.SynonymFilterFactory {synonyms=synonyms.txt, expand=false, ignoreCase=true} term position 1 term text New York term type word source start,end 0,8 payload org.apache.solr.analysis.RemoveDuplicatesTokenFilterFactory {} term position 1 term text New York term type word source start,end 0,8 payload On Tue, Oct 6, 2009 at 4:19 PM, Christian Zambranowrote: Have you tried using the Analysis page to see what tokens are generated for the string "New York"? It could be one of the token filter is adding the token 'new' for all strings that start with 'new' On 10/06/2009 02:54 PM, Ravi Kiran wrote: Hello All, Iam getting some ghost facets in solr 1.4. Can anybody kindly help me understand why I get them and how to eliminate them. My schema.xml snippet is given at the end. Iam indexing Named Entities extracted via OpenNLP into solr. My understanding regarding KeywordTokenizerFactory is that it will use all words as a single token, am I right ? for example: "New York" will be indexed as 'New York' and will not be split right??? However I see then splitup in facets as follows when running the query " http://localhost:8080/solr-admin/topicscore/select/?facet=true&facet.limit=-1 "...but when I search with standard handler qt=standard&q=keyword:"New" I dont find any doc which has just "New". After digging in a bit I found that if several keywords have a common starting word it is being pulled out as another facet like the following. Any help is greatly appreciated Result 47 > Ghost 7 16 10 147 23 8 5 6 8 10 8 5 7 7 --> Ghost 5 5 7--> Ghost 6 26 6 27 8 7 12 Schema.xml -
Re: stats page slow in latest nightly
thx much guys, no biggie for me, i just wanted to get to the bottom of it in case i had screwed something else up.. --joe On Tue, Oct 6, 2009 at 1:19 PM, Mark Miller wrote: > I was worried about that actually. I havn't tested how fast the RAM > estimator is on huge String FieldCaches - it will be fast on everything > else, but it checks the size of each String in the array. > > When I was working on it, I was actually going to default to not show > the size, and make you click a link that added a param to get the sizes > in the display too. But I foolishly didn't bring it up when Hoss made my > life easier with his simpler patch. > > Yonik Seeley wrote: >> Might be the new Lucene fieldCache stats stuff that was recently added? >> >> -Yonik >> http://www.lucidimagination.com >> >> >> On Tue, Oct 6, 2009 at 3:56 PM, Joe Calderon wrote: >> >>> hello *, ive been noticing that /admin/stats.jsp is really slow in the >>> recent builds, has anyone else encountered this? >>> >>> >>> --joe >>> > > > -- > - Mark > > http://www.lucidimagination.com > > > >
RE: Solr Timeouts
Yeah that's exactly right Mark. What does the "maxCommitsToKeep"(from SolrDeletionPolicy in SolrConfig.xml) parameter actually do? Increasing this value seems to have helped a little, but I'm wary of cranking it without having a better understanding of what it does. -Original Message- From: Mark Miller [mailto:markrmil...@gmail.com] Sent: Tuesday, October 06, 2009 4:44 PM To: solr-user@lucene.apache.org Subject: Re: Solr Timeouts It sounds like he is indexing on a local disk, but reading the files to be index from NFS - which would be fine. You can get Lucene indexes to work on NFS (though still not recommended) , but you need to use a custom IndexDeletionPolicy to keep older commit points around longer and be sure not to use NIOFSDirectory. Feak, Todd wrote: > I seem to recall hearing something about *not* putting a Solr index directory > on an NFS mount. Might want to search on that. > > That, of course, doesn't have anything to do with commits showing up > unexpectedly in stack traces, per your original email. > > -Todd > > -Original Message- > From: Giovanni Fernandez-Kincade [mailto:gfernandez-kinc...@capitaliq.com] > Sent: Tuesday, October 06, 2009 12:39 PM > To: solr-user@lucene.apache.org; yo...@lucidimagination.com > Subject: RE: Solr Timeouts > > That thread was blocking for an hour while all other threads were idle or > blocked. > > -Original Message- > From: ysee...@gmail.com [mailto:ysee...@gmail.com] On Behalf Of Yonik Seeley > Sent: Tuesday, October 06, 2009 3:07 PM > To: solr-user@lucene.apache.org > Subject: Re: Solr Timeouts > > This specific thread was blocked for an hour? > If so, I'd echo Lance... this is a local disk right? > > -Yonik > http://www.lucidimagination.com > > > On Mon, Oct 5, 2009 at 2:11 PM, Giovanni Fernandez-Kincade > wrote: > >> I just grabbed another stack trace for a thread that has been similarly >> blocking for over an hour. Notice that there is no Commit in this one: >> >> http-8080-Processor67 [RUNNABLE] CPU time: 1:02:05 >> org.apache.lucene.index.TermBuffer.read(IndexInput, FieldInfos) >> org.apache.lucene.index.SegmentTermEnum.next() >> org.apache.lucene.index.SegmentTermEnum.scanTo(Term) >> org.apache.lucene.index.TermInfosReader.get(Term, boolean) >> org.apache.lucene.index.TermInfosReader.get(Term) >> org.apache.lucene.index.SegmentTermDocs.seek(Term) >> org.apache.lucene.index.DocumentsWriter.applyDeletes(IndexReader, int) >> org.apache.lucene.index.DocumentsWriter.applyDeletes(SegmentInfos) >> org.apache.lucene.index.IndexWriter.applyDeletes() >> org.apache.lucene.index.IndexWriter.doFlushInternal(boolean, boolean) >> org.apache.lucene.index.IndexWriter.doFlush(boolean, boolean) >> org.apache.lucene.index.IndexWriter.flush(boolean, boolean, boolean) >> org.apache.lucene.index.IndexWriter.updateDocument(Term, Document, Analyzer) >> org.apache.lucene.index.IndexWriter.updateDocument(Term, Document) >> org.apache.solr.update.DirectUpdateHandler2.addDoc(AddUpdateCommand) >> org.apache.solr.update.processor.RunUpdateProcessor.processAdd(AddUpdateCommand) >> org.apache.solr.handler.extraction.ExtractingDocumentLoader.doAdd(SolrContentHandler, >> AddUpdateCommand) >> org.apache.solr.handler.extraction.ExtractingDocumentLoader.addDoc(SolrContentHandler) >> org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(SolrQueryRequest, >> SolrQueryResponse, ContentStream) >> org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(SolrQueryRequest, >> SolrQueryResponse) >> org.apache.solr.handler.RequestHandlerBase.handleRequest(SolrQueryRequest, >> SolrQueryResponse) >> org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(SolrQueryRequest, >> SolrQueryResponse) >> org.apache.solr.core.SolrCore.execute(SolrRequestHandler, SolrQueryRequest, >> SolrQueryResponse) >> org.apache.solr.servlet.SolrDispatchFilter.execute(HttpServletRequest, >> SolrRequestHandler, SolrQueryRequest, SolrQueryResponse) >> org.apache.solr.servlet.SolrDispatchFilter.doFilter(ServletRequest, >> ServletResponse, FilterChain) >> org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ServletRequest, >> ServletResponse) >> org.apache.catalina.core.ApplicationFilterChain.doFilter(ServletRequest, >> ServletResponse) >> org.apache.catalina.core.StandardWrapperValve.invoke(Request, Response) >> org.apache.catalina.core.StandardContextValve.invoke(Request, Response) >> org.apache.catalina.core.StandardHostValve.invoke(Request, Response) >> org.apache.catalina.valves.ErrorReportValve.invoke(Request, Response) >> org.apache.catalina.core.StandardEngineValve.invoke(Request, Response) >> org.apache.catalina.connector.CoyoteAdapter.service(Request, Response) >> org.apache.coyote.http11.Http11Processor.process(InputStream, OutputStream) >> org.apache.coyote.http11.Http11BaseProtocol$Http11ConnectionHandler.processConnection(TcpConnection, >> Object[]) >> org.apache.tomcat.util.net.PoolTcpEndpoint.processSocket(Sock
Re: Weird Facet and KeywordTokenizerFactory Issue
I did infact check it out any there is no weirdness in analysis page...see result below Index Analyzer org.apache.solr.analysis.KeywordTokenizerFactory {} term position 1 term text New York term type word source start,end 0,8 payload org.apache.solr.analysis.TrimFilterFactory {} term position 1 term text New York term type word source start,end 0,8 payload org.apache.solr.analysis.StopFilterFactory {words=entity-stopwords.txt, ignoreCase=true, enablePositionIncrements=true} term position 1 term text New York term type word source start,end 0,8 payload org.apache.solr.analysis.SynonymFilterFactory {synonyms=synonyms.txt, expand=false, ignoreCase=true} term position 1 term text New York term type word source start,end 0,8 payload org.apache.solr.analysis.RemoveDuplicatesTokenFilterFactory {} term position 1 term text New York term type word source start,end 0,8 payload Query Analyzer org.apache.solr.analysis.KeywordTokenizerFactory {} term position 1 term text New York term type word source start,end 0,8 payload org.apache.solr.analysis.TrimFilterFactory {} term position 1 term text New York term type word source start,end 0,8 payload org.apache.solr.analysis.StopFilterFactory {words=entity-stopwords.txt, ignoreCase=true, enablePositionIncrements=true} term position 1 term text New York term type word source start,end 0,8 payload org.apache.solr.analysis.SynonymFilterFactory {synonyms=synonyms.txt, expand=false, ignoreCase=true} term position 1 term text New York term type word source start,end 0,8 payload org.apache.solr.analysis.RemoveDuplicatesTokenFilterFactory {} term position 1 term text New York term type word source start,end 0,8 payload On Tue, Oct 6, 2009 at 4:19 PM, Christian Zambrano wrote: > Have you tried using the Analysis page to see what tokens are generated for > the string "New York"? It could be one of the token filter is adding the > token 'new' for all strings that start with 'new' > > > On 10/06/2009 02:54 PM, Ravi Kiran wrote: > >> Hello All, >> Iam getting some ghost facets in solr 1.4. Can anybody >> kindly >> help me understand why I get them and how to eliminate them. My schema.xml >> snippet is given at the end. Iam indexing Named Entities extracted via >> OpenNLP into solr. My understanding regarding KeywordTokenizerFactory is >> that it will use all words as a single token, am I right ? for example: >> "New >> York" will be indexed as 'New York' and will not be split right??? However >> I >> see then splitup in facets as follows when running the query " >> >> http://localhost:8080/solr-admin/topicscore/select/?facet=true&facet.limit=-1 >> "...but >> when I search with standard handler qt=standard&q=keyword:"New" I dont >> find >> any doc which has just "New". After digging in a bit I found that if >> several >> keywords have a common starting word it is being pulled out as another >> facet >> like the following. Any help is greatly appreciated >> >> Result >> >> 47 > Ghost >> 7 >> 16 >> 10 >> 147 >> 23 >> 8 >> 5 >> 6 >> 8 >> 10 >> 8 >> 5 >> 7 >> >> 7 --> Ghost >> 5 >> 5 >> >> >> 7 --> Ghost >> 6 >> 26 >> 6 >> >> 27 >> 8 >> 7 >> 12 >> >> Schema.xml >> - >> >> > sortMissingLast="true" omitNorms="true" positionIncrementGap="100"> >> >> >> >> > words="stopwords.txt,entity-stopwords.txt" >> enablePositionIncrements="true"/> >> >> > ignoreCase="true" expand="false" /> >> >> >> >> >> >> > words="stopwords.txt,entity-stopwords.txt" enablePositionIncrements="true" >> /> >> > ignoreCase="true" expand="false" /> >> >> >> >> >> > multiValued="true" termVectors="false" termPositions="false" >> termOffsets="false"/> >> > stored="true" multiValued="true" termVectors="false" termPositions="false" >> termOffsets="false"/> >> > multiValued="true" termVectors="false" termPositions="false" >> termOffsets="false"/> >> > multiValued="true" termVectors="false" termPositions="false" >> termOffsets="false"/> >> >> >> >
Re: Solr Timeouts
It sounds like he is indexing on a local disk, but reading the files to be index from NFS - which would be fine. You can get Lucene indexes to work on NFS (though still not recommended) , but you need to use a custom IndexDeletionPolicy to keep older commit points around longer and be sure not to use NIOFSDirectory. Feak, Todd wrote: > I seem to recall hearing something about *not* putting a Solr index directory > on an NFS mount. Might want to search on that. > > That, of course, doesn't have anything to do with commits showing up > unexpectedly in stack traces, per your original email. > > -Todd > > -Original Message- > From: Giovanni Fernandez-Kincade [mailto:gfernandez-kinc...@capitaliq.com] > Sent: Tuesday, October 06, 2009 12:39 PM > To: solr-user@lucene.apache.org; yo...@lucidimagination.com > Subject: RE: Solr Timeouts > > That thread was blocking for an hour while all other threads were idle or > blocked. > > -Original Message- > From: ysee...@gmail.com [mailto:ysee...@gmail.com] On Behalf Of Yonik Seeley > Sent: Tuesday, October 06, 2009 3:07 PM > To: solr-user@lucene.apache.org > Subject: Re: Solr Timeouts > > This specific thread was blocked for an hour? > If so, I'd echo Lance... this is a local disk right? > > -Yonik > http://www.lucidimagination.com > > > On Mon, Oct 5, 2009 at 2:11 PM, Giovanni Fernandez-Kincade > wrote: > >> I just grabbed another stack trace for a thread that has been similarly >> blocking for over an hour. Notice that there is no Commit in this one: >> >> http-8080-Processor67 [RUNNABLE] CPU time: 1:02:05 >> org.apache.lucene.index.TermBuffer.read(IndexInput, FieldInfos) >> org.apache.lucene.index.SegmentTermEnum.next() >> org.apache.lucene.index.SegmentTermEnum.scanTo(Term) >> org.apache.lucene.index.TermInfosReader.get(Term, boolean) >> org.apache.lucene.index.TermInfosReader.get(Term) >> org.apache.lucene.index.SegmentTermDocs.seek(Term) >> org.apache.lucene.index.DocumentsWriter.applyDeletes(IndexReader, int) >> org.apache.lucene.index.DocumentsWriter.applyDeletes(SegmentInfos) >> org.apache.lucene.index.IndexWriter.applyDeletes() >> org.apache.lucene.index.IndexWriter.doFlushInternal(boolean, boolean) >> org.apache.lucene.index.IndexWriter.doFlush(boolean, boolean) >> org.apache.lucene.index.IndexWriter.flush(boolean, boolean, boolean) >> org.apache.lucene.index.IndexWriter.updateDocument(Term, Document, Analyzer) >> org.apache.lucene.index.IndexWriter.updateDocument(Term, Document) >> org.apache.solr.update.DirectUpdateHandler2.addDoc(AddUpdateCommand) >> org.apache.solr.update.processor.RunUpdateProcessor.processAdd(AddUpdateCommand) >> org.apache.solr.handler.extraction.ExtractingDocumentLoader.doAdd(SolrContentHandler, >> AddUpdateCommand) >> org.apache.solr.handler.extraction.ExtractingDocumentLoader.addDoc(SolrContentHandler) >> org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(SolrQueryRequest, >> SolrQueryResponse, ContentStream) >> org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(SolrQueryRequest, >> SolrQueryResponse) >> org.apache.solr.handler.RequestHandlerBase.handleRequest(SolrQueryRequest, >> SolrQueryResponse) >> org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(SolrQueryRequest, >> SolrQueryResponse) >> org.apache.solr.core.SolrCore.execute(SolrRequestHandler, SolrQueryRequest, >> SolrQueryResponse) >> org.apache.solr.servlet.SolrDispatchFilter.execute(HttpServletRequest, >> SolrRequestHandler, SolrQueryRequest, SolrQueryResponse) >> org.apache.solr.servlet.SolrDispatchFilter.doFilter(ServletRequest, >> ServletResponse, FilterChain) >> org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ServletRequest, >> ServletResponse) >> org.apache.catalina.core.ApplicationFilterChain.doFilter(ServletRequest, >> ServletResponse) >> org.apache.catalina.core.StandardWrapperValve.invoke(Request, Response) >> org.apache.catalina.core.StandardContextValve.invoke(Request, Response) >> org.apache.catalina.core.StandardHostValve.invoke(Request, Response) >> org.apache.catalina.valves.ErrorReportValve.invoke(Request, Response) >> org.apache.catalina.core.StandardEngineValve.invoke(Request, Response) >> org.apache.catalina.connector.CoyoteAdapter.service(Request, Response) >> org.apache.coyote.http11.Http11Processor.process(InputStream, OutputStream) >> org.apache.coyote.http11.Http11BaseProtocol$Http11ConnectionHandler.processConnection(TcpConnection, >> Object[]) >> org.apache.tomcat.util.net.PoolTcpEndpoint.processSocket(Socket, >> TcpConnection, Object[]) >> org.apache.tomcat.util.net.LeaderFollowerWorkerThread.runIt(Object[]) >> org.apache.tomcat.util.threads.ThreadPool$ControlRunnable.run() >> java.lang.Thread.run() >> >> >> -Original Message- >> From: ysee...@gmail.com [mailto:ysee...@gmail.com] On Behalf Of Yonik Seeley >> Sent: Monday, October 05, 2009 1:18 PM >> To: solr-user@lucene.apache.org >> Subject: Re: Solr Timeouts >> >> OK... next step
RE: Solr Timeouts
I seem to recall hearing something about *not* putting a Solr index directory on an NFS mount. Might want to search on that. That, of course, doesn't have anything to do with commits showing up unexpectedly in stack traces, per your original email. -Todd -Original Message- From: Giovanni Fernandez-Kincade [mailto:gfernandez-kinc...@capitaliq.com] Sent: Tuesday, October 06, 2009 12:39 PM To: solr-user@lucene.apache.org; yo...@lucidimagination.com Subject: RE: Solr Timeouts That thread was blocking for an hour while all other threads were idle or blocked. -Original Message- From: ysee...@gmail.com [mailto:ysee...@gmail.com] On Behalf Of Yonik Seeley Sent: Tuesday, October 06, 2009 3:07 PM To: solr-user@lucene.apache.org Subject: Re: Solr Timeouts This specific thread was blocked for an hour? If so, I'd echo Lance... this is a local disk right? -Yonik http://www.lucidimagination.com On Mon, Oct 5, 2009 at 2:11 PM, Giovanni Fernandez-Kincade wrote: > I just grabbed another stack trace for a thread that has been similarly > blocking for over an hour. Notice that there is no Commit in this one: > > http-8080-Processor67 [RUNNABLE] CPU time: 1:02:05 > org.apache.lucene.index.TermBuffer.read(IndexInput, FieldInfos) > org.apache.lucene.index.SegmentTermEnum.next() > org.apache.lucene.index.SegmentTermEnum.scanTo(Term) > org.apache.lucene.index.TermInfosReader.get(Term, boolean) > org.apache.lucene.index.TermInfosReader.get(Term) > org.apache.lucene.index.SegmentTermDocs.seek(Term) > org.apache.lucene.index.DocumentsWriter.applyDeletes(IndexReader, int) > org.apache.lucene.index.DocumentsWriter.applyDeletes(SegmentInfos) > org.apache.lucene.index.IndexWriter.applyDeletes() > org.apache.lucene.index.IndexWriter.doFlushInternal(boolean, boolean) > org.apache.lucene.index.IndexWriter.doFlush(boolean, boolean) > org.apache.lucene.index.IndexWriter.flush(boolean, boolean, boolean) > org.apache.lucene.index.IndexWriter.updateDocument(Term, Document, Analyzer) > org.apache.lucene.index.IndexWriter.updateDocument(Term, Document) > org.apache.solr.update.DirectUpdateHandler2.addDoc(AddUpdateCommand) > org.apache.solr.update.processor.RunUpdateProcessor.processAdd(AddUpdateCommand) > org.apache.solr.handler.extraction.ExtractingDocumentLoader.doAdd(SolrContentHandler, > AddUpdateCommand) > org.apache.solr.handler.extraction.ExtractingDocumentLoader.addDoc(SolrContentHandler) > org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(SolrQueryRequest, > SolrQueryResponse, ContentStream) > org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(SolrQueryRequest, > SolrQueryResponse) > org.apache.solr.handler.RequestHandlerBase.handleRequest(SolrQueryRequest, > SolrQueryResponse) > org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(SolrQueryRequest, > SolrQueryResponse) > org.apache.solr.core.SolrCore.execute(SolrRequestHandler, SolrQueryRequest, > SolrQueryResponse) > org.apache.solr.servlet.SolrDispatchFilter.execute(HttpServletRequest, > SolrRequestHandler, SolrQueryRequest, SolrQueryResponse) > org.apache.solr.servlet.SolrDispatchFilter.doFilter(ServletRequest, > ServletResponse, FilterChain) > org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ServletRequest, > ServletResponse) > org.apache.catalina.core.ApplicationFilterChain.doFilter(ServletRequest, > ServletResponse) > org.apache.catalina.core.StandardWrapperValve.invoke(Request, Response) > org.apache.catalina.core.StandardContextValve.invoke(Request, Response) > org.apache.catalina.core.StandardHostValve.invoke(Request, Response) > org.apache.catalina.valves.ErrorReportValve.invoke(Request, Response) > org.apache.catalina.core.StandardEngineValve.invoke(Request, Response) > org.apache.catalina.connector.CoyoteAdapter.service(Request, Response) > org.apache.coyote.http11.Http11Processor.process(InputStream, OutputStream) > org.apache.coyote.http11.Http11BaseProtocol$Http11ConnectionHandler.processConnection(TcpConnection, > Object[]) > org.apache.tomcat.util.net.PoolTcpEndpoint.processSocket(Socket, > TcpConnection, Object[]) > org.apache.tomcat.util.net.LeaderFollowerWorkerThread.runIt(Object[]) > org.apache.tomcat.util.threads.ThreadPool$ControlRunnable.run() > java.lang.Thread.run() > > > -Original Message- > From: ysee...@gmail.com [mailto:ysee...@gmail.com] On Behalf Of Yonik Seeley > Sent: Monday, October 05, 2009 1:18 PM > To: solr-user@lucene.apache.org > Subject: Re: Solr Timeouts > > OK... next step is to verify that SolrCell doesn't have a bug that > causes it to commit. > I'll try and verify today unless someone else beats me to it. > > -Yonik > http://www.lucidimagination.com > > On Mon, Oct 5, 2009 at 1:04 PM, Giovanni Fernandez-Kincade > wrote: >> I'm fairly certain that all of the indexing jobs are calling SOLR with >> commit=false. They all construct the indexing URLs using a CLR function I >> wrote, which takes in a Commit parameter, which is
Re: Weird Facet and KeywordTokenizerFactory Issue
Have you tried using the Analysis page to see what tokens are generated for the string "New York"? It could be one of the token filter is adding the token 'new' for all strings that start with 'new' On 10/06/2009 02:54 PM, Ravi Kiran wrote: Hello All, Iam getting some ghost facets in solr 1.4. Can anybody kindly help me understand why I get them and how to eliminate them. My schema.xml snippet is given at the end. Iam indexing Named Entities extracted via OpenNLP into solr. My understanding regarding KeywordTokenizerFactory is that it will use all words as a single token, am I right ? for example: "New York" will be indexed as 'New York' and will not be split right??? However I see then splitup in facets as follows when running the query " http://localhost:8080/solr-admin/topicscore/select/?facet=true&facet.limit=-1"...but when I search with standard handler qt=standard&q=keyword:"New" I dont find any doc which has just "New". After digging in a bit I found that if several keywords have a common starting word it is being pulled out as another facet like the following. Any help is greatly appreciated Result 47 > Ghost 7 16 10 147 23 8 5 6 8 10 8 5 7 7 --> Ghost 5 5 7 --> Ghost 6 26 6 27 8 7 12 Schema.xml -
Re: stats page slow in latest nightly
I was worried about that actually. I havn't tested how fast the RAM estimator is on huge String FieldCaches - it will be fast on everything else, but it checks the size of each String in the array. When I was working on it, I was actually going to default to not show the size, and make you click a link that added a param to get the sizes in the display too. But I foolishly didn't bring it up when Hoss made my life easier with his simpler patch. Yonik Seeley wrote: > Might be the new Lucene fieldCache stats stuff that was recently added? > > -Yonik > http://www.lucidimagination.com > > > On Tue, Oct 6, 2009 at 3:56 PM, Joe Calderon wrote: > >> hello *, ive been noticing that /admin/stats.jsp is really slow in the >> recent builds, has anyone else encountered this? >> >> >> --joe >> -- - Mark http://www.lucidimagination.com
RE: Solr and Garbage Collection
Master-Slave replica: new caches will be warmed&prepopulated _before_ making new IndexReader available for _new_ requests and _before_ discarding old one - it means that theoretical sizing for FieldCache (which is defined by number of docs in an index and cardinality of a field) should be doubled... of course we need to play with GC options too for performance tuning (mostly) > > I read pretty much all posts on this thread (before and after this one). > Looks > > like the main suggestion from you and others is to keep max heap size > (-Xmx) > > as small as possible (as long as you don't see OOM exception). > > > I suggested absolute opposite; please note also that "as small as possible" > does not have any meaning in multiuser environment of Tomcat. It depends on > query types (10 documents per request? OR, may be 1???) AND it depends > on average server loading (one concurrent request? Or, may be 200 threads > trying to deal with 2000 concurrent requests?) AND it depends on whether it > is Master (used for updates - parses tons of docs in a single file???) - and > it depends on unpredictable memory fragmentation - it all depends on use > case too(!!!), additionally to schema / index size. > > > Please note also, such staff depends on JVM vendor too: what if it > precompiles everything into CPU native code (including memory dealloc after > each call)? Some do! > > -Fuad > http://www.linkedin.com/in/liferay > > > ...but 'core' constantly disagrees with me :) > > >
Re: stats page slow in latest nightly
Might be the new Lucene fieldCache stats stuff that was recently added? -Yonik http://www.lucidimagination.com On Tue, Oct 6, 2009 at 3:56 PM, Joe Calderon wrote: > hello *, ive been noticing that /admin/stats.jsp is really slow in the > recent builds, has anyone else encountered this? > > > --joe
Re: Solr Trunk Heap Space Issues
Jeff Newburn wrote: > So could that potentially explain our use of more ram on indexing? Or is > this a rare edge case. > I think it could explain the JVM using more RAM while indexing - but it should be fairly easily recoverable from what I can tell - so no explanation on the OOM yet. Still looking at that one. Is you system basically stock, or do you have custom plugins in it? -- - Mark http://www.lucidimagination.com
stats page slow in latest nightly
hello *, ive been noticing that /admin/stats.jsp is really slow in the recent builds, has anyone else encountered this? --joe
Weird Facet and KeywordTokenizerFactory Issue
Hello All, Iam getting some ghost facets in solr 1.4. Can anybody kindly help me understand why I get them and how to eliminate them. My schema.xml snippet is given at the end. Iam indexing Named Entities extracted via OpenNLP into solr. My understanding regarding KeywordTokenizerFactory is that it will use all words as a single token, am I right ? for example: "New York" will be indexed as 'New York' and will not be split right??? However I see then splitup in facets as follows when running the query " http://localhost:8080/solr-admin/topicscore/select/?facet=true&facet.limit=-1"...but when I search with standard handler qt=standard&q=keyword:"New" I dont find any doc which has just "New". After digging in a bit I found that if several keywords have a common starting word it is being pulled out as another facet like the following. Any help is greatly appreciated Result 47> Ghost 7 16 10 147 23 8 5 6 8 10 8 5 7 7--> Ghost 5 5 7 --> Ghost 6 26 6 27 8 7 12 Schema.xml -
Re: Solr Trunk Heap Space Issues
So could that potentially explain our use of more ram on indexing? Or is this a rare edge case. -- Jeff Newburn Software Engineer, Zappos.com jnewb...@zappos.com - 702-943-7562 > From: Mark Miller > Reply-To: > Date: Tue, 06 Oct 2009 15:30:50 -0400 > To: > Subject: Re: Solr Trunk Heap Space Issues > > This is looking like its just a Lucene oddity you get when adding a > single doc due to some changes with the NRT stuff. > > Mark Miller wrote: >> Okay - I'm sorry - serves me right for working sick. >> >> Now that I have put on my glasses and correctly tagged my two eclipse tests: >> >> It still appears that trunk likes to use more RAM. >> >> I switched both tests to one million iterations and watched the heap. >> >> The test from the build around may 5th (I promise :) ) regularly GC's >> down to about 70-80MB after a fair time >> of running. It doesn't appear to climb - keeps GC'ing back to 70-80 >> (after starting at by GC'ing down to 40 for a bit). >> >> The test from trunk, after a fair time of running, keeps GC'ing down to >> about 120-150MB - 150 at the end, slowly working its >> way up from 90-110 at the beginning. >> >> Don't know what that means yet - but it appears trunk likes to use a bit >> more RAM while indexing. Odd that its so much more because these docs >> are tiny: >> >> String[] fields = {"text","simple" >> ,"text","test" >> ,"text","how now brown cow" >> ,"text","what's that?" >> ,"text","radical!" >> ,"text","what's all this about, anyway?" >> ,"text","just how fast is this text indexing?" >> }; >> >> Mark Miller wrote: >> >>> Okay, I juggled the tests in eclipse and flipped the results. So they >>> make sense. >>> >>> Sorry - goose chase on this one. >>> >>> Yonik Seeley wrote: >>> >>> I don't see this with trunk... I just tried TestIndexingPerformance with 1M docs, and it seemed to work fine. Memory use stabilized at 40MB. Most memory use was for indexing (not analysis). char[] topped out at 4.5MB -Yonik http://www.lucidimagination.com On Tue, Oct 6, 2009 at 12:31 PM, Mark Miller wrote: > Yeah - I was wondering about that ... not sure how these guys are > stacking up ... > > Yonik Seeley wrote: > > > >> TestIndexingPerformance? >> What the heck... that's not even multi-threaded! >> >> -Yonik >> http://www.lucidimagination.com >> >> >> >> On Tue, Oct 6, 2009 at 12:17 PM, Mark Miller >> wrote: >> >> >> >> >>> Darnit - didn't finish that email. This is after running your old short >>> doc perf test for 10,000 iterations. You see the same thing with 1000 >>> iterations but much less pronounced eg gettin' worse with more >>> iterations. >>> >>> Mark Miller wrote: >>> >>> >>> >>> A little before and after. The before is around may 5th'is - the after is trunk. http://myhardshadow.com/memanalysis/before.png http://myhardshadow.com/memanalysis/after.png Mark Miller wrote: > Took a peak at the checkout around the time he says he's using. > > CharTokenizer appears to be holding onto much large char[] arrays now > than before. Same with snowball.Among - used to be almost nothing, now > its largio. > > The new TokenStream stuff appears to be clinging. Needs to find some > inner peace. > > > >>> >>> >> >> >> > > > -- > - Mark > > http://www.lucidimagination.com > > >
RE: Solr Timeouts
That thread was blocking for an hour while all other threads were idle or blocked. -Original Message- From: ysee...@gmail.com [mailto:ysee...@gmail.com] On Behalf Of Yonik Seeley Sent: Tuesday, October 06, 2009 3:07 PM To: solr-user@lucene.apache.org Subject: Re: Solr Timeouts This specific thread was blocked for an hour? If so, I'd echo Lance... this is a local disk right? -Yonik http://www.lucidimagination.com On Mon, Oct 5, 2009 at 2:11 PM, Giovanni Fernandez-Kincade wrote: > I just grabbed another stack trace for a thread that has been similarly > blocking for over an hour. Notice that there is no Commit in this one: > > http-8080-Processor67 [RUNNABLE] CPU time: 1:02:05 > org.apache.lucene.index.TermBuffer.read(IndexInput, FieldInfos) > org.apache.lucene.index.SegmentTermEnum.next() > org.apache.lucene.index.SegmentTermEnum.scanTo(Term) > org.apache.lucene.index.TermInfosReader.get(Term, boolean) > org.apache.lucene.index.TermInfosReader.get(Term) > org.apache.lucene.index.SegmentTermDocs.seek(Term) > org.apache.lucene.index.DocumentsWriter.applyDeletes(IndexReader, int) > org.apache.lucene.index.DocumentsWriter.applyDeletes(SegmentInfos) > org.apache.lucene.index.IndexWriter.applyDeletes() > org.apache.lucene.index.IndexWriter.doFlushInternal(boolean, boolean) > org.apache.lucene.index.IndexWriter.doFlush(boolean, boolean) > org.apache.lucene.index.IndexWriter.flush(boolean, boolean, boolean) > org.apache.lucene.index.IndexWriter.updateDocument(Term, Document, Analyzer) > org.apache.lucene.index.IndexWriter.updateDocument(Term, Document) > org.apache.solr.update.DirectUpdateHandler2.addDoc(AddUpdateCommand) > org.apache.solr.update.processor.RunUpdateProcessor.processAdd(AddUpdateCommand) > org.apache.solr.handler.extraction.ExtractingDocumentLoader.doAdd(SolrContentHandler, > AddUpdateCommand) > org.apache.solr.handler.extraction.ExtractingDocumentLoader.addDoc(SolrContentHandler) > org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(SolrQueryRequest, > SolrQueryResponse, ContentStream) > org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(SolrQueryRequest, > SolrQueryResponse) > org.apache.solr.handler.RequestHandlerBase.handleRequest(SolrQueryRequest, > SolrQueryResponse) > org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(SolrQueryRequest, > SolrQueryResponse) > org.apache.solr.core.SolrCore.execute(SolrRequestHandler, SolrQueryRequest, > SolrQueryResponse) > org.apache.solr.servlet.SolrDispatchFilter.execute(HttpServletRequest, > SolrRequestHandler, SolrQueryRequest, SolrQueryResponse) > org.apache.solr.servlet.SolrDispatchFilter.doFilter(ServletRequest, > ServletResponse, FilterChain) > org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ServletRequest, > ServletResponse) > org.apache.catalina.core.ApplicationFilterChain.doFilter(ServletRequest, > ServletResponse) > org.apache.catalina.core.StandardWrapperValve.invoke(Request, Response) > org.apache.catalina.core.StandardContextValve.invoke(Request, Response) > org.apache.catalina.core.StandardHostValve.invoke(Request, Response) > org.apache.catalina.valves.ErrorReportValve.invoke(Request, Response) > org.apache.catalina.core.StandardEngineValve.invoke(Request, Response) > org.apache.catalina.connector.CoyoteAdapter.service(Request, Response) > org.apache.coyote.http11.Http11Processor.process(InputStream, OutputStream) > org.apache.coyote.http11.Http11BaseProtocol$Http11ConnectionHandler.processConnection(TcpConnection, > Object[]) > org.apache.tomcat.util.net.PoolTcpEndpoint.processSocket(Socket, > TcpConnection, Object[]) > org.apache.tomcat.util.net.LeaderFollowerWorkerThread.runIt(Object[]) > org.apache.tomcat.util.threads.ThreadPool$ControlRunnable.run() > java.lang.Thread.run() > > > -Original Message- > From: ysee...@gmail.com [mailto:ysee...@gmail.com] On Behalf Of Yonik Seeley > Sent: Monday, October 05, 2009 1:18 PM > To: solr-user@lucene.apache.org > Subject: Re: Solr Timeouts > > OK... next step is to verify that SolrCell doesn't have a bug that > causes it to commit. > I'll try and verify today unless someone else beats me to it. > > -Yonik > http://www.lucidimagination.com > > On Mon, Oct 5, 2009 at 1:04 PM, Giovanni Fernandez-Kincade > wrote: >> I'm fairly certain that all of the indexing jobs are calling SOLR with >> commit=false. They all construct the indexing URLs using a CLR function I >> wrote, which takes in a Commit parameter, which is always set to false. >> >> Also, I don't see any calls to commit in the Tomcat logs (whereas normally >> when I make a commit call I do). >> >> This suggests that Solr is doing it automatically, but the extract handler >> doesn't seem to be the problem: >> > class="org.apache.solr.handler.extraction.ExtractingRequestHandler" >> startup="lazy"> >> >> ignored_ >> fileData >> >> >> >> >> There is no external config file specified, and I don't see anything about >>
RE: Solr Timeouts
Yeah this is Java 1.6. The indexes are being written to a local disk, but they files being indexed live on a NFS. -Original Message- From: Lance Norskog [mailto:goks...@gmail.com] Sent: Tuesday, October 06, 2009 2:59 PM To: solr-user@lucene.apache.org Subject: Re: Solr Timeouts Is this Java 1.5? There are known threading bugs in 1.5 that were fixed in Java 1.6. Also, there was one short series of 1.6 releases that wrote bogus Lucene index files. So, make sure you use the latest Java 1.6 release. Also, I hope this is a local disk. Some shops try running over NFS or Windows file sharing and this often does not work well. Lance On 10/6/09, Giovanni Fernandez-Kincade wrote: > Is it possible that deletions are triggering these commits? Some of the > documents that I'm making indexing requests for already exist in the index, > so they would result in deletions. I tried messing with some of these > parameters but I'm still running into the same problem: > > > > false > > 100 > > > > This is happening like every 30-40minutes and it's really hampering the > indexing progress... > > > -Original Message- > From: Giovanni Fernandez-Kincade [mailto:gfernandez-kinc...@capitaliq.com] > Sent: Monday, October 05, 2009 2:11 PM > To: solr-user@lucene.apache.org; yo...@lucidimagination.com > Subject: RE: Solr Timeouts > > I just grabbed another stack trace for a thread that has been similarly > blocking for over an hour. Notice that there is no Commit in this one: > > http-8080-Processor67 [RUNNABLE] CPU time: 1:02:05 > org.apache.lucene.index.TermBuffer.read(IndexInput, FieldInfos) > org.apache.lucene.index.SegmentTermEnum.next() > org.apache.lucene.index.SegmentTermEnum.scanTo(Term) > org.apache.lucene.index.TermInfosReader.get(Term, boolean) > org.apache.lucene.index.TermInfosReader.get(Term) > org.apache.lucene.index.SegmentTermDocs.seek(Term) > org.apache.lucene.index.DocumentsWriter.applyDeletes(IndexReader, int) > org.apache.lucene.index.DocumentsWriter.applyDeletes(SegmentInfos) > org.apache.lucene.index.IndexWriter.applyDeletes() > org.apache.lucene.index.IndexWriter.doFlushInternal(boolean, boolean) > org.apache.lucene.index.IndexWriter.doFlush(boolean, boolean) > org.apache.lucene.index.IndexWriter.flush(boolean, boolean, boolean) > org.apache.lucene.index.IndexWriter.updateDocument(Term, Document, Analyzer) > org.apache.lucene.index.IndexWriter.updateDocument(Term, Document) > org.apache.solr.update.DirectUpdateHandler2.addDoc(AddUpdateCommand) > org.apache.solr.update.processor.RunUpdateProcessor.processAdd(AddUpdateCommand) > org.apache.solr.handler.extraction.ExtractingDocumentLoader.doAdd(SolrContentHandler, > AddUpdateCommand) > org.apache.solr.handler.extraction.ExtractingDocumentLoader.addDoc(SolrContentHandler) > org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(SolrQueryRequest, > SolrQueryResponse, ContentStream) > org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(SolrQueryRequest, > SolrQueryResponse) > org.apache.solr.handler.RequestHandlerBase.handleRequest(SolrQueryRequest, > SolrQueryResponse) > org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(SolrQueryRequest, > SolrQueryResponse) > org.apache.solr.core.SolrCore.execute(SolrRequestHandler, SolrQueryRequest, > SolrQueryResponse) > org.apache.solr.servlet.SolrDispatchFilter.execute(HttpServletRequest, > SolrRequestHandler, SolrQueryRequest, SolrQueryResponse) > org.apache.solr.servlet.SolrDispatchFilter.doFilter(ServletRequest, > ServletResponse, FilterChain) > org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ServletRequest, > ServletResponse) > org.apache.catalina.core.ApplicationFilterChain.doFilter(ServletRequest, > ServletResponse) > org.apache.catalina.core.StandardWrapperValve.invoke(Request, Response) > org.apache.catalina.core.StandardContextValve.invoke(Request, Response) > org.apache.catalina.core.StandardHostValve.invoke(Request, Response) > org.apache.catalina.valves.ErrorReportValve.invoke(Request, Response) > org.apache.catalina.core.StandardEngineValve.invoke(Request, Response) > org.apache.catalina.connector.CoyoteAdapter.service(Request, Response) > org.apache.coyote.http11.Http11Processor.process(InputStream, OutputStream) > org.apache.coyote.http11.Http11BaseProtocol$Http11ConnectionHandler.processConnection(TcpConnection, > Object[]) > org.apache.tomcat.util.net.PoolTcpEndpoint.processSocket(Socket, > TcpConnection, Object[]) > org.apache.tomcat.util.net.LeaderFollowerWorkerThread.runIt(Object[]) > org.apache.tomcat.util.threads.ThreadPool$ControlRunnable.run() > java.lang.Thread.run() > > > -Original Message- > From: ysee...@gmail.com [mailto:ysee...@gmail.com] On Behalf Of Yonik Seeley > Sent: Monday, October 05, 2009 1:18 PM > To: solr-user@lucene.apache.org > Subject: Re: Solr Timeouts > > OK... next step is to verify that SolrCell doesn't have a bug that > causes it to commit. >
Re: Solr Trunk Heap Space Issues
This is looking like its just a Lucene oddity you get when adding a single doc due to some changes with the NRT stuff. Mark Miller wrote: > Okay - I'm sorry - serves me right for working sick. > > Now that I have put on my glasses and correctly tagged my two eclipse tests: > > It still appears that trunk likes to use more RAM. > > I switched both tests to one million iterations and watched the heap. > > The test from the build around may 5th (I promise :) ) regularly GC's > down to about 70-80MB after a fair time > of running. It doesn't appear to climb - keeps GC'ing back to 70-80 > (after starting at by GC'ing down to 40 for a bit). > > The test from trunk, after a fair time of running, keeps GC'ing down to > about 120-150MB - 150 at the end, slowly working its > way up from 90-110 at the beginning. > > Don't know what that means yet - but it appears trunk likes to use a bit > more RAM while indexing. Odd that its so much more because these docs > are tiny: > > String[] fields = {"text","simple" > ,"text","test" > ,"text","how now brown cow" > ,"text","what's that?" > ,"text","radical!" > ,"text","what's all this about, anyway?" > ,"text","just how fast is this text indexing?" > }; > > Mark Miller wrote: > >> Okay, I juggled the tests in eclipse and flipped the results. So they >> make sense. >> >> Sorry - goose chase on this one. >> >> Yonik Seeley wrote: >> >> >>> I don't see this with trunk... I just tried TestIndexingPerformance >>> with 1M docs, and it seemed to work fine. >>> Memory use stabilized at 40MB. >>> Most memory use was for indexing (not analysis). >>> char[] topped out at 4.5MB >>> >>> -Yonik >>> http://www.lucidimagination.com >>> >>> >>> On Tue, Oct 6, 2009 at 12:31 PM, Mark Miller wrote: >>> >>> >>> Yeah - I was wondering about that ... not sure how these guys are stacking up ... Yonik Seeley wrote: > TestIndexingPerformance? > What the heck... that's not even multi-threaded! > > -Yonik > http://www.lucidimagination.com > > > > On Tue, Oct 6, 2009 at 12:17 PM, Mark Miller > wrote: > > > > >> Darnit - didn't finish that email. This is after running your old short >> doc perf test for 10,000 iterations. You see the same thing with 1000 >> iterations but much less pronounced eg gettin' worse with more >> iterations. >> >> Mark Miller wrote: >> >> >> >> >>> A little before and after. The before is around may 5th'is - the after >>> is trunk. >>> >>> http://myhardshadow.com/memanalysis/before.png >>> http://myhardshadow.com/memanalysis/after.png >>> >>> Mark Miller wrote: >>> >>> >>> >>> >>> Took a peak at the checkout around the time he says he's using. CharTokenizer appears to be holding onto much large char[] arrays now than before. Same with snowball.Among - used to be almost nothing, now its largio. The new TokenStream stuff appears to be clinging. Needs to find some inner peace. >> >> > > > -- - Mark http://www.lucidimagination.com
RE: Solr and Garbage Collection
> I read pretty much all posts on this thread (before and after this one). Looks > like the main suggestion from you and others is to keep max heap size (-Xmx) > as small as possible (as long as you don't see OOM exception). I suggested absolute opposite; please note also that "as small as possible" does not have any meaning in multiuser environment of Tomcat. It depends on query types (10 documents per request? OR, may be 1???) AND it depends on average server loading (one concurrent request? Or, may be 200 threads trying to deal with 2000 concurrent requests?) AND it depends on whether it is Master (used for updates - parses tons of docs in a single file???) - and it depends on unpredictable memory fragmentation - it all depends on use case too(!!!), additionally to schema / index size. Please note also, such staff depends on JVM vendor too: what if it precompiles everything into CPU native code (including memory dealloc after each call)? Some do! -Fuad http://www.linkedin.com/in/liferay ...but 'core' constantly disagrees with me :)
RE: Geo Coding Service
If you are looking for (simplified) ZIP/PostalCode -> Longitude-Latitude mapping (North America) check this: http://www.zipcodedownload.com I am using it for service area calculations for casaGURU renovation professionals at http://www.casaguru.com They even have API library (including stored procedures for MySQL, Oracle, etc + Java API) - to calculate distance between two postal codes, execute queries, etc. -Fuad http://www.linkedin.com/in/liferay > -Original Message- > From: ram_sj [mailto:rpachaiyap...@gmail.com] > Sent: October-06-09 2:33 PM > To: solr-user@lucene.apache.org > Subject: Geo Coding Service > > > Hi, > > Can someone suggest me a good geo-coding service or software for commercial > use. I want to find gecodes for large collection of address. I'm looking for > a good long term service. > > Thanks > Ram > -- > View this message in context: http://www.nabble.com/Geo-Coding-Service- > tp25774277p25774277.html > Sent from the Solr - User mailing list archive at Nabble.com.
Re: Importing CSV file slow/crashes
: Is it possible to narrow down what fields/field-types are causing the problems? : Or perhaps profile and see what's taking up time compared to the older version? Or: could you post your solrconfig + schema + csv files online so other people could help debug the problem? : : -Yonik : http://www.lucidimagination.com : : : : On Tue, Oct 6, 2009 at 1:48 PM, Nasseam Elkarra wrote: : > Hello Erick, : > : > Sorry about that. I'm using the CSV update handler. Uploading a local CSV : > using the stream.file parameter. There are 94 fields and 36 copyFields. : > : > Thank you, : > Nasseam : > : > On Oct 6, 2009, at 10:09 AM, Erick Erickson wrote: : > : >> Well, without some better idea of *how* you're doing the import, it's a : >> little hard to say anything meaningful (hint, hint). : >> Best : >> Erick : >> : >> On Tue, Oct 6, 2009 at 1:06 PM, Nasseam Elkarra : >> wrote: : >> : >>> Hello all, : >>> : >>> I had a dev build of 1.4 from 5/1/2009 and importing a 20K row took less : >>> than a minute. Updating to the latest as of yesterday, the import is : >>> really : >>> slow and I had to cancel it after a half hour. This prevented me from : >>> upgrading a few months ago as well. : >>> : >>> Any ideas as to the cause of this? : >>> : >>> Thank you, : >>> : >>> Nasseam Elkarra : >>> http://bodukai.com/boutique/ : >>> The fastest possible shopping experience. : >>> : >>> : > : > : -Hoss
Re: DataImportHandler problem: Feeding the XPathEntityProcessor with the FieldReaderDataSource
A side note that might help: if I change the dataField from 'db.blob' to 'blob', this DIH stack emits no documents. On 10/5/09, Lance Norskog wrote: > I've added a unit test for the problem down below. It feeds document > field data into the XPathEntityProcessor via the > FieldReaderDataSource, and the XPath EP does not emit unpacked fields. > > Running this under the debugger, I can see the supplied StringReader, > with the XML string, being piped into the XPath EP. But somehow the > XPath EP does not pick it apart the right way. > > Here is the DIH configuration file separately. > > > > > > > > > > processor='XPathEntityProcessor' > forEach='/names' dataField='db.blob'> > > > > > > > Any ideas? > > --- > > package org.apache.solr.handler.dataimport; > > import static > org.apache.solr.handler.dataimport.AbstractDataImportHandlerTest.createMap; > import junit.framework.TestCase; > > import java.util.ArrayList; > import java.util.HashMap; > import java.util.List; > import java.util.Map; > > import org.apache.solr.common.SolrInputDocument; > import org.apache.solr.common.SolrInputField; > import org.apache.solr.handler.dataimport.TestDocBuilder.SolrWriterImpl; > import org.junit.Test; > > /* > * Demonstrate problem feeding XPathEntity from a FieldReaderDatasource > */ > > public class TestFieldReaderXPath extends TestCase { > static final String KISSINGER = "Henry"; > > static final String[][][] DBDOCS = { > {{"dbid", "1"}, {"blob", KISSINGER}}, > }; > > /* >* Receive a row from SQL and fetch a row from Solr - no value matching >* stolen from TestDocBuilder >* */ > > @Test > public void testSolrEmbedded() throws Exception { > try { > DataImporter di = new DataImporter(); > di.loadDataConfig(dih_config_FR_into_XP); > DataImporter.RequestParams rp = new > DataImporter.RequestParams(); > rp.command = "full-import"; > rp.requestParams = new HashMap(); > > DataConfig cfg = di.getConfig(); > DataConfig.Entity entity = cfg.document.entities.get(0); > List> l = new > ArrayList>(); > addDBDocuments(l); > MockDataSource.setIterator("select * from x", > l.iterator()); > entity.dataSrc = new MockDataSource(); > entity.isDocRoot = true; > SolrWriterImpl swi = new SolrWriterImpl(); > di.runCmd(rp, swi); > > assertEquals(1, swi.docs.size()); > SolrInputDocument doc = swi.docs.get(0); > SolrInputField field; > field = doc.getField("dbid"); > assertEquals(field.getValue().toString(), "1"); > field = doc.getField("blob"); > assertEquals(field.getValue().toString(), KISSINGER); > field = doc.getField("name"); > assertNotNull(field); > assertEquals(field.getValue().toString(), "Henry"); > } finally { > MockDataSource.clearCache(); > } > } > > > private void addDBDocuments(List> l) { > for(String[][] dbdoc: DBDOCS) { > l.add(createMap(dbdoc[0][0], dbdoc[0][1], dbdoc[1][0], > dbdoc[1][1])); > } > } > >String dih_config_FR_into_XP = "\r\n" + >" \r\n" + >" \r\n" + >" \r\n" + >" \r\n" > + >"\r\n" + >"\r\n" + >"\r\n" + >" processor='XPathEntityProcessor'\r\n" + >" forEach='/names' dataField='db.blob'>\r\n" + >" \r\n" + >"\r\n" + >" \r\n" + >" \r\n" + >"\r\n" >; > > > } > -- Lance Norskog goks...@gmail.com
Re: Different sort behavior on same code
Lucene's test for multi-valued fields is crude... it's essentially if the number of values (un-inverted term instances) becomes greater than the number of documents. -Yonik http://www.lucidimagination.com On Tue, Oct 6, 2009 at 3:04 PM, wojtekpia wrote: > > Hi, > > I'm running Solr version 1.3.0.2009.07.08.08.05.45 in 2 environments. I have > a field defined as: > > multiValued="true"/> > > The two environments have different data, but both have single and multi > valued entries for myDate. > > On one environment sorting by myDate works (sort seems to be by the 'last' > value if multi valued). > > On the other environment I get: > HTTP Status 500 - there are more terms than documents in field "myDate", but > it's impossible to sort on tokenized fields java.lang.RuntimeException: > there are more terms than documents in field > > I've read that I shouldn't sort by multi-valued fields, so my solution will > be to add a single-valued date field for sorting. But I don't understand why > my two environments behave differently, and it doesn't seem like the error > message makes sense (are date fields tokenized?). Any thoughts? > > Thanks, > > Wojtek > -- > View this message in context: > http://www.nabble.com/Different-sort-behavior-on-same-code-tp25774769p25774769.html > Sent from the Solr - User mailing list archive at Nabble.com. > >
Re: Solr Timeouts
This specific thread was blocked for an hour? If so, I'd echo Lance... this is a local disk right? -Yonik http://www.lucidimagination.com On Mon, Oct 5, 2009 at 2:11 PM, Giovanni Fernandez-Kincade wrote: > I just grabbed another stack trace for a thread that has been similarly > blocking for over an hour. Notice that there is no Commit in this one: > > http-8080-Processor67 [RUNNABLE] CPU time: 1:02:05 > org.apache.lucene.index.TermBuffer.read(IndexInput, FieldInfos) > org.apache.lucene.index.SegmentTermEnum.next() > org.apache.lucene.index.SegmentTermEnum.scanTo(Term) > org.apache.lucene.index.TermInfosReader.get(Term, boolean) > org.apache.lucene.index.TermInfosReader.get(Term) > org.apache.lucene.index.SegmentTermDocs.seek(Term) > org.apache.lucene.index.DocumentsWriter.applyDeletes(IndexReader, int) > org.apache.lucene.index.DocumentsWriter.applyDeletes(SegmentInfos) > org.apache.lucene.index.IndexWriter.applyDeletes() > org.apache.lucene.index.IndexWriter.doFlushInternal(boolean, boolean) > org.apache.lucene.index.IndexWriter.doFlush(boolean, boolean) > org.apache.lucene.index.IndexWriter.flush(boolean, boolean, boolean) > org.apache.lucene.index.IndexWriter.updateDocument(Term, Document, Analyzer) > org.apache.lucene.index.IndexWriter.updateDocument(Term, Document) > org.apache.solr.update.DirectUpdateHandler2.addDoc(AddUpdateCommand) > org.apache.solr.update.processor.RunUpdateProcessor.processAdd(AddUpdateCommand) > org.apache.solr.handler.extraction.ExtractingDocumentLoader.doAdd(SolrContentHandler, > AddUpdateCommand) > org.apache.solr.handler.extraction.ExtractingDocumentLoader.addDoc(SolrContentHandler) > org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(SolrQueryRequest, > SolrQueryResponse, ContentStream) > org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(SolrQueryRequest, > SolrQueryResponse) > org.apache.solr.handler.RequestHandlerBase.handleRequest(SolrQueryRequest, > SolrQueryResponse) > org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(SolrQueryRequest, > SolrQueryResponse) > org.apache.solr.core.SolrCore.execute(SolrRequestHandler, SolrQueryRequest, > SolrQueryResponse) > org.apache.solr.servlet.SolrDispatchFilter.execute(HttpServletRequest, > SolrRequestHandler, SolrQueryRequest, SolrQueryResponse) > org.apache.solr.servlet.SolrDispatchFilter.doFilter(ServletRequest, > ServletResponse, FilterChain) > org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ServletRequest, > ServletResponse) > org.apache.catalina.core.ApplicationFilterChain.doFilter(ServletRequest, > ServletResponse) > org.apache.catalina.core.StandardWrapperValve.invoke(Request, Response) > org.apache.catalina.core.StandardContextValve.invoke(Request, Response) > org.apache.catalina.core.StandardHostValve.invoke(Request, Response) > org.apache.catalina.valves.ErrorReportValve.invoke(Request, Response) > org.apache.catalina.core.StandardEngineValve.invoke(Request, Response) > org.apache.catalina.connector.CoyoteAdapter.service(Request, Response) > org.apache.coyote.http11.Http11Processor.process(InputStream, OutputStream) > org.apache.coyote.http11.Http11BaseProtocol$Http11ConnectionHandler.processConnection(TcpConnection, > Object[]) > org.apache.tomcat.util.net.PoolTcpEndpoint.processSocket(Socket, > TcpConnection, Object[]) > org.apache.tomcat.util.net.LeaderFollowerWorkerThread.runIt(Object[]) > org.apache.tomcat.util.threads.ThreadPool$ControlRunnable.run() > java.lang.Thread.run() > > > -Original Message- > From: ysee...@gmail.com [mailto:ysee...@gmail.com] On Behalf Of Yonik Seeley > Sent: Monday, October 05, 2009 1:18 PM > To: solr-user@lucene.apache.org > Subject: Re: Solr Timeouts > > OK... next step is to verify that SolrCell doesn't have a bug that > causes it to commit. > I'll try and verify today unless someone else beats me to it. > > -Yonik > http://www.lucidimagination.com > > On Mon, Oct 5, 2009 at 1:04 PM, Giovanni Fernandez-Kincade > wrote: >> I'm fairly certain that all of the indexing jobs are calling SOLR with >> commit=false. They all construct the indexing URLs using a CLR function I >> wrote, which takes in a Commit parameter, which is always set to false. >> >> Also, I don't see any calls to commit in the Tomcat logs (whereas normally >> when I make a commit call I do). >> >> This suggests that Solr is doing it automatically, but the extract handler >> doesn't seem to be the problem: >> > class="org.apache.solr.handler.extraction.ExtractingRequestHandler" >> startup="lazy"> >> >> ignored_ >> fileData >> >> >> >> >> There is no external config file specified, and I don't see anything about >> commits here. >> >> I've tried setting up more detailed indexer logging but haven't been able to >> get it to work: >> true >> >> I tried relative and absolute paths, but no dice so far. >> >> Any other ideas? >> >> -Gio. >> >> -Original Message- >> From: ysee...@gmail.com [mail
Different sort behavior on same code
Hi, I'm running Solr version 1.3.0.2009.07.08.08.05.45 in 2 environments. I have a field defined as: The two environments have different data, but both have single and multi valued entries for myDate. On one environment sorting by myDate works (sort seems to be by the 'last' value if multi valued). On the other environment I get: HTTP Status 500 - there are more terms than documents in field "myDate", but it's impossible to sort on tokenized fields java.lang.RuntimeException: there are more terms than documents in field I've read that I shouldn't sort by multi-valued fields, so my solution will be to add a single-valued date field for sorting. But I don't understand why my two environments behave differently, and it doesn't seem like the error message makes sense (are date fields tokenized?). Any thoughts? Thanks, Wojtek -- View this message in context: http://www.nabble.com/Different-sort-behavior-on-same-code-tp25774769p25774769.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Solr Timeouts
Is this Java 1.5? There are known threading bugs in 1.5 that were fixed in Java 1.6. Also, there was one short series of 1.6 releases that wrote bogus Lucene index files. So, make sure you use the latest Java 1.6 release. Also, I hope this is a local disk. Some shops try running over NFS or Windows file sharing and this often does not work well. Lance On 10/6/09, Giovanni Fernandez-Kincade wrote: > Is it possible that deletions are triggering these commits? Some of the > documents that I'm making indexing requests for already exist in the index, > so they would result in deletions. I tried messing with some of these > parameters but I'm still running into the same problem: > > > > false > > 100 > > > > This is happening like every 30-40minutes and it's really hampering the > indexing progress... > > > -Original Message- > From: Giovanni Fernandez-Kincade [mailto:gfernandez-kinc...@capitaliq.com] > Sent: Monday, October 05, 2009 2:11 PM > To: solr-user@lucene.apache.org; yo...@lucidimagination.com > Subject: RE: Solr Timeouts > > I just grabbed another stack trace for a thread that has been similarly > blocking for over an hour. Notice that there is no Commit in this one: > > http-8080-Processor67 [RUNNABLE] CPU time: 1:02:05 > org.apache.lucene.index.TermBuffer.read(IndexInput, FieldInfos) > org.apache.lucene.index.SegmentTermEnum.next() > org.apache.lucene.index.SegmentTermEnum.scanTo(Term) > org.apache.lucene.index.TermInfosReader.get(Term, boolean) > org.apache.lucene.index.TermInfosReader.get(Term) > org.apache.lucene.index.SegmentTermDocs.seek(Term) > org.apache.lucene.index.DocumentsWriter.applyDeletes(IndexReader, int) > org.apache.lucene.index.DocumentsWriter.applyDeletes(SegmentInfos) > org.apache.lucene.index.IndexWriter.applyDeletes() > org.apache.lucene.index.IndexWriter.doFlushInternal(boolean, boolean) > org.apache.lucene.index.IndexWriter.doFlush(boolean, boolean) > org.apache.lucene.index.IndexWriter.flush(boolean, boolean, boolean) > org.apache.lucene.index.IndexWriter.updateDocument(Term, Document, Analyzer) > org.apache.lucene.index.IndexWriter.updateDocument(Term, Document) > org.apache.solr.update.DirectUpdateHandler2.addDoc(AddUpdateCommand) > org.apache.solr.update.processor.RunUpdateProcessor.processAdd(AddUpdateCommand) > org.apache.solr.handler.extraction.ExtractingDocumentLoader.doAdd(SolrContentHandler, > AddUpdateCommand) > org.apache.solr.handler.extraction.ExtractingDocumentLoader.addDoc(SolrContentHandler) > org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(SolrQueryRequest, > SolrQueryResponse, ContentStream) > org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(SolrQueryRequest, > SolrQueryResponse) > org.apache.solr.handler.RequestHandlerBase.handleRequest(SolrQueryRequest, > SolrQueryResponse) > org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(SolrQueryRequest, > SolrQueryResponse) > org.apache.solr.core.SolrCore.execute(SolrRequestHandler, SolrQueryRequest, > SolrQueryResponse) > org.apache.solr.servlet.SolrDispatchFilter.execute(HttpServletRequest, > SolrRequestHandler, SolrQueryRequest, SolrQueryResponse) > org.apache.solr.servlet.SolrDispatchFilter.doFilter(ServletRequest, > ServletResponse, FilterChain) > org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ServletRequest, > ServletResponse) > org.apache.catalina.core.ApplicationFilterChain.doFilter(ServletRequest, > ServletResponse) > org.apache.catalina.core.StandardWrapperValve.invoke(Request, Response) > org.apache.catalina.core.StandardContextValve.invoke(Request, Response) > org.apache.catalina.core.StandardHostValve.invoke(Request, Response) > org.apache.catalina.valves.ErrorReportValve.invoke(Request, Response) > org.apache.catalina.core.StandardEngineValve.invoke(Request, Response) > org.apache.catalina.connector.CoyoteAdapter.service(Request, Response) > org.apache.coyote.http11.Http11Processor.process(InputStream, OutputStream) > org.apache.coyote.http11.Http11BaseProtocol$Http11ConnectionHandler.processConnection(TcpConnection, > Object[]) > org.apache.tomcat.util.net.PoolTcpEndpoint.processSocket(Socket, > TcpConnection, Object[]) > org.apache.tomcat.util.net.LeaderFollowerWorkerThread.runIt(Object[]) > org.apache.tomcat.util.threads.ThreadPool$ControlRunnable.run() > java.lang.Thread.run() > > > -Original Message- > From: ysee...@gmail.com [mailto:ysee...@gmail.com] On Behalf Of Yonik Seeley > Sent: Monday, October 05, 2009 1:18 PM > To: solr-user@lucene.apache.org > Subject: Re: Solr Timeouts > > OK... next step is to verify that SolrCell doesn't have a bug that > causes it to commit. > I'll try and verify today unless someone else beats me to it. > > -Yonik > http://www.lucidimagination.com > > On Mon, Oct 5, 2009 at 1:04 PM, Giovanni Fernandez-Kincade > wrote: >> I'm fairly certain that all of the indexing jobs are calling SOLR with >> commit=false. They all construct
Re: Importing CSV file slow/crashes
Is it possible to narrow down what fields/field-types are causing the problems? Or perhaps profile and see what's taking up time compared to the older version? -Yonik http://www.lucidimagination.com On Tue, Oct 6, 2009 at 1:48 PM, Nasseam Elkarra wrote: > Hello Erick, > > Sorry about that. I'm using the CSV update handler. Uploading a local CSV > using the stream.file parameter. There are 94 fields and 36 copyFields. > > Thank you, > Nasseam > > On Oct 6, 2009, at 10:09 AM, Erick Erickson wrote: > >> Well, without some better idea of *how* you're doing the import, it's a >> little hard to say anything meaningful (hint, hint). >> Best >> Erick >> >> On Tue, Oct 6, 2009 at 1:06 PM, Nasseam Elkarra >> wrote: >> >>> Hello all, >>> >>> I had a dev build of 1.4 from 5/1/2009 and importing a 20K row took less >>> than a minute. Updating to the latest as of yesterday, the import is >>> really >>> slow and I had to cancel it after a half hour. This prevented me from >>> upgrading a few months ago as well. >>> >>> Any ideas as to the cause of this? >>> >>> Thank you, >>> >>> Nasseam Elkarra >>> http://bodukai.com/boutique/ >>> The fastest possible shopping experience. >>> >>> > >
Geo Coding Service
Hi, Can someone suggest me a good geo-coding service or software for commercial use. I want to find gecodes for large collection of address. I'm looking for a good long term service. Thanks Ram -- View this message in context: http://www.nabble.com/Geo-Coding-Service-tp25774277p25774277.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: TermsComponent or auto-suggest with filter
Have a look at a blog I posted on how to use EdgeNGrams to build an auto-suggest tool: http://www.lucidimagination.com/blog/2009/09/08/auto-suggest-from-popular-queries-using-edgengrams/ You could easily add filter queries to this approach. Ffor example, the query used in the blog could add filter queries like this: http://localhost:8983/solr/select/?q=user_query:”i”&wt=json&fl=user_query&indent=on&echoParams=none&rows=10&sort=count desc&fq=yourField:yourQuery&fq=anotherField:anotherQuery -Jay http://www.lucidimagination.com On Tue, Oct 6, 2009 at 4:40 AM, R. Tan wrote: > Hello, > What's the best way to get auto-suggested terms/keywords that is filtered > by > one or more fields? TermsComponent should have been the solution but > filters > are not supported. > > Thanks, > Rihaed >
Re: Importing CSV file slow/crashes
Hello Erick, Sorry about that. I'm using the CSV update handler. Uploading a local CSV using the stream.file parameter. There are 94 fields and 36 copyFields. Thank you, Nasseam On Oct 6, 2009, at 10:09 AM, Erick Erickson wrote: Well, without some better idea of *how* you're doing the import, it's a little hard to say anything meaningful (hint, hint). Best Erick On Tue, Oct 6, 2009 at 1:06 PM, Nasseam Elkarra wrote: Hello all, I had a dev build of 1.4 from 5/1/2009 and importing a 20K row took less than a minute. Updating to the latest as of yesterday, the import is really slow and I had to cancel it after a half hour. This prevented me from upgrading a few months ago as well. Any ideas as to the cause of this? Thank you, Nasseam Elkarra http://bodukai.com/boutique/ The fastest possible shopping experience.
Re: Solr Trunk Heap Space Issues
Okay - I'm sorry - serves me right for working sick. Now that I have put on my glasses and correctly tagged my two eclipse tests: It still appears that trunk likes to use more RAM. I switched both tests to one million iterations and watched the heap. The test from the build around may 5th (I promise :) ) regularly GC's down to about 70-80MB after a fair time of running. It doesn't appear to climb - keeps GC'ing back to 70-80 (after starting at by GC'ing down to 40 for a bit). The test from trunk, after a fair time of running, keeps GC'ing down to about 120-150MB - 150 at the end, slowly working its way up from 90-110 at the beginning. Don't know what that means yet - but it appears trunk likes to use a bit more RAM while indexing. Odd that its so much more because these docs are tiny: String[] fields = {"text","simple" ,"text","test" ,"text","how now brown cow" ,"text","what's that?" ,"text","radical!" ,"text","what's all this about, anyway?" ,"text","just how fast is this text indexing?" }; Mark Miller wrote: > Okay, I juggled the tests in eclipse and flipped the results. So they > make sense. > > Sorry - goose chase on this one. > > Yonik Seeley wrote: > >> I don't see this with trunk... I just tried TestIndexingPerformance >> with 1M docs, and it seemed to work fine. >> Memory use stabilized at 40MB. >> Most memory use was for indexing (not analysis). >> char[] topped out at 4.5MB >> >> -Yonik >> http://www.lucidimagination.com >> >> >> On Tue, Oct 6, 2009 at 12:31 PM, Mark Miller wrote: >> >> >>> Yeah - I was wondering about that ... not sure how these guys are >>> stacking up ... >>> >>> Yonik Seeley wrote: >>> >>> TestIndexingPerformance? What the heck... that's not even multi-threaded! -Yonik http://www.lucidimagination.com On Tue, Oct 6, 2009 at 12:17 PM, Mark Miller wrote: > Darnit - didn't finish that email. This is after running your old short > doc perf test for 10,000 iterations. You see the same thing with 1000 > iterations but much less pronounced eg gettin' worse with more iterations. > > Mark Miller wrote: > > > >> A little before and after. The before is around may 5th'is - the after >> is trunk. >> >> http://myhardshadow.com/memanalysis/before.png >> http://myhardshadow.com/memanalysis/after.png >> >> Mark Miller wrote: >> >> >> >> >>> Took a peak at the checkout around the time he says he's using. >>> >>> CharTokenizer appears to be holding onto much large char[] arrays now >>> than before. Same with snowball.Among - used to be almost nothing, now >>> its largio. >>> >>> The new TokenStream stuff appears to be clinging. Needs to find some >>> inner peace. >>> >>> > > > -- - Mark http://www.lucidimagination.com
Re: Importing CSV file slow/crashes
Well, without some better idea of *how* you're doing the import, it's a little hard to say anything meaningful (hint, hint). Best Erick On Tue, Oct 6, 2009 at 1:06 PM, Nasseam Elkarra wrote: > Hello all, > > I had a dev build of 1.4 from 5/1/2009 and importing a 20K row took less > than a minute. Updating to the latest as of yesterday, the import is really > slow and I had to cancel it after a half hour. This prevented me from > upgrading a few months ago as well. > > Any ideas as to the cause of this? > > Thank you, > > Nasseam Elkarra > http://bodukai.com/boutique/ > The fastest possible shopping experience. > >
Re: Solr Trunk Heap Space Issues
Okay, I juggled the tests in eclipse and flipped the results. So they make sense. Sorry - goose chase on this one. Yonik Seeley wrote: > I don't see this with trunk... I just tried TestIndexingPerformance > with 1M docs, and it seemed to work fine. > Memory use stabilized at 40MB. > Most memory use was for indexing (not analysis). > char[] topped out at 4.5MB > > -Yonik > http://www.lucidimagination.com > > > On Tue, Oct 6, 2009 at 12:31 PM, Mark Miller wrote: > >> Yeah - I was wondering about that ... not sure how these guys are >> stacking up ... >> >> Yonik Seeley wrote: >> >>> TestIndexingPerformance? >>> What the heck... that's not even multi-threaded! >>> >>> -Yonik >>> http://www.lucidimagination.com >>> >>> >>> >>> On Tue, Oct 6, 2009 at 12:17 PM, Mark Miller wrote: >>> >>> Darnit - didn't finish that email. This is after running your old short doc perf test for 10,000 iterations. You see the same thing with 1000 iterations but much less pronounced eg gettin' worse with more iterations. Mark Miller wrote: > A little before and after. The before is around may 5th'is - the after > is trunk. > > http://myhardshadow.com/memanalysis/before.png > http://myhardshadow.com/memanalysis/after.png > > Mark Miller wrote: > > > >> Took a peak at the checkout around the time he says he's using. >> >> CharTokenizer appears to be holding onto much large char[] arrays now >> than before. Same with snowball.Among - used to be almost nothing, now >> its largio. >> >> The new TokenStream stuff appears to be clinging. Needs to find some >> inner peace. >> -- - Mark http://www.lucidimagination.com
Re: Solr Trunk Heap Space Issues
I don't see this with trunk... I just tried TestIndexingPerformance with 1M docs, and it seemed to work fine. Memory use stabilized at 40MB. Most memory use was for indexing (not analysis). char[] topped out at 4.5MB -Yonik http://www.lucidimagination.com On Tue, Oct 6, 2009 at 12:31 PM, Mark Miller wrote: > Yeah - I was wondering about that ... not sure how these guys are > stacking up ... > > Yonik Seeley wrote: >> TestIndexingPerformance? >> What the heck... that's not even multi-threaded! >> >> -Yonik >> http://www.lucidimagination.com >> >> >> >> On Tue, Oct 6, 2009 at 12:17 PM, Mark Miller wrote: >> >>> Darnit - didn't finish that email. This is after running your old short >>> doc perf test for 10,000 iterations. You see the same thing with 1000 >>> iterations but much less pronounced eg gettin' worse with more iterations. >>> >>> Mark Miller wrote: >>> A little before and after. The before is around may 5th'is - the after is trunk. http://myhardshadow.com/memanalysis/before.png http://myhardshadow.com/memanalysis/after.png Mark Miller wrote: > Took a peak at the checkout around the time he says he's using. > > CharTokenizer appears to be holding onto much large char[] arrays now > than before. Same with snowball.Among - used to be almost nothing, now > its largio. > > The new TokenStream stuff appears to be clinging. Needs to find some > inner peace.
De-basing / re-basing docIDs, or how to effectively pass calculated values from a Scorer or Filter up to (Solr's) QueryComponent.process
(Posted here, per Yonik's suggestion) In the code I'm working with, I generate a cache of calculated values as a by-product within a Filter.getDocidSet implementation (and within a Query-ized version of the filter and its Scorer method) . These values are keyed off the IndexReader's docID values, since that's all that's accessible at that level. Ultimately, however, I need to be able to access these values much higher up in the stack (Solr's QueryComponent.process method), so that I can inject the dynamic values into the response as a fake field. The IDs available here, however, are for the entire index and not just relative to the current IndexReader. I'm still fairly new to Lucene and I've been scratching my head a bit trying to find a reliable way to map these values into the same space, without having to hack up too many base classes. I noticed that there was a related discussion at: http://issues.apache.org/jira/browse/LUCENE-1821?focusedCommentId=12745041#action_12745041 ... but also a bit of disagreement on the suggested strategies. Ideally, I'm also hoping there's a strategy that won't require me to hack up too much of the core product; subclassing IndexSearcher in the way suggested would basically require me to change all of the various SearchComponents I use in Solr, and that sounds like it'd end up a real maintenance nightmare. I was looking at the Collector class as possible solution, since it has knowledge of the docbase, but it looks like I'd then need to change every derived collector that the code ultimately uses and, including the various anonymous Collectors in Solr, that also looks like it'd be a fairly ghoulish solution. I suppose I'm being wishful, or lazy, but is there a reasonable and reliable way to do this, without having to fork the core code? If not, any suggestion on the best strategy to accomplish this, without adding too much overhead every time I wanted to up-rev the core Lucene and/or Solr code to the latest version? Thanks a ton, Aaron
Re: HighLithing exact phrases with solr
Please try hl.usePhraseHighlighter=true parameter. (It should be true by default if you use the latest nightly, but I think you don't) Koji Antonio Calò wrote: Hi Guys I'm getting crazy with the highlighting in solr. The problem is the follow: when I submit an exact phrase query, I get the related results and the related snippets with highlight. But I've noticed that the *single term of the phrase are highlighted too*. Here an example: If I start a search for "quick brown fox", I obtain the correct result with the doc wich contains the phrase, but the snippets came to me like this: The quick brown fox jump over the lazy dog. The fox is a nice animal. Also with some documents, only single terms are highlighted insteand of exact sentence even if the exact phrase is contained into the document i. e.: The fox is a nice animal. My understanding of highlighting is that if I search for exact phrase, only the exact phrase is should be highlighted. Here an extract of my solrconfig.xml & schema.xml solrconfig.xml: 500 700 0.5 [-\w ,/\n\"']{20,200} true true schema.xml: Maybe I'm missing something, or my understanding of the highlighting feature is not correct. Any Idea? As always, thanks for your support! Regards, Antonio
RE: Solr Timeouts
Is it possible that deletions are triggering these commits? Some of the documents that I'm making indexing requests for already exist in the index, so they would result in deletions. I tried messing with some of these parameters but I'm still running into the same problem: false 100 This is happening like every 30-40minutes and it's really hampering the indexing progress... -Original Message- From: Giovanni Fernandez-Kincade [mailto:gfernandez-kinc...@capitaliq.com] Sent: Monday, October 05, 2009 2:11 PM To: solr-user@lucene.apache.org; yo...@lucidimagination.com Subject: RE: Solr Timeouts I just grabbed another stack trace for a thread that has been similarly blocking for over an hour. Notice that there is no Commit in this one: http-8080-Processor67 [RUNNABLE] CPU time: 1:02:05 org.apache.lucene.index.TermBuffer.read(IndexInput, FieldInfos) org.apache.lucene.index.SegmentTermEnum.next() org.apache.lucene.index.SegmentTermEnum.scanTo(Term) org.apache.lucene.index.TermInfosReader.get(Term, boolean) org.apache.lucene.index.TermInfosReader.get(Term) org.apache.lucene.index.SegmentTermDocs.seek(Term) org.apache.lucene.index.DocumentsWriter.applyDeletes(IndexReader, int) org.apache.lucene.index.DocumentsWriter.applyDeletes(SegmentInfos) org.apache.lucene.index.IndexWriter.applyDeletes() org.apache.lucene.index.IndexWriter.doFlushInternal(boolean, boolean) org.apache.lucene.index.IndexWriter.doFlush(boolean, boolean) org.apache.lucene.index.IndexWriter.flush(boolean, boolean, boolean) org.apache.lucene.index.IndexWriter.updateDocument(Term, Document, Analyzer) org.apache.lucene.index.IndexWriter.updateDocument(Term, Document) org.apache.solr.update.DirectUpdateHandler2.addDoc(AddUpdateCommand) org.apache.solr.update.processor.RunUpdateProcessor.processAdd(AddUpdateCommand) org.apache.solr.handler.extraction.ExtractingDocumentLoader.doAdd(SolrContentHandler, AddUpdateCommand) org.apache.solr.handler.extraction.ExtractingDocumentLoader.addDoc(SolrContentHandler) org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(SolrQueryRequest, SolrQueryResponse, ContentStream) org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(SolrQueryRequest, SolrQueryResponse) org.apache.solr.handler.RequestHandlerBase.handleRequest(SolrQueryRequest, SolrQueryResponse) org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(SolrQueryRequest, SolrQueryResponse) org.apache.solr.core.SolrCore.execute(SolrRequestHandler, SolrQueryRequest, SolrQueryResponse) org.apache.solr.servlet.SolrDispatchFilter.execute(HttpServletRequest, SolrRequestHandler, SolrQueryRequest, SolrQueryResponse) org.apache.solr.servlet.SolrDispatchFilter.doFilter(ServletRequest, ServletResponse, FilterChain) org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ServletRequest, ServletResponse) org.apache.catalina.core.ApplicationFilterChain.doFilter(ServletRequest, ServletResponse) org.apache.catalina.core.StandardWrapperValve.invoke(Request, Response) org.apache.catalina.core.StandardContextValve.invoke(Request, Response) org.apache.catalina.core.StandardHostValve.invoke(Request, Response) org.apache.catalina.valves.ErrorReportValve.invoke(Request, Response) org.apache.catalina.core.StandardEngineValve.invoke(Request, Response) org.apache.catalina.connector.CoyoteAdapter.service(Request, Response) org.apache.coyote.http11.Http11Processor.process(InputStream, OutputStream) org.apache.coyote.http11.Http11BaseProtocol$Http11ConnectionHandler.processConnection(TcpConnection, Object[]) org.apache.tomcat.util.net.PoolTcpEndpoint.processSocket(Socket, TcpConnection, Object[]) org.apache.tomcat.util.net.LeaderFollowerWorkerThread.runIt(Object[]) org.apache.tomcat.util.threads.ThreadPool$ControlRunnable.run() java.lang.Thread.run() -Original Message- From: ysee...@gmail.com [mailto:ysee...@gmail.com] On Behalf Of Yonik Seeley Sent: Monday, October 05, 2009 1:18 PM To: solr-user@lucene.apache.org Subject: Re: Solr Timeouts OK... next step is to verify that SolrCell doesn't have a bug that causes it to commit. I'll try and verify today unless someone else beats me to it. -Yonik http://www.lucidimagination.com On Mon, Oct 5, 2009 at 1:04 PM, Giovanni Fernandez-Kincade wrote: > I'm fairly certain that all of the indexing jobs are calling SOLR with > commit=false. They all construct the indexing URLs using a CLR function I > wrote, which takes in a Commit parameter, which is always set to false. > > Also, I don't see any calls to commit in the Tomcat logs (whereas normally > when I make a commit call I do). > > This suggests that Solr is doing it automatically, but the extract handler > doesn't seem to be the problem: > class="org.apache.solr.handler.extraction.ExtractingRequestHandler" > startup="lazy"> > > ignored_ > fileData > > > > > There is no external config file specified, and I don't see anything about > co
Re: Solr Trunk Heap Space Issues
Yeah - I was wondering about that ... not sure how these guys are stacking up ... Yonik Seeley wrote: > TestIndexingPerformance? > What the heck... that's not even multi-threaded! > > -Yonik > http://www.lucidimagination.com > > > > On Tue, Oct 6, 2009 at 12:17 PM, Mark Miller wrote: > >> Darnit - didn't finish that email. This is after running your old short >> doc perf test for 10,000 iterations. You see the same thing with 1000 >> iterations but much less pronounced eg gettin' worse with more iterations. >> >> Mark Miller wrote: >> >>> A little before and after. The before is around may 5th'is - the after >>> is trunk. >>> >>> http://myhardshadow.com/memanalysis/before.png >>> http://myhardshadow.com/memanalysis/after.png >>> >>> Mark Miller wrote: >>> >>> Took a peak at the checkout around the time he says he's using. CharTokenizer appears to be holding onto much large char[] arrays now than before. Same with snowball.Among - used to be almost nothing, now its largio. The new TokenStream stuff appears to be clinging. Needs to find some inner peace. Yonik Seeley wrote: > On Mon, Oct 5, 2009 at 4:54 PM, Jeff Newburn wrote: > > > > >> Ok we have done some more testing on this issue. When I only have the 1 >> core the reindex completes fine. However, when I added a second core >> with >> no documents it runs out of heap again. This time the heap was 322Mb of >> LRUCache. The 1 query that warms returns exactly 2 documents so I have >> no >> idea where the LRUCache is getting its information or what is even in >> there. >> >> >> >> > I guess the obvious thing to check would be the custom search component. > Does it access documents? I don't see how else the document cache > could self populate with so many entries (assuming it is the document > cache again). > > -Yonik > http://www.lucidimagination.com > > > > > > > > >> -- >> Jeff Newburn >> Software Engineer, Zappos.com >> jnewb...@zappos.com - 702-943-7562 >> >> >> >> >> >> >>> From: Yonik Seeley >>> Reply-To: >>> Date: Mon, 5 Oct 2009 13:32:32 -0400 >>> To: >>> Subject: Re: Solr Trunk Heap Space Issues >>> >>> On Mon, Oct 5, 2009 at 1:00 PM, Jeff Newburn >>> wrote: >>> >>> >>> >>> Ok I have eliminated all queries for warming and am still getting the heap space dump. Any ideas at this point what could be wrong? This seems like a huge increase in memory to go from indexing without issues to not being able to even with warming off. >>> Do you have any custom Analyzers, Tokenizers, TokenFilters? >>> Another change is that token streams are reused by caching in a >>> thread-local, so every thread in your server could potentially have a >>> copy of an analysis chain (token stream) per field that you have used. >>> This normally shouldn't be an issue since these will be small. Also, >>> how many unique fields do you have? >>> >>> -Yonik >>> http://www.lucidimagination.com >>> >>> >>> >>> >>> >>> >>> Jeff Newburn Software Engineer, Zappos.com jnewb...@zappos.com - 702-943-7562 > From: Jeff Newburn > Reply-To: > Date: Thu, 01 Oct 2009 08:41:18 -0700 > To: "solr-user@lucene.apache.org" > Subject: Solr Trunk Heap Space Issues > > I am trying to update to the newest version of solr from trunk as of > May > 5th. I updated and compiled from trunk as of yesterday (09/30/2009). > When > I try to do a full import I am receiving a GC heap error after > changing > nothing in the configuration files. Why would this happen in the most > recent versions but not in the version from a few months ago. The > stack > trace is below. > > Oct 1, 2009 8:34:32 AM > org.apache.solr.update.processor.LogUpdateProcessor > finish > INFO: {add=[166400, 166608, 166698, 166800, 166811, 167097, 167316, > 167353, > ...(83 more)]} 0 35991 > Oct 1, 2009 8:34:32 AM org.apache.solr.common.SolrException log > SEVERE: java.lang.OutOfMemoryError: GC overhead limit exceeded > at java.util.Arrays.copyOfRange(Arrays.java:3209) > at java.lang.String.(String.java:215) > at > com.ctc.wstx.util.TextBuffer.contentsAsString(TextBuffer.java:384) > at > com.ctc.wstx.s
Re: Solr Trunk Heap Space Issues
TestIndexingPerformance? What the heck... that's not even multi-threaded! -Yonik http://www.lucidimagination.com On Tue, Oct 6, 2009 at 12:17 PM, Mark Miller wrote: > Darnit - didn't finish that email. This is after running your old short > doc perf test for 10,000 iterations. You see the same thing with 1000 > iterations but much less pronounced eg gettin' worse with more iterations. > > Mark Miller wrote: >> A little before and after. The before is around may 5th'is - the after >> is trunk. >> >> http://myhardshadow.com/memanalysis/before.png >> http://myhardshadow.com/memanalysis/after.png >> >> Mark Miller wrote: >> >>> Took a peak at the checkout around the time he says he's using. >>> >>> CharTokenizer appears to be holding onto much large char[] arrays now >>> than before. Same with snowball.Among - used to be almost nothing, now >>> its largio. >>> >>> The new TokenStream stuff appears to be clinging. Needs to find some >>> inner peace. >>> >>> Yonik Seeley wrote: >>> >>> On Mon, Oct 5, 2009 at 4:54 PM, Jeff Newburn wrote: > Ok we have done some more testing on this issue. When I only have the 1 > core the reindex completes fine. However, when I added a second core with > no documents it runs out of heap again. This time the heap was 322Mb of > LRUCache. The 1 query that warms returns exactly 2 documents so I have no > idea where the LRUCache is getting its information or what is even in > there. > > > I guess the obvious thing to check would be the custom search component. Does it access documents? I don't see how else the document cache could self populate with so many entries (assuming it is the document cache again). -Yonik http://www.lucidimagination.com > -- > Jeff Newburn > Software Engineer, Zappos.com > jnewb...@zappos.com - 702-943-7562 > > > > > >> From: Yonik Seeley >> Reply-To: >> Date: Mon, 5 Oct 2009 13:32:32 -0400 >> To: >> Subject: Re: Solr Trunk Heap Space Issues >> >> On Mon, Oct 5, 2009 at 1:00 PM, Jeff Newburn wrote: >> >> >> >>> Ok I have eliminated all queries for warming and am still getting the >>> heap >>> space dump. Any ideas at this point what could be wrong? This seems >>> like a >>> huge increase in memory to go from indexing without issues to not being >>> able >>> to even with warming off. >>> >>> >>> >> Do you have any custom Analyzers, Tokenizers, TokenFilters? >> Another change is that token streams are reused by caching in a >> thread-local, so every thread in your server could potentially have a >> copy of an analysis chain (token stream) per field that you have used. >> This normally shouldn't be an issue since these will be small. Also, >> how many unique fields do you have? >> >> -Yonik >> http://www.lucidimagination.com >> >> >> >> >> >> >>> Jeff Newburn >>> Software Engineer, Zappos.com >>> jnewb...@zappos.com - 702-943-7562 >>> >>> >>> >>> >>> From: Jeff Newburn Reply-To: Date: Thu, 01 Oct 2009 08:41:18 -0700 To: "solr-user@lucene.apache.org" Subject: Solr Trunk Heap Space Issues I am trying to update to the newest version of solr from trunk as of May 5th. I updated and compiled from trunk as of yesterday (09/30/2009). When I try to do a full import I am receiving a GC heap error after changing nothing in the configuration files. Why would this happen in the most recent versions but not in the version from a few months ago. The stack trace is below. Oct 1, 2009 8:34:32 AM org.apache.solr.update.processor.LogUpdateProcessor finish INFO: {add=[166400, 166608, 166698, 166800, 166811, 167097, 167316, 167353, ...(83 more)]} 0 35991 Oct 1, 2009 8:34:32 AM org.apache.solr.common.SolrException log SEVERE: java.lang.OutOfMemoryError: GC overhead limit exceeded at java.util.Arrays.copyOfRange(Arrays.java:3209) at java.lang.String.(String.java:215) at com.ctc.wstx.util.TextBuffer.contentsAsString(TextBuffer.java:384) at com.ctc.wstx.sr.BasicStreamReader.getText(BasicStreamReader.java:821) at org.apache.solr.handler.XMLLoader.readDoc(XMLLoader.java:280) at org.apache.solr.handler.XMLLoader.processUpdate(XMLLoader.java:139) at org.apache.solr.handler.XMLLoader.load(XMLLoader.java:69) at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentSt reamHandlerBase.java:54) at org.apache.solr.handler.RequestHan
Re: Solr Trunk Heap Space Issues
Darnit - didn't finish that email. This is after running your old short doc perf test for 10,000 iterations. You see the same thing with 1000 iterations but much less pronounced eg gettin' worse with more iterations. Mark Miller wrote: > A little before and after. The before is around may 5th'is - the after > is trunk. > > http://myhardshadow.com/memanalysis/before.png > http://myhardshadow.com/memanalysis/after.png > > Mark Miller wrote: > >> Took a peak at the checkout around the time he says he's using. >> >> CharTokenizer appears to be holding onto much large char[] arrays now >> than before. Same with snowball.Among - used to be almost nothing, now >> its largio. >> >> The new TokenStream stuff appears to be clinging. Needs to find some >> inner peace. >> >> Yonik Seeley wrote: >> >> >>> On Mon, Oct 5, 2009 at 4:54 PM, Jeff Newburn wrote: >>> >>> >>> Ok we have done some more testing on this issue. When I only have the 1 core the reindex completes fine. However, when I added a second core with no documents it runs out of heap again. This time the heap was 322Mb of LRUCache. The 1 query that warms returns exactly 2 documents so I have no idea where the LRUCache is getting its information or what is even in there. >>> I guess the obvious thing to check would be the custom search component. >>> Does it access documents? I don't see how else the document cache >>> could self populate with so many entries (assuming it is the document >>> cache again). >>> >>> -Yonik >>> http://www.lucidimagination.com >>> >>> >>> >>> >>> >>> >>> -- Jeff Newburn Software Engineer, Zappos.com jnewb...@zappos.com - 702-943-7562 > From: Yonik Seeley > Reply-To: > Date: Mon, 5 Oct 2009 13:32:32 -0400 > To: > Subject: Re: Solr Trunk Heap Space Issues > > On Mon, Oct 5, 2009 at 1:00 PM, Jeff Newburn wrote: > > > >> Ok I have eliminated all queries for warming and am still getting the >> heap >> space dump. Any ideas at this point what could be wrong? This seems >> like a >> huge increase in memory to go from indexing without issues to not being >> able >> to even with warming off. >> >> >> > Do you have any custom Analyzers, Tokenizers, TokenFilters? > Another change is that token streams are reused by caching in a > thread-local, so every thread in your server could potentially have a > copy of an analysis chain (token stream) per field that you have used. > This normally shouldn't be an issue since these will be small. Also, > how many unique fields do you have? > > -Yonik > http://www.lucidimagination.com > > > > > > >> Jeff Newburn >> Software Engineer, Zappos.com >> jnewb...@zappos.com - 702-943-7562 >> >> >> >> >> >>> From: Jeff Newburn >>> Reply-To: >>> Date: Thu, 01 Oct 2009 08:41:18 -0700 >>> To: "solr-user@lucene.apache.org" >>> Subject: Solr Trunk Heap Space Issues >>> >>> I am trying to update to the newest version of solr from trunk as of May >>> 5th. I updated and compiled from trunk as of yesterday (09/30/2009). >>> When >>> I try to do a full import I am receiving a GC heap error after changing >>> nothing in the configuration files. Why would this happen in the most >>> recent versions but not in the version from a few months ago. The stack >>> trace is below. >>> >>> Oct 1, 2009 8:34:32 AM >>> org.apache.solr.update.processor.LogUpdateProcessor >>> finish >>> INFO: {add=[166400, 166608, 166698, 166800, 166811, 167097, 167316, >>> 167353, >>> ...(83 more)]} 0 35991 >>> Oct 1, 2009 8:34:32 AM org.apache.solr.common.SolrException log >>> SEVERE: java.lang.OutOfMemoryError: GC overhead limit exceeded >>> at java.util.Arrays.copyOfRange(Arrays.java:3209) >>> at java.lang.String.(String.java:215) >>> at >>> com.ctc.wstx.util.TextBuffer.contentsAsString(TextBuffer.java:384) >>> at >>> com.ctc.wstx.sr.BasicStreamReader.getText(BasicStreamReader.java:821) >>> at org.apache.solr.handler.XMLLoader.readDoc(XMLLoader.java:280) >>> at >>> org.apache.solr.handler.XMLLoader.processUpdate(XMLLoader.java:139) >>> at org.apache.solr.handler.XMLLoader.load(XMLLoader.java:69) >>> at >>> org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentSt >>> reamHandlerBase.java:54) >>> at >>> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase. >>> java:131) >>> at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316
Re: Solr Trunk Heap Space Issues
A little before and after. The before is around may 5th'is - the after is trunk. http://myhardshadow.com/memanalysis/before.png http://myhardshadow.com/memanalysis/after.png Mark Miller wrote: > Took a peak at the checkout around the time he says he's using. > > CharTokenizer appears to be holding onto much large char[] arrays now > than before. Same with snowball.Among - used to be almost nothing, now > its largio. > > The new TokenStream stuff appears to be clinging. Needs to find some > inner peace. > > Yonik Seeley wrote: > >> On Mon, Oct 5, 2009 at 4:54 PM, Jeff Newburn wrote: >> >> >>> Ok we have done some more testing on this issue. When I only have the 1 >>> core the reindex completes fine. However, when I added a second core with >>> no documents it runs out of heap again. This time the heap was 322Mb of >>> LRUCache. The 1 query that warms returns exactly 2 documents so I have no >>> idea where the LRUCache is getting its information or what is even in there. >>> >>> >> I guess the obvious thing to check would be the custom search component. >> Does it access documents? I don't see how else the document cache >> could self populate with so many entries (assuming it is the document >> cache again). >> >> -Yonik >> http://www.lucidimagination.com >> >> >> >> >> >> >>> -- >>> Jeff Newburn >>> Software Engineer, Zappos.com >>> jnewb...@zappos.com - 702-943-7562 >>> >>> >>> >>> From: Yonik Seeley Reply-To: Date: Mon, 5 Oct 2009 13:32:32 -0400 To: Subject: Re: Solr Trunk Heap Space Issues On Mon, Oct 5, 2009 at 1:00 PM, Jeff Newburn wrote: > Ok I have eliminated all queries for warming and am still getting the heap > space dump. Any ideas at this point what could be wrong? This seems > like a > huge increase in memory to go from indexing without issues to not being > able > to even with warming off. > > Do you have any custom Analyzers, Tokenizers, TokenFilters? Another change is that token streams are reused by caching in a thread-local, so every thread in your server could potentially have a copy of an analysis chain (token stream) per field that you have used. This normally shouldn't be an issue since these will be small. Also, how many unique fields do you have? -Yonik http://www.lucidimagination.com > Jeff Newburn > Software Engineer, Zappos.com > jnewb...@zappos.com - 702-943-7562 > > > > >> From: Jeff Newburn >> Reply-To: >> Date: Thu, 01 Oct 2009 08:41:18 -0700 >> To: "solr-user@lucene.apache.org" >> Subject: Solr Trunk Heap Space Issues >> >> I am trying to update to the newest version of solr from trunk as of May >> 5th. I updated and compiled from trunk as of yesterday (09/30/2009). >> When >> I try to do a full import I am receiving a GC heap error after changing >> nothing in the configuration files. Why would this happen in the most >> recent versions but not in the version from a few months ago. The stack >> trace is below. >> >> Oct 1, 2009 8:34:32 AM >> org.apache.solr.update.processor.LogUpdateProcessor >> finish >> INFO: {add=[166400, 166608, 166698, 166800, 166811, 167097, 167316, >> 167353, >> ...(83 more)]} 0 35991 >> Oct 1, 2009 8:34:32 AM org.apache.solr.common.SolrException log >> SEVERE: java.lang.OutOfMemoryError: GC overhead limit exceeded >> at java.util.Arrays.copyOfRange(Arrays.java:3209) >> at java.lang.String.(String.java:215) >> at com.ctc.wstx.util.TextBuffer.contentsAsString(TextBuffer.java:384) >> at >> com.ctc.wstx.sr.BasicStreamReader.getText(BasicStreamReader.java:821) >> at org.apache.solr.handler.XMLLoader.readDoc(XMLLoader.java:280) >> at >> org.apache.solr.handler.XMLLoader.processUpdate(XMLLoader.java:139) >> at org.apache.solr.handler.XMLLoader.load(XMLLoader.java:69) >> at >> org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentSt >> reamHandlerBase.java:54) >> at >> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase. >> java:131) >> at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316) >> at >> org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:3 >> 38) >> at >> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java: >> 241) >> at >> org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(Application >> FilterChain.java:235) >> at >> org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterCh >> ain.java:206) >> at >> org.apache.catalina.core.StandardWrappe
How to retrieve the index of a string within a field?
Hi, I have a field. The field has a sentence. If the user types in a word or a phrase, how can I return the index of this word or the index of the first word of the phrase? I tried to use &bf=ord..., but it does not work as i expected. Thanks. Elaine
Re: Solr Trunk Heap Space Issues
Took a peak at the checkout around the time he says he's using. CharTokenizer appears to be holding onto much large char[] arrays now than before. Same with snowball.Among - used to be almost nothing, now its largio. The new TokenStream stuff appears to be clinging. Needs to find some inner peace. Yonik Seeley wrote: > On Mon, Oct 5, 2009 at 4:54 PM, Jeff Newburn wrote: > >> Ok we have done some more testing on this issue. When I only have the 1 >> core the reindex completes fine. However, when I added a second core with >> no documents it runs out of heap again. This time the heap was 322Mb of >> LRUCache. The 1 query that warms returns exactly 2 documents so I have no >> idea where the LRUCache is getting its information or what is even in there. >> > > I guess the obvious thing to check would be the custom search component. > Does it access documents? I don't see how else the document cache > could self populate with so many entries (assuming it is the document > cache again). > > -Yonik > http://www.lucidimagination.com > > > > > >> -- >> Jeff Newburn >> Software Engineer, Zappos.com >> jnewb...@zappos.com - 702-943-7562 >> >> >> >>> From: Yonik Seeley >>> Reply-To: >>> Date: Mon, 5 Oct 2009 13:32:32 -0400 >>> To: >>> Subject: Re: Solr Trunk Heap Space Issues >>> >>> On Mon, Oct 5, 2009 at 1:00 PM, Jeff Newburn wrote: >>> Ok I have eliminated all queries for warming and am still getting the heap space dump. Any ideas at this point what could be wrong? This seems like a huge increase in memory to go from indexing without issues to not being able to even with warming off. >>> Do you have any custom Analyzers, Tokenizers, TokenFilters? >>> Another change is that token streams are reused by caching in a >>> thread-local, so every thread in your server could potentially have a >>> copy of an analysis chain (token stream) per field that you have used. >>> This normally shouldn't be an issue since these will be small. Also, >>> how many unique fields do you have? >>> >>> -Yonik >>> http://www.lucidimagination.com >>> >>> >>> >>> Jeff Newburn Software Engineer, Zappos.com jnewb...@zappos.com - 702-943-7562 > From: Jeff Newburn > Reply-To: > Date: Thu, 01 Oct 2009 08:41:18 -0700 > To: "solr-user@lucene.apache.org" > Subject: Solr Trunk Heap Space Issues > > I am trying to update to the newest version of solr from trunk as of May > 5th. I updated and compiled from trunk as of yesterday (09/30/2009). > When > I try to do a full import I am receiving a GC heap error after changing > nothing in the configuration files. Why would this happen in the most > recent versions but not in the version from a few months ago. The stack > trace is below. > > Oct 1, 2009 8:34:32 AM org.apache.solr.update.processor.LogUpdateProcessor > finish > INFO: {add=[166400, 166608, 166698, 166800, 166811, 167097, 167316, > 167353, > ...(83 more)]} 0 35991 > Oct 1, 2009 8:34:32 AM org.apache.solr.common.SolrException log > SEVERE: java.lang.OutOfMemoryError: GC overhead limit exceeded > at java.util.Arrays.copyOfRange(Arrays.java:3209) > at java.lang.String.(String.java:215) > at com.ctc.wstx.util.TextBuffer.contentsAsString(TextBuffer.java:384) > at > com.ctc.wstx.sr.BasicStreamReader.getText(BasicStreamReader.java:821) > at org.apache.solr.handler.XMLLoader.readDoc(XMLLoader.java:280) > at org.apache.solr.handler.XMLLoader.processUpdate(XMLLoader.java:139) > at org.apache.solr.handler.XMLLoader.load(XMLLoader.java:69) > at > org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentSt > reamHandlerBase.java:54) > at > org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase. > java:131) > at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316) > at > org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:3 > 38) > at > org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java: > 241) > at > org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(Application > FilterChain.java:235) > at > org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterCh > ain.java:206) > at > org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.ja > va:233) > at > org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.ja > va:175) > at > org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:128 > ) > at > org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102 > ) > at > org.apache.catalina.core.StandardEngineValve.invoke(S
Problems with DIH XPath flatten
Hi all, I'm trying to set up DataImportHandler to index some XML documents available over web services. The XML includes both content and metadata, so for the indexable content, I'm trying to just index everything under the content tag: The result of this is that the title field gets populated and indexed (there are no child nodes of /document/kbml/kbq), but content does not get indexed at all. Since /document/kbml/body has many children, I expected that flatten="true" would store all of the body text in the field. Instead, it stores nothing at all. I've tried this with many combinations of transformers and flatten options, and the result is the same each time. Here are the relevant field declarations from the schema (the type="text" is just the one from the example's schema.xml). I have tried combinations here as well of stored= and multiValued=, with the same result each time. If it would help troubleshooting, I could send along some sample XML. I don't want to spam the list with an attachment unless it's necessary, though :) Thanks in advance for your help, Adam Foltzer
Tr : Questions about synonyms and highlighting
Hello, Even short/partial answers could satisfy me :) Nourredine. >Hi, >Can you please give me some answers for those questions : > >1 - How can I get synonyms found for a keyword ? > >I mean i search "foo" and i have in my synonyms.txt file the following tokens >: "foo, foobar, fee" (with expand = true) >My index contains "foo" and "foobar". I want to display a message in a result >page, on the header for example, only the 2 matched tokens and not >"fee" >like "Results found for foo and foobar" > >2 - Can solR make analysis on an index to extract associations between tokens ? > >for example , if "foo" often appears with "fee" in a field, it will associate >the 2 tokens. > >3 - Is it possible and if so How can I configure solR to set or not >highlighting for tokens with diacritics ? > >Settings for "vélo" (all highlighted) ==> the two words "vélo" and >"velo" are highlighted >Settings for "vélo" ==> the first word "vélo" is highlighted but not >the second : "velo" > >4 - the same question for highlighting with lemmatisation? > >Settings for "manage" (all highlighted) ==> the two wordsmanage and >"management" are highlighted >Settings for "manage" ==> the first word "manage" is highlighted but >not the second : "management" > > >Thanks in advance. > >Regards > >Nourredine. __ Do You Yahoo!? En finir avec le spam? Yahoo! Mail vous offre la meilleure protection possible contre les messages non sollicités http://mail.yahoo.fr Yahoo! Mail
Re: FACET_SORT_INDEX descending?
Reverse alphabetical ordering. The option "index" provides alphabetical ordering. I have a year_facet field, that I would like to display in reverse order (most recent years first). Perhaps there is some other way to accomplish this. Thanks. --Gerald Chris Hostetter wrote: : Is there any value for the "f.my_year_facet.facet.sort" parameter that will : return the facet values in descending order? So far I only see "index" and : "count" as the choices. descending what? (count is descending order by count) -Hoss
Re: Creating cores using SolrJ
yeah that is missing. I've just committed a setter/getter for dataDir in create command do this CoreAdminRequest.Create req = new CoreAdminRequest.Create(); req.setCoreName( name ); req.setInstanceDir(instanceDir); req.setDataDirDir(dataDir); return req.process( solrServer ); 2009/10/6 Licinio Fernández Maurelo : > Hi there, > > i want to create cores using SolrJ, but i also want to create then in a > given datadir. How can i do this? Looking CoreAdminRequest methods i only > found: > > > - createCore(name, instanceDir, server) > - createCore(name, instanceDir, server, configFile, schemaFile) > > None of above methods allow datadir param. > > Thx > > -- > Lici > -- - Noble Paul | Principal Engineer| AOL | http://aol.com
Creating cores using SolrJ
Hi there, i want to create cores using SolrJ, but i also want to create then in a given datadir. How can i do this? Looking CoreAdminRequest methods i only found: - createCore(name, instanceDir, server) - createCore(name, instanceDir, server, configFile, schemaFile) None of above methods allow datadir param. Thx -- Lici
RE: using regular expressions in solr query
Any particular reason for the double quotes in the 2nd and 3rd query example, but not the 1st, or is this just an artifact of your email? -Todd -Original Message- From: Rakhi Khatwani [mailto:rkhatw...@gmail.com] Sent: Tuesday, October 06, 2009 2:26 AM To: solr-user@lucene.apache.org Subject: using regular expressions in solr query Hi, i have an example in which i want to use a regular expression in my solr query: for example: suppose i wanna search on a sample : raakhi rajnish ninad goureya sheetal ritesh rajnish ninad goureya sheetal where my content field is of type text when i type in QUERY: content:raa* RESPONSE : raakhi rajnish ninad goureya sheetal QUERY: content:"ra*" RESPONSE: 0 results coz of this i am facing problems with the next query: QUERY: content: "r* rajnish" RESPONSE: 0 results which should ideally return both the results. any pointers?? Regards, Raakhi
Re: search by some functionality
Hi Elaine, You can implement a function query in Solr in two ways: 1. Using Dismax request handler (with bf parameter). 2. Using the standard request handler (with _val_ field). I recommend the first option. Sandeep Elaine Li wrote: > > Hi Sandeep, > > I read about this chapter before. It did not mention how to create my > own customized function. > Can you point me to some instructions? > Thanks. > Elaine > > -- View this message in context: http://www.nabble.com/search-by-some-functionality-tp25721533p25767741.html Sent from the Solr - User mailing list archive at Nabble.com.
solr optimize - no space left on device
I am attempting to optimize a large shard on solr 1.4 and repeatedly get java.io.IOException: No space left on device. The shard, after a final commit before optimize, shows a size of about 192GB on a 400GB volume. I have successfully optimized 2 other shards that were similarly large without this problem on identical hardware boxes. Before the optimize I see: % df -B1 . Filesystem 1B-blocks Used Available Use% Mounted on /dev/mapper/internal-solr--build--2 435440427008 205681356800 225335255040 48% /l/solrs/build-2 slurm-4:/l/solrs/build-2/data/index % du -B1 205441486848 . There's a slight discrepancy between the du and df which appears to be orphaned inodes. But the du says there should be enough space to handle the doubling in size during optimization. However, for the second time we run out of space and the du and df are wildly different at that point and the volume is at 100% % df -B1 . Filesystem 1B-blocks Used Available Use% Mounted on /dev/mapper/internal-solr--build--2 435440427008 430985760768 30851072 100% /l/solrs/build-2 slurm-4:/l/solrs/build-2/data/index % du -B1 252552298496. At this point it appears orphaned inodes are consuming space and not being freed-up. Any clue as to whether this is a lucene bug a solr bug or some other problem. Error traces follow. Thanks! Phil --- Oct 6, 2009 2:12:37 AM org.apache.solr.update.processor.LogUpdateProcessor finish INFO: {} 0 9110523 Oct 6, 2009 2:12:37 AM org.apache.solr.common.SolrException log SEVERE: java.io.IOException: background merge hit exception: _ojl:C151080 _169w:C141302 _1j36:C80405 _1j35:C2043 _1j34:C192 into _1j37 [optimize] at org.apache.lucene.index.IndexWriter.optimize(IndexWriter.java:2737) at org.apache.lucene.index.IndexWriter.optimize(IndexWriter.java:2658) at org.apache.solr.update.DirectUpdateHandler2.commit(DirectUpdateHandler2.java:401) at org.apache.solr.update.processor.RunUpdateProcessor.processCommit(RunUpdateProcessorFactory.java:85) at org.apache.solr.handler.XMLLoader.processUpdate(XMLLoader.java:168) at org.apache.solr.handler.XMLLoader.load(XMLLoader.java:69) at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:54) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1299) at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:338) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:241) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:215) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:188) at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:213) at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:172) at org.apache.catalina.valves.AccessLogValve.invoke(AccessLogValve.java:548) at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127) at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:117) at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:108) at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:174) at org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:875) at org.apache.coyote.http11.Http11BaseProtocol$Http11ConnectionHandler.processConnection(Http11BaseProtocol.java:665) at org.apache.tomcat.util.net.PoolTcpEndpoint.processSocket(PoolTcpEndpoint.java:528) at org.apache.tomcat.util.net.LeaderFollowerWorkerThread.runIt(LeaderFollowerWorkerThread.java:81) at org.apache.tomcat.util.threads.ThreadPool$ControlRunnable.run(ThreadPool.java:689) at java.lang.Thread.run(Thread.java:619) Caused by: java.io.IOException: No space left on device at java.io.RandomAccessFile.writeBytes(Native Method) at java.io.RandomAccessFile.write(RandomAccessFile.java:466) at org.apache.lucene.store.FSDirectory$FSIndexOutput.flushBuffer(FSDirectory.java:719) at org.apache.lucene.store.BufferedIndexOutput.flushBuffer(BufferedIndexOutput.java:96) at org.apache.lucene.store.BufferedIndexOutput.flush(BufferedIndexOutput.java:85) at org.apache.lucene.store.BufferedIndexOutput.seek(BufferedIndexOutput.java:124) at org.apache.lucene.store.FSDirectory$FSIndexOutput.seek(FSDirectory.java:744) at org.apache.lucene.index.TermInfosWriter.close(TermInfosWriter.java:220) at org.apache.lucene.index.FormatPostingsFieldsWriter.finish(FormatPostingsFieldsWriter.java:70) at org.apache.lucene.index.SegmentMerger.mergeTerms(SegmentMerger.java:493) at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:140) at org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWri
TermsComponent or auto-suggest with filter
Hello, What's the best way to get auto-suggested terms/keywords that is filtered by one or more fields? TermsComponent should have been the solution but filters are not supported. Thanks, Rihaed
ISOLatin1AccentFilter before or after Snowball?
Hi all, from reading through previous posts on that subject, it seems like the accent filter has to come before the snowball filter. I'd just like to make sure this is so. If it is the case, I'm wondering whether snowball filters for i.e. French process accented language correctly, at all, or whether they remove accents anyway... Or whether accents should be removed whenever making use of snowball filters. And also: it really is meant to take UTF-8 as input, even though it is named ISOLatin1AccentFilter, isn't it? Thanks in advance! Chantal -- Chantal Ackermann
Re: Re : Re : wildcard searches
You are right, Angel. The problem would still persist. Why don't you consider putting the original data in some field. While querying, you can query on both the fields - analyzed and original one. Wildcard queries will not give you any results from the analyzed field but would match the data in your original field. Works? Cheers Avlesh On Tue, Oct 6, 2009 at 2:27 PM, Angel Ice wrote: > Ah yes, got it. > But i'm not sure this will solve my problem. > Because, I'm aloso using the IsoLatin1 filter, that remove the accentued > characters. > So I will have the same problem with accentued characters. Cause the > original token is not stored with this filter. > > Laurent > > > > > > > > De : Avlesh Singh > À : solr-user@lucene.apache.org > Envoyé le : Mardi, 6 Octobre 2009, 10h41mn 56s > Objet : Re: Re : wildcard searches > > You are processing your tokens in the filter that you wrote. I am assuming > it is the first filter being applied and removes the character 'h' from > tokens. When you are doing that, you can preserve the original token in the > same field as well. Because as of now, you are simply removing the > character. Subsequent filters don't even know that there was an 'h' > character in the original token. > > Since wild card queries are not analyzed, the 'h' character in the query > "hésita*" does NOT get removed during query time. This means that unless > the > original token was preserved in the field it wouldn't find any matches. > > This helps? > > Cheers > Avlesh > > On Tue, Oct 6, 2009 at 2:02 PM, Angel Ice wrote: > > > Hi. > > > > Thanks for your answers Christian and Avlesh. > > > > But I don't understant what you mean by : > > "If you want to enable wildcard queries, preserving the original token > > (while processing each token in your filter) might work." > > > > Could you explain this point please ? > > > > Laurent > > > > > > > > > > > > > > De : Avlesh Singh > > À : solr-user@lucene.apache.org > > Envoyé le : Lundi, 5 Octobre 2009, 20h30mn 54s > > Objet : Re: wildcard searches > > > > Zambrano is right, Laurent. The analyzers for a field are not invoked for > > wildcard queries. You custom filter is not even getting executed at > > query-time. > > If you want to enable wildcard queries, preserving the original token > > (while > > processing each token in your filter) might work. > > > > Cheers > > Avlesh > > > > On Mon, Oct 5, 2009 at 10:39 PM, Angel Ice wrote: > > > > > Hi everyone, > > > > > > I have a little question regarding the search engine when a wildcard > > > character is used in the query. > > > Let's take the following example : > > > > > > - I have sent in indexation the word Hésitation (with an accent on the > > "e") > > > - The filters applied to the field that will handle this word, result > in > > > the indexation of "esit" (the mute H is suppressed (home made filter), > > the > > > accent too (IsoLatin1Filter), and the SnowballPorterFilter suppress the > > > "ation". > > > > > > When i search for "hesitation", "esitation", "ésitation" etc ... all is > > OK, > > > the document is returned. > > > But as soon as I use a wildcard, like "hésita*", the document is not > > > returned. In fact, I have to put the wildcard in a manner that match > the > > > indexed term exactly (example "esi*") > > > > > > Does the search engine applies the filters to the word that prefix the > > > wildcard ? Or does it use this prefix verbatim ? > > > > > > Thanks for you help. > > > > > > Laurent > > > > > > > > > > > > > > > > > > > > > > > > >
using regular expressions in solr query
Hi, i have an example in which i want to use a regular expression in my solr query: for example: suppose i wanna search on a sample : raakhi rajnish ninad goureya sheetal ritesh rajnish ninad goureya sheetal where my content field is of type text when i type in QUERY: content:raa* RESPONSE : raakhi rajnish ninad goureya sheetal QUERY: content:"ra*" RESPONSE: 0 results coz of this i am facing problems with the next query: QUERY: content: "r* rajnish" RESPONSE: 0 results which should ideally return both the results. any pointers?? Regards, Raakhi
Re: Need "OR" in DisMax Query
Am 05.10.2009 um 20:36 schrieb David Giffin: Hi David, Maybe I'm missing something, but I can't seem to get the dismax request handler to perform and OR query. It appears that OR is removed by the stop words. It's not the stop words, Dismax simply doesn't do any boolean operations, the only thing you can do is using +searchWord and - searchWord or changing to the standard request handler. best Ingo -- Ingo Renner TYPO3 Core Developer, Release Manager TYPO3 4.2 Apache Solr for TYPO3: http://www.typo3-solr.com
Re : Re : wildcard searches
Ah yes, got it. But i'm not sure this will solve my problem. Because, I'm aloso using the IsoLatin1 filter, that remove the accentued characters. So I will have the same problem with accentued characters. Cause the original token is not stored with this filter. Laurent De : Avlesh Singh À : solr-user@lucene.apache.org Envoyé le : Mardi, 6 Octobre 2009, 10h41mn 56s Objet : Re: Re : wildcard searches You are processing your tokens in the filter that you wrote. I am assuming it is the first filter being applied and removes the character 'h' from tokens. When you are doing that, you can preserve the original token in the same field as well. Because as of now, you are simply removing the character. Subsequent filters don't even know that there was an 'h' character in the original token. Since wild card queries are not analyzed, the 'h' character in the query "hésita*" does NOT get removed during query time. This means that unless the original token was preserved in the field it wouldn't find any matches. This helps? Cheers Avlesh On Tue, Oct 6, 2009 at 2:02 PM, Angel Ice wrote: > Hi. > > Thanks for your answers Christian and Avlesh. > > But I don't understant what you mean by : > "If you want to enable wildcard queries, preserving the original token > (while processing each token in your filter) might work." > > Could you explain this point please ? > > Laurent > > > > > > > De : Avlesh Singh > À : solr-user@lucene.apache.org > Envoyé le : Lundi, 5 Octobre 2009, 20h30mn 54s > Objet : Re: wildcard searches > > Zambrano is right, Laurent. The analyzers for a field are not invoked for > wildcard queries. You custom filter is not even getting executed at > query-time. > If you want to enable wildcard queries, preserving the original token > (while > processing each token in your filter) might work. > > Cheers > Avlesh > > On Mon, Oct 5, 2009 at 10:39 PM, Angel Ice wrote: > > > Hi everyone, > > > > I have a little question regarding the search engine when a wildcard > > character is used in the query. > > Let's take the following example : > > > > - I have sent in indexation the word Hésitation (with an accent on the > "e") > > - The filters applied to the field that will handle this word, result in > > the indexation of "esit" (the mute H is suppressed (home made filter), > the > > accent too (IsoLatin1Filter), and the SnowballPorterFilter suppress the > > "ation". > > > > When i search for "hesitation", "esitation", "ésitation" etc ... all is > OK, > > the document is returned. > > But as soon as I use a wildcard, like "hésita*", the document is not > > returned. In fact, I have to put the wildcard in a manner that match the > > indexed term exactly (example "esi*") > > > > Does the search engine applies the filters to the word that prefix the > > wildcard ? Or does it use this prefix verbatim ? > > > > Thanks for you help. > > > > Laurent > > > > > > > > > > > >
Re: Date field being null
> > I am defining a field: > > indexed="false" and stored="false"? really? This field is as good as nothing. What would you use it for? Can I have a null for such a field? > Yes you can. Moreover, as you have sortMissingLast="true" specified in your field type definition, documents having null values in this field would appear in the end for any kind of sorting. Cheers Avlesh On Tue, Oct 6, 2009 at 1:16 PM, Pooja Verlani wrote: > Hi, > My fieldtype definition is like: > omitNorms="true"/> > > I am defining a field: > > > Can I have a null for such a field? or is there a way I can use it as a > date > field only if the value is null. I cant put the field as a string type as I > have to apply recency sort and some filters for that field. > Regards, > Pooja >
Re: Re : wildcard searches
You are processing your tokens in the filter that you wrote. I am assuming it is the first filter being applied and removes the character 'h' from tokens. When you are doing that, you can preserve the original token in the same field as well. Because as of now, you are simply removing the character. Subsequent filters don't even know that there was an 'h' character in the original token. Since wild card queries are not analyzed, the 'h' character in the query "hésita*" does NOT get removed during query time. This means that unless the original token was preserved in the field it wouldn't find any matches. This helps? Cheers Avlesh On Tue, Oct 6, 2009 at 2:02 PM, Angel Ice wrote: > Hi. > > Thanks for your answers Christian and Avlesh. > > But I don't understant what you mean by : > "If you want to enable wildcard queries, preserving the original token > (while processing each token in your filter) might work." > > Could you explain this point please ? > > Laurent > > > > > > > De : Avlesh Singh > À : solr-user@lucene.apache.org > Envoyé le : Lundi, 5 Octobre 2009, 20h30mn 54s > Objet : Re: wildcard searches > > Zambrano is right, Laurent. The analyzers for a field are not invoked for > wildcard queries. You custom filter is not even getting executed at > query-time. > If you want to enable wildcard queries, preserving the original token > (while > processing each token in your filter) might work. > > Cheers > Avlesh > > On Mon, Oct 5, 2009 at 10:39 PM, Angel Ice wrote: > > > Hi everyone, > > > > I have a little question regarding the search engine when a wildcard > > character is used in the query. > > Let's take the following example : > > > > - I have sent in indexation the word Hésitation (with an accent on the > "e") > > - The filters applied to the field that will handle this word, result in > > the indexation of "esit" (the mute H is suppressed (home made filter), > the > > accent too (IsoLatin1Filter), and the SnowballPorterFilter suppress the > > "ation". > > > > When i search for "hesitation", "esitation", "ésitation" etc ... all is > OK, > > the document is returned. > > But as soon as I use a wildcard, like "hésita*", the document is not > > returned. In fact, I have to put the wildcard in a manner that match the > > indexed term exactly (example "esi*") > > > > Does the search engine applies the filters to the word that prefix the > > wildcard ? Or does it use this prefix verbatim ? > > > > Thanks for you help. > > > > Laurent > > > > > > > > > > > >
Re : wildcard searches
Hi. Thanks for your answers Christian and Avlesh. But I don't understant what you mean by : "If you want to enable wildcard queries, preserving the original token (while processing each token in your filter) might work." Could you explain this point please ? Laurent De : Avlesh Singh À : solr-user@lucene.apache.org Envoyé le : Lundi, 5 Octobre 2009, 20h30mn 54s Objet : Re: wildcard searches Zambrano is right, Laurent. The analyzers for a field are not invoked for wildcard queries. You custom filter is not even getting executed at query-time. If you want to enable wildcard queries, preserving the original token (while processing each token in your filter) might work. Cheers Avlesh On Mon, Oct 5, 2009 at 10:39 PM, Angel Ice wrote: > Hi everyone, > > I have a little question regarding the search engine when a wildcard > character is used in the query. > Let's take the following example : > > - I have sent in indexation the word Hésitation (with an accent on the "e") > - The filters applied to the field that will handle this word, result in > the indexation of "esit" (the mute H is suppressed (home made filter), the > accent too (IsoLatin1Filter), and the SnowballPorterFilter suppress the > "ation". > > When i search for "hesitation", "esitation", "ésitation" etc ... all is OK, > the document is returned. > But as soon as I use a wildcard, like "hésita*", the document is not > returned. In fact, I have to put the wildcard in a manner that match the > indexed term exactly (example "esi*") > > Does the search engine applies the filters to the word that prefix the > wildcard ? Or does it use this prefix verbatim ? > > Thanks for you help. > > Laurent > > > >
Date field being null
Hi, My fieldtype definition is like: I am defining a field: Can I have a null for such a field? or is there a way I can use it as a date field only if the value is null. I cant put the field as a string type as I have to apply recency sort and some filters for that field. Regards, Pooja
solr reporting tool adapter
Hi, i wanted to query solr and send the output some reporting tool. has anyone done something like that? moreover, which reporting filter is good?? ny suggesstions? Regards, Raakhi