Re: How to make Relationships work for Multi-valued Index Fields?
I thought 1.3 supported dynamic fields in schema.xml? Guna On Jan 22, 2009, at 11:54 PM, Shalin Shekhar Mangar wrote: Oops, one more gotcha. The dynamic field support is only in 1.4 trunk. On Fri, Jan 23, 2009 at 1:24 PM, Shalin Shekhar Mangar < shalinman...@gmail.com> wrote: On Fri, Jan 23, 2009 at 1:08 PM, Gunaranjan Chandraraju < chandrar...@apple.com> wrote: I have setup my DIH to treat these as entities as below I think the only way is to create a dynamic field for each attribute (street, state etc.). Write a transformer to copy the fields from your data config to appropriately named dynamic field (e.g. street_1, state_1, etc). To maintain this counter you will need to get/store it with Context#getSessionAttribute(name, val, Context.SCOPE_DOC) and Context#setSessionAttribute(name, val, Context.SCOPE_DOC). I cant't think of an easier way. -- Regards, Shalin Shekhar Mangar. -- Regards, Shalin Shekhar Mangar.
Re: How to make Relationships work for Multi-valued Index Fields?
I thought 1.3 supported dynamic fields in schema.xml? Guna On Jan 22, 2009, at 11:54 PM, Shalin Shekhar Mangar wrote: Oops, one more gotcha. The dynamic field support is only in 1.4 trunk. On Fri, Jan 23, 2009 at 1:24 PM, Shalin Shekhar Mangar < shalinman...@gmail.com> wrote: On Fri, Jan 23, 2009 at 1:08 PM, Gunaranjan Chandraraju < chandrar...@apple.com> wrote: I have setup my DIH to treat these as entities as below I think the only way is to create a dynamic field for each attribute (street, state etc.). Write a transformer to copy the fields from your data config to appropriately named dynamic field (e.g. street_1, state_1, etc). To maintain this counter you will need to get/store it with Context#getSessionAttribute(name, val, Context.SCOPE_DOC) and Context#setSessionAttribute(name, val, Context.SCOPE_DOC). I cant't think of an easier way. -- Regards, Shalin Shekhar Mangar. -- Regards, Shalin Shekhar Mangar.
Re: How to make Relationships work for Multi-valued Index Fields?
Yes Solr does. But DataImportHandler with the 1.3 release does not support it. However, you can use the trunk data import handler jar with Solr 1.3 if you do not feel comfortable using Solr 1.4 trunk. On Fri, Jan 23, 2009 at 1:36 PM, Gunaranjan Chandraraju < chandrar...@apple.com> wrote: > > I thought 1.3 supported dynamic fields in schema.xml? > > Guna > > > On Jan 22, 2009, at 11:54 PM, Shalin Shekhar Mangar wrote: > > Oops, one more gotcha. The dynamic field support is only in 1.4 trunk. >> >> On Fri, Jan 23, 2009 at 1:24 PM, Shalin Shekhar Mangar < >> shalinman...@gmail.com> wrote: >> >> On Fri, Jan 23, 2009 at 1:08 PM, Gunaranjan Chandraraju < >>> chandrar...@apple.com> wrote: >>> >>> I have setup my DIH to treat these as entities as below >>> baseDir="***" fileName=".*xml" rootEntity="false" dataSource="null" > >>> name="record" processor="XPathEntityProcessor" stream="false" forEach="/record" url="${f.fileAbsolutePath}"> >>> name="record_adr" processor="XPathEntityProcessor" stream="false" forEach="/record/address" url="${f.fileAbsolutePath}"> >>> xpath="/record/address/@street" /> >>> xpath="/record/address//@state" /> >>> xpath="/record/address//@type" /> >>> I think the only way is to create a dynamic field for each attribute >>> (street, state etc.). Write a transformer to copy the fields from your >>> data >>> config to appropriately named dynamic field (e.g. street_1, state_1, >>> etc). >>> To maintain this counter you will need to get/store it with >>> Context#getSessionAttribute(name, val, Context.SCOPE_DOC) and >>> Context#setSessionAttribute(name, val, Context.SCOPE_DOC). >>> >>> I cant't think of an easier way. >>> -- >>> Regards, >>> Shalin Shekhar Mangar. >>> >>> >> >> >> -- >> Regards, >> Shalin Shekhar Mangar. >> > > -- Regards, Shalin Shekhar Mangar.
stats.jsp - maxDoc and numDoc-help
Hi all, i am new to solr.I have posted nearly 10 lakh xml docs for the last few months. Now i want to find out the total number of duplicate posts untill now. whether the stats.jsp's numDocs and maxDocs is the appropriate one to find out the total duplicate post(maxDocs-numDocs) so far? please guide me to the solution. -- Yours, S.Selvam
Re: Solr Replication: disk space consumed on slave much higher than on master
Hi, I applied the patch and did some more tests - also adding some LOG.info() calls in delTree to see if it actually gets invoked (LOG.info("START: delTree: "+dir.getName()); at the start of that method). I don't see any entries of this showing up in the log file at all, so it looks like delTree doesn't get invoked at all. To be sure, explaining the issue to prevent misunderstanding: - The number of files in the index directory on the slave keeps increasing (in my very small test core, there are now 128 files in the slave's index directory, and only 73 files in the master's index directory) - The directories index.x are still there after replication, but they are empty Are there any other things I can do check, or more info that I can provide to help fix this? Thanks, bye, Jaco. 2009/1/22 Shalin Shekhar Mangar > On Fri, Jan 23, 2009 at 12:15 AM, Noble Paul നോബിള് नोब्ळ् < > noble.p...@gmail.com> wrote: > > > I have attached a patch which logs the names of the files which could > > not get deleted (which may help us diagnose the problem). If you are > > comfortable applying a patch you may try it out. > > > > I've committed this patch to trunk. > > -- > Regards, > Shalin Shekhar Mangar. >
Re: URL-import field type?
Well, the idea is that the solr engine indexes the contents of a web platform. Each document is a user-side-URL out of which several fields would be fetched through various URL-get-documents (e.g. the full-text-view, e.g. the future openmath representation, e.g. the topics (URIs in an ontology), ...). Would the alternate (and maybe equivalent) way to stream all documents into one XML document and let the XPath triage act through all fields? That would also work would take advantage of the XPathEntityProcessor's nice configuration. What bothers me with the HttpDataSource example is that, for now, at least, it is configured to pull a single URL while what is needed (and would provide delta ability) is really to index a list of URLs (for which one would pull regularly the list of recently update URLs or simply use GET-if-modified-since on all of them). I didn't think that modifying the XPathEntityProcessor was the right thing since it seems based on a single stream. Hints for altnernative eagerly welcome. paul Le 23-janv.-09 à 05:45, Noble Paul നോബിള് नोब्ळ् a écrit : where is this url coming from? what is the content type of the stream? is it plain text or html? if yes, this is a possible enhancement to DIH On Fri, Jan 23, 2009 at 4:39 AM, Paul Libbrecht wrote: Hello list, after searching around for quite a while, including in the DataImportHandler documentation on the wiki (which looks amazing), I couldn't find a way to indicate to solr that the tokens of that field should be the result of analyzing the tokens of the stream at URL-xxx. I know I was able to imitate that in plain-lucene by crafting a particular analyzer-filter who was only given the URL as content and who gave further the tokens of the stream. Is this the right way in solr? thanks in advance. paul -- --Noble Paul smime.p7s Description: S/MIME cryptographic signature
Re: URL-import field type?
On Fri, Jan 23, 2009 at 2:28 PM, Paul Libbrecht wrote: > Well, > > the idea is that the solr engine indexes the contents of a web platform. > > Each document is a user-side-URL out of which several fields would be > fetched through various URL-get-documents (e.g. the full-text-view, e.g. the > future openmath representation, e.g. the topics (URIs in an ontology), ...). if the response of these are URLs are well formed xpaths they can be channeled to an XPathEntityProcessor (one per field) and they can be processed if the response is not XML ,then there is no EntityProcessor that can consume this. We may need to add one. > > Would the alternate (and maybe equivalent) way to stream all documents into > one XML document and let the XPath triage act through all fields? That would > also work would take advantage of the XPathEntityProcessor's nice > configuration. > > What bothers me with the HttpDataSource example is that, for now, at least, > it is configured to pull a single URL while what is needed (and would > provide delta ability) is really to index a list of URLs (for which one > would pull regularly the list of recently update URLs or simply use > GET-if-modified-since on all of them). The if-modified since is not supported by HttpdataSource. However you can write a transformer which pings the URL w/ a if-modified-since header an skip the document using the $skipDoc option > > I didn't think that modifying the XPathEntityProcessor was the right thing > since it seems based on a single stream. > > Hints for altnernative eagerly welcome. > > paul > > > Le 23-janv.-09 à 05:45, Noble Paul നോബിള് नोब्ळ् a écrit : > >> where is this url coming from? what is the content type of the stream? >> is it plain text or html? >> >> if yes, this is a possible enhancement to DIH >> >> >> >> On Fri, Jan 23, 2009 at 4:39 AM, Paul Libbrecht >> wrote: >>> >>> Hello list, >>> >>> after searching around for quite a while, including in the >>> DataImportHandler >>> documentation on the wiki (which looks amazing), I couldn't find a way to >>> indicate to solr that the tokens of that field should be the result of >>> analyzing the tokens of the stream at URL-xxx. >>> >>> I know I was able to imitate that in plain-lucene by crafting a >>> particular >>> analyzer-filter who was only given the URL as content and who gave >>> further >>> the tokens of the stream. >>> >>> Is this the right way in solr? >>> >>> thanks in advance. >>> >>> paul >> >> >> >> -- >> --Noble Paul > > -- --Noble Paul
Re: DIH XPathEntityProcessor fails with docs containing
Seems to work fin on this mornings 23-jan-2009 nightly. Thanks very much. >On Wed, Jan 21, 2009 at 6:05 PM, Fergus McMenemie wrote: > >> >> After looking looking at http://issues.apache.org/jira/browse/SOLR-964, >> where >> it seems this issue has been addressed, I had another go at indexing >> documents >> containing DOCTYPE. It failed as follows. >> >> >That patch has not been committed to the trunk yet. I'll take it up. > >-- >Regards, >Shalin Shekhar Mangar. -- === Fergus McMenemie Email:fer...@twig.me.uk Techmore Ltd Phone:(UK) 07721 376021 Unix/Mac/Intranets Analyst Programmer ===
Re: URL-import field type?
Le 23-janv.-09 à 10:10, Noble Paul നോബിള് नोब्ळ् a écrit : if the response is not XML ,then there is no EntityProcessor that can consume this. We may need to add one. well, even binary data such as word documents (base64-encoded for example) run the risk of appearing here. They sure need a pile of filters! What bothers me with the HttpDataSource example is that, for now, at least, it is configured to pull a single URL while what is needed (and would provide delta ability) is really to index a list of URLs (for which one would pull regularly the list of recently update URLs or simply use GET-if-modified-since on all of them). The if-modified since is not supported by HttpdataSource. However you can write a transformer which pings the URL w/ a if-modified-since header an skip the document using the $skipDoc option I still don't understand how you give several documents to the HttpDataSource. The configuration seems only to allow a single URL. Am I missing something? paul PS: would it be worth chatting about that on irc.freenode.net#solr ? smime.p7s Description: S/MIME cryptographic signature
QTime in microsecond
Is there a way to get QTime in microsecond from solr? I have small set of collection and my response time (QTime) is 0 or 1 milliseconds. I am running benchmark tests and I need more sensitive running times for comparision. Thanks for your help.
Re: What can be the reason for stopping solr work after some time?
Hi, thanks for your reply. Sorry for lesser information that i gave in my first post, i just didn't know what to share. Yes, java proccess is still working, but search in the site does not work and i cannot see any http request at this time in the logs. I have not tested the admin page, this is something that i should test. How can i enable debug mode in solr? I'm sending you the private message only, because i have unsubscribed from solr mailing list, i need to subscribe again. On Wed, 2009-01-21 at 22:00 -0800, Chris Hostetter wrote: > : i'm newbie with solr. We have installed with together with ezfind from > : EZ Publish web sites and it is working. But in one of the servers we > : have this kind of problem. It works for example for 3 hours, and then in > : one moment it stop to work, searching and indexing does not work. > > it's pretty hard to make any sort of guess as to what your problem might > be without more information. is your java process still running? does it > responsed to any HTTP requests (ie: do the admin pages work?) what do the > logs say? > > > -Hoss >
facet dates and distributed search
Hey there, I would like to understand why distributed search doesn't suport facet dates. As I understand it would have problems because if the time of the servers is not syncronized, the results would not be exact but... In case I wouldn't mind if results are completley exacts... would be possible to use facet dates on distributd search? In case I am completely wrong with this explanation... can someone explain me the reason why it's not suported? If I understand it maybe I could try to do a path ... Thanks in advance. -- View this message in context: http://www.nabble.com/facet-dates-and-distributed-search-tp21621576p21621576.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Intermittent high response times
Hi wojtekpia, That's interesting, I shall be looking into this over the weekend so I shall look at the GC also. I was briefly reading about GC last night, am I right in thinking it could be affected by what version of the jvm I'm using (1.5.0.8), and also what type of Collector is set? What collector is the default, and what would people recommend for an application like Solr? Thanks Waseem On Thu, Jan 22, 2009 at 5:24 PM, wojtekpia wrote: > > I'm experiencing similar issues. Mine seem to be related to old generation > garbage collection. Can you monitor your garbage collection activity? (I'm > using JConsole to monitor it: > http://java.sun.com/developer/technicalArticles/J2SE/jconsole.html). > > In my system, garbage collection usually doesn't cause any trouble. But > once > in a while, the size of the old generation flat-lines for some time > (~dozens > of seconds). When this happens, I see really bad response times from Solr > (not quite as bad as you're seeing, but almost). The old-gen flat-lines > always seem to be right before, or right after the old-gen is garbage > collected. > -- > View this message in context: > http://www.nabble.com/Intermittent-high-response-times-tp21602475p21608986.html > Sent from the Solr - User mailing list archive at Nabble.com. > >
Re: Any advice for facet.prefix for suggestions
Ian, A new field is indeed needed and warranted for this case. Facets only work off indexed terms, not stored. Erik On Jan 22, 2009, at 11:48 PM, Ian Connor wrote: The facet prefix method to get suggestions for search terms really helps. However, it seems to show the indexed rather than the stored terms. For instance, if you have a "word-with-hyphen", it will show the "wordwithhyphen" as a suggestion in fields where I have asked it to strip out these characters (this is a valid facet based on the index but confusing to the user). Here is an example: http://pubget.com/search?suggest=true type "kirschne" and wait for the suggestion of kirschnerwir based on a index of kirschner-wire. Is there a way to have it show the stored version of the words or do I need to mirror a field that does the indexing but without the filters? I am hoping there might be something I am missing here and a new field is not needed. -- Regards, Ian Connor
Re: URL-import field type?
On Fri, Jan 23, 2009 at 2:55 PM, Paul Libbrecht wrote: > > Le 23-janv.-09 à 10:10, Noble Paul നോബിള് नोब्ळ् a écrit : >> >> if the response is not XML ,then there is no EntityProcessor that can >> consume this. We may need to add one. > > well, even binary data such as word documents (base64-encoded for example) > run the risk of appearing here. They sure need a pile of filters! > >>> What bothers me with the HttpDataSource example is that, for now, at >>> least, >>> it is configured to pull a single URL while what is needed (and would >>> provide delta ability) is really to index a list of URLs (for which one >>> would pull regularly the list of recently update URLs or simply use >>> GET-if-modified-since on all of them). >> >> The if-modified since is not supported by HttpdataSource. However you >> can write a transformer which pings the URL w/ a if-modified-since >> header an skip the document using the $skipDoc option > > I still don't understand how you give several documents to the > HttpDataSource. > The configuration seems only to allow a single URL. > Am I missing something? The DataSource is like a helper class. The only intelligent piece here is an EntityProcessor. > > paul > > PS: would it be worth chatting about that on irc.freenode.net#solr ? -- --Noble Paul
Re: Solr Replication: disk space consumed on slave much higher than on master
On Fri, Jan 23, 2009 at 2:12 PM, Jaco wrote: > Hi, > > I applied the patch and did some more tests - also adding some LOG.info() > calls in delTree to see if it actually gets invoked (LOG.info("START: > delTree: "+dir.getName()); at the start of that method). I don't see any > entries of this showing up in the log file at all, so it looks like delTree > doesn't get invoked at all. > > To be sure, explaining the issue to prevent misunderstanding: > - The number of files in the index directory on the slave keeps increasing > (in my very small test core, there are now 128 files in the slave's index > directory, and only 73 files in the master's index directory) > - The directories index.x are still there after replication, but they > are empty > > Are there any other things I can do check, or more info that I can provide > to help fix this? > The problem is that when we do a commit on the slave after replication is done. The commit does not re-open the IndexWriter. Therefore, the deletion policy does not take affect and older files are left as is. This can keep on building up. The only solution is to re-open the index writer. I think the attached patch can solve this problem. Can you try this and let us know? Thank you for your patience. -- Regards, Shalin Shekhar Mangar. Index: src/java/org/apache/solr/handler/SnapPuller.java === --- src/java/org/apache/solr/handler/SnapPuller.java (revision 736746) +++ src/java/org/apache/solr/handler/SnapPuller.java Fri Jan 23 16:47:41 IST 2009 @@ -27,6 +27,7 @@ import static org.apache.solr.handler.ReplicationHandler.*; import org.apache.solr.search.SolrIndexSearcher; import org.apache.solr.update.CommitUpdateCommand; +import org.apache.solr.update.DirectUpdateHandler2; import org.apache.solr.util.RefCounted; import org.slf4j.Logger; import org.slf4j.LoggerFactory; @@ -281,14 +282,14 @@ replicationStartTime = 0; return successfulInstall; } catch (ReplicationHandlerException e) { -delTree(tmpIndexDir); LOG.error("User aborted Replication"); } catch (SolrException e) { -delTree(tmpIndexDir); throw e; } catch (Exception e) { delTree(tmpIndexDir); throw new SolrException(SolrException.ErrorCode.SERVER_ERROR, "Snappull failed : ", e); + } finally { +delTree(tmpIndexDir); } return successfulInstall; } finally { @@ -349,7 +350,15 @@ cmd.waitFlush = true; cmd.waitSearcher = true; solrCore.getUpdateHandler().commit(cmd); +if (solrCore.getUpdateHandler() instanceof DirectUpdateHandler2) { + LOG.info("Re-opening index writer to make sure older index files get deleted"); + DirectUpdateHandler2 handler = (DirectUpdateHandler2) solrCore.getUpdateHandler(); + handler.reOpenWriter(); +} else { + LOG.warn("The update handler is not an instance or sub-class of DirectUpdateHandler2. " + + "ReplicationHandler may not be able to cleanup un-used index files."); - } +} + } /** Index: src/java/org/apache/solr/update/DirectUpdateHandler2.java === --- src/java/org/apache/solr/update/DirectUpdateHandler2.java (revision 736614) +++ src/java/org/apache/solr/update/DirectUpdateHandler2.java Fri Jan 23 16:23:36 IST 2009 @@ -187,7 +187,7 @@ addCommands.incrementAndGet(); addCommandsCumulative.incrementAndGet(); int rc=-1; - + // if there is no ID field, use allowDups if( idField == null ) { cmd.allowDups = true; @@ -259,7 +259,7 @@ } finally { iwCommit.unlock(); } - + if( tracker.timeUpperBound > 0 ) { tracker.scheduleCommitWithin( tracker.timeUpperBound ); } @@ -294,7 +294,7 @@ deleteAll(); } else { openWriter(); -writer.deleteDocuments(q); +writer.deleteDocuments(q); } } finally { iwCommit.unlock(); @@ -313,8 +313,15 @@ } } + public void reOpenWriter() throws IOException { +iwCommit.lock(); +try { + openWriter(); +} finally { + iwCommit.unlock(); +} + } - public void commit(CommitUpdateCommand cmd) throws IOException { if (cmd.optimize) { @@ -419,14 +426,14 @@ tracker.pending.cancel( true ); tracker.pending = null; } - tracker.scheduler.shutdown(); + tracker.scheduler.shutdown(); closeWriter(); } finally { iwCommit.unlock(); } log.info("closed " + this); } - + /** Helper class for tracking autoCommit state. * * Note: This is purely an implementation detail of autoCommit and will @@ -435,8 +442,8 @@ * * Note: all access must be synchronized. */ - class CommitTracker implements Runnable - { + class CommitTracker implements Runnable + { // scheduler delay for maxDoc-trigger
Re: Solr Replication: disk space consumed on slave much higher than on master
I tested with the patch it has solved both the issues On Fri, Jan 23, 2009 at 5:00 PM, Shalin Shekhar Mangar wrote: > > > On Fri, Jan 23, 2009 at 2:12 PM, Jaco wrote: >> >> Hi, >> >> I applied the patch and did some more tests - also adding some LOG.info() >> calls in delTree to see if it actually gets invoked (LOG.info("START: >> delTree: "+dir.getName()); at the start of that method). I don't see any >> entries of this showing up in the log file at all, so it looks like >> delTree >> doesn't get invoked at all. >> >> To be sure, explaining the issue to prevent misunderstanding: >> - The number of files in the index directory on the slave keeps increasing >> (in my very small test core, there are now 128 files in the slave's index >> directory, and only 73 files in the master's index directory) >> - The directories index.x are still there after replication, but they >> are empty >> >> Are there any other things I can do check, or more info that I can provide >> to help fix this? > > The problem is that when we do a commit on the slave after replication is > done. The commit does not re-open the IndexWriter. Therefore, the deletion > policy does not take affect and older files are left as is. This can keep on > building up. The only solution is to re-open the index writer. > > I think the attached patch can solve this problem. Can you try this and let > us know? Thank you for your patience. > > -- > Regards, > Shalin Shekhar Mangar. > -- --Noble Paul
Re: Solr Replication: disk space consumed on slave much higher than on master
I have opened an issue to track this https://issues.apache.org/jira/browse/SOLR-978 On Fri, Jan 23, 2009 at 5:22 PM, Noble Paul നോബിള് नोब्ळ् wrote: > I tested with the patch > it has solved both the issues > > On Fri, Jan 23, 2009 at 5:00 PM, Shalin Shekhar Mangar > wrote: >> >> >> On Fri, Jan 23, 2009 at 2:12 PM, Jaco wrote: >>> >>> Hi, >>> >>> I applied the patch and did some more tests - also adding some LOG.info() >>> calls in delTree to see if it actually gets invoked (LOG.info("START: >>> delTree: "+dir.getName()); at the start of that method). I don't see any >>> entries of this showing up in the log file at all, so it looks like >>> delTree >>> doesn't get invoked at all. >>> >>> To be sure, explaining the issue to prevent misunderstanding: >>> - The number of files in the index directory on the slave keeps increasing >>> (in my very small test core, there are now 128 files in the slave's index >>> directory, and only 73 files in the master's index directory) >>> - The directories index.x are still there after replication, but they >>> are empty >>> >>> Are there any other things I can do check, or more info that I can provide >>> to help fix this? >> >> The problem is that when we do a commit on the slave after replication is >> done. The commit does not re-open the IndexWriter. Therefore, the deletion >> policy does not take affect and older files are left as is. This can keep on >> building up. The only solution is to re-open the index writer. >> >> I think the attached patch can solve this problem. Can you try this and let >> us know? Thank you for your patience. >> >> -- >> Regards, >> Shalin Shekhar Mangar. >> > > > > -- > --Noble Paul > -- --Noble Paul
Maximum size of document indexed
Hi, I am trying to index a 25 MB word document. I am not able to search all the keywords. Looks like only certain number of initial words are getting indexed. Is there any limit to the size of document getting indexed? Or is there any word count limit per field? Thanks, Siddharth
Re: Solr Replication: disk space consumed on slave much higher than on master
Hi, I have tested this as well, looking fine! Both issues are indeed fixed, and the index directory of the slaves gets cleaned up nicely. I will apply the changes to all systems I've got running and report back in this thread in case any issues are found. Thanks for the very fast help! I usually need much, much more patience with commercial software vendors.. Cheers, Jaco. 2009/1/23 Noble Paul നോബിള് नोब्ळ् > I have opened an issue to track this > https://issues.apache.org/jira/browse/SOLR-978 > > On Fri, Jan 23, 2009 at 5:22 PM, Noble Paul നോബിള് नोब्ळ् > wrote: > > I tested with the patch > > it has solved both the issues > > > > On Fri, Jan 23, 2009 at 5:00 PM, Shalin Shekhar Mangar > > wrote: > >> > >> > >> On Fri, Jan 23, 2009 at 2:12 PM, Jaco wrote: > >>> > >>> Hi, > >>> > >>> I applied the patch and did some more tests - also adding some > LOG.info() > >>> calls in delTree to see if it actually gets invoked (LOG.info("START: > >>> delTree: "+dir.getName()); at the start of that method). I don't see > any > >>> entries of this showing up in the log file at all, so it looks like > >>> delTree > >>> doesn't get invoked at all. > >>> > >>> To be sure, explaining the issue to prevent misunderstanding: > >>> - The number of files in the index directory on the slave keeps > increasing > >>> (in my very small test core, there are now 128 files in the slave's > index > >>> directory, and only 73 files in the master's index directory) > >>> - The directories index.x are still there after replication, but > they > >>> are empty > >>> > >>> Are there any other things I can do check, or more info that I can > provide > >>> to help fix this? > >> > >> The problem is that when we do a commit on the slave after replication > is > >> done. The commit does not re-open the IndexWriter. Therefore, the > deletion > >> policy does not take affect and older files are left as is. This can > keep on > >> building up. The only solution is to re-open the index writer. > >> > >> I think the attached patch can solve this problem. Can you try this and > let > >> us know? Thank you for your patience. > >> > >> -- > >> Regards, > >> Shalin Shekhar Mangar. > >> > > > > > > > > -- > > --Noble Paul > > > > > > -- > --Noble Paul >
search/query issue. sorting, match exact, match first etc
Hi, I am trying to utilize solr into an autocomplete thingy. Let's assume I query for 'foo'. Assuming we work with case insensitive here. I would like to have records returned in specific order. First all that have exact match, then all that start with Foo in alphabetical order, then all that contain the exact word (but not necessarily first) and lastly all matches where foo is anywhere within words. Any pointers are more than welcome. I am trying to find something in archives as well but no luck so far. Example response when searching 'foo' or 'Foo': Foo Foo AAA Foo BBB Gooo Foo Moo Foo xxxfoox Boo Foos
Re: Maximum size of document indexed
Try: http://wiki.apache.org/solr/SolrConfigXml?highlight=(maxfieldlength) Best Erick On Fri, Jan 23, 2009 at 7:29 AM, Gargate, Siddharth wrote: > Hi, > I am trying to index a 25 MB word document. I am not able to search all > the keywords. Looks like only certain number of initial words are > getting indexed. > Is there any limit to the size of document getting indexed? Or is there > any word count limit per field? > > Thanks, > Siddharth >
Re: Master failover - seeking comments
Thanks for the response. Let me clarify things a bit. Regarding the Slaves: Our project is a web application. It is our desire to embedd Solr into the web application. The web applications are configured with a local embedded Solr instance configured as a slave, and a remote Solr instance configured as a master. We have a requirement for real-time updates to the Solr indexes. Our strategy is to use the local embedded Solr instance as a read-only repository. Any time a write is made, we will send it to the remote Master. Once a user pushes a write operation to the remote Master, all subsequent read operations for this user now are made against the Master for the duration of the session. This approximates "realtime" updates and seems to work for our purposes. Writes to our system are a small percentage of Read operations. Now, back to the original question. We're simply looking for failover solution if the Master server goes down. Oh, and we are using the replication scripts to sync the servers. > It seems like you are trying to write to Solr directly from your front end > application. This is why you are thinking of multiple masters. I'll let > others comment on how easy/hard/correct the solution would be. > Well, yes. We have business requirements that want updates to Solr to be realtime, or as close to that as possible, so when a user changes something, our strategy was to save it to the DB and push it to the Solr Master as well. Although, we will have a background application that will help ensure that Solr is in sync with the DB for times that Solr is down and the DB is not. > But, do you really need to have live writes? Can they be channeled through > a > background process? Since you anyway cannot do a commit per-write, the > advantage of live writes is minimal. Moreover you would need to invest a > lot > of time in handling availability concerns to avoid losing updates. If you > log/record the write requests to an intermediate store (or queue), you can > do with one master (with another host on standby acting as a slave). > We do need to have live writes, as I mentioned above. The concern you mention about losing live writes is exactly why we are looking at a Master Solr server failover strategy. We thought about having a backup Solr server that is a Slave to the Master and could be easily reconfigured as a new Master in a pinch. Our operations team has pushed us to come up with a solution that would be more seamless. This is why we came up with a Master/Master solution where both Masters are also slaves to each other. >> >> To test this, I ran the following scenario. >> >> 1) Slave 1 (S1) is configured to use M2 as it's master. >> 2) We push an update to M2. >> 3) We restart S1, now pointing to M1. >> 4) We wait for M1 to sync from M2 >> 5) We then sync S1 to M1. >> 6) Success! >> > > How do you co-ordinate all this? > This was just a test scenario I ran manually to see if the setup I described above would even work. Is there a Wiki page that outlines typical web application Solr deployment strategies? There are a lot of questions on the forum about this type of thing (including this one). For those who have expertise in this area, I'm sure there are many who could benefit from this (hint hint). As before, any comments or suggestions on the above would be much appreciated. Thanks, Erik -- View this message in context: http://www.nabble.com/Master-failover---seeking-comments-tp21614750p21625324.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: how can solr search angainst group of field
I think you could use dismax and restric de result with a filter query. Suposing you're using dismaxquery parser it should look like: http://localhost:8080/solr/select?q=whatever&fq=category:3 I think this would sort your case surfer10 wrote: > > definitly disMax do the thing by searching one term against multifield. > but what if my index contains two additional multivalued fields like > category id > > i need to search against terms in particular fields of documents and > dismax do this well thru "qf=field1,field2" > how can i filter results which has only "1" or "2" or "3" in categoryID > field? > > could you please help me to figure this? > > update: i've found discursion about that on > http://www.nabble.com/using-dismax-with-additional-query--td18178512.html#a18178512 > there is a suggestion to use filterquery. i'll check it out > -- View this message in context: http://www.nabble.com/how-can-solr-search-angainst-group-of-field-tp21557783p21625476.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: I get SEVERE: Lock obtain timed out
Julian Davchev wrote on 01/20/2009 10:07:48 AM: > Julian Davchev > 01/20/2009 10:07 AM > > I get SEVERE: Lock obtain timed out > > Hi, > Any documents or something I can read on how locks work and how I can > controll it. When do they occur etc. > Cause only way I got out of this mess was restarting tomcat > > SEVERE: org.apache.lucene.store.LockObtainFailedException: Lock obtain > timed out: SingleInstanceLock: write.lock I've seen this with my customized setup. Before I saw the write.lock messages, I had an OutOfMemoryError, but the container didn't shut down. After that Solr spewed write lock messages and I had to restart. So, you might want to search backwards in your logs and see if you can find when the write lock problems and if there is some identifiable problem preceding that. Jerry Quinn
Fwd: [Travel Assistance] Applications for ApacheCon EU 2009 - Now Open
Begin forwarded message: From: Tony Stevenson Date: January 23, 2009 8:28:19 AM EST To: travel-assista...@apache.org Subject: [Travel Assistance] Applications for ApacheCon EU 2009 - Now Open The Travel Assistance Committee is now accepting applications for those wanting to attend ApacheCon EU 2009 between the 23rd and 27th March 2009 in Amsterdam. The Travel Assistance Committee is looking for people who would like to be able to attend ApacheCon EU 2009 who need some financial support in order to get there. There are very few places available and the criteria is high, that aside applications are open to all open source developers who feel that their attendance would benefit themselves, their project(s), the ASF or open source in general. Financial assistance is available for travel, accommodation and entrance fees either in full or in part, depending on circumstances. It is intended that all our ApacheCon events are covered, so it may be prudent for those in the United States or Asia to wait until an event closer to them comes up - you are all welcome to apply for ApacheCon EU of course, but there must be compelling reasons for you to attend an event further away that your home location for your application to be considered above those closer to the event location. More information can be found on the main Apache website at http://www.apache.org/travel/index.html - where you will also find a link to the online application form. Time is very tight for this event, so applications are open now and will end on the 4th February 2009 - to give enough time for travel arrangements to be made. Good luck to all those that apply. Regards, The Travel Assistance Committee -- -- Tony Stevenson t...@pc-tony.com // pct...@apache.org // pct...@freenode.net http://blog.pc-tony.com/ 1024D/51047D66 ECAF DC55 C608 5E82 0B5E 3359 C9C7 924E 5104 7D66 --
Method toMultiMap(NamedList params) in SolrParams
Hi, I'm getting confused about the method Map toMultiMap(NamedList params) in SolrParams class. When some of your parameter is instanceof String[] it's converted to to String using the toString() method, which seems to me to be wrong. It is probably assuming, that the values in NamedList are all String, but when you look at the method toNamedList() it's clearly adding String[] in case that the parameter has more than one value. So my question is, wheater it is a bug or I'm getting something wrong. public static Map toMultiMap(NamedList params) { HashMap map = new HashMap(); for (int i=0; i it=getParameterNamesIterator(); it.hasNext(); ) { final String name = it.next(); final String [] values = getParams(name); if(values.length==1) { result.add(name,values[0]); } else { // currently no reason not to use the same array result.add(name,values); } } return result; } Cheers Hana -- View this message in context: http://www.nabble.com/Method-toMultiMap%28NamedList-params%29-in-SolrParams-tp21626588p21626588.html Sent from the Solr - User mailing list archive at Nabble.com.
RE: Performance "dead-zone" due to garbage collection
Can you share your experience with the IBM JDK once you've evaluated it? You are working with a heavy load, I think many would benefit from the feedback. -Todd Feak -Original Message- From: wojtekpia [mailto:wojte...@hotmail.com] Sent: Thursday, January 22, 2009 3:46 PM To: solr-user@lucene.apache.org Subject: Re: Performance "dead-zone" due to garbage collection I'm not sure if you suggested it, but I'd like to try the IBM JVM. Aside from setting my JRE paths, is there anything else I need to do run inside the IBM JVM? (e.g. re-compiling?) Walter Underwood wrote: > > What JVM and garbage collector setting? We are using the IBM JVM with > their concurrent generational collector. I would strongly recommend > trying a similar collector on your JVM. Hint: how much memory is in > use after a full GC? That is a good approximation to the working set. > > -- View this message in context: http://www.nabble.com/Performance-%22dead-zone%22-due-to-garbage-collect ion-tp21588427p21616078.html Sent from the Solr - User mailing list archive at Nabble.com.
RE: QTime in microsecond
The easiest way is to run maybe 100,000 or more queries and take an average. A single microsecond value for a query would be incredibly inaccurate. -ToddFeak -Original Message- From: AHMET ARSLAN [mailto:iori...@yahoo.com] Sent: Friday, January 23, 2009 1:33 AM To: solr-user@lucene.apache.org Subject: QTime in microsecond Is there a way to get QTime in microsecond from solr? I have small set of collection and my response time (QTime) is 0 or 1 milliseconds. I am running benchmark tests and I need more sensitive running times for comparision. Thanks for your help.
Re: Intermittent high response times
The type of garbage collector definitely affects performance, but there are other settings as well. There's a related thread currently discussing this: http://www.nabble.com/Performance-%22dead-zone%22-due-to-garbage-collection-td21588427.html hbi dev wrote: > > Hi wojtekpia, > > That's interesting, I shall be looking into this over the weekend so I > shall > look at the GC also. I was briefly reading about GC last night, am I right > in thinking it could be affected by what version of the jvm I'm using > (1.5.0.8), and also what type of Collector is set? What collector is the > default, and what would people recommend for an application like Solr? > Thanks > Waseem > -- View this message in context: http://www.nabble.com/Intermittent-high-response-times-tp21602475p21628769.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: stats.jsp - maxDoc and numDoc-help
Hello, Those two numbers won't necessarily give you the number of duplicates, as they reflect the number of deletes in the index, and those deletes were not necessarily caused by Solr detecting a duplicate insert. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message > From: S.Selvam Siva > To: solr-user@lucene.apache.org > Sent: Friday, January 23, 2009 3:33:56 AM > Subject: stats.jsp - maxDoc and numDoc-help > > Hi all, > > i am new to solr.I have posted nearly 10 lakh xml docs for the last few > months. > > Now i want to find out the total number of duplicate posts untill now. > > whether the stats.jsp's numDocs and maxDocs is the appropriate one to find > out the total duplicate post(maxDocs-numDocs) so far? > please guide me to the solution. > -- > Yours, > S.Selvam
Solr schema causing an error
Hi there, I just configured my Solr schema file to support the data types I wish to submit for indexing. However, as soon as try and start the Solr server I get an error trying to reach the admin page. I know this only has something to do with my definitions in the schema, because when I tried to revert back to the default schema it worked again. In my new schema I took out only the example definitions I was told to and input the below. Can someone tell me what's wrong? Also, what's the difference between text/string (I tried with both). And am I right in thinking that I could set the type to "StrField" to prevent any analysis pre-index? Cheers for the help! -- View this message in context: http://www.nabble.com/Solr-schema-causing-an-error-tp21629485p21629485.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Solr schema causing an error
Are there any error log messages? The difference between a string and text is that string is basically stored with no modification (it is the solr.StrField). The text type is actually defined in the fieldtype section and usually contains a tokenizer and some analyzers (usually stemming, lowercasing, deduping). On 1/23/09 9:52 AM, "Johnny X" wrote: > > Hi there, > > > I just configured my Solr schema file to support the data types I wish to > submit for indexing. However, as soon as try and start the Solr server I get > an error trying to reach the admin page. > > I know this only has something to do with my definitions in the schema, > because when I tried to revert back to the default schema it worked again. > > In my new schema I took out only the example definitions I was told to and > input the below. Can someone tell me what's wrong? > > > > > > > > > stored="true"/> > > > > > > > > > > > Also, what's the difference between text/string (I tried with both). And am > I right in thinking that I could set the type to "StrField" to prevent any > analysis pre-index? > > > Cheers for the help! >
Re: stats.jsp - maxDoc and numDoc-help
On Fri, Jan 23, 2009 at 10:54 PM, Otis Gospodnetic < otis_gospodne...@yahoo.com> wrote: > Hello, > > Those two numbers won't necessarily give you the number of duplicates, as > they reflect the number of deletes in the index, and those deletes were not > necessarily caused by Solr detecting a duplicate insert. > > > Otis > thank you otis, 1)then i can think of that "maxDocs- numDocs " should be the maximum(upper bound) duplicate post count so far,if i assume no other deletion happened other than duplication deletion. 2)Also i have a another query ,where the deletion of indexed document will happen when a duplicate is posted.My aim is to retrive a particular field(not unique field) from the indexed document before it is deleted due to duplication. -- Yours, S.Selvam
Re: Solr schema causing an error
Ah, gotcha. Where do I go to find the log messages? Obviously it prints a lot of jargon on the admin page reporting the error, but is that what you want? Jeff Newburn wrote: > > Are there any error log messages? > > The difference between a string and text is that string is basically > stored > with no modification (it is the solr.StrField). The text type is actually > defined in the fieldtype section and usually contains a tokenizer and some > analyzers (usually stemming, lowercasing, deduping). > > > On 1/23/09 9:52 AM, "Johnny X" wrote: > >> >> Hi there, >> >> >> I just configured my Solr schema file to support the data types I wish to >> submit for indexing. However, as soon as try and start the Solr server I >> get >> an error trying to reach the admin page. >> >> I know this only has something to do with my definitions in the schema, >> because when I tried to revert back to the default schema it worked >> again. >> >> In my new schema I took out only the example definitions I was told to >> and >> input the below. Can someone tell me what's wrong? >> >> >> >> >> >> >>> stored="true"/> >>> stored="true"/> >>> stored="true"/> >> >> >> >> >> >> >> >> >> >> >> Also, what's the difference between text/string (I tried with both). And >> am >> I right in thinking that I could set the type to "StrField" to prevent >> any >> analysis pre-index? >> >> >> Cheers for the help! >> > > > -- View this message in context: http://www.nabble.com/Solr-schema-causing-an-error-tp21629485p21630425.html Sent from the Solr - User mailing list archive at Nabble.com.
Solr stemming -> preserve original words
hello, Is it possible to retrieve the original words once solr (Porter algorithm) stems them? I need to index a bunch of data, store it in solr, and get back a list of most frequent terms out of solr. and i want to see the non-stemmed version of this data. so basically, i want to enhance this: http://localhost:8983/solr/admin/schema.jsp to see the "top terms" in non-stemmed form. thanks, thushara
Re: Solr schema causing an error
The first 10-15 lines of the jargon might help. Additionally, the full exceptions will be in the webserver logs (ie tomcat or jetty logs). On 1/23/09 10:40 AM, "Johnny X" wrote: > > Ah, gotcha. > > Where do I go to find the log messages? Obviously it prints a lot of jargon > on the admin page reporting the error, but is that what you want? > > > > Jeff Newburn wrote: >> >> Are there any error log messages? >> >> The difference between a string and text is that string is basically >> stored >> with no modification (it is the solr.StrField). The text type is actually >> defined in the fieldtype section and usually contains a tokenizer and some >> analyzers (usually stemming, lowercasing, deduping). >> >> >> On 1/23/09 9:52 AM, "Johnny X" wrote: >> >>> >>> Hi there, >>> >>> >>> I just configured my Solr schema file to support the data types I wish to >>> submit for indexing. However, as soon as try and start the Solr server I >>> get >>> an error trying to reach the admin page. >>> >>> I know this only has something to do with my definitions in the schema, >>> because when I tried to revert back to the default schema it worked >>> again. >>> >>> In my new schema I took out only the example definitions I was told to >>> and >>> input the below. Can someone tell me what's wrong? >>> >>> >>> >>> >>> >>> >>>>> stored="true"/> >>>>> stored="true"/> >>>>> stored="true"/> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> Also, what's the difference between text/string (I tried with both). And >>> am >>> I right in thinking that I could set the type to "StrField" to prevent >>> any >>> analysis pre-index? >>> >>> >>> Cheers for the help! >>> >> >> >>
Re: Solr schema causing an error
The important info you are looking for is "undefined field sku at". It looks like there may be a copyfield in the schema looking for a field named sku which does not exist. Just search "sku" in the file and see what comes up. On 1/23/09 11:15 AM, "Johnny X" wrote: > > Well here are the first 10/15 lines: > > HTTP Status 500 - Severe errors in solr configuration. Check your log files > for more detailed information on what may be wrong. If you want solr to > continue after configuration errors, change: > false in null > - > org.apache.solr.common.SolrException: undefined field sku at > org.apache.solr.schema.IndexSchema.getField(IndexSchema.java:994) at > org.apache.solr.schema.IndexSchema.registerCopyField(IndexSchema.java:652) > at org.apache.solr.schema.IndexSchema.readSchema(IndexSchema.java:613) at > org.apache.solr.schema.IndexSchema.(IndexSchema.java:92) at > org.apache.solr.core.SolrCore.(SolrCore.java:412) at > org.apache.solr.core.CoreContainer$Initializer.initialize(CoreContainer.java:1 > 19) > at > org.apache.solr.servlet.SolrDispatchFilter.init(SolrDispatchFilter.java:69) > at > org.apache.catalina.core.ApplicationFilterConfig.getFilter(ApplicationFilterCo > nfig.java:275) > at > org.apache.catalina.core.ApplicationFilterConfig.setFilterDef(ApplicationFilte > rConfig.java:397) > at > org.apache.catalina.core.ApplicationFilterConfig.(ApplicationFilterConfi > g.java:108) > at > org.apache.catalina.core.StandardContext.filterStart(StandardContext.java:3709> ) > at org.apache.catalina.core.StandardContext.start(StandardContext.java:4363) > at > org.apache.catalina.core.ContainerBase.addChildInternal(ContainerBase.java:791> ) > at org.apache.catalina.core.ContainerBase.addChild(ContainerBase.java:771) > at org.apache.catalina.core.StandardHost.addChild(StandardHost.java:525) at > org.apache.catalina.startup.HostConfig.deployWAR(HostConfig.java:830) at > org.apache.catalina.startup.HostConfig.deployWARs(HostConfig.java:719) at > org.apache.catalina.startup.HostConfig.deployApps(HostConfig.java:490) at > org.apache.catalina.startup.HostConfig.start(HostConfig.java:1149) at > org.apache.catalina.startup.HostConfig.lifecycleEvent(HostConfig.java:311) > at > org.apache.catalina.util.LifecycleSupport.fireLifecycleEvent(LifecycleSupport. > java:117) > at org.apache.catalina.core.ContainerBase.start(ContainerBase.java:1053) at > org.apache.catalina.core.StandardHost.start(StandardHost.java:719) at > org.apache.catalina.core.ContainerBase.start(ContainerBase.java:1045) at > org.apache.catalina.core.StandardEngine.start(StandardEngine.java:443) at > org.apache.catalina.core.StandardService.start(StandardService.java:516) at > org.apache.catalina.core.StandardServer.start(StandardServer.java:710) at > org.apache.catalina.startup.Catalina.start(Catalina.java:578) at > sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at > sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source) at > sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source) at > java.lang.reflect.Method.invoke(Unknown Source) at > org.apache.catalina.startup.Bootstrap.start(Bootstrap.java:288) at > org.apache.catalina.startup.Bootstrap.main(Bootstrap.java:413) > > > > Jeff Newburn wrote: >> >> The first 10-15 lines of the jargon might help. Additionally, the full >> exceptions will be in the webserver logs (ie tomcat or jetty logs). >> >> >> On 1/23/09 10:40 AM, "Johnny X" wrote: >> >>> >>> Ah, gotcha. >>> >>> Where do I go to find the log messages? Obviously it prints a lot of >>> jargon >>> on the admin page reporting the error, but is that what you want? >>> >>> >>> >>> Jeff Newburn wrote: Are there any error log messages? The difference between a string and text is that string is basically stored with no modification (it is the solr.StrField). The text type is actually defined in the fieldtype section and usually contains a tokenizer and some analyzers (usually stemming, lowercasing, deduping). On 1/23/09 9:52 AM, "Johnny X" wrote: > > Hi there, > > > I just configured my Solr schema file to support the data types I wish > to > submit for indexing. However, as soon as try and start the Solr server > I > get > an error trying to reach the admin page. > > I know this only has something to do with my definitions in the schema, > because when I tried to revert back to the default schema it worked > again. > > In my new schema I took out only the example definitions I was told to > and > input the below. Can someone tell me what's wrong? > > stored="true"/> > > > > > stored="true"/> > stored="true"/> > indexed="false" > stored="true"/> > > > > > > >>>>
Re: Solr schema causing an error
Well here are the first 10/15 lines: HTTP Status 500 - Severe errors in solr configuration. Check your log files for more detailed information on what may be wrong. If you want solr to continue after configuration errors, change: false in null - org.apache.solr.common.SolrException: undefined field sku at org.apache.solr.schema.IndexSchema.getField(IndexSchema.java:994) at org.apache.solr.schema.IndexSchema.registerCopyField(IndexSchema.java:652) at org.apache.solr.schema.IndexSchema.readSchema(IndexSchema.java:613) at org.apache.solr.schema.IndexSchema.(IndexSchema.java:92) at org.apache.solr.core.SolrCore.(SolrCore.java:412) at org.apache.solr.core.CoreContainer$Initializer.initialize(CoreContainer.java:119) at org.apache.solr.servlet.SolrDispatchFilter.init(SolrDispatchFilter.java:69) at org.apache.catalina.core.ApplicationFilterConfig.getFilter(ApplicationFilterConfig.java:275) at org.apache.catalina.core.ApplicationFilterConfig.setFilterDef(ApplicationFilterConfig.java:397) at org.apache.catalina.core.ApplicationFilterConfig.(ApplicationFilterConfig.java:108) at org.apache.catalina.core.StandardContext.filterStart(StandardContext.java:3709) at org.apache.catalina.core.StandardContext.start(StandardContext.java:4363) at org.apache.catalina.core.ContainerBase.addChildInternal(ContainerBase.java:791) at org.apache.catalina.core.ContainerBase.addChild(ContainerBase.java:771) at org.apache.catalina.core.StandardHost.addChild(StandardHost.java:525) at org.apache.catalina.startup.HostConfig.deployWAR(HostConfig.java:830) at org.apache.catalina.startup.HostConfig.deployWARs(HostConfig.java:719) at org.apache.catalina.startup.HostConfig.deployApps(HostConfig.java:490) at org.apache.catalina.startup.HostConfig.start(HostConfig.java:1149) at org.apache.catalina.startup.HostConfig.lifecycleEvent(HostConfig.java:311) at org.apache.catalina.util.LifecycleSupport.fireLifecycleEvent(LifecycleSupport.java:117) at org.apache.catalina.core.ContainerBase.start(ContainerBase.java:1053) at org.apache.catalina.core.StandardHost.start(StandardHost.java:719) at org.apache.catalina.core.ContainerBase.start(ContainerBase.java:1045) at org.apache.catalina.core.StandardEngine.start(StandardEngine.java:443) at org.apache.catalina.core.StandardService.start(StandardService.java:516) at org.apache.catalina.core.StandardServer.start(StandardServer.java:710) at org.apache.catalina.startup.Catalina.start(Catalina.java:578) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source) at java.lang.reflect.Method.invoke(Unknown Source) at org.apache.catalina.startup.Bootstrap.start(Bootstrap.java:288) at org.apache.catalina.startup.Bootstrap.main(Bootstrap.java:413) Jeff Newburn wrote: > > The first 10-15 lines of the jargon might help. Additionally, the full > exceptions will be in the webserver logs (ie tomcat or jetty logs). > > > On 1/23/09 10:40 AM, "Johnny X" wrote: > >> >> Ah, gotcha. >> >> Where do I go to find the log messages? Obviously it prints a lot of >> jargon >> on the admin page reporting the error, but is that what you want? >> >> >> >> Jeff Newburn wrote: >>> >>> Are there any error log messages? >>> >>> The difference between a string and text is that string is basically >>> stored >>> with no modification (it is the solr.StrField). The text type is >>> actually >>> defined in the fieldtype section and usually contains a tokenizer and >>> some >>> analyzers (usually stemming, lowercasing, deduping). >>> >>> >>> On 1/23/09 9:52 AM, "Johnny X" wrote: >>> Hi there, I just configured my Solr schema file to support the data types I wish to submit for indexing. However, as soon as try and start the Solr server I get an error trying to reach the admin page. I know this only has something to do with my definitions in the schema, because when I tried to revert back to the default schema it worked again. In my new schema I took out only the example definitions I was told to and input the below. Can someone tell me what's wrong? >>> stored="true"/> >>> stored="true"/> >>> stored="true"/> >>> indexed="false" stored="true"/> >>> stored="true"/> Also, what's the difference between text/string (I tried with both). And am I right in thinking that I could set the type to "StrField" to prevent any analysis pre-index? Cheers for the help! >>> >>> >>> > > > -- View this message in context: http://www.nabble.com/Solr-schema-causing-an-error-tp21629485p21630937.html Sent from the Solr - User mailing list archive at N
Re: Solr schema causing an error
Wicked...you fixed it! Thanks very much. Pretty simple in the end I guess...but I thought it might be. Cheers. Jeff Newburn wrote: > > The important info you are looking for is "undefined field sku at". It > looks like there may be a copyfield in the schema looking for a field > named > sku which does not exist. Just search "sku" in the file and see what > comes > up. > > > On 1/23/09 11:15 AM, "Johnny X" wrote: > >> >> Well here are the first 10/15 lines: >> >> HTTP Status 500 - Severe errors in solr configuration. Check your log >> files >> for more detailed information on what may be wrong. If you want solr to >> continue after configuration errors, change: >> false in null >> - >> org.apache.solr.common.SolrException: undefined field sku at >> org.apache.solr.schema.IndexSchema.getField(IndexSchema.java:994) at >> org.apache.solr.schema.IndexSchema.registerCopyField(IndexSchema.java:652) >> at org.apache.solr.schema.IndexSchema.readSchema(IndexSchema.java:613) at >> org.apache.solr.schema.IndexSchema.(IndexSchema.java:92) at >> org.apache.solr.core.SolrCore.(SolrCore.java:412) at >> org.apache.solr.core.CoreContainer$Initializer.initialize(CoreContainer.java:1 >> 19) >> at >> org.apache.solr.servlet.SolrDispatchFilter.init(SolrDispatchFilter.java:69) >> at >> org.apache.catalina.core.ApplicationFilterConfig.getFilter(ApplicationFilterCo >> nfig.java:275) >> at >> org.apache.catalina.core.ApplicationFilterConfig.setFilterDef(ApplicationFilte >> rConfig.java:397) >> at >> org.apache.catalina.core.ApplicationFilterConfig.(ApplicationFilterConfi >> g.java:108) >> at >> > org.apache.catalina.core.StandardContext.filterStart(StandardContext.java:3709> > ) >> at >> org.apache.catalina.core.StandardContext.start(StandardContext.java:4363) >> at >> > org.apache.catalina.core.ContainerBase.addChildInternal(ContainerBase.java:791> > ) >> at >> org.apache.catalina.core.ContainerBase.addChild(ContainerBase.java:771) >> at org.apache.catalina.core.StandardHost.addChild(StandardHost.java:525) >> at >> org.apache.catalina.startup.HostConfig.deployWAR(HostConfig.java:830) at >> org.apache.catalina.startup.HostConfig.deployWARs(HostConfig.java:719) at >> org.apache.catalina.startup.HostConfig.deployApps(HostConfig.java:490) at >> org.apache.catalina.startup.HostConfig.start(HostConfig.java:1149) at >> org.apache.catalina.startup.HostConfig.lifecycleEvent(HostConfig.java:311) >> at >> org.apache.catalina.util.LifecycleSupport.fireLifecycleEvent(LifecycleSupport. >> java:117) >> at org.apache.catalina.core.ContainerBase.start(ContainerBase.java:1053) >> at >> org.apache.catalina.core.StandardHost.start(StandardHost.java:719) at >> org.apache.catalina.core.ContainerBase.start(ContainerBase.java:1045) at >> org.apache.catalina.core.StandardEngine.start(StandardEngine.java:443) at >> org.apache.catalina.core.StandardService.start(StandardService.java:516) >> at >> org.apache.catalina.core.StandardServer.start(StandardServer.java:710) at >> org.apache.catalina.startup.Catalina.start(Catalina.java:578) at >> sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at >> sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source) at >> sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source) at >> java.lang.reflect.Method.invoke(Unknown Source) at >> org.apache.catalina.startup.Bootstrap.start(Bootstrap.java:288) at >> org.apache.catalina.startup.Bootstrap.main(Bootstrap.java:413) >> >> >> >> Jeff Newburn wrote: >>> >>> The first 10-15 lines of the jargon might help. Additionally, the full >>> exceptions will be in the webserver logs (ie tomcat or jetty logs). >>> >>> >>> On 1/23/09 10:40 AM, "Johnny X" wrote: >>> Ah, gotcha. Where do I go to find the log messages? Obviously it prints a lot of jargon on the admin page reporting the error, but is that what you want? Jeff Newburn wrote: > > Are there any error log messages? > > The difference between a string and text is that string is basically > stored > with no modification (it is the solr.StrField). The text type is > actually > defined in the fieldtype section and usually contains a tokenizer and > some > analyzers (usually stemming, lowercasing, deduping). > > > On 1/23/09 9:52 AM, "Johnny X" wrote: > >> >> Hi there, >> >> >> I just configured my Solr schema file to support the data types I >> wish >> to >> submit for indexing. However, as soon as try and start the Solr >> server >> I >> get >> an error trying to reach the admin page. >> >> I know this only has something to do with my definitions in the >> schema, >> because when I tried to revert back to the default schema it worked >> again. >> >> In my new schema I took out only the example definitions I was told >> to >>
Re: Solr stemming -> preserve original words
I think best way to get non-stemmed top terms is to index the field using a fieldType that does not employes any stem filter. For example: By using copyField you can store two (or more) versions of a field. Stemmed and non-stemmed. Just a new field: And a copy field: Schema Browser (Field: text) will give you top terms. > Is it possible to retrieve the original words once solr > (Porter algorithm) > stems them? > I need to index a bunch of data, store it in solr, and get > back a list of > most frequent terms out of solr. and i want to see the > non-stemmed version > of this data. > > so basically, i want to enhance this: > http://localhost:8983/solr/admin/schema.jsp to see the > "top terms" in > non-stemmed form. > > thanks, > thushara
Re: Solr stemming -> preserve original words
hi Ahmet, thanks. when i look at the non_stemmed_text field to get the top terms, i will not be getting the useful feature of aggregating many related words into one (which is done by stemming). for ex: if a document has run(10), running(20), runner(2), runners(8) - i would like to see a a "top term" to be "run" here. i think with the non-stemmed solution, i will see run, running, runner, runners as separate top terms so if the term "weather" happens to occur 21 times in the document, it will replace any version of "run" as the top term. of course i could go back to the text field for top terms where i will see "run", but some of the terms in the text field will be non-english (stemmed beyond english, ex: archiv, perman). so how can i tell if a term i see in the text field is a "badly stemmed" word or not? maybe at this point i could use a dictionary? if a term in the text field is not in the dictionary, i would try to find a prefix match from the non-stemmed field? or maybe there's a better way? thanks, thushara On Fri, Jan 23, 2009 at 11:37 AM, AHMET ARSLAN wrote: > I think best way to get non-stemmed top terms is to index the field using a > fieldType that does not employes any stem filter. For example: > > > class="org.apache.lucene.analysis.standard.StandardAnalyzer"/> > > > By using copyField you can store two (or more) versions of a field. Stemmed > and non-stemmed. > > Just a new field: > > > And a copy field: > > > Schema Browser (Field: text) will give you top terms. > > > Is it possible to retrieve the original words once solr > > (Porter algorithm) > > stems them? > > I need to index a bunch of data, store it in solr, and get > > back a list of > > most frequent terms out of solr. and i want to see the > > non-stemmed version > > of this data. > > > > so basically, i want to enhance this: > > http://localhost:8983/solr/admin/schema.jsp to see the > > "top terms" in > > non-stemmed form. > > > > thanks, > > thushara > > > >
Issue indexing in Solr
I keep getting the error "FATAL: Solr returned an error: Bad Request" Solr is running on a different port (8080) so I changed the command line request to "java -Durl=http://localhost:8080/solr/update -jar post.jar *.xml" which seems to at least initiate. "WARNING: Make sure your XML documents are encoded in UTF-8, other encodings are not currently supported" appears, but I don't know if that's normal. These XML files were generated using a library in dot net so I'm not sure, but I'd guess they'd encode to UTF-8 by default? My xml currently looks like this: <12929996.1075855668941.javamail.ev...@thyme>Mon, 31 Dec 1979 16:00:00 -0800 (PST)phillip.al...@enron.commul...@thedoghousemail.comRe: (No Subject)1.0text/plain; charset=us-ascii7bitPhillip K Allenmul...@thedoghousemail.com\Phillip_Allen_Dec2000\Notes Folders\All documentsAllen-Ppallen.nsf How is your racing going? What category are you up to? I Would the "" cause a problem? Also, on a side note, do I need a in the last XML document, or will Solr automatically commit after a set period post-indexing? Thanks very much. -- View this message in context: http://www.nabble.com/Issue-indexing-in-Solr-tp21632462p21632462.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Issue indexing in Solr
The best way to find out what was wrong with the request is going to be the web server logs. It should throw an exception that usually complains about fields missing or incorrect. As to the committing solr has an autocommit option that will fire after a designated amount of changes have been entered. If you are planning on updating a record here or there I would advise sending in the commit yourself. This process will ensure your data is what you intend it to be. If you are planning on doing large amounts of commits then the autocommit is probably a better bet. On 1/23/09 12:53 PM, "Johnny X" wrote: > > I keep getting the error "FATAL: Solr returned an error: Bad Request" > > Solr is running on a different port (8080) so I changed the command line > request to "java -Durl=http://localhost:8080/solr/update -jar post.jar > *.xml" > > which seems to at least initiate. > > "WARNING: Make sure your XML documents are encoded in UTF-8, other encodings > are not currently supported" appears, but I don't know if that's normal. > These XML files were generated using a library in dot net so I'm not sure, > but I'd guess they'd encode to UTF-8 by default? > > > My xml currently looks like this: > > name="Message-ID"><12929996.1075855668941.javamail.ev...@thyme>< > field > name="Date">Mon, 31 Dec 1979 16:00:00 -0800 (PST) name="From">phillip.al...@enron.com name="To">mul...@thedoghousemail.comRe: (No > Subject)1.0 name="Content-Type">text/plain; charset=us-ascii name="Content-Transfer-Encoding">7bitPhillip K > Allenmul...@thedoghousemail.com name="X-cc"> name="X-Folder">\Phillip_Allen_Dec2000\Notes Folders\All > documentsAllen-P name="X-FileName">pallen.nsf > How is your racing going? What category are you up to? > > I > > > > > Would the "" cause a > problem? > > Also, on a side note, do I need a in the last XML document, or > will Solr automatically commit after a set period post-indexing? > > > Thanks very much. >
Should I extend DIH to handle POST too?
Hi I had earlier described my requirement of needing to 'post XMLs as-is' to SOLR and have it handled just as the DIH would do on import using the mapping in data-config.xml. I got multiple answers for the 'post approach' - the top two being - Use SOLR CELL - Use SOLRJ In general I would like to keep all the 'data conversion' inside the SOLR powered search system rather than having clients do the XSL and transforming the XML before sending them (CELL approach). My question is? How should I design this - Tomcat Servlet that provides this 'post' endpoint. Accepts the XML over HTTP, transforms it and calls SOLRJ to update. This is the same TOMCAT that houses SOLR. - SOLR Handler (Is this the right way?) - Take this a step further and implement it as an extension to DIH - a handler that will refer to DIH data-config xml and use the same transformation. This way I can invoke an import for 'batched files' or do a 'post 'for the same XML with the same data-config mapping being applied. Maybe it can be a separate handler that just refers to the same data-config.xml and not necessarily bundled with DIH handler code. Looking for some advise. If the DIH extension is the way to go then I would be happy to extend it and contribute that back to SOLR. Regards, Guna
Re: Solr stemming -> preserve original words
It seems like what's desired is not so much a stemmer as what you might call a "canonicalizer", which would translate each source word not into its "stem" but into its "most canonical form". Critically, the latter, by definition, is always a legitimate word, e.g. "run". What's more, it's always the "most appropriate word" or "most general word", or some such. I'm not sure you could implement this except through a massive dictionary. And you'd have trouble because some words would probably be ambiguous between whether they should canonicalize this way or that. On Fri, Jan 23, 2009 at 11:53 AM, Thushara Wijeratna wrote: > hi Ahmet, > > thanks. when i look at the non_stemmed_text field to get the top terms, i > will not be getting the useful feature of aggregating many related words > into one (which is done by stemming). > > for ex: if a document has run(10), running(20), runner(2), runners(8) - i > would like to see a a "top term" to be "run" here. i think with the > non-stemmed solution, i will see run, running, runner, runners as separate > top terms so if the term "weather" happens to occur 21 times in the > document, it will replace any version of "run" as the top term. > > of course i could go back to the text field for top terms where i will see > "run", but some of the terms in the text field will be non-english (stemmed > beyond english, ex: archiv, perman). so how can i tell if a term i see in > the text field is a "badly stemmed" word or not? > > maybe at this point i could use a dictionary? if a term in the text field > is > not in the dictionary, i would try to find a prefix match from the > non-stemmed field? or maybe there's a better way? > > thanks, > thushara > > On Fri, Jan 23, 2009 at 11:37 AM, AHMET ARSLAN wrote: > > > I think best way to get non-stemmed top terms is to index the field using > a > > fieldType that does not employes any stem filter. For example: > > > > > > > class="org.apache.lucene.analysis.standard.StandardAnalyzer"/> > > > > > > By using copyField you can store two (or more) versions of a field. > Stemmed > > and non-stemmed. > > > > Just a new field: > > /> > > > > And a copy field: > > > > > > Schema Browser (Field: text) will give you top terms. > > > > > Is it possible to retrieve the original words once solr > > > (Porter algorithm) > > > stems them? > > > I need to index a bunch of data, store it in solr, and get > > > back a list of > > > most frequent terms out of solr. and i want to see the > > > non-stemmed version > > > of this data. > > > > > > so basically, i want to enhance this: > > > http://localhost:8983/solr/admin/schema.jsp to see the > > > "top terms" in > > > non-stemmed form. > > > > > > thanks, > > > thushara > > > > > > > > >
Re: Solr stemming -> preserve original words
I didn't understand what exactly you want. if a document has run(10), running(20), runner(2), runners(8): (assuming stemmer reduces all those words to run) with non-stemmed you will see: running(20) run(10) runners(8) runner(2) with stemmed you will see: run(40) You want to see run as a top term but also you want to see the original words that formed that term? run(40) => 20 from running, 10 from run, 8 from runners, 2 from runner Or do you want to see most frequent terms that passed through stem filter verbatim? (terms that stemmer didn't change/modify) What do you mean by saying "badly stemmed" word? > hi Ahmet, > > thanks. when i look at the non_stemmed_text field to get > the top terms, i > will not be getting the useful feature of aggregating many > related words > into one (which is done by stemming). > > for ex: if a document has run(10), running(20), runner(2), > runners(8) - i > would like to see a a "top term" to be > "run" here. i think with the > non-stemmed solution, i will see run, running, runner, > runners as separate > top terms so if the term "weather" happens to > occur 21 times in the > document, it will replace any version of "run" as > the top term. > > of course i could go back to the text field for top terms > where i will see > "run", but some of the terms in the text field > will be non-english (stemmed > beyond english, ex: archiv, perman). so how can i tell if a > term i see in > the text field is a "badly stemmed" word or not? > > maybe at this point i could use a dictionary? if a term in > the text field is > not in the dictionary, i would try to find a prefix match > from the > non-stemmed field? or maybe there's a better way? > > thanks, > thushara
Re: Solr stemming -> preserve original words
Chris, Ahmet - thanks for the responses. Ahmet - yes, i want to see "run" as a top term + the original words that formed that term The reason is that due to mis-stemming, the terms could become non-english. ex: "permanent" would stem to "perm", "archive" would become "archiv". I need to extract a set of keywords from the indexed content - I'd like these to be correct full english words. thanks, thushara On Fri, Jan 23, 2009 at 2:12 PM, AHMET ARSLAN wrote: > I didn't understand what exactly you want. > > if a document has run(10), running(20), runner(2), runners(8): > (assuming stemmer reduces all those words to run) > with non-stemmed you will see: > running(20) > run(10) > runners(8) > runner(2) > > with stemmed you will see: > run(40) > > You want to see run as a top term but also you want to see the original > words that formed that term? > run(40) => 20 from running, 10 from run, 8 from runners, 2 from runner > > Or do you want to see most frequent terms that passed through stem filter > verbatim? (terms that stemmer didn't change/modify) > > What do you mean by saying "badly stemmed" word? > > > > hi Ahmet, > > > > thanks. when i look at the non_stemmed_text field to get > > the top terms, i > > will not be getting the useful feature of aggregating many > > related words > > into one (which is done by stemming). > > > > for ex: if a document has run(10), running(20), runner(2), > > runners(8) - i > > would like to see a a "top term" to be > > "run" here. i think with the > > non-stemmed solution, i will see run, running, runner, > > runners as separate > > top terms so if the term "weather" happens to > > occur 21 times in the > > document, it will replace any version of "run" as > > the top term. > > > > of course i could go back to the text field for top terms > > where i will see > > "run", but some of the terms in the text field > > will be non-english (stemmed > > beyond english, ex: archiv, perman). so how can i tell if a > > term i see in > > the text field is a "badly stemmed" word or not? > > > > maybe at this point i could use a dictionary? if a term in > > the text field is > > not in the dictionary, i would try to find a prefix match > > from the > > non-stemmed field? or maybe there's a better way? > > > > thanks, > > thushara > > > >
DataImport TXT file entity processor
Is there a way to us Data Import Handler to index non-XML (i.e. simple text) files (either via HTTP or FileSystem)? I need to put the entire contents of a text file into a single field of a document and the other fields are being pulled out of Oracle... -Nathan
faceting question
Hello; I got a multiField named tagList which may contain multiple tags. I am making a query like: tagList:a AND tagList:b AND tagList:c and I am also getting a tagList facet returning me some values. What I would like is Solr to return me facets as if the query was: tagList:a AND tagList:b is it even possible? Best, -C.B.
Results not appearing
I've indexed my XML using the below in the schema: Message-ID However searching via the Message-ID or Content fields returns 0. Using Luke I can still see these fields are stored however. Out of interest, by setting the other fields to just "stored=true", can they be returned in a query as part of a search? Cheers. -- View this message in context: http://www.nabble.com/Results-not-appearing-tp21637069p21637069.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Results not appearing
These might be obvious, but: * I assume you did a Solr commit command after indexing, right? * If you are using the fieldtype definitions from the default schema.xml, then your "string" fields are not being analyzed, which means you should expect search results only if you enter the entire, exact value of one of the Message-ID or Date fields in your query. Is that your intention? And yes, your analysis of "stored" seems correct. Stored fields are those whose values you need back at query time, and indexed fields are those you can do queries on. For a few complications, see http://wiki.apache.org/solr/FieldOptionsByUseCase On Fri, Jan 23, 2009 at 8:04 PM, Johnny X wrote: > > I've indexed my XML using the below in the schema: > >required="true"/> > > > > > > >stored="true"/> > > > > > > > > > > Message-ID > > However searching via the Message-ID or Content fields returns 0. Using Luke > I can still see these fields are stored however. > > Out of interest, by setting the other fields to just "stored=true", can they > be returned in a query as part of a search? > > > Cheers. > -- > View this message in context: > http://www.nabble.com/Results-not-appearing-tp21637069p21637069.html > Sent from the Solr - User mailing list archive at Nabble.com. > >
Re: Should I extend DIH to handle POST too?
There's another option. Using DIH with Solrj. Take a look at: https://issues.apache.org/jira/browse/SOLR-853 There's a patch there but it hasn't been updated to trunk. A contribution would be most welcome. On Sat, Jan 24, 2009 at 3:11 AM, Gunaranjan Chandraraju < chandrar...@apple.com> wrote: > Hi > I had earlier described my requirement of needing to 'post XMLs as-is' to > SOLR and have it handled just as the DIH would do on import using the > mapping in data-config.xml. I got multiple answers for the 'post approach' > - the top two being > > - Use SOLR CELL > - Use SOLRJ > > In general I would like to keep all the 'data conversion' inside the SOLR > powered search system rather than having clients do the XSL and transforming > the XML before sending them (CELL approach). > > My question is? How should I design this > - Tomcat Servlet that provides this 'post' endpoint. Accepts the XML over > HTTP, transforms it and calls SOLRJ to update. This is the same TOMCAT that > houses SOLR. > - SOLR Handler (Is this the right way?) > - Take this a step further and implement it as an extension to DIH - a > handler that will refer to DIH data-config xml and use the same > transformation. This way I can invoke an import for 'batched files' or do a > 'post 'for the same XML with the same data-config mapping being applied. > Maybe it can be a separate handler that just refers to the same > data-config.xml and not necessarily bundled with DIH handler code. > > Looking for some advise. If the DIH extension is the way to go then I > would be happy to extend it and contribute that back to SOLR. > > Regards, > Guna > -- Regards, Shalin Shekhar Mangar.
Re: DataImport TXT file entity processor
On Sat, Jan 24, 2009 at 5:56 AM, Nathan Adams wrote: > Is there a way to us Data Import Handler to index non-XML (i.e. simple > text) files (either via HTTP or FileSystem)? I need to put the entire > contents of a text file into a single field of a document and the other > fields are being pulled out of Oracle... Not yet. But I think it will be nice to have. Can you open an issue in Jira? I think importing from HTTP was something another user had asked for recently. How do you get the url/path of this text file? That would help decide if we need a Transformer or EntityProcessor for these tasks. -- Regards, Shalin Shekhar Mangar.
Re: faceting question
On Sat, Jan 24, 2009 at 6:56 AM, Cam Bazz wrote: > Hello; > > I got a multiField named tagList which may contain multiple tags. I am > making a query like: > > tagList:a AND tagList:b AND tagList:c > > and I am also getting a tagList facet returning me some values. > > What I would like is Solr to return me facets as if the query was: > tagList:a AND tagList:b > > is it even possible? > If I understand correctly, 1. You want to query for tagList:a AND tagList:b AND tagList:c 2. At the same time, you want to request facets for tagList but only for tagList:a and tagList:b If that is correct, you can use the features introduced by https://issues.apache.org/jira/browse/SOLR-911 However you may need to put #1 as fq instead of q. -- Regards, Shalin Shekhar Mangar.