Problem of facet on 170M documents
I have an index with 170M documents, and two of the fields for each doc is source and url. And I want to know the top 500 most frequent urls from Video source. So I did a facet with fq=source:Videofacet=truefacet.field=urlfacet.limit=500, and the matching documents are about 9 millions. The solr cluster is hosted on two ec2 instances each with 4 cpu, and 32G memory. 16G is allocated tfor java heap. 4 master shards on one machine, and 4 replica on another machine. Connected together via zookeeper. Whenever I did the query above, the response is just taking too long and the client will get timed out. Sometimes, when the end user is impatient, so he/she may wait for a few second for the results, and then kill the connection, and then issue the same query again and again. Then the server will have to deal with multiple such heavy queries simultaneously and being so busy that we got no server hosting shard error, probably due to lost communication between solr node and zookeeper. Is there any way to deal with such problem? Thanks, Ming
Re: Problem of facet on 170M documents
Hi Ming, which Solr version are you using? In case you use one of the latest versions (4.5 or above) try the new parameter facet.threads with a reasonable value (4 to 8 gave me a massive performance speedup when working with large facets, i.e. nTerms 10^7). -Sascha Mingfeng Yang wrote: I have an index with 170M documents, and two of the fields for each doc is source and url. And I want to know the top 500 most frequent urls from Video source. So I did a facet with fq=source:Videofacet=truefacet.field=urlfacet.limit=500, and the matching documents are about 9 millions. The solr cluster is hosted on two ec2 instances each with 4 cpu, and 32G memory. 16G is allocated tfor java heap. 4 master shards on one machine, and 4 replica on another machine. Connected together via zookeeper. Whenever I did the query above, the response is just taking too long and the client will get timed out. Sometimes, when the end user is impatient, so he/she may wait for a few second for the results, and then kill the connection, and then issue the same query again and again. Then the server will have to deal with multiple such heavy queries simultaneously and being so busy that we got no server hosting shard error, probably due to lost communication between solr node and zookeeper. Is there any way to deal with such problem? Thanks, Ming
Re: Store Solr OpenBitSets In Solr Indexes
Oh fine. Caution point was useful for me. Yes I wanted to do something similar to filer queries. It is not XY problem. I am simply trying to implement something as described below. I have a [non-clinical] group sets in system and I want to build bitset based on the documents belonging to that group and save it. So that, While searching I want to retrieve similar bitset from Solr engine for matched document and then execute logical XOR. [Am I clear with problem explanation now?] So what I am looking for is, If I have to retrieve bitset instance from Solr search engine for the documents matched, how can I get it? And How do I save bit mapping for the documents belonging to a particular group. thus enable XOR operation. Thanks - David On Fri, Nov 1, 2013 at 5:05 PM, Erick Erickson erickerick...@gmail.comwrote: Why are you saving this? Because if the bitset you're saving has anything to do with, say, filter queries, it's probably useless. The internal bitsets are often based on the internal Lucene doc ID, which will change when segment merges happen, thus the caution. Otherwise, theres the binary type you can probably use. It's not very efficient since I believe it uses base-64 encoding under the covers though... Is this an XY problem? Best, Erick On Wed, Oct 30, 2013 at 8:06 AM, David Philip davidphilipshe...@gmail.comwrote: Hi All, What should be the field type if I have to save solr's open bit set value within solr document object and retrieve it later for search? OpenBitSet bits = new OpenBitSet(); bits.set(0); bits.set(1000); doc.addField(SolrBitSets, bits); What should be the field type of SolrBitSets? Thanks
Re: Background merge errors with Solr 4.4.0 on Optimize call
See: https://issues.apache.org/jira/browse/SOLR-5418 Thanks Matthew and Robert! I'll see if I can get to this this weekend. On Wed, Oct 30, 2013 at 7:45 AM, Erick Erickson erickerick...@gmail.comwrote: Robert: Thanks. I'm on my way out the door, so I'll have to put up a JIRA with your patch later if it hasn't been done already Erick On Tue, Oct 29, 2013 at 10:14 PM, Robert Muir rcm...@gmail.com wrote: I think its a bug, but thats just my opinion. i sent a patch to dev@ for thoughts. On Tue, Oct 29, 2013 at 6:09 PM, Erick Erickson erickerick...@gmail.com wrote: Hmmm, so you're saying that merging indexes where a field has been removed isn't handled. So you have some documents that do have a what field, but your schema doesn't have it, is that true? It _seems_ like you could get by by putting the _what_ field back into your schema, just not sending any data to it in new docs. I'll let others who understand merging better than me chime in on whether this is a case that should be handled or a bug. I pinged the dev list to see what the opinion is Best, Erick On Mon, Oct 28, 2013 at 6:39 PM, Matthew Shapiro m...@mshapiro.net wrote: Sorry for reposting after I just sent in a reply, but I just looked at the error trace closer and noticed 1. Caused by: java.lang.IllegalArgumentException: no such field what The 'what' field was removed by request of the customer as they wanted the logic behind what gets queried in the what field to be code side instead of solr side (for easier changing without having to re-index everything. I didn't feel strongly either way and since they are paying me, I took it out). This makes me wonder if its crashing while merging because a field that used to be there is now gone. However, this seems odd to me as Solr doesn't even let me delete the old data and instead its leaving my collection in an extremely bad state, with the only remedy I can think of is to nuke the index at the filesystem level. If this is indeed the cause of the crash, is the only way to delete a field to first completely empty your index first? On Mon, Oct 28, 2013 at 6:34 PM, Matthew Shapiro m...@mshapiro.net wrote: Thanks for your response. You were right, solr is logging to the catalina.out file for tomcat. When I click the optimize button in solr's admin interface the following logs are written: http://apaste.info/laup About JVM memory, solr's admin interface is listing JVM memory at 3.1% (221.7MB is dark grey, 512.56MB light grey and 6.99GB total). On Mon, Oct 28, 2013 at 6:29 AM, Erick Erickson erickerick...@gmail.com wrote: For Tomcat, the Solr is often put into catalina.out as a default, so the output might be there. You can configure Solr to send the logs most anywhere you please, but without some specific setup on your part the log output just goes to the default for the servlet. I took a quick glance at the code but since the merges are happening in the background, there's not much context for where that error is thrown. How much memory is there for the JVM? I'm grasping at straws a bit... Erick On Sun, Oct 27, 2013 at 9:54 PM, Matthew Shapiro m...@mshapiro.net wrote: I am working at implementing solr to work as the search backend for our web system. So far things have been going well, but today I made some schema changes and now things have broken. I updated the schema.xml file and reloaded the core (via the admin interface). No errors were reported in the logs. I then pushed 100 records to be indexed. A call to Commit afterwards seemed fine, however my next call for Optimize caused the following errors: java.io.IOException: background merge hit exception: _2n(4.4):C4263/154 _30(4.4):C134 _32(4.4):C10 _31(4.4):C10 into _37 [maxNumSegments=1] null:java.io.IOException: background merge hit exception: _2n(4.4):C4263/154 _30(4.4):C134 _32(4.4):C10 _31(4.4):C10 into _37 [maxNumSegments=1] Unfortunately, googling for background merge hit exception came up with 2 thing: a corrupt index or not enough free space. The host machine that's hosting solr has 227 out of 229GB free (according to df -h), so that's not it. I then ran CheckIndex on the index, and got the following results: http://apaste.info/gmGU As someone who is new to solr and lucene, as far as I can tell this means my index is fine. So I am coming up at a loss. I'm somewhat sure that I could probably delete my data directory and rebuild it but I am more interested in finding out why is it having issues, what is the best way to fix it, and what is the best way to prevent it from happening when this goes into production. Does anyone have any advice that may
Re: Custom Plugin exception : Plugin init failure for [schema.xml]
Hi Shawn, Thank you for your answer. I have solved the problem. The problem is, in our code constructor of TurkishFilterFactory is setted as protected and that works without problem on the 3.x versions of the Solr but gives the exception that I mentioned here on 4.x versions. By analyzing the stack trace I saw that it gives an InstantationException and by making constructor public solves the problem. On Fri, Nov 1, 2013 at 6:34 PM, Shawn Heisey s...@elyograg.org wrote: On 11/1/2013 4:18 AM, Parvin Gasimzade wrote: I have a problem with custom plugin development in solr 4.x versions. I have developed custom filter and trying to install it but I got following exception. Later you indicated that you can use it with Solr 3.x without any problem. Did you recompile your custom plugin against the Solr jars from the new version? There was a *huge* amount of java class refactoring that went into the 4.0 version as compared to any 3.x version, and that continues with each new 4.x release. I would bet that if you tried that recompile, it would fail due to errors and/or warnings, which you'll need to fix. There might also be operational problems that the compiler doesn't find, due to changes in how the underlying APIs get used. Thanks, Shawn
Writing a Solr custom analyzer to post content to Stanbol {was: Need additional data processing in Data Import Handler prior to indexing}
Hi All, I went through possible solutions for my requirement of triggering a Stanbol enhancement during Solr indexing, and I got the requirement simplified. I only need to process the field named content to perform the Stanbol enhancement to extract Person and Organizations. So I think it will be easier to do the Stanbol request during indexing the content field , after the data is imported (from DIH). I think the best solution will be to write a custom Analyzer to process the content and post it to Stanbol. In the analyzer I also need to process the Stanbol enhancement response. The response should be processed as a new document to index and store the identified Person and Organization entities in a field called extractedEntities. So my current idea is as follows; in the schema.xml copyField source=content dest=stanbolRequest / field name=stanbolRequest type=stanbolRequestType indexed=true stored=true docValues=truerequired=false/ fieldType name=stanbolRequestType class=solr.TextField analyzer class=MyCustomAnalyzer/ /fieldType In the : MyCustomAnalyzer class the content will be posted and enhanced from Stanbol. The Person and Organization entities in the response should be indexed into the Solr field extractedEntities. Am I going in the correct path for my requirement? Please share your ideas. Appreciate any relevant pointers to samples/documentation. Thanks, Dileea On Wed, Oct 30, 2013 at 11:26 AM, Dileepa Jayakody dileepajayak...@gmail.com wrote: Thanks guys for your ideas. I will go through them and come back with questions. Regards, Dileepa On Wed, Oct 30, 2013 at 7:00 AM, Erick Erickson erickerick...@gmail.comwrote: Third time tonight I've been able to paste this link Also, you can consider just moving to SolrJ and taking DIH out of the process, see: http://searchhub.org/2012/02/14/indexing-with-solrj/ Whichever approach fits your needs of course. Best, Erick On Tue, Oct 29, 2013 at 7:15 PM, Alexandre Rafalovitch arafa...@gmail.comwrote: It's also possible to combine Update Request Processor with DIH. That way if a debug entry needs to be inserted it could go through the same Stanbol process. Just define a processing chain the DIH handler and write custom URP to call out to Stanbol web service. You have access to a full record in URP, so can add/delete/change the fields at will. Regards, Alex. Personal website: http://www.outerthoughts.com/ LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch - Time is the quality of nature that keeps events from happening all at once. Lately, it doesn't seem to be working. (Anonymous - via GTD book) On Wed, Oct 30, 2013 at 4:09 AM, Michael Della Bitta michael.della.bi...@appinions.com wrote: Hi Dileepa, You can write your own Transformers in Java. If it doesn't make sense to run Stanbol calls in a Transformer, maybe setting up a web service that grabs a record out of MySQL, sends the data to Stanbol, and displays the results could be used in conjunction with HttpDataSource rather than JdbcDataSource. http://wiki.apache.org/solr/DIHCustomTransformer http://wiki.apache.org/solr/DataImportHandler#Usage_with_XML.2FHTTP_Datasource Michael Della Bitta Applications Developer o: +1 646 532 3062 | c: +1 917 477 7906 appinions inc. “The Science of Influence Marketing” 18 East 41st Street New York, NY 10017 t: @appinions https://twitter.com/Appinions | g+: plus.google.com/appinions https://plus.google.com/u/0/b/112002776285509593336/112002776285509593336/posts w: appinions.com http://www.appinions.com/ On Tue, Oct 29, 2013 at 4:47 PM, Dileepa Jayakody dileepajayak...@gmail.com wrote: Hi All, I'm a newbie to Solr, and I have a requirement to import data from a mysql database; enhance the imported content to identify Persons mentioned and index it as a separate field in Solr along with the other fields defined for the original db query. I'm using Apache Stanbol [1] for the content enhancement requirement. I can get enhancement results for 'Person' type data in the content as the enhancement result. The data flow will be; mysql-db Solr data-import handler Stanbol enhancer Solr index For the above requirement I need to perform additional processing at the data-import handler prior to indexing to send a request to Stanbol and process the enhancement response. I found some related examples on modifying mysql data import handler to customize the query results in db-data-config.xml by using a transformer script. As per my requirement, In the data-import-handler I need to send a request to Stanbol and process the response prior to indexing. But I'm not sure if this can be achieved using a simple javascript. Is there any other better way
Re: unable to load core after cluster restart
Hi Shawn, One thing I forget to mention here is the same setup (with no bootstrap) is working fine in our QA1 environment. I did not have the bootstrap option from start, I added it thinking it will solve the problem. Nonetheless I followed Shawn's instructions, wherever it differed from my old approach... 1. I moved my zkHost from JVM to solr.xml and added chroot in it 2. removed bootstrap option 3. created collections with URL template suggested (I have tried it earlier too) None of it worked for me... I am seeing same errors.. I am adding some more logs before and after the error occurs - INFO - 2013-11-02 17:40:40.427; org.apache.solr.update.DefaultSolrCoreState; closing IndexWriter with IndexWriterCloser INFO - 2013-11-02 17:40:40.428; org.apache.solr.core.SolrCore; [xyz] Closing main searcher on request. INFO - 2013-11-02 17:40:40.431; org.apache.solr.core.CachingDirectoryFactory; Closing NRTCachingDirectoryFactory - 1 directories currently being tracked INFO - 2013-11-02 17:40:40.432; org.apache.solr.core.CachingDirectoryFactory; looking to close /mnt/emc/App_name/data-UAT-refresh/SolrCloud/SolrHome2/solr/xyz/data [CachedDirrefCount=0;path=/mnt/emc/App_name/data-UAT-refresh/SolrCloud/SolrHome2/solr/xyz/data;done=false] INFO - 2013-11-02 17:40:40.432; org.apache.solr.core.CachingDirectoryFactory; Closing directory: /mnt/emc/App_name/data-UAT-refresh/SolrCloud/SolrHome2/solr/xyz/data ERROR - 2013-11-02 17:40:40.433; org.apache.solr.core.CoreContainer; Unable to create core: xyz org.apache.solr.common.SolrException: Error opening new searcher at org.apache.solr.core.SolrCore.init(SolrCore.java:834) at org.apache.solr.core.SolrCore.init(SolrCore.java:625) at org.apache.solr.core.ZkContainer.createFromZk(ZkContainer.java:256) at org.apache.solr.core.CoreContainer.create(CoreContainer.java:555) at org.apache.solr.core.CoreContainer$1.call(CoreContainer.java:247) at org.apache.solr.core.CoreContainer$1.call(CoreContainer.java:239) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) at java.util.concurrent.FutureTask.run(FutureTask.java:138) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) at java.util.concurrent.FutureTask.run(FutureTask.java:138) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang.Thread.run(Thread.java:619) Caused by: org.apache.solr.common.SolrException: Error opening new searcher at org.apache.solr.core.SolrCore.openNewSearcher(SolrCore.java:1477) at org.apache.solr.core.SolrCore.getSearcher(SolrCore.java:1589) at org.apache.solr.core.SolrCore.init(SolrCore.java:821) ... 13 more Caused by: org.apache.lucene.store.LockObtainFailedException: Lock obtain timed out: NativeFSLock@/mnt/emc/App_name/data-UAT-refresh/SolrCloud/SolrHome2/solr/xyz/data/index/write.lock at org.apache.lucene.store.Lock.obtain(Lock.java:84) at org.apache.lucene.index.IndexWriter.init(IndexWriter.java:695) at org.apache.solr.update.SolrIndexWriter.init(SolrIndexWriter.java:77) at org.apache.solr.update.SolrIndexWriter.create(SolrIndexWriter.java:64) at org.apache.solr.update.DefaultSolrCoreState.createMainIndexWriter(DefaultSolrCoreState.java:267) at org.apache.solr.update.DefaultSolrCoreState.getIndexWriter(DefaultSolrCoreState.java:110) at org.apache.solr.core.SolrCore.openNewSearcher(SolrCore.java:1440) ... 15 more ERROR - 2013-11-02 17:40:40.443; org.apache.solr.common.SolrException; null:org.apache.solr.common.SolrException: Unable to create core: xyz at org.apache.solr.core.CoreContainer.recordAndThrow(CoreContainer.java:934) at org.apache.solr.core.CoreContainer.create(CoreContainer.java:566) at org.apache.solr.core.CoreContainer$1.call(CoreContainer.java:247) at org.apache.solr.core.CoreContainer$1.call(CoreContainer.java:239) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) at java.util.concurrent.FutureTask.run(FutureTask.java:138) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) at java.util.concurrent.FutureTask.run(FutureTask.java:138) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang.Thread.run(Thread.java:619) Caused by: org.apache.solr.common.SolrException: Error opening new searcher at org.apache.solr.core.SolrCore.init(SolrCore.java:834)