Re: Distributed query: strange behavior.
Hi, Erick! That's it! I'm using a custom implementation of a SolrServer with distributed behavior that routes queries and updates using an in-house Round Robin method. But the thing is that I'm doing this myself because I've noticed that duplicated documents appears using LBHttpSolrServer implementation. Last week I modified my implementation to avoid that with this changes: - I have normalized the key field to all documents. Now every document indexed must include *_id_* field that stores the selected key value. The value is setted with a *copyField*. - When I index a new document a *HttpSolrServer* from the shard list is selected using a Round Robin strategy. Then, a field called *_shard_* is setted to *SolrInputDocument*. That field value includes a relationship with the main shard selected. - If a document wants to be indexed/updated and it includes *_shard_*field to update it automatically the belonged shard ( *HttpSolrServer*) is selected. - If a document wants to be indexed/updated and *_shard_* field is not included then the key value from *_id_* is getted from *SolrInputDocument *. With that key a distributed search query is executed by it's key to retrieve *_shard_* field. With *_shard_* field we can now choose the correct shard (*HttpSolrServer*). It's not a good practice and performance isn't the best, but it's secure. Best Regards, - Luis Cappa 2013/5/26 Erick Erickson erickerick...@gmail.com Valery: I share your puzzlement. _If_ you are letting Solr do the document routing, and not doing any of the custom routing, then the same unique key should be going to the same shard and replacing the previous doc with that key. But, if you're using custom routing, if you've been experimenting with different configurations and didn't start over, in general if you're configuration is in an interesting state this could happen. So in the normal case if you have a document with the same key indexed in multiple shards, that would indicate a bug. But there are many ways, especially when experimenting, that you could have this happen which are _not_ a bug. I'm guessing that Luis may be trying the custom routing option maybe? Best Erick On Fri, May 24, 2013 at 9:09 AM, Valery Giner valgi...@research.att.com wrote: Shawn, How is it possible for more than one document with the same unique key to appear in the index, even in different shards? Isn't it a bug by definition? What am I missing here? Thanks, Val On 05/23/2013 09:55 AM, Shawn Heisey wrote: On 5/23/2013 1:51 AM, Luis Cappa Banda wrote: I've query each Solr shard server one by one and the total number of documents is correct. However, when I change rows parameter from 10 to 100 the total numFound of documents change: I've seen this problem on the list before and the cause has been determined each time to be caused by documents with the same uniqueKey value appearing in more than one shard. What I think happens here: With rows=10, you get the top ten docs from each of the three shards, and each shard sends its numFound for that query to the core that's coordinating the search. The coordinator adds up numFound, looks through those thirty docs, and arranges them according to the requested sort order, returning only the top 10. In this case, there happen to be no duplicates. With rows=100, you get a total of 300 docs. This time, duplicates are found and removed by the coordinator. I think that the coordinator adjusts the total numFound by the number of duplicate documents it removed, in an attempt to be more accurate. I don't know if adjusting numFound when duplicates are found in a sharded query is the right thing to do, I'll leave that for smarter people. Perhaps Solr should return a message with the results saying that duplicates were found, and if a config option is not enabled, the server should throw an exception and return a 4xx HTTP error code. One idea for a config parameter name would be allowShardDuplicates, but something better can probably be found. Thanks, Shawn -- - Luis Cappa
Re: Java heap space exception in 4.2.1
I have the same problem. at 4.1 ,a solr instance could take 8,000,000,000 doc. but at 4.2.1, a instance only take 400,000,000 doc, it will oom at facet query. the facet field was token by space. May 27, 2013 11:12:55 AM org.apache.solr.common.SolrException log SEVERE: null:java.lang.RuntimeException: java.lang.OutOfMemoryError: Java heap space at org.apache.solr.servlet.SolrDispatchFilter.sendError(SolrDispatchFilter.java:653) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:366) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:141) at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1338) at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:484) at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:119) at org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:524) at org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:231) at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1065) at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:413) at org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:192) at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:999) at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:117) at org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:250) at org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:149) at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:111) at org.eclipse.jetty.server.Server.handle(Server.java:350) at org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:454) at org.eclipse.jetty.server.BlockingHttpConnection.handleRequest(BlockingHttpConnection.java:47) at org.eclipse.jetty.server.AbstractHttpConnection.content(AbstractHttpConnection.java:900) at org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.content(AbstractHttpConnection.java:954) at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:851) at org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:235) at org.eclipse.jetty.server.BlockingHttpConnection.handle(BlockingHttpConnection.java:66) at org.eclipse.jetty.server.bio.SocketConnector$ConnectorEndPoint.run(SocketConnector.java:254) at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:603) at org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:538) at java.lang.Thread.run(Thread.java:662) Caused by: java.lang.OutOfMemoryError: Java heap space at org.apache.lucene.index.DocTermOrds.uninvert(DocTermOrds.java:448) at org.apache.solr.request.UnInvertedField.init(UnInvertedField.java:179) at org.apache.solr.request.UnInvertedField.getUnInvertedField(UnInvertedField.java:664) at org.apache.solr.request.SimpleFacets.getTermCounts(SimpleFacets.java:426) at org.apache.solr.request.SimpleFacets.getFacetFieldCounts(SimpleFacets.java:517) at org.apache.solr.request.SimpleFacets.getFacetCounts(SimpleFacets.java:252) at org.apache.solr.handler.component.FacetComponent.process(FacetComponent.java:78) at org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:208) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1825) at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:639) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:345) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:141) at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1338) at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:484) at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:119) at org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:524) at org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:231) at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1065) at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:413) at org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:192) at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:999) at
Re: Java heap space exception in 4.2.1
I am sorry about a type mistake 8,000,000,000 - 800,000,000 2013/5/27 Jam Luo cooljam2...@gmail.com I have the same problem. at 4.1 ,a solr instance could take 8,000,000,000 doc. but at 4.2.1, a instance only take 400,000,000 doc, it will oom at facet query. the facet field was token by space. May 27, 2013 11:12:55 AM org.apache.solr.common.SolrException log SEVERE: null:java.lang.RuntimeException: java.lang.OutOfMemoryError: Java heap space at org.apache.solr.servlet.SolrDispatchFilter.sendError(SolrDispatchFilter.java:653) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:366) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:141) at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1338) at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:484) at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:119) at org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:524) at org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:231) at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1065) at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:413) at org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:192) at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:999) at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:117) at org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:250) at org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:149) at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:111) at org.eclipse.jetty.server.Server.handle(Server.java:350) at org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:454) at org.eclipse.jetty.server.BlockingHttpConnection.handleRequest(BlockingHttpConnection.java:47) at org.eclipse.jetty.server.AbstractHttpConnection.content(AbstractHttpConnection.java:900) at org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.content(AbstractHttpConnection.java:954) at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:851) at org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:235) at org.eclipse.jetty.server.BlockingHttpConnection.handle(BlockingHttpConnection.java:66) at org.eclipse.jetty.server.bio.SocketConnector$ConnectorEndPoint.run(SocketConnector.java:254) at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:603) at org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:538) at java.lang.Thread.run(Thread.java:662) Caused by: java.lang.OutOfMemoryError: Java heap space at org.apache.lucene.index.DocTermOrds.uninvert(DocTermOrds.java:448) at org.apache.solr.request.UnInvertedField.init(UnInvertedField.java:179) at org.apache.solr.request.UnInvertedField.getUnInvertedField(UnInvertedField.java:664) at org.apache.solr.request.SimpleFacets.getTermCounts(SimpleFacets.java:426) at org.apache.solr.request.SimpleFacets.getFacetFieldCounts(SimpleFacets.java:517) at org.apache.solr.request.SimpleFacets.getFacetCounts(SimpleFacets.java:252) at org.apache.solr.handler.component.FacetComponent.process(FacetComponent.java:78) at org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:208) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1825) at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:639) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:345) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:141) at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1338) at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:484) at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:119) at org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:524) at org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:231) at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1065) at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:413) at
Re: Why would one not use RemoveDuplicatesTokenFilterFactory?
On Sun, May 26, 2013 at 8:16 PM, Jack Krupansky j...@basetechnology.com wrote: The only comment I was trying to make here is the relationship between the RemoveDuplicatesTokenFilterFactory and the KeywordRepeatFilterFactory. No, stemmed terms are not considered the same text as the original word. By definition, they are a new value for the term text. I see, for some reason I did not concentrate on this key quote of yours: ...to remove the tokens that did not produce a stem ... Now it makes perfect sense. Thank you, Jack! -- Dotan Cohen http://gibberish.co.il http://what-is-what.com
Re: Indexing message module
On 27 May 2013 12:58, Arkadi Colson ark...@smartbit.be wrote: Hi We would like to index our messages system. We should be able to search for messages for specific recipients due to performance issues on our databases. But the message is of course the same for all receipients and the message text should be saved only once! Is it possible to have some kind of array field to include in the search query where all the recipients are stored? Or should we for example use a simple text field which is filled with the receipients like this: field_434_3432_432_6546_75_8678_/field [...] Why couldn't you use a multi-valued string/int field for the recipient IDs? Regards, Gora
Indexing message module
Hi We would like to index our messages system. We should be able to search for messages for specific recipients due to performance issues on our databases. But the message is of course the same for all receipients and the message text should be saved only once! Is it possible to have some kind of array field to include in the search query where all the recipients are stored? Or should we for example use a simple text field which is filled with the receipients like this: field_434_3432_432_6546_75_8678_/field Anybody a good idea? BR, Arkadi
Re: Overlapping onDeckSearchers=2
Hi, thanks for the response. Seems like this is the case because there are no any other applications that could fire commit/optimize calls. All commits are triggered by Solr and the optimize is triggered by a cron task. Because of all that it looks like a bug in Solr. It probably should not run commits when optimize is in the process or should do commit before the optimize. Or should not run optimize when a commit is in the process. Not sure which exactly scenario happens. Is there anything I can do to fix this? I don't have too much memory on this server and since this could lead to double the RAM used it is a serious issue for me. Best, Alex -- View this message in context: http://lucene.472066.n3.nabble.com/Overlapping-onDeckSearchers-2-tp772556p4066205.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Indexing message module
Yes indeed... Thx! On 05/27/2013 09:33 AM, Gora Mohanty wrote: On 27 May 2013 12:58, Arkadi Colson ark...@smartbit.be wrote: Hi We would like to index our messages system. We should be able to search for messages for specific recipients due to performance issues on our databases. But the message is of course the same for all receipients and the message text should be saved only once! Is it possible to have some kind of array field to include in the search query where all the recipients are stored? Or should we for example use a simple text field which is filled with the receipients like this: field_434_3432_432_6546_75_8678_/field [...] Why couldn't you use a multi-valued string/int field for the recipient IDs? Regards, Gora
RE: Tika: How can I import automatically all metadata without specifiying them explicitly
Thanks for the help. @Alexandre: Thanks for the suggestion, I'll try to use an ExtractingRequestHandler, I thought that I was missing some DIH option :). @Erik: I'm interested in knowing them all to do various form of analysis. I have documents coming from heterogeneous sources and I'm interested in searching inside the content, but also being able to extract all possible metadata. I'm working in .Net so it is useful letting tika doing everything for me directly in solr and then retrieve all metadata for matched documents. Thanks again to everyone. -- Gian Maria Ricci Mobile: +39 320 0136949 -Original Message- From: Erick Erickson [mailto:erickerick...@gmail.com] Sent: Sunday, May 26, 2013 5:30 PM To: solr-user@lucene.apache.org; Gian Maria Ricci Subject: Re: Tika: How can I import automatically all metadata without specifiying them explicitly In addition to Alexandre's comment: bq: ...I'd like to import in my index all metadata Be a little careful here, this isn't actually very useful in my experience. Sure it's nice to have all that data in the index, but... how do you search it meaningfully? Consider that some doc may have an author metadata field. Another may have a last editor field. Yet another may have a main author field. If you add all these as their field name, what do you do to search for author? Somehow you have to create a mapping between the various metadata names and something that's searchable, why not do this at index time? Not to mention I've seen this done and the result may be literally hundreds of different metadata fields which are not very useful. All that said, it may be perfectly valid to inde them all, but before going there it's worth considering whether the result is actually _useful_. Best Erick On Sat, May 25, 2013 at 4:44 AM, Gian Maria Ricci alkamp...@nablasoft.comwrote: Hi to everyone, ** ** I've configured import of a document folder with FileListEntityProcessor, everything went smooth on the first try, but I have a simple question. I'm able to map metadata without any problem, but I'd like to import in my index all metadata, not only those I've configured with field nodes. In this example I've imported Author and title, but I does not know in advance which metadata a document could have and I wish to have all of them inside my index. ** ** Here is my import config. It is the first try with importing with tika and probably I'm missing a simple stuff. ** ** dataConfig dataSource type=BinFileDataSource / document entity name=files dataSource=null rootEntity=false processor=FileListEntityProcessor baseDir=c:/temp/docs fileName=.*\.(doc)|(pdf)|(docx) onError=skip recursive=true field column=file name=id / field column=fileAbsolutePath name=path / field column=fileSize name=size / field column=fileLastModified name=lastModified / entity ** ** name=documentImport processor=TikaEntityProcessor url=${files.fileAbsolutePath} format=text field column=file name=fileName/ field column=Author name=author meta=true/ field column=title name=title meta=true/ field column=text name=text/ /entity* *** /entity /document /dataConfig ** ** ** ** -- Gian Maria Ricci Mobile: +39 320 0136949 http://mvp.microsoft.com/en-us/mvp/Gian%20Maria%20Ricci-4025635 [image: https://encrypted-tbn1.gstatic.com/images?q=tbn:ANd9GcQyg0wiW_QuTxl-rn uVR2P0jGuj4qO3I9attctCNarL--FC3vdPYg]http://www.linkedin.com/in/gianm ariaricci [image: https://encrypted-tbn2.gstatic.com/images?q=tbn:ANd9GcT8z0HpwpDSjDWw1I 59Yx7HmF79u-NnP0NYeYYyEyWM1WtIbOl7]https://twitter.com/alkampfer [image: https://encrypted-tbn1.gstatic.com/images?q=tbn:ANd9GcQQWMj687BGGypKMU Tub_lkUrull1uU2LTx0K2tDBeu3mNUr7Oxlg]http://feeds.feedburner.com/Alka mpferEng [image: https://encrypted-tbn3.gstatic.com/images?q=tbn:ANd9GcSkTG_lPTPFe470xf DtiInUtseqKcuV_lvI5h_-8t_3PsY5ikg3] ** ** ** **
Application connecting to SOLR cloud
Hi, We have setup the SOLR cloud with zookeeper. Zookeeper (localhost:8000) 1 shard (localhost:9000) 2 Replica (localhost:9001,localhost:9002) Question : We load the solr index from Relational DB using DIH, Based on solr cloud documentation the request to load the data will be forwarded to Leader. CollectionShard1-Leader1(localhost:9000) | |_Replication1(localhost:9001) |Replication2 (localhost :9002) 1. To identify Leader (Here localhost:9000) from the zookeeper, do we need to read the znode clusterstate.json having leader=true by connecting to zookeeper. ? 2. Do we need to keep external load balancer for (localhost:9000,9001,9002) to route request ? Is there any other way ? Thanks, Sathish -- View this message in context: http://lucene.472066.n3.nabble.com/Application-connecting-to-SOLR-cloud-tp4066220.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Overlapping onDeckSearchers=2
The intent is that optimize is obsolete and should no longer be used, especially with tiered merge policy running. In other words, merging should be occurring on the fly in Lucene now. What release of Solr are you running? -- Jack Krupansky -Original Message- From: heaven Sent: Monday, May 27, 2013 3:51 AM To: solr-user@lucene.apache.org Subject: Re: Overlapping onDeckSearchers=2 Hi, thanks for the response. Seems like this is the case because there are no any other applications that could fire commit/optimize calls. All commits are triggered by Solr and the optimize is triggered by a cron task. Because of all that it looks like a bug in Solr. It probably should not run commits when optimize is in the process or should do commit before the optimize. Or should not run optimize when a commit is in the process. Not sure which exactly scenario happens. Is there anything I can do to fix this? I don't have too much memory on this server and since this could lead to double the RAM used it is a serious issue for me. Best, Alex -- View this message in context: http://lucene.472066.n3.nabble.com/Overlapping-onDeckSearchers-2-tp772556p4066205.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: How can I import automatically all metadata without specifiying them explicitly
Setting the uprefix parameter of SolrCell (ERH) to something like attr_ will result in all metatdata attributes that are not named in the Solr schema being given the attr_ prefix to their metadata attribute names. For example, curl http://localhost:8983/solr/update/extract?literal.id=doc-1\ commit=trueuprefix=attr_ -F my.pdf=@my.pdf Once you fixed out which of the metadata you want to keep, either add those metadata attribute names to your schema, or add explicit SolrCell field mappings for each piece of metadata: fmap.my-field=metadata-name. -- Jack Krupansky -Original Message- From: Gian Maria Ricci Sent: Monday, May 27, 2013 4:21 AM To: solr-user@lucene.apache.org Subject: RE: Tika: How can I import automatically all metadata without specifiying them explicitly Thanks for the help. @Alexandre: Thanks for the suggestion, I'll try to use an ExtractingRequestHandler, I thought that I was missing some DIH option :). @Erik: I'm interested in knowing them all to do various form of analysis. I have documents coming from heterogeneous sources and I'm interested in searching inside the content, but also being able to extract all possible metadata. I'm working in .Net so it is useful letting tika doing everything for me directly in solr and then retrieve all metadata for matched documents. Thanks again to everyone. -- Gian Maria Ricci Mobile: +39 320 0136949 -Original Message- From: Erick Erickson [mailto:erickerick...@gmail.com] Sent: Sunday, May 26, 2013 5:30 PM To: solr-user@lucene.apache.org; Gian Maria Ricci Subject: Re: Tika: How can I import automatically all metadata without specifiying them explicitly In addition to Alexandre's comment: bq: ...I'd like to import in my index all metadata Be a little careful here, this isn't actually very useful in my experience. Sure it's nice to have all that data in the index, but... how do you search it meaningfully? Consider that some doc may have an author metadata field. Another may have a last editor field. Yet another may have a main author field. If you add all these as their field name, what do you do to search for author? Somehow you have to create a mapping between the various metadata names and something that's searchable, why not do this at index time? Not to mention I've seen this done and the result may be literally hundreds of different metadata fields which are not very useful. All that said, it may be perfectly valid to inde them all, but before going there it's worth considering whether the result is actually _useful_. Best Erick On Sat, May 25, 2013 at 4:44 AM, Gian Maria Ricci alkamp...@nablasoft.comwrote: Hi to everyone, ** ** I've configured import of a document folder with FileListEntityProcessor, everything went smooth on the first try, but I have a simple question. I'm able to map metadata without any problem, but I'd like to import in my index all metadata, not only those I've configured with field nodes. In this example I've imported Author and title, but I does not know in advance which metadata a document could have and I wish to have all of them inside my index. ** ** Here is my import config. It is the first try with importing with tika and probably I'm missing a simple stuff. ** ** dataConfig dataSource type=BinFileDataSource / document entity name=files dataSource=null rootEntity=false processor=FileListEntityProcessor baseDir=c:/temp/docs fileName=.*\.(doc)|(pdf)|(docx) onError=skip recursive=true field column=file name=id / field column=fileAbsolutePath name=path / field column=fileSize name=size / field column=fileLastModified name=lastModified / entity ** ** name=documentImport processor=TikaEntityProcessor url=${files.fileAbsolutePath} format=text field column=file name=fileName/ field column=Author name=author meta=true/ field column=title name=title meta=true/ field column=text name=text/ /entity* *** /entity /document /dataConfig ** ** ** ** -- Gian Maria Ricci Mobile: +39 320 0136949 http://mvp.microsoft.com/en-us/mvp/Gian%20Maria%20Ricci-4025635 [image: https://encrypted-tbn1.gstatic.com/images?q=tbn:ANd9GcQyg0wiW_QuTxl-rn uVR2P0jGuj4qO3I9attctCNarL--FC3vdPYg]http://www.linkedin.com/in/gianm
RE: Tika: How can I import automatically all metadata without specifiying them explicitly
Standalone Tika can also run in a network server mode. That increases data roundtrips but gives you more options. Even in .net . Regards, Alex On 27 May 2013 04:22, Gian Maria Ricci alkamp...@nablasoft.com wrote: Thanks for the help. @Alexandre: Thanks for the suggestion, I'll try to use an ExtractingRequestHandler, I thought that I was missing some DIH option :). @Erik: I'm interested in knowing them all to do various form of analysis. I have documents coming from heterogeneous sources and I'm interested in searching inside the content, but also being able to extract all possible metadata. I'm working in .Net so it is useful letting tika doing everything for me directly in solr and then retrieve all metadata for matched documents. Thanks again to everyone. -- Gian Maria Ricci Mobile: +39 320 0136949 -Original Message- From: Erick Erickson [mailto:erickerick...@gmail.com] Sent: Sunday, May 26, 2013 5:30 PM To: solr-user@lucene.apache.org; Gian Maria Ricci Subject: Re: Tika: How can I import automatically all metadata without specifiying them explicitly In addition to Alexandre's comment: bq: ...I'd like to import in my index all metadata Be a little careful here, this isn't actually very useful in my experience. Sure it's nice to have all that data in the index, but... how do you search it meaningfully? Consider that some doc may have an author metadata field. Another may have a last editor field. Yet another may have a main author field. If you add all these as their field name, what do you do to search for author? Somehow you have to create a mapping between the various metadata names and something that's searchable, why not do this at index time? Not to mention I've seen this done and the result may be literally hundreds of different metadata fields which are not very useful. All that said, it may be perfectly valid to inde them all, but before going there it's worth considering whether the result is actually _useful_. Best Erick On Sat, May 25, 2013 at 4:44 AM, Gian Maria Ricci alkamp...@nablasoft.comwrote: Hi to everyone, ** ** I've configured import of a document folder with FileListEntityProcessor, everything went smooth on the first try, but I have a simple question. I'm able to map metadata without any problem, but I'd like to import in my index all metadata, not only those I've configured with field nodes. In this example I've imported Author and title, but I does not know in advance which metadata a document could have and I wish to have all of them inside my index. ** ** Here is my import config. It is the first try with importing with tika and probably I'm missing a simple stuff. ** ** dataConfig dataSource type=BinFileDataSource / document entity name=files dataSource=null rootEntity=false processor=FileListEntityProcessor baseDir=c:/temp/docs fileName=.*\.(doc)|(pdf)|(docx) onError=skip recursive=true field column=file name=id / field column=fileAbsolutePath name=path / field column=fileSize name=size / field column=fileLastModified name=lastModified / entity ** ** name=documentImport processor=TikaEntityProcessor url=${files.fileAbsolutePath} format=text field column=file name=fileName/ field column=Author name=author meta=true/ field column=title name=title meta=true/ field column=text name=text/ /entity* *** /entity /document /dataConfig ** ** ** ** -- Gian Maria Ricci Mobile: +39 320 0136949 http://mvp.microsoft.com/en-us/mvp/Gian%20Maria%20Ricci-4025635 [image: https://encrypted-tbn1.gstatic.com/images?q=tbn:ANd9GcQyg0wiW_QuTxl-rn uVR2P0jGuj4qO3I9attctCNarL--FC3vdPYg]http://www.linkedin.com/in/gianm ariaricci [image: https://encrypted-tbn2.gstatic.com/images?q=tbn:ANd9GcT8z0HpwpDSjDWw1I 59Yx7HmF79u-NnP0NYeYYyEyWM1WtIbOl7]https://twitter.com/alkampfer [image: https://encrypted-tbn1.gstatic.com/images?q=tbn:ANd9GcQQWMj687BGGypKMU Tub_lkUrull1uU2LTx0K2tDBeu3mNUr7Oxlg]http://feeds.feedburner.com/Alka mpferEng [image:
Re: Java heap space exception in 4.2.1
400M docs is quite a large number of documents for a single piece of hardware, and if you're faceting over a large number of unique values, this will chew up memory. So it's not surprising that you're seeing OOMs, I suspect you just have too many documents on a single machine.. Best Erick On Mon, May 27, 2013 at 3:11 AM, Jam Luo cooljam2...@gmail.com wrote: I am sorry about a type mistake 8,000,000,000 - 800,000,000 2013/5/27 Jam Luo cooljam2...@gmail.com I have the same problem. at 4.1 ,a solr instance could take 8,000,000,000 doc. but at 4.2.1, a instance only take 400,000,000 doc, it will oom at facet query. the facet field was token by space. May 27, 2013 11:12:55 AM org.apache.solr.common.SolrException log SEVERE: null:java.lang.RuntimeException: java.lang.OutOfMemoryError: Java heap space at org.apache.solr.servlet.SolrDispatchFilter.sendError(SolrDispatchFilter.java:653) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:366) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:141) at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1338) at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:484) at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:119) at org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:524) at org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:231) at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1065) at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:413) at org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:192) at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:999) at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:117) at org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:250) at org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:149) at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:111) at org.eclipse.jetty.server.Server.handle(Server.java:350) at org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:454) at org.eclipse.jetty.server.BlockingHttpConnection.handleRequest(BlockingHttpConnection.java:47) at org.eclipse.jetty.server.AbstractHttpConnection.content(AbstractHttpConnection.java:900) at org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.content(AbstractHttpConnection.java:954) at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:851) at org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:235) at org.eclipse.jetty.server.BlockingHttpConnection.handle(BlockingHttpConnection.java:66) at org.eclipse.jetty.server.bio.SocketConnector$ConnectorEndPoint.run(SocketConnector.java:254) at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:603) at org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:538) at java.lang.Thread.run(Thread.java:662) Caused by: java.lang.OutOfMemoryError: Java heap space at org.apache.lucene.index.DocTermOrds.uninvert(DocTermOrds.java:448) at org.apache.solr.request.UnInvertedField.init(UnInvertedField.java:179) at org.apache.solr.request.UnInvertedField.getUnInvertedField(UnInvertedField.java:664) at org.apache.solr.request.SimpleFacets.getTermCounts(SimpleFacets.java:426) at org.apache.solr.request.SimpleFacets.getFacetFieldCounts(SimpleFacets.java:517) at org.apache.solr.request.SimpleFacets.getFacetCounts(SimpleFacets.java:252) at org.apache.solr.handler.component.FacetComponent.process(FacetComponent.java:78) at org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:208) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1825) at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:639) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:345) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:141) at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1338) at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:484) at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:119) at
Re: Application connecting to SOLR cloud
There's no requirement to send the document to any leader, send updates to any node in the system. The documents will be automatically forwarded to the appropriate leaders. You may be getting confused by the leader aware solr client stuff. It's slightly more efficient to send updates to the leader directly and save the extra hop, but it's not a requirement at all. You don't need the external load balancer at all, internally Solr does its own load balancing. That said, if your external app connects to a single node and that node goes down, regardless of any internal load balancing, it's a single point of failure so having the external load balancer can still make sense. Best Erick On Mon, May 27, 2013 at 6:46 AM, sathish_ix skandhasw...@inautix.co.in wrote: Hi, We have setup the SOLR cloud with zookeeper. Zookeeper (localhost:8000) 1 shard (localhost:9000) 2 Replica (localhost:9001,localhost:9002) Question : We load the solr index from Relational DB using DIH, Based on solr cloud documentation the request to load the data will be forwarded to Leader. CollectionShard1-Leader1(localhost:9000) | |_Replication1(localhost:9001) |Replication2 (localhost :9002) 1. To identify Leader (Here localhost:9000) from the zookeeper, do we need to read the znode clusterstate.json having leader=true by connecting to zookeeper. ? 2. Do we need to keep external load balancer for (localhost:9000,9001,9002) to route request ? Is there any other way ? Thanks, Sathish -- View this message in context: http://lucene.472066.n3.nabble.com/Application-connecting-to-SOLR-cloud-tp4066220.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Note on The Book
Hi Jack, I'd like to ask as a person who contributed a case study article about Automatically acquiring synonym knowledge from Wikipedia to the book. (13/05/24 8:14), Jack Krupansky wrote: To those of you who may have heard about the Lucene/Solr book that I and two others are writing on Lucene and Solr, some bad and good news. The bad news: The book contract with O’Reilly has been canceled. The good news: I’m going to proceed with self-publishing (possibly on Lulu or even Amazon) a somewhat reduced scope Solr-only Reference Guide (with hints of Lucene). The scope of the previous effort was too great, even for O’Reilly – a book larger than 800 pages (or even 600) that was heavy on reference and lighter on “guide” just wasn’t fitting in with their traditional “guide” model. In truth, Solr is just too complex for a simple guide that covers it all, let alone Lucene as well. Will the reduced Solr-only reference guide include my article? If not (for now I think it is not because my article is for Lucene case study, not Solr), I'd like to put it out on my blog or somewhere. BTW, those who want to know how to acquire synonym knowledge from Wikipedia, the summary is available at slideshare: http://www.slideshare.net/KojiSekiguchi/wikipediasolr koji -- http://soleami.com/blog/lucene-4-is-super-convenient-for-developing-nlp-tools.html
A strange RemoteSolrException
Hello, I'm writing my first little Solrj program, but don't get it running because of an RemoteSolrException: Server at http://localhost:8983/solr returned non ok status:404 The server is definitely running and the url works in the browser. I am working with Solr 4.3.0. This is my source code: public static void main(String[] args) { String url = http://localhost:8983/solr;; SolrServer server; try { server = new HttpSolrServer(url); server.ping(); } catch (Exception ex) { ex.printStackTrace(); } } with the stack trace: org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: Server at http://localhost:8983/solr returned non ok status:404, message:Not Found at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:372) at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:180) at org.apache.solr.client.solrj.request.SolrPing.process(SolrPing.java:62) at org.apache.solr.client.solrj.SolrServer.ping(SolrServer.java:293) at de.epublius.blogindexer.App.main(App.java:47) If I call server.shutdown(), there is no such exception, but for almost all other SolrServer-methods. What am I doing wrong? Thanks in advance Hans-Peter
Re: Note on The Book
If you would like to Solr-ize your contribution, that would be great. The focus of the book will be hard-core Solr. -- Jack Krupansky -Original Message- From: Koji Sekiguchi Sent: Monday, May 27, 2013 8:07 AM To: solr-user@lucene.apache.org Subject: Re: Note on The Book Hi Jack, I'd like to ask as a person who contributed a case study article about Automatically acquiring synonym knowledge from Wikipedia to the book. (13/05/24 8:14), Jack Krupansky wrote: To those of you who may have heard about the Lucene/Solr book that I and two others are writing on Lucene and Solr, some bad and good news. The bad news: The book contract with O’Reilly has been canceled. The good news: I’m going to proceed with self-publishing (possibly on Lulu or even Amazon) a somewhat reduced scope Solr-only Reference Guide (with hints of Lucene). The scope of the previous effort was too great, even for O’Reilly – a book larger than 800 pages (or even 600) that was heavy on reference and lighter on “guide” just wasn’t fitting in with their traditional “guide” model. In truth, Solr is just too complex for a simple guide that covers it all, let alone Lucene as well. Will the reduced Solr-only reference guide include my article? If not (for now I think it is not because my article is for Lucene case study, not Solr), I'd like to put it out on my blog or somewhere. BTW, those who want to know how to acquire synonym knowledge from Wikipedia, the summary is available at slideshare: http://www.slideshare.net/KojiSekiguchi/wikipediasolr koji -- http://soleami.com/blog/lucene-4-is-super-convenient-for-developing-nlp-tools.html
Re: index multiple files into one index entity
You did not open source it by any chance? :-) Regards, Alex. Personal blog: http://blog.outerthoughts.com/ LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch - Time is the quality of nature that keeps events from happening all at once. Lately, it doesn't seem to be working. (Anonymous - via GTD book) On Sun, May 26, 2013 at 8:23 PM, Yury Kats yuryk...@yahoo.com wrote: That's exactly what happens. Each streams goes into a separate document. If all streams share the same unique id parameter, the last stream will overwrite everything. I've asked this same question last year. Got no responses and ended up writing my own UpdateRequestProcessor. See http://tinyurl.com/phhqsb4 On 5/26/2013 11:15 AM, Alexandre Rafalovitch wrote: If I understand correctly, the issue is: 1) The client provides multiple content stream and expects Tika to parse all of them and stick all the extracted content into one big SolrDoc 2) Tika (looking at load() method of: ExtractingDocumentLoader.java (Github link: http://bit.ly/12GsDl9 ) does not actually suspect that it's load method may be called multiple types and therefore happily submit the document at the end of that call. Probably submits a new document for each content source, which probably means it just overrides the same doc over and over again. If I am right, then we have a bug in Tika handler's expectations (of single load() call). The next step would be to put together a very simple use case and open a Jira case with it. Regards, Alex. P.s. I am not a Solr code wrangler, so this MAY be completely wrong. Personal blog: http://blog.outerthoughts.com/ LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch - Time is the quality of nature that keeps events from happening all at once. Lately, it doesn't seem to be working. (Anonymous - via GTD book) On Sun, May 26, 2013 at 10:46 AM, Erick Erickson erickerick...@gmail.com wrote: I'm still not quite getting the issue. Separate requests (i.e. any addition of a SolrInputDocument) are treated as a separate document. There's no notion of append the contents of one doc to another based on ID, unless you're doing atomic updates. And Tika takes some care to index separate files as separate documents. Now, if you don't need these as with the same uniqueKey, you might index them as separate documents and include a field that lets you associate these documents somehow (see the group/field collapsing Wiki page). But otherwise, I think I need a higher-level view of what you're trying to accomplish to make an intelligent comment. Best Erick On Thu, May 23, 2013 at 9:05 AM, mark.ka...@t-systems.com wrote: Hello Erick, Thank you for your fast answer. Maybe I don't exclaim my question clearly. I want index many files to one index entity. I will use the same behavior as any other multivalued field which can indexed to one unique id. So I think every ContentStreamUpdateRequest represent one index entity, isn't it? And with each addContentStream I will add one File to this entity. Thank you and with best Regards Mark -Ursprüngliche Nachricht- Von: Erick Erickson [mailto:erickerick...@gmail.com] Gesendet: Donnerstag, 23. Mai 2013 14:11 An: solr-user@lucene.apache.org Betreff: Re: index multiple files into one index entity I just skimmed your post, but I'm responding to the last bit. If you have uniqueKey defined as id in schema.xml then no, you cannot have multiple documents with the same ID. Whenever a new doc comes in it replaces the old doc with that ID. You can remove the uniqueKey definition and do what you want, but there are very few Solr installations with no uniqueKey and it's probably a better idea to make your id's truly unique. Best Erick On Thu, May 23, 2013 at 6:14 AM, mark.ka...@t-systems.com wrote: Hello solr team, I want to index multiple fields into one solr index entity, with the same id. We are using solr 4.1 I try it with following source fragment: public void addContentSet(ContentSet contentSet) throws SearchProviderException { ... ContentStreamUpdateRequest csur = generateCSURequest(contentSet.getIndexId(), contentSet); String indexId = contentSet.getIndexId(); ConcurrentUpdateSolrServer server = serverPool.getUpdateServer(indexId); server.request(csur); ... } private ContentStreamUpdateRequest generateCSURequest(String indexId, ContentSet contentSet) throws IOException { ContentStreamUpdateRequest csur = new ContentStreamUpdateRequest(confStore.getExtractUrl()); ModifiableSolrParams parameters = csur.getParams(); if (parameters == null) { parameters = new ModifiableSolrParams(); } parameters.set(literalsOverride, false); // maps the tika default content attribute to
Re: index multiple files into one index entity
No, the implementation was very specific to my needs. On 5/27/2013 8:28 AM, Alexandre Rafalovitch wrote: You did not open source it by any chance? :-) Regards, Alex.
using solr for web page classification
Hello, I am working on implementation of system to categorize URLs/Web Pages. I would have categories like ... Adult Health Business Arts Home Science I am looking at how Lucence/Solr could help me out to achive this. I came across links that mention MoreLikeThis could be of my help. I found LucidWorks Search of help for me as it has done installation for Jetty, Solr in few clicks. Importing data and Query was also straight forward. My question is: - I have pre-defined list of categories for which I would have webpages + documents that could be stored in solr index assigned with category - have input processors like on each page Text extractor (from HTML, PDF, Office format) Text language detection Standard text processors - stemming, remove stopwords, lowwercase etc Title extractor Summary extractor Field mapping Header and footer remover - All these document could be processed and stored in Solr Index with known category - When new request comes I need to for MLT or solr Query based on content of webpage and get similar documents. Based on results I could reply back with top 3 categories. Please let me know if using solr for this problem in correct way ? If yes how to go with the forming query based on web page contents ? Thanks Rajesh
sourceId of JMX
Hello Our team faced the problem regarding the sourceId of JMX when getting the information of JMX from tomcat manager. Command: curl http://localhost:${PORT}/manager/jmxproxy?qry=solr:type=documentCache,* Here is the error log (tomcat/manager log). --- 2013/05/27 0:04:01 org.apache.catalina.core.ApplicationContext log JMXProxy: Error getting attribute solr:type=documentCache,id=org.apache.solr.search.LRUCache sourceId javax.management.AttributeNotFoundException: sourceId --- Solr ver. : 4.1.0 I think this error comes from when JMX cannot get the sourceId. BTW Let's look at this issue. https://issues.apache.org/jira/browse/SOLR-3329 It is decided to drop getSourceId() in this issue. But in org.apache.solr.core.JmxMonitoredMap.SolrDynamicMBean, staticStats.add(sourceId) is still defined in SolrDynamicMBean at L211. http://javasourcecode.org/html/open-source/solr/solr-3.3.0/org/apache/solr/core/JmxMonitoredMap.SolrDynamicMBean.java.html#line.202 -- l.211 staticStats.add(sourceId); -- Maybe this error comes from this inconsistency. This problem is not critical, but I think this is inconsistent. 1. Anyone knows why staticStats.add(sourceId) still remained in SolrDynamicMBean? Do you have any idea? 2. Has anyone faced such error ? How did you solved it? Thank you. Regards suganuma
Solr/Lucene Analayzer That Writes To File
Hi; I want to use Solr for an academical research. One step of my purpose is I want to store tokens in a file (I will store it at a database later) and I don't want to index them. For such kind of purposes should I use core Lucene or Solr? Is there an example for writing a custom analyzer and just storing tokens in a file?
Re: Solr/Lucene Analayzer That Writes To File
Hello! Take a look at custom posting formats. For example here is a nice post showing what you can do with Lucene SimpleText codec: http://blog.mikemccandless.com/2010/10/lucenes-simpletext-codec.html However please remember that it is not advised to use that codec in production environment. -- Regards, Rafał Kuć Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch - ElasticSearch Hi; I want to use Solr for an academical research. One step of my purpose is I want to store tokens in a file (I will store it at a database later) and I don't want to index them. For such kind of purposes should I use core Lucene or Solr? Is there an example for writing a custom analyzer and just storing tokens in a file?
Re: Overlapping onDeckSearchers=2
On Mon, May 27, 2013 at 7:11 AM, Jack Krupansky j...@basetechnology.com wrote: The intent is that optimize is obsolete and should no longer be used That's incorrect. People need to understand the cost of optimize, and that it's use is optional. It's up to the developer to figure out of the benefits of calling optimize outweigh the costs in their particular situations. The wiki currently says: An optimize is like a hard commit except that it forces all of the index segments to be merged into a single segment first. Depending on the use cases, this operation should be performed infrequently (like nightly), if at all, since it is very expensive and involves reading and re-writing the entire index. Segments are normally merged over time anyway (as determined by the merge policy), and optimize just forces these merges to occur immediately. -Yonik http://lucidworks.com
Re: A strange RemoteSolrException
I downloaded solr 4.3.0, started it up with java -jar start.jar (from inside the example directory) and executed your program. No exceptions are thrown. Is there something you did differently? On Mon, May 27, 2013 at 5:45 PM, Hans-Peter Stricker stric...@epublius.dewrote: Hello, I'm writing my first little Solrj program, but don't get it running because of an RemoteSolrException: Server at http://localhost:8983/solrreturned non ok status:404 The server is definitely running and the url works in the browser. I am working with Solr 4.3.0. This is my source code: public static void main(String[] args) { String url = http://localhost:8983/solr;; SolrServer server; try { server = new HttpSolrServer(url); server.ping(); } catch (Exception ex) { ex.printStackTrace(); } } with the stack trace: org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: Server at http://localhost:8983/solr returned non ok status:404, message:Not Found at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:372) at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:180) at org.apache.solr.client.solrj.request.SolrPing.process(SolrPing.java:62) at org.apache.solr.client.solrj.SolrServer.ping(SolrServer.java:293) at de.epublius.blogindexer.App.main(App.java:47) If I call server.shutdown(), there is no such exception, but for almost all other SolrServer-methods. What am I doing wrong? Thanks in advance Hans-Peter -- Regards, Shalin Shekhar Mangar.
RE: Overlapping onDeckSearchers=2
forceMerge is very useful if you delete a significant portion of an index. It can take a very long time before any merge policy decides to finally merge them all away, especially for a static or infrequently changing index. Also, having a lot of deleted docs in the index can be an issue if your similarity uses maxDoc for IDF. The cost is also much lower when using SSD's, forceMerging a 1GB core is a matter of seconds. -Original message- From:Yonik Seeley yo...@lucidworks.com Sent: Mon 27-May-2013 15:47 To: solr-user@lucene.apache.org Subject: Re: Overlapping onDeckSearchers=2 On Mon, May 27, 2013 at 7:11 AM, Jack Krupansky j...@basetechnology.com wrote: The intent is that optimize is obsolete and should no longer be used That's incorrect. People need to understand the cost of optimize, and that it's use is optional. It's up to the developer to figure out of the benefits of calling optimize outweigh the costs in their particular situations. The wiki currently says: An optimize is like a hard commit except that it forces all of the index segments to be merged into a single segment first. Depending on the use cases, this operation should be performed infrequently (like nightly), if at all, since it is very expensive and involves reading and re-writing the entire index. Segments are normally merged over time anyway (as determined by the merge policy), and optimize just forces these merges to occur immediately. -Yonik http://lucidworks.com
Re: Overlapping onDeckSearchers=2
As the wiki does say: if at all ... Segments are normally merged over time anyway (as determined by the merge policy), and optimize just forces these merges to occur immediately. So, the only real question here is if the optimize really does lie outside the if at all category and whether Segments are normally merged over time anyway is in fact not good enough. This is why I referred to the intent - whether the actual reality of a specific Solr application does align with the expectation that Segments are normally merged over time anyway. But, start with the presumption that merge policy does eliminate the need for optimize. As a general proposition: 1. Try to avoid using optimize if at all possible. Let merge policy do its thing. 2. Try to take a server out offline while optimizing if an optimize is really absolutely needed. 3. Try to understand why #1 is not sufficient and resolve the cause(s), so that optimize is no longer needed. -- Jack Krupansky -Original Message- From: Yonik Seeley Sent: Monday, May 27, 2013 9:46 AM To: solr-user@lucene.apache.org Subject: Re: Overlapping onDeckSearchers=2 On Mon, May 27, 2013 at 7:11 AM, Jack Krupansky j...@basetechnology.com wrote: The intent is that optimize is obsolete and should no longer be used That's incorrect. People need to understand the cost of optimize, and that it's use is optional. It's up to the developer to figure out of the benefits of calling optimize outweigh the costs in their particular situations. The wiki currently says: An optimize is like a hard commit except that it forces all of the index segments to be merged into a single segment first. Depending on the use cases, this operation should be performed infrequently (like nightly), if at all, since it is very expensive and involves reading and re-writing the entire index. Segments are normally merged over time anyway (as determined by the merge policy), and optimize just forces these merges to occur immediately. -Yonik http://lucidworks.com
Re: sourceId of JMX
This is a bug. The sourceId should have been removed from the SolrDynamicMBean. I'll create an issue. On Mon, May 27, 2013 at 6:39 PM, 菅沼 嘉一 yo_sugan...@waku-2.com wrote: Hello Our team faced the problem regarding the sourceId of JMX when getting the information of JMX from tomcat manager. Command: curl http://localhost: ${PORT}/manager/jmxproxy?qry=solr:type=documentCache,* Here is the error log (tomcat/manager log). --- 2013/05/27 0:04:01 org.apache.catalina.core.ApplicationContext log JMXProxy: Error getting attribute solr:type=documentCache,id=org.apache.solr.search.LRUCache sourceId javax.management.AttributeNotFoundException: sourceId --- Solr ver. : 4.1.0 I think this error comes from when JMX cannot get the sourceId. BTW Let's look at this issue. https://issues.apache.org/jira/browse/SOLR-3329 It is decided to drop getSourceId() in this issue. But in org.apache.solr.core.JmxMonitoredMap.SolrDynamicMBean, staticStats.add(sourceId) is still defined in SolrDynamicMBean at L211. http://javasourcecode.org/html/open-source/solr/solr-3.3.0/org/apache/solr/core/JmxMonitoredMap.SolrDynamicMBean.java.html#line.202 -- l.211 staticStats.add(sourceId); -- Maybe this error comes from this inconsistency. This problem is not critical, but I think this is inconsistent. 1. Anyone knows why staticStats.add(sourceId) still remained in SolrDynamicMBean? Do you have any idea? 2. Has anyone faced such error ? How did you solved it? Thank you. Regards suganuma -- Regards, Shalin Shekhar Mangar.
Re: Overlapping onDeckSearchers=2
I am on 4.2.1 @Yonik Seeley I do understand the cost and run it once per 24 hours and perhaps later this interval will be increased up to a few days. In general I am optimizing not to merge the fragments but to remove deleted docs. My index refreshes quickly and number of deleted docs could reach a few millions per week. The questions is: if optimize does the same what the hard commit does + some other optimizations why does Solr schedule a new commit when optimize is in the process? That's the problem. Since I am optimizing once per day and all commits are scheduled by Solr itself: autoCommit maxDocs25000/maxDocs maxTime30/maxTime openSearcherfalse/openSearcher /autoCommit autoSoftCommit maxTime5000/maxTime /autoSoftCommit If Solr see that an optimization procedure is running it could delay all scheduled hard commits until optimization is complete. Additionally it could perform a soft commit (for cases when application need to see the updated docs in index and fires commits) and run the delayed hard commit when optimization is complete. It will warm a new searcher but at least will prevent from multiple searchers warming simultaneously. Best, Alex -- View this message in context: http://lucene.472066.n3.nabble.com/Overlapping-onDeckSearchers-2-tp772556p4066267.html Sent from the Solr - User mailing list archive at Nabble.com.
AW: Core admin action CREATE fails to persist some settings in solr.xml with Solr 4.3
I created SOLR-4862 ... I found no way to assign the ticket to somebody though (I guess it is is under Workflow, but the button is greyed out). Thanks, André
Re: sourceId of JMX
I opened https://issues.apache.org/jira/browse/SOLR-4863 On Mon, May 27, 2013 at 7:35 PM, Shalin Shekhar Mangar shalinman...@gmail.com wrote: This is a bug. The sourceId should have been removed from the SolrDynamicMBean. I'll create an issue. On Mon, May 27, 2013 at 6:39 PM, 菅沼 嘉一 yo_sugan...@waku-2.com wrote: Hello Our team faced the problem regarding the sourceId of JMX when getting the information of JMX from tomcat manager. Command: curl http://localhost: ${PORT}/manager/jmxproxy?qry=solr:type=documentCache,* Here is the error log (tomcat/manager log). --- 2013/05/27 0:04:01 org.apache.catalina.core.ApplicationContext log JMXProxy: Error getting attribute solr:type=documentCache,id=org.apache.solr.search.LRUCache sourceId javax.management.AttributeNotFoundException: sourceId --- Solr ver. : 4.1.0 I think this error comes from when JMX cannot get the sourceId. BTW Let's look at this issue. https://issues.apache.org/jira/browse/SOLR-3329 It is decided to drop getSourceId() in this issue. But in org.apache.solr.core.JmxMonitoredMap.SolrDynamicMBean, staticStats.add(sourceId) is still defined in SolrDynamicMBean at L211. http://javasourcecode.org/html/open-source/solr/solr-3.3.0/org/apache/solr/core/JmxMonitoredMap.SolrDynamicMBean.java.html#line.202 -- l.211 staticStats.add(sourceId); -- Maybe this error comes from this inconsistency. This problem is not critical, but I think this is inconsistent. 1. Anyone knows why staticStats.add(sourceId) still remained in SolrDynamicMBean? Do you have any idea? 2. Has anyone faced such error ? How did you solved it? Thank you. Regards suganuma -- Regards, Shalin Shekhar Mangar. -- Regards, Shalin Shekhar Mangar.
Re: Distributed query: strange behavior.
Hello, guys! Well, I've done some tests and I think that there exists some kind of bug related with distributed search. Currently I'm setting a key field that it's impossible to be duplicated, and I have experienced the same wrong behavior with numFound field while changing rows parameter. Has anyone experienced the same? Best regards, - Luis Cappa 2013/5/27 Luis Cappa Banda luisca...@gmail.com Hi, Erick! That's it! I'm using a custom implementation of a SolrServer with distributed behavior that routes queries and updates using an in-house Round Robin method. But the thing is that I'm doing this myself because I've noticed that duplicated documents appears using LBHttpSolrServer implementation. Last week I modified my implementation to avoid that with this changes: - I have normalized the key field to all documents. Now every document indexed must include *_id_* field that stores the selected key value. The value is setted with a *copyField*. - When I index a new document a *HttpSolrServer* from the shard list is selected using a Round Robin strategy. Then, a field called *_shard_ * is setted to *SolrInputDocument*. That field value includes a relationship with the main shard selected. - If a document wants to be indexed/updated and it includes *_shard_*field to update it automatically the belonged shard ( *HttpSolrServer*) is selected. - If a document wants to be indexed/updated and *_shard_* field is not included then the key value from *_id_* is getted from * SolrInputDocument*. With that key a distributed search query is executed by it's key to retrieve *_shard_* field. With *_shard_* field we can now choose the correct shard (*HttpSolrServer*). It's not a good practice and performance isn't the best, but it's secure. Best Regards, - Luis Cappa 2013/5/26 Erick Erickson erickerick...@gmail.com Valery: I share your puzzlement. _If_ you are letting Solr do the document routing, and not doing any of the custom routing, then the same unique key should be going to the same shard and replacing the previous doc with that key. But, if you're using custom routing, if you've been experimenting with different configurations and didn't start over, in general if you're configuration is in an interesting state this could happen. So in the normal case if you have a document with the same key indexed in multiple shards, that would indicate a bug. But there are many ways, especially when experimenting, that you could have this happen which are _not_ a bug. I'm guessing that Luis may be trying the custom routing option maybe? Best Erick On Fri, May 24, 2013 at 9:09 AM, Valery Giner valgi...@research.att.com wrote: Shawn, How is it possible for more than one document with the same unique key to appear in the index, even in different shards? Isn't it a bug by definition? What am I missing here? Thanks, Val On 05/23/2013 09:55 AM, Shawn Heisey wrote: On 5/23/2013 1:51 AM, Luis Cappa Banda wrote: I've query each Solr shard server one by one and the total number of documents is correct. However, when I change rows parameter from 10 to 100 the total numFound of documents change: I've seen this problem on the list before and the cause has been determined each time to be caused by documents with the same uniqueKey value appearing in more than one shard. What I think happens here: With rows=10, you get the top ten docs from each of the three shards, and each shard sends its numFound for that query to the core that's coordinating the search. The coordinator adds up numFound, looks through those thirty docs, and arranges them according to the requested sort order, returning only the top 10. In this case, there happen to be no duplicates. With rows=100, you get a total of 300 docs. This time, duplicates are found and removed by the coordinator. I think that the coordinator adjusts the total numFound by the number of duplicate documents it removed, in an attempt to be more accurate. I don't know if adjusting numFound when duplicates are found in a sharded query is the right thing to do, I'll leave that for smarter people. Perhaps Solr should return a message with the results saying that duplicates were found, and if a config option is not enabled, the server should throw an exception and return a 4xx HTTP error code. One idea for a config parameter name would be allowShardDuplicates, but something better can probably be found. Thanks, Shawn -- - Luis Cappa -- - Luis Cappa
Re: A strange RemoteSolrException
Yes, I started it up with java -Dsolr.solr.home=example-DIH/solr -jar start.jar. Without the java options I don't get the expections neither! (I should have checked.) What now? -- From: Shalin Shekhar Mangar shalinman...@gmail.com Sent: Monday, May 27, 2013 3:58 PM To: solr-user@lucene.apache.org Subject: Re: A strange RemoteSolrException I downloaded solr 4.3.0, started it up with java -jar start.jar (from inside the example directory) and executed your program. No exceptions are thrown. Is there something you did differently? On Mon, May 27, 2013 at 5:45 PM, Hans-Peter Stricker stric...@epublius.dewrote: Hello, I'm writing my first little Solrj program, but don't get it running because of an RemoteSolrException: Server at http://localhost:8983/solrreturned non ok status:404 The server is definitely running and the url works in the browser. I am working with Solr 4.3.0. This is my source code: public static void main(String[] args) { String url = http://localhost:8983/solr;; SolrServer server; try { server = new HttpSolrServer(url); server.ping(); } catch (Exception ex) { ex.printStackTrace(); } } with the stack trace: org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: Server at http://localhost:8983/solr returned non ok status:404, message:Not Found at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:372) at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:180) at org.apache.solr.client.solrj.request.SolrPing.process(SolrPing.java:62) at org.apache.solr.client.solrj.SolrServer.ping(SolrServer.java:293) at de.epublius.blogindexer.App.main(App.java:47) If I call server.shutdown(), there is no such exception, but for almost all other SolrServer-methods. What am I doing wrong? Thanks in advance Hans-Peter -- Regards, Shalin Shekhar Mangar.
Re: A strange RemoteSolrException
On 5/27/2013 6:15 AM, Hans-Peter Stricker wrote: I'm writing my first little Solrj program, but don't get it running because of an RemoteSolrException: Server at http://localhost:8983/solr returned non ok status:404 The server is definitely running and the url works in the browser. I am working with Solr 4.3.0. Hans, To use SolrJ against the URL that you provided, you must have a defaultCoreName attribute in solr.xml that points at a core that exists. The defaultCoreName used in the old-style solr.xml (4.3.1 and earlier) is collection1. The new-style solr.xml (4.4 and later, when released) does not define a default core name. A far safer option for any Solr client API is to use a base URL that includes the name of the core. If you are using SolrCloud, you can optionally use the collection name instead. This option will be required for the new style solr.xml. Here's the format: http://server:port/solr/corename In the UI, the cores that exist will be in a left-side dropdown that says Core Selector. If you are using SolrCloud, you can click on the Cloud option in the UI and then on Graph to see the collection names. They will be on the left side of the graph. NB: If you are using SolrCloud, it is better to use CloudSolrServer instead of HttpSolrServer. Thanks, Shawn
Re: A strange RemoteSolrException
On 5/27/2013 8:24 AM, Hans-Peter Stricker wrote: Yes, I started it up with java -Dsolr.solr.home=example-DIH/solr -jar start.jar. That explains it. See my other reply. The solr.xml file for example-DIH does not have a defaultCoreName attribute. Thanks, Shawn
Re: A strange RemoteSolrException
Dear Shawn, dear Shalin, thanks for your valuable replies! Could/should I have known better (by reading more carefully the manual)? I'll try to fix it - and I am confident that it will work! Best regards Hans -- From: Shawn Heisey s...@elyograg.org Sent: Monday, May 27, 2013 4:29 PM To: solr-user@lucene.apache.org Subject: Re: A strange RemoteSolrException On 5/27/2013 8:24 AM, Hans-Peter Stricker wrote: Yes, I started it up with java -Dsolr.solr.home=example-DIH/solr -jar start.jar. That explains it. See my other reply. The solr.xml file for example-DIH does not have a defaultCoreName attribute. Thanks, Shawn
RE: Tika: How can I import automatically all metadata without specifiying them explicitly
Thanks a lot, other useful hints, and probably standalone Tika could be a solution. I've another little question: how can I express filters in DIH configuration to run import of the server incrementally? Actually I've two distinct scenario. In first scenario I've documents stored inside database, so I need to write a DIH to import data from database and since I have timestamp column this is not a problem. Second scenario: need to monitor one folder, and do incremental population each 15 minutes. Usually with Sql DIH I use some column as a filter to do incremental population, but I wonder if it is possible to pass filter to BinFileDataSource, telling to process only new files and those modified after a timestamp (last run). Thanks again for all your precious suggestions. -- Gian Maria Ricci Mobile: +39 320 0136949 -Original Message- From: Alexandre Rafalovitch [mailto:arafa...@gmail.com] Sent: Monday, May 27, 2013 1:44 PM To: solr-user@lucene.apache.org Subject: RE: Tika: How can I import automatically all metadata without specifiying them explicitly Standalone Tika can also run in a network server mode. That increases data roundtrips but gives you more options. Even in .net . Regards, Alex On 27 May 2013 04:22, Gian Maria Ricci alkamp...@nablasoft.com wrote: Thanks for the help. @Alexandre: Thanks for the suggestion, I'll try to use an ExtractingRequestHandler, I thought that I was missing some DIH option :). @Erik: I'm interested in knowing them all to do various form of analysis. I have documents coming from heterogeneous sources and I'm interested in searching inside the content, but also being able to extract all possible metadata. I'm working in .Net so it is useful letting tika doing everything for me directly in solr and then retrieve all metadata for matched documents. Thanks again to everyone. -- Gian Maria Ricci Mobile: +39 320 0136949 -Original Message- From: Erick Erickson [mailto:erickerick...@gmail.com] Sent: Sunday, May 26, 2013 5:30 PM To: solr-user@lucene.apache.org; Gian Maria Ricci Subject: Re: Tika: How can I import automatically all metadata without specifiying them explicitly In addition to Alexandre's comment: bq: ...I'd like to import in my index all metadata Be a little careful here, this isn't actually very useful in my experience. Sure it's nice to have all that data in the index, but... how do you search it meaningfully? Consider that some doc may have an author metadata field. Another may have a last editor field. Yet another may have a main author field. If you add all these as their field name, what do you do to search for author? Somehow you have to create a mapping between the various metadata names and something that's searchable, why not do this at index time? Not to mention I've seen this done and the result may be literally hundreds of different metadata fields which are not very useful. All that said, it may be perfectly valid to inde them all, but before going there it's worth considering whether the result is actually _useful_. Best Erick On Sat, May 25, 2013 at 4:44 AM, Gian Maria Ricci alkamp...@nablasoft.comwrote: Hi to everyone, ** ** I've configured import of a document folder with FileListEntityProcessor, everything went smooth on the first try, but I have a simple question. I'm able to map metadata without any problem, but I'd like to import in my index all metadata, not only those I've configured with field nodes. In this example I've imported Author and title, but I does not know in advance which metadata a document could have and I wish to have all of them inside my index. ** ** Here is my import config. It is the first try with importing with tika and probably I'm missing a simple stuff. ** ** dataConfig dataSource type=BinFileDataSource / document entity name=files dataSource=null rootEntity=false processor=FileListEntityProcessor baseDir=c:/temp/docs fileName=.*\.(doc)|(pdf)|(docx) onError=skip recursive=true field column=file name=id / field column=fileAbsolutePath name=path / field column=fileSize name=size / field column=fileLastModified name=lastModified / entity ** **
[blog post] Automatically Acquiring Synonym Knowledge from Wikipedia
Hello, Sorry for cross post. I just wanted to announce that I've written a blog post on how to create synonyms.txt file automatically from Wikipedia: http://soleami.com/blog/automatically-acquiring-synonym-knowledge-from-wikipedia.html Hope that the article gives someone a good experience! koji -- http://soleami.com/blog/lucene-4-is-super-convenient-for-developing-nlp-tools.html
Re: A strange RemoteSolrException
On 5/27/2013 8:34 AM, Hans-Peter Stricker wrote: Dear Shawn, dear Shalin, thanks for your valuable replies! Could/should I have known better (by reading more carefully the manual)? I just looked at the wiki. The SolrJ wiki page doesn't mention using the core name, which I find surprising, because Solr has had multicore capability for a REALLY long time, and it has been the default in the example since the 3.x days. The only wiki example code that has a URL with a core name is the code in the database example: http://wiki.apache.org/solr/Solrj#Reading_data_from_a_database Oddly enough, I wrote and contributed that example, but it was a few years ago and I haven't looked at it since. When I find the time, I will go through the SolrJ wiki page and bring it into this decade. Multicore operation is very likely going to be required on Solr 5.0 when that version comes out. If anyone else wants to update the wiki, feel free. If you don't already have edit permission, just ask. We can add your wiki username to the contributors group. Thanks, Shawn
Re: Note on The Book
Now my contribution can be read on soleami blog in English: Automatically Acquiring Synonym Knowledge from Wikipedia http://soleami.com/blog/automatically-acquiring-synonym-knowledge-from-wikipedia.html koji (13/05/27 21:16), Jack Krupansky wrote: If you would like to Solr-ize your contribution, that would be great. The focus of the book will be hard-core Solr. -- Jack Krupansky -Original Message- From: Koji Sekiguchi Sent: Monday, May 27, 2013 8:07 AM To: solr-user@lucene.apache.org Subject: Re: Note on The Book Hi Jack, I'd like to ask as a person who contributed a case study article about Automatically acquiring synonym knowledge from Wikipedia to the book. (13/05/24 8:14), Jack Krupansky wrote: To those of you who may have heard about the Lucene/Solr book that I and two others are writing on Lucene and Solr, some bad and good news. The bad news: The book contract with O’Reilly has been canceled. The good news: I’m going to proceed with self-publishing (possibly on Lulu or even Amazon) a somewhat reduced scope Solr-only Reference Guide (with hints of Lucene). The scope of the previous effort was too great, even for O’Reilly – a book larger than 800 pages (or even 600) that was heavy on reference and lighter on “guide” just wasn’t fitting in with their traditional “guide” model. In truth, Solr is just too complex for a simple guide that covers it all, let alone Lucene as well. Will the reduced Solr-only reference guide include my article? If not (for now I think it is not because my article is for Lucene case study, not Solr), I'd like to put it out on my blog or somewhere. BTW, those who want to know how to acquire synonym knowledge from Wikipedia, the summary is available at slideshare: http://www.slideshare.net/KojiSekiguchi/wikipediasolr koji -- http://soleami.com/blog/lucene-4-is-super-convenient-for-developing-nlp-tools.html
Specifiy colums to return for mlt results
Hi, I'm executing a search and retrieve more like this results. For the search results, I can specify the columns to be returned via the fl parameter. The mlt.fl parameter defines the columns to be used for similarity calculation. The mlt-results see to return the columns specified in fl too. Is there a way to specify different columns for the search and mlt results? kind regards, Achim
Problems with DIH in Solrj
I start the SOLR example with java -Dsolr.solr.home=example-DIH/solr -jar start.jar and run public static void main(String[] args) { String url = http://localhost:8983/solr/rss;; SolrServer server; SolrQuery query; try { server = new HttpSolrServer(url); query = new SolrQuery(); query.setParam(CommonParams.QT,/dataimport); QueryRequest request = new QueryRequest(query); QueryResponse response = request.process(server); server.commit(); System.out.println(response.toString()); } catch (Exception ex) { ex.printStackTrace(); } } without exception and the response string as {responseHeader={status=0,QTime=0},initArgs={defaults={config=rss-data-config.xml}},status=idle,importResponse=,statusMessages={},WARNING=This response format is experimental. It is likely to change in the future.} The Lucene index is touched but not really updated: there are only segments.gen and segments_a files of size 1Kb. If I execute /dataimport (full-import with option commit checked) from http://localhost:8983/solr/#/rss/dataimport//dataimport I get { responseHeader: { status: 0, QTime: 1 }, initArgs: [ defaults, [ config, rss-data-config.xml ] ], command: status, status: idle, importResponse: , statusMessages: { Total Requests made to DataSource: 1, Total Rows Fetched: 10, Total Documents Skipped: 0, Full Dump Started: 2013-05-27 17:57:07, : Indexing completed. Added/Updated: 10 documents. Deleted 0 documents., Committed: 2013-05-27 17:57:07, Total Documents Processed: 10, Time taken: 0:0:0.603 }, WARNING: This response format is experimental. It is likely to change in the future. } What am I doing wrong?
Re: Problems with DIH in Solrj
Your program is not specifying a command. You need to add: query.setParam(command, full-import); On Mon, May 27, 2013 at 9:31 PM, Hans-Peter Stricker stric...@epublius.dewrote: I start the SOLR example with java -Dsolr.solr.home=example-DIH/solr -jar start.jar and run public static void main(String[] args) { String url = http://localhost:8983/solr/rss;; SolrServer server; SolrQuery query; try { server = new HttpSolrServer(url); query = new SolrQuery(); query.setParam(CommonParams.QT,/dataimport); QueryRequest request = new QueryRequest(query); QueryResponse response = request.process(server); server.commit(); System.out.println(response.toString()); } catch (Exception ex) { ex.printStackTrace(); } } without exception and the response string as {responseHeader={status=0,QTime=0},initArgs={defaults={config=rss-data-config.xml}},status=idle,importResponse=,statusMessages={},WARNING=This response format is experimental. It is likely to change in the future.} The Lucene index is touched but not really updated: there are only segments.gen and segments_a files of size 1Kb. If I execute /dataimport (full-import with option commit checked) from http://localhost:8983/solr/#/rss/dataimport//dataimport I get { responseHeader: { status: 0, QTime: 1 }, initArgs: [ defaults, [ config, rss-data-config.xml ] ], command: status, status: idle, importResponse: , statusMessages: { Total Requests made to DataSource: 1, Total Rows Fetched: 10, Total Documents Skipped: 0, Full Dump Started: 2013-05-27 17:57:07, : Indexing completed. Added/Updated: 10 documents. Deleted 0 documents., Committed: 2013-05-27 17:57:07, Total Documents Processed: 10, Time taken: 0:0:0.603 }, WARNING: This response format is experimental. It is likely to change in the future. } What am I doing wrong? -- Regards, Shalin Shekhar Mangar.
Re: Problems with DIH in Solrj
Marvelous!! Once again: where could/should I have read this? What kinds of concepts/keywords are command and full-import? (Couldn't find them in any config file. Where are they explained?) Anyway: Now it works like a charm! Thanks Hans -- From: Shalin Shekhar Mangar shalinman...@gmail.com Sent: Monday, May 27, 2013 6:09 PM To: solr-user@lucene.apache.org Subject: Re: Problems with DIH in Solrj Your program is not specifying a command. You need to add: query.setParam(command, full-import); On Mon, May 27, 2013 at 9:31 PM, Hans-Peter Stricker stric...@epublius.dewrote: I start the SOLR example with java -Dsolr.solr.home=example-DIH/solr -jar start.jar and run public static void main(String[] args) { String url = http://localhost:8983/solr/rss;; SolrServer server; SolrQuery query; try { server = new HttpSolrServer(url); query = new SolrQuery(); query.setParam(CommonParams.QT,/dataimport); QueryRequest request = new QueryRequest(query); QueryResponse response = request.process(server); server.commit(); System.out.println(response.toString()); } catch (Exception ex) { ex.printStackTrace(); } } without exception and the response string as {responseHeader={status=0,QTime=0},initArgs={defaults={config=rss-data-config.xml}},status=idle,importResponse=,statusMessages={},WARNING=This response format is experimental. It is likely to change in the future.} The Lucene index is touched but not really updated: there are only segments.gen and segments_a files of size 1Kb. If I execute /dataimport (full-import with option commit checked) from http://localhost:8983/solr/#/rss/dataimport//dataimport I get { responseHeader: { status: 0, QTime: 1 }, initArgs: [ defaults, [ config, rss-data-config.xml ] ], command: status, status: idle, importResponse: , statusMessages: { Total Requests made to DataSource: 1, Total Rows Fetched: 10, Total Documents Skipped: 0, Full Dump Started: 2013-05-27 17:57:07, : Indexing completed. Added/Updated: 10 documents. Deleted 0 documents., Committed: 2013-05-27 17:57:07, Total Documents Processed: 10, Time taken: 0:0:0.603 }, WARNING: This response format is experimental. It is likely to change in the future. } What am I doing wrong? -- Regards, Shalin Shekhar Mangar.
Re: Problems with DIH in Solrj
Details about the DataImportHandler are on the wiki: http://wiki.apache.org/solr/DataImportHandler In general, the SolrJ client just makes HTTP requests to the corresponding Solr APIs so you need to learn about the http parameters for the corresponding solr component. The solr wiki is your best bet. http://wiki.apache.org/solr/FrontPage On Mon, May 27, 2013 at 9:50 PM, Hans-Peter Stricker stric...@epublius.dewrote: Marvelous!! Once again: where could/should I have read this? What kinds of concepts/keywords are command and full-import? (Couldn't find them in any config file. Where are they explained?) Anyway: Now it works like a charm! Thanks Hans --** From: Shalin Shekhar Mangar shalinman...@gmail.com Sent: Monday, May 27, 2013 6:09 PM To: solr-user@lucene.apache.org Subject: Re: Problems with DIH in Solrj Your program is not specifying a command. You need to add: query.setParam(command, full-import); On Mon, May 27, 2013 at 9:31 PM, Hans-Peter Stricker stric...@epublius.dewrote: I start the SOLR example with java -Dsolr.solr.home=example-DIH/**solr -jar start.jar and run public static void main(String[] args) { String url = http://localhost:8983/solr/**rsshttp://localhost:8983/solr/rss ; SolrServer server; SolrQuery query; try { server = new HttpSolrServer(url); query = new SolrQuery(); query.setParam(CommonParams.**QT,/dataimport); QueryRequest request = new QueryRequest(query); QueryResponse response = request.process(server); server.commit(); System.out.println(response.**toString()); } catch (Exception ex) { ex.printStackTrace(); } } without exception and the response string as {responseHeader={status=0,**QTime=0},initArgs={defaults={** config=rss-data-config.xml}},**status=idle,importResponse=,** statusMessages={},WARNING=This response format is experimental. It is likely to change in the future.} The Lucene index is touched but not really updated: there are only segments.gen and segments_a files of size 1Kb. If I execute /dataimport (full-import with option commit checked) from http://localhost:8983/solr/#/**rss/dataimport//dataimporthttp://localhost:8983/solr/#/rss/dataimport//dataimportI get { responseHeader: { status: 0, QTime: 1 }, initArgs: [ defaults, [ config, rss-data-config.xml ] ], command: status, status: idle, importResponse: , statusMessages: { Total Requests made to DataSource: 1, Total Rows Fetched: 10, Total Documents Skipped: 0, Full Dump Started: 2013-05-27 17:57:07, : Indexing completed. Added/Updated: 10 documents. Deleted 0 documents., Committed: 2013-05-27 17:57:07, Total Documents Processed: 10, Time taken: 0:0:0.603 }, WARNING: This response format is experimental. It is likely to change in the future. } What am I doing wrong? -- Regards, Shalin Shekhar Mangar. -- Regards, Shalin Shekhar Mangar.
Re: Problems with DIH in Solrj
On 5/27/2013 10:20 AM, Hans-Peter Stricker wrote: Marvelous!! Once again: where could/should I have read this? What kinds of concepts/keywords are command and full-import? (Couldn't find them in any config file. Where are they explained?) Anyway: Now it works like a charm! http://wiki.apache.org/solr/DataImportHandler#Commands The CommonParams.QT syntax that you used only works with SolrJ 4.0 and newer, and those versions have a shortcut that's slightly easier to read: query.setRequestHandler(/dataimport); The reason that there are no real examples of using DIH with SolrJ is because if you are using SolrJ, it is expected that your application will be doing the indexing itself, with the add method on the server object. I'll point you once again to the database example: http://wiki.apache.org/solr/Solrj#Reading_data_from_a_database I do use DIH on occasion - whenever I do a full rebuild of my index, DIH does the job a lot faster than my own code. I handle it from SolrJ. Sending a full-import or delta-import command to Solr returns to SolrJ immediately. You will only see a failure on that request if something major fails with the request itself, it won't tell you anything about whether the import succeeded or failed. You must periodically check the status. Interpreting the status in a program is a complicated endeavor, because the status is human readable, not machine readable, and important information is added or removed from the response at various success and error stages. There have been a number of issues on this. I filed most of them: https://issues.apache.org/jira/browse/SOLR-2728 https://issues.apache.org/jira/browse/SOLR-2729 https://issues.apache.org/jira/browse/SOLR-3319 https://issues.apache.org/jira/browse/SOLR-3689 https://issues.apache.org/jira/browse/SOLR-4241 I do have SolrJ code that interprets DIH status, but it's tied up in a larger work and will require some cleanup before I can share it. Thanks, Shawn
Unable to start solr 4.3
I've a test VM where I usually test solr installation. In that VM I already configured solr4.0 and everything went good. Today I download the 4.3 version, unpack everything, configuring TOMCAT as I did for the 4.0 version but the application does not start, and in catilina log I find only May 27, 2013 7:31:54 PM org.apache.catalina.core.StandardContext startInternal SEVERE: Error filterStart May 27, 2013 7:31:54 PM org.apache.catalina.core.StandardContext startInternal SEVERE: Context [/TestInstance43] startup failed due to previous errors Where can I find more info on what is wrong? It seems that in log file there is no detailed information on why the Application should not start. -- Gian Maria Ricci Mobile: +39 320 0136949 http://mvp.microsoft.com/en-us/mvp/Gian%20Maria%20Ricci-4025635 http://www.linkedin.com/in/gianmariaricci https://twitter.com/alkampfer http://feeds.feedburner.com/AlkampferEng skype://alkampferaok/
Re: Unable to start solr 4.3
The usual answer (which may or may not be relevant) is that Solr 4.3 has moved the logging libraries around and you need to copy specific library implementations to your Tomcat lib files. If that sounds as a possible, search the mailing list for a number of detailed discussions on this topic. Regards, Alex Personal blog: http://blog.outerthoughts.com/ LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch - Time is the quality of nature that keeps events from happening all at once. Lately, it doesn't seem to be working. (Anonymous - via GTD book) On Mon, May 27, 2013 at 1:34 PM, Gian Maria Ricci alkamp...@nablasoft.comwrote: I’ve a test VM where I usually test solr installation. In that VM I already configured solr4.0 and everything went good. Today I download the 4.3 version, unpack everything, configuring TOMCAT as I did for the 4.0 version but the application does not start, and in catilina log I find only ** ** May 27, 2013 7:31:54 PM org.apache.catalina.core.StandardContext startInternal SEVERE: Error filterStart May 27, 2013 7:31:54 PM org.apache.catalina.core.StandardContext startInternal SEVERE: Context [/TestInstance43] startup failed due to previous errors*** * ** ** Where can I find more info on what is wrong? It seems that in log file there is no detailed information on why the Application should not start.* *** ** ** -- Gian Maria Ricci Mobile: +39 320 0136949 http://mvp.microsoft.com/en-us/mvp/Gian%20Maria%20Ricci-4025635 [image: https://encrypted-tbn1.gstatic.com/images?q=tbn:ANd9GcQyg0wiW_QuTxl-rnuVR2P0jGuj4qO3I9attctCNarL--FC3vdPYg]http://www.linkedin.com/in/gianmariaricci [image: https://encrypted-tbn2.gstatic.com/images?q=tbn:ANd9GcT8z0HpwpDSjDWw1I59Yx7HmF79u-NnP0NYeYYyEyWM1WtIbOl7]https://twitter.com/alkampfer [image: https://encrypted-tbn1.gstatic.com/images?q=tbn:ANd9GcQQWMj687BGGypKMUTub_lkUrull1uU2LTx0K2tDBeu3mNUr7Oxlg]http://feeds.feedburner.com/AlkampferEng [image: https://encrypted-tbn3.gstatic.com/images?q=tbn:ANd9GcSkTG_lPTPFe470xfDtiInUtseqKcuV_lvI5h_-8t_3PsY5ikg3] ** ** ** **
Prevention of heavy wildcard queries
Hi. Searching terms with wildcard in their start, is solved with ReversedWildcardFilterFactory. But, what about terms with wildcard in both start AND end? This query is heavy, and I want to disallow such queries from my users. I'm looking for a way to cause these queries to fail. I guess there is no built-in support for my need, so it is OK to write a new solution. My current plan is to create a search component (which will run before QueryComponent). It should analyze the query string, and to drop the query if too heavy wildcard are found. Another option is to create a query parser, which wraps the current (specified or default) qparser, and does the same work as above. These two options require an analysis of the query text, which might be an ugly work (just think about nested queries [using _query_], OR even a lot of more basic scenarios like quoted terms, etc.) Am I missing a simple and clean way to do this? What would you do? P.S. if no simple solution exists, timeAllowed limit is the best work-around I could think about. Any other suggestions?
Re: Unable to start solr 4.3
On 5/27/2013 12:00 PM, Alexandre Rafalovitch wrote: The usual answer (which may or may not be relevant) is that Solr 4.3 has moved the logging libraries around and you need to copy specific library implementations to your Tomcat lib files. If that sounds as a possible, search the mailing list for a number of detailed discussions on this topic. snip alkamp...@nablasoft.comwrote: I’ve a test VM where I usually test solr installation. In that VM I already configured solr4.0 and everything went good. Today I download the 4.3 version, unpack everything, configuring TOMCAT as I did for the 4.0 version but the application does not start, and in catilina log I find only Alexandre has probably nailed the problem here. A little more detail on how to fix it: The section of the SolrLogging wiki page on how to use the example logging with another container will be exactly what you need. http://wiki.apache.org/solr/SolrLogging#Using_the_example_logging_setup_in_containers_other_than_Jetty This section is also followed by instructions for switching back to java.util.logging. Because the example setup takes logging control away from tomcat, this is something that many tomcat users will want to do. Thanks, Shawn
RE: Unable to start solr 4.3
Thanks, I'll check :) -- Gian Maria Ricci Mobile: +39 320 0136949 -Original Message- From: Alexandre Rafalovitch [mailto:arafa...@gmail.com] Sent: Monday, May 27, 2013 8:00 PM To: solr-user@lucene.apache.org; alkamp...@nablasoft.com Subject: Re: Unable to start solr 4.3 The usual answer (which may or may not be relevant) is that Solr 4.3 has moved the logging libraries around and you need to copy specific library implementations to your Tomcat lib files. If that sounds as a possible, search the mailing list for a number of detailed discussions on this topic. Regards, Alex Personal blog: http://blog.outerthoughts.com/ LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch - Time is the quality of nature that keeps events from happening all at once. Lately, it doesn't seem to be working. (Anonymous - via GTD book) On Mon, May 27, 2013 at 1:34 PM, Gian Maria Ricci alkamp...@nablasoft.comwrote: I’ve a test VM where I usually test solr installation. In that VM I already configured solr4.0 and everything went good. Today I download the 4.3 version, unpack everything, configuring TOMCAT as I did for the 4.0 version but the application does not start, and in catilina log I find only ** ** May 27, 2013 7:31:54 PM org.apache.catalina.core.StandardContext startInternal SEVERE: Error filterStart May 27, 2013 7:31:54 PM org.apache.catalina.core.StandardContext startInternal SEVERE: Context [/TestInstance43] startup failed due to previous errors*** * ** ** Where can I find more info on what is wrong? It seems that in log file there is no detailed information on why the Application should not start.* *** ** ** -- Gian Maria Ricci Mobile: +39 320 0136949 http://mvp.microsoft.com/en-us/mvp/Gian%20Maria%20Ricci-4025635 [image: https://encrypted-tbn1.gstatic.com/images?q=tbn:ANd9GcQyg0wiW_QuTxl-rn uVR2P0jGuj4qO3I9attctCNarL--FC3vdPYg]http://www.linkedin.com/in/gianm ariaricci [image: https://encrypted-tbn2.gstatic.com/images?q=tbn:ANd9GcT8z0HpwpDSjDWw1I 59Yx7HmF79u-NnP0NYeYYyEyWM1WtIbOl7]https://twitter.com/alkampfer [image: https://encrypted-tbn1.gstatic.com/images?q=tbn:ANd9GcQQWMj687BGGypKMU Tub_lkUrull1uU2LTx0K2tDBeu3mNUr7Oxlg]http://feeds.feedburner.com/Alka mpferEng [image: https://encrypted-tbn3.gstatic.com/images?q=tbn:ANd9GcSkTG_lPTPFe470xf DtiInUtseqKcuV_lvI5h_-8t_3PsY5ikg3] ** ** ** **
Re: Prevention of heavy wildcard queries
You are right that starting to parse the query before the query component can get soon very ugly and complicated. You should take advantage of the flex parser, it is already in lucene contrib - but if you are interested in the better version, look at https://issues.apache.org/jira/browse/LUCENE-5014 The way you can solve this is: 1. use the standard syntax grammar (which allows *foo*) 2. add (or modify) WildcardQueryNodeProcessor to dis/allow that case, or raise error etc this way, you are changing semantics - but don't need to touch the syntax definition; of course, you may also change the grammar and allow only one instance of wildcard (or some combination) but for that you should probably use LUCENE-5014 roman On Mon, May 27, 2013 at 2:18 PM, Isaac Hebsh isaac.he...@gmail.com wrote: Hi. Searching terms with wildcard in their start, is solved with ReversedWildcardFilterFactory. But, what about terms with wildcard in both start AND end? This query is heavy, and I want to disallow such queries from my users. I'm looking for a way to cause these queries to fail. I guess there is no built-in support for my need, so it is OK to write a new solution. My current plan is to create a search component (which will run before QueryComponent). It should analyze the query string, and to drop the query if too heavy wildcard are found. Another option is to create a query parser, which wraps the current (specified or default) qparser, and does the same work as above. These two options require an analysis of the query text, which might be an ugly work (just think about nested queries [using _query_], OR even a lot of more basic scenarios like quoted terms, etc.) Am I missing a simple and clean way to do this? What would you do? P.S. if no simple solution exists, timeAllowed limit is the best work-around I could think about. Any other suggestions?
RE: Unable to start solr 4.3
Thanks a lot, it seems that probably solr won't start because of all the log libraries missing. Once I copied all needed log libraries inside c:\tomcat\libs solr started with no problem. If other person are interested, here is the link on the wiki that states changes in logging library in solr 4.3 http://wiki.apache.org/solr/SolrLogging#What_changed Thanks Alexandre for pointing me in the right direction :) -- Gian Maria Ricci Mobile: +39 320 0136949 -Original Message- From: Alexandre Rafalovitch [mailto:arafa...@gmail.com] Sent: Monday, May 27, 2013 8:00 PM To: solr-user@lucene.apache.org; alkamp...@nablasoft.com Subject: Re: Unable to start solr 4.3 The usual answer (which may or may not be relevant) is that Solr 4.3 has moved the logging libraries around and you need to copy specific library implementations to your Tomcat lib files. If that sounds as a possible, search the mailing list for a number of detailed discussions on this topic. Regards, Alex Personal blog: http://blog.outerthoughts.com/ LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch - Time is the quality of nature that keeps events from happening all at once. Lately, it doesn't seem to be working. (Anonymous - via GTD book) On Mon, May 27, 2013 at 1:34 PM, Gian Maria Ricci alkamp...@nablasoft.comwrote: I’ve a test VM where I usually test solr installation. In that VM I already configured solr4.0 and everything went good. Today I download the 4.3 version, unpack everything, configuring TOMCAT as I did for the 4.0 version but the application does not start, and in catilina log I find only ** ** May 27, 2013 7:31:54 PM org.apache.catalina.core.StandardContext startInternal SEVERE: Error filterStart May 27, 2013 7:31:54 PM org.apache.catalina.core.StandardContext startInternal SEVERE: Context [/TestInstance43] startup failed due to previous errors*** * ** ** Where can I find more info on what is wrong? It seems that in log file there is no detailed information on why the Application should not start.* *** ** ** -- Gian Maria Ricci Mobile: +39 320 0136949 http://mvp.microsoft.com/en-us/mvp/Gian%20Maria%20Ricci-4025635 [image: https://encrypted-tbn1.gstatic.com/images?q=tbn:ANd9GcQyg0wiW_QuTxl-rn uVR2P0jGuj4qO3I9attctCNarL--FC3vdPYg]http://www.linkedin.com/in/gianm ariaricci [image: https://encrypted-tbn2.gstatic.com/images?q=tbn:ANd9GcT8z0HpwpDSjDWw1I 59Yx7HmF79u-NnP0NYeYYyEyWM1WtIbOl7]https://twitter.com/alkampfer [image: https://encrypted-tbn1.gstatic.com/images?q=tbn:ANd9GcQQWMj687BGGypKMU Tub_lkUrull1uU2LTx0K2tDBeu3mNUr7Oxlg]http://feeds.feedburner.com/Alka mpferEng [image: https://encrypted-tbn3.gstatic.com/images?q=tbn:ANd9GcSkTG_lPTPFe470xf DtiInUtseqKcuV_lvI5h_-8t_3PsY5ikg3] ** ** ** **
Re: Keeping a rolling window of indexes around solr
Hi, SolrCloud now has the same index aliasing as Elasticsearch. I can't lookup the link now but Zoie from LinkedIn has Hourglass, which is uses for circular buffer sort of index setup if I recall correctly. Otis Solr ElasticSearch Support http://sematext.com/ On May 24, 2013 10:26 AM, Saikat Kanjilal sxk1...@hotmail.com wrote: Hello Solr community folks, I am doing some investigative work around how to roll and manage indexes inside our solr configuration, to date I've come up with an architecture that separates a set of masters that are focused on writes and get replicated periodically and a set of slave shards strictly docused on reads, additionally for each master index the design contains partial purges which get performed on each of the slave shards as well as the master to keep the data current. However the architecture seems a bit more complex than I'd like with a lot of moving pieces. I was wondering if anyone has ever handled/designed an architecture around a conveyor belt or rolling window of indexes around n days of data and if there are best practices around this. One thing I was thinking about was whether to keep a conveyor belt list of the slave shards and rotate them as needed and drop the master periodically and make its backup temporarily the master. Anyways would love to hear thoughts and usecases that are similar from the community. Regards
Re: Keeping a rolling window of indexes around solr
But how is Hourglass going to help Solr? Or is it a portable implementation? Regards, Alex. Personal blog: http://blog.outerthoughts.com/ LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch - Time is the quality of nature that keeps events from happening all at once. Lately, it doesn't seem to be working. (Anonymous - via GTD book) On Mon, May 27, 2013 at 3:48 PM, Otis Gospodnetic otis.gospodne...@gmail.com wrote: Hi, SolrCloud now has the same index aliasing as Elasticsearch. I can't lookup the link now but Zoie from LinkedIn has Hourglass, which is uses for circular buffer sort of index setup if I recall correctly. Otis Solr ElasticSearch Support http://sematext.com/ On May 24, 2013 10:26 AM, Saikat Kanjilal sxk1...@hotmail.com wrote: Hello Solr community folks, I am doing some investigative work around how to roll and manage indexes inside our solr configuration, to date I've come up with an architecture that separates a set of masters that are focused on writes and get replicated periodically and a set of slave shards strictly docused on reads, additionally for each master index the design contains partial purges which get performed on each of the slave shards as well as the master to keep the data current. However the architecture seems a bit more complex than I'd like with a lot of moving pieces. I was wondering if anyone has ever handled/designed an architecture around a conveyor belt or rolling window of indexes around n days of data and if there are best practices around this. One thing I was thinking about was whether to keep a conveyor belt list of the slave shards and rotate them as needed and drop the master periodically and make its backup temporarily the master. Anyways would love to hear thoughts and usecases that are similar from the community. Regards
Re: Prevention of heavy wildcard queries
Thanks Roman. Based on some of your suggestions, will the steps below do the work? * Create (and register) a new SearchComponent * In its prepare method: Do for Q and all of the FQs (so this SearchComponent should run AFTER QueryComponent, in order to see all of the FQs) * Create org.apache.lucene.queryparser.flexible.core.StandardQueryParser, with a special implementation of QueryNodeProcessorPipeline, which contains my NodeProcessor in the top of its list. * Set my analyzer into that StandardQueryParser * My NodeProcessor will be called for each term in the query, so it can throw an exception if a (basic) querynode contains wildcard in both start and end of the term. Do I have a way to avoid from reimplementing the whole StandardQueryParser class? Will this work for both LuceneQParser and EdismaxQParser queries? Any other solution/work-around? How do other production environments of Solr overcome this issue? On Mon, May 27, 2013 at 10:15 PM, Roman Chyla roman.ch...@gmail.com wrote: You are right that starting to parse the query before the query component can get soon very ugly and complicated. You should take advantage of the flex parser, it is already in lucene contrib - but if you are interested in the better version, look at https://issues.apache.org/jira/browse/LUCENE-5014 The way you can solve this is: 1. use the standard syntax grammar (which allows *foo*) 2. add (or modify) WildcardQueryNodeProcessor to dis/allow that case, or raise error etc this way, you are changing semantics - but don't need to touch the syntax definition; of course, you may also change the grammar and allow only one instance of wildcard (or some combination) but for that you should probably use LUCENE-5014 roman On Mon, May 27, 2013 at 2:18 PM, Isaac Hebsh isaac.he...@gmail.com wrote: Hi. Searching terms with wildcard in their start, is solved with ReversedWildcardFilterFactory. But, what about terms with wildcard in both start AND end? This query is heavy, and I want to disallow such queries from my users. I'm looking for a way to cause these queries to fail. I guess there is no built-in support for my need, so it is OK to write a new solution. My current plan is to create a search component (which will run before QueryComponent). It should analyze the query string, and to drop the query if too heavy wildcard are found. Another option is to create a query parser, which wraps the current (specified or default) qparser, and does the same work as above. These two options require an analysis of the query text, which might be an ugly work (just think about nested queries [using _query_], OR even a lot of more basic scenarios like quoted terms, etc.) Am I missing a simple and clean way to do this? What would you do? P.S. if no simple solution exists, timeAllowed limit is the best work-around I could think about. Any other suggestions?
RE: sourceId of JMX
Thank you, Shalin. I'll see it. -Original Message- From: Shalin Shekhar Mangar [mailto:shalinman...@gmail.com] Sent: Monday, May 27, 2013 11:11 PM To: solr-user@lucene.apache.org Subject: Re: sourceId of JMX I opened https://issues.apache.org/jira/browse/SOLR-4863 On Mon, May 27, 2013 at 7:35 PM, Shalin Shekhar Mangar shalinman...@gmail.com wrote: This is a bug. The sourceId should have been removed from the SolrDynamicMBean. I'll create an issue. On Mon, May 27, 2013 at 6:39 PM, 菅沼 嘉一 yo_sugan...@waku-2.com wrote: Hello Our team faced the problem regarding the sourceId of JMX when getting the information of JMX from tomcat manager. Command: curl http://localhost: ${PORT}/manager/jmxproxy?qry=solr:type=documentCache,* Here is the error log (tomcat/manager log). --- 2013/05/27 0:04:01 org.apache.catalina.core.ApplicationContext log JMXProxy: Error getting attribute solr:type=documentCache,id=org.apache.solr.search.LRUCache sourceId javax.management.AttributeNotFoundException: sourceId --- Solr ver. : 4.1.0 I think this error comes from when JMX cannot get the sourceId. BTW Let's look at this issue. https://issues.apache.org/jira/browse/SOLR-3329 It is decided to drop getSourceId() in this issue. But in org.apache.solr.core.JmxMonitoredMap.SolrDynamicMBean, staticStats.add(sourceId) is still defined in SolrDynamicMBean at L211. http://javasourcecode.org/html/open-source/solr/solr-3.3.0/org/apache/solr/core/Jmx MonitoredMap.SolrDynamicMBean.java.html#line.202 -- l.211 staticStats.add(sourceId); -- Maybe this error comes from this inconsistency. This problem is not critical, but I think this is inconsistent. 1. Anyone knows why staticStats.add(sourceId) still remained in SolrDynamicMBean? Do you have any idea? 2. Has anyone faced such error ? How did you solved it? Thank you. Regards suganuma -- Regards, Shalin Shekhar Mangar. -- Regards, Shalin Shekhar Mangar.
Solr 4.3.0 geo search with multiple coordinates
Hi Solr experts, I have a solr 4.3 schema fieldType name=location_rpt class= solr.SpatialRecursivePrefixTreeFieldType geo=true distErrPct=0.025 maxDistErr=0.09 units=degrees / field name=location_geo type=location indexed=true stored=true * multiValued*=true / and xml data field name=location_geo51.1164,6.9612/field field name=location_geo52.3473,9.77564/field If I run this query: fq={!geofilt pt=51.11,6.9 sfield=location_geo d=20} I get no result. But if I remove the second geo line and only have this geo coordinate it works: field name=location_geo51.1164,6.9612/field *Thus it seems that the multi valued index does not work *even though solr returns the doc values as: arr name=location_geo str51.1164,6.9612/str str52.3473,9.77564/ str /arr Is my schema wrongly configured? Thanks Ericz
Re: Prevention of heavy wildcard queries
Hi Issac, it is as you say, with the exception that you create a QParserPlugin, not a search component * create QParserPlugin, give it some name, eg. 'nw' * make a copy of the pipeline - your component should be at the same place, or just above, the wildcard processor also make sure you are setting your qparser for FQ queries, ie. fq={!nw}foo On Mon, May 27, 2013 at 5:01 PM, Isaac Hebsh isaac.he...@gmail.com wrote: Thanks Roman. Based on some of your suggestions, will the steps below do the work? * Create (and register) a new SearchComponent * In its prepare method: Do for Q and all of the FQs (so this SearchComponent should run AFTER QueryComponent, in order to see all of the FQs) * Create org.apache.lucene.queryparser.flexible.core.StandardQueryParser, with a special implementation of QueryNodeProcessorPipeline, which contains my NodeProcessor in the top of its list. * Set my analyzer into that StandardQueryParser * My NodeProcessor will be called for each term in the query, so it can throw an exception if a (basic) querynode contains wildcard in both start and end of the term. Do I have a way to avoid from reimplementing the whole StandardQueryParser class? you can try subclassing it, if it allows it Will this work for both LuceneQParser and EdismaxQParser queries? this will not work for edismax, nothing but changing the edismax qparser will do the trick Any other solution/work-around? How do other production environments of Solr overcome this issue? you can also try modifying the standard solr parser, or even the JavaCC generated classes I believe many people do just that (or some sort of preprocessing) roman On Mon, May 27, 2013 at 10:15 PM, Roman Chyla roman.ch...@gmail.com wrote: You are right that starting to parse the query before the query component can get soon very ugly and complicated. You should take advantage of the flex parser, it is already in lucene contrib - but if you are interested in the better version, look at https://issues.apache.org/jira/browse/LUCENE-5014 The way you can solve this is: 1. use the standard syntax grammar (which allows *foo*) 2. add (or modify) WildcardQueryNodeProcessor to dis/allow that case, or raise error etc this way, you are changing semantics - but don't need to touch the syntax definition; of course, you may also change the grammar and allow only one instance of wildcard (or some combination) but for that you should probably use LUCENE-5014 roman On Mon, May 27, 2013 at 2:18 PM, Isaac Hebsh isaac.he...@gmail.com wrote: Hi. Searching terms with wildcard in their start, is solved with ReversedWildcardFilterFactory. But, what about terms with wildcard in both start AND end? This query is heavy, and I want to disallow such queries from my users. I'm looking for a way to cause these queries to fail. I guess there is no built-in support for my need, so it is OK to write a new solution. My current plan is to create a search component (which will run before QueryComponent). It should analyze the query string, and to drop the query if too heavy wildcard are found. Another option is to create a query parser, which wraps the current (specified or default) qparser, and does the same work as above. These two options require an analysis of the query text, which might be an ugly work (just think about nested queries [using _query_], OR even a lot of more basic scenarios like quoted terms, etc.) Am I missing a simple and clean way to do this? What would you do? P.S. if no simple solution exists, timeAllowed limit is the best work-around I could think about. Any other suggestions?
Re: Solr 4.3.0 geo search with multiple coordinates
I think I found the reason/bug the type was wrong, it should be field name=location_geo type=*location_rpt* indexed=true stored= true *multiValued*=true / On Tue, May 28, 2013 at 1:37 AM, Eric Grobler impalah...@googlemail.comwrote: Hi Solr experts, I have a solr 4.3 schema fieldType name=location_rpt class= solr.SpatialRecursivePrefixTreeFieldType geo=true distErrPct=0.025 maxDistErr=0.09 units=degrees / field name=location_geo type=location indexed=true stored=true * multiValued*=true / and xml data field name=location_geo51.1164,6.9612/field field name=location_geo52.3473,9.77564/field If I run this query: fq={!geofilt pt=51.11,6.9 sfield=location_geo d=20} I get no result. But if I remove the second geo line and only have this geo coordinate it works: field name=location_geo51.1164,6.9612/field *Thus it seems that the multi valued index does not work *even though solr returns the doc values as: arr name=location_geo str51.1164,6.9612/str str52.3473,9.77564/ str /arr Is my schema wrongly configured? Thanks Ericz
RE: sourceId of JMX
Shalin, We tried use it after removing staticStats.add(sourceId), it seems going with no problem. Do you know any other side effects by removing it ? Regards suganuma -Original Message- From: 菅沼 嘉一 [mailto:yo_sugan...@waku-2.com] Sent: Tuesday, May 28, 2013 9:30 AM To: solr-user@lucene.apache.org Subject: RE: sourceId of JMX Thank you, Shalin. I'll see it. -Original Message- From: Shalin Shekhar Mangar [mailto:shalinman...@gmail.com] Sent: Monday, May 27, 2013 11:11 PM To: solr-user@lucene.apache.org Subject: Re: sourceId of JMX I opened https://issues.apache.org/jira/browse/SOLR-4863 On Mon, May 27, 2013 at 7:35 PM, Shalin Shekhar Mangar shalinman...@gmail.com wrote: This is a bug. The sourceId should have been removed from the SolrDynamicMBean. I'll create an issue. On Mon, May 27, 2013 at 6:39 PM, 菅沼 嘉一 yo_sugan...@waku-2.com wrote: Hello Our team faced the problem regarding the sourceId of JMX when getting the information of JMX from tomcat manager. Command: curl http://localhost: ${PORT}/manager/jmxproxy?qry=solr:type=documentCache,* Here is the error log (tomcat/manager log). --- 2013/05/27 0:04:01 org.apache.catalina.core.ApplicationContext log JMXProxy: Error getting attribute solr:type=documentCache,id=org.apache.solr.search.LRUCache sourceId javax.management.AttributeNotFoundException: sourceId --- Solr ver. : 4.1.0 I think this error comes from when JMX cannot get the sourceId. BTW Let's look at this issue. https://issues.apache.org/jira/browse/SOLR-3329 It is decided to drop getSourceId() in this issue. But in org.apache.solr.core.JmxMonitoredMap.SolrDynamicMBean, staticStats.add(sourceId) is still defined in SolrDynamicMBean at L211. http://javasourcecode.org/html/open-source/solr/solr-3.3.0/org/apache/solr/core/J mx MonitoredMap.SolrDynamicMBean.java.html#line.202 -- l.211 staticStats.add(sourceId); -- Maybe this error comes from this inconsistency. This problem is not critical, but I think this is inconsistent. 1. Anyone knows why staticStats.add(sourceId) still remained in SolrDynamicMBean? Do you have any idea? 2. Has anyone faced such error ? How did you solved it? Thank you. Regards suganuma -- Regards, Shalin Shekhar Mangar. -- Regards, Shalin Shekhar Mangar.
Re: Prevention of heavy wildcard queries
I don't want to affect on the (correctness of the) real query parsing, so creating a QParserPlugin is risky. Instead, If I'll parse the query in my search component, it will be detached from the real query parsing, (obviously this causes double parsing, but assume it's OK)... On Tue, May 28, 2013 at 3:52 AM, Roman Chyla roman.ch...@gmail.com wrote: Hi Issac, it is as you say, with the exception that you create a QParserPlugin, not a search component * create QParserPlugin, give it some name, eg. 'nw' * make a copy of the pipeline - your component should be at the same place, or just above, the wildcard processor also make sure you are setting your qparser for FQ queries, ie. fq={!nw}foo On Mon, May 27, 2013 at 5:01 PM, Isaac Hebsh isaac.he...@gmail.com wrote: Thanks Roman. Based on some of your suggestions, will the steps below do the work? * Create (and register) a new SearchComponent * In its prepare method: Do for Q and all of the FQs (so this SearchComponent should run AFTER QueryComponent, in order to see all of the FQs) * Create org.apache.lucene.queryparser.flexible.core.StandardQueryParser, with a special implementation of QueryNodeProcessorPipeline, which contains my NodeProcessor in the top of its list. * Set my analyzer into that StandardQueryParser * My NodeProcessor will be called for each term in the query, so it can throw an exception if a (basic) querynode contains wildcard in both start and end of the term. Do I have a way to avoid from reimplementing the whole StandardQueryParser class? you can try subclassing it, if it allows it Will this work for both LuceneQParser and EdismaxQParser queries? this will not work for edismax, nothing but changing the edismax qparser will do the trick Any other solution/work-around? How do other production environments of Solr overcome this issue? you can also try modifying the standard solr parser, or even the JavaCC generated classes I believe many people do just that (or some sort of preprocessing) roman On Mon, May 27, 2013 at 10:15 PM, Roman Chyla roman.ch...@gmail.com wrote: You are right that starting to parse the query before the query component can get soon very ugly and complicated. You should take advantage of the flex parser, it is already in lucene contrib - but if you are interested in the better version, look at https://issues.apache.org/jira/browse/LUCENE-5014 The way you can solve this is: 1. use the standard syntax grammar (which allows *foo*) 2. add (or modify) WildcardQueryNodeProcessor to dis/allow that case, or raise error etc this way, you are changing semantics - but don't need to touch the syntax definition; of course, you may also change the grammar and allow only one instance of wildcard (or some combination) but for that you should probably use LUCENE-5014 roman On Mon, May 27, 2013 at 2:18 PM, Isaac Hebsh isaac.he...@gmail.com wrote: Hi. Searching terms with wildcard in their start, is solved with ReversedWildcardFilterFactory. But, what about terms with wildcard in both start AND end? This query is heavy, and I want to disallow such queries from my users. I'm looking for a way to cause these queries to fail. I guess there is no built-in support for my need, so it is OK to write a new solution. My current plan is to create a search component (which will run before QueryComponent). It should analyze the query string, and to drop the query if too heavy wildcard are found. Another option is to create a query parser, which wraps the current (specified or default) qparser, and does the same work as above. These two options require an analysis of the query text, which might be an ugly work (just think about nested queries [using _query_], OR even a lot of more basic scenarios like quoted terms, etc.) Am I missing a simple and clean way to do this? What would you do? P.S. if no simple solution exists, timeAllowed limit is the best work-around I could think about. Any other suggestions?
Re: [blog post] Automatically Acquiring Synonym Knowledge from Wikipedia
Hello Koji, This is seems pretty useful post on how to create synonyms file. Thanks a lot for sharing this ! Have you shared source code / jar for the same so at it could be used ? Thanks, Rajesh On Mon, May 27, 2013 at 8:44 PM, Koji Sekiguchi k...@r.email.ne.jp wrote: Hello, Sorry for cross post. I just wanted to announce that I've written a blog post on how to create synonyms.txt file automatically from Wikipedia: http://soleami.com/blog/automatically-acquiring-synonym-knowledge-from-wikipedia.html Hope that the article gives someone a good experience! koji -- http://soleami.com/blog/lucene-4-is-super-convenient-for-developing-nlp-tools.html
Re: sourceId of JMX
Suganuma, No, there shouldn't be any side effects. On Tue, May 28, 2013 at 7:13 AM, 菅沼 嘉一 yo_sugan...@waku-2.com wrote: Shalin, We tried use it after removing staticStats.add(sourceId), it seems going with no problem. Do you know any other side effects by removing it ? Regards suganuma -Original Message- From: 菅沼 嘉一 [mailto:yo_sugan...@waku-2.com] Sent: Tuesday, May 28, 2013 9:30 AM To: solr-user@lucene.apache.org Subject: RE: sourceId of JMX Thank you, Shalin. I'll see it. -Original Message- From: Shalin Shekhar Mangar [mailto:shalinman...@gmail.com] Sent: Monday, May 27, 2013 11:11 PM To: solr-user@lucene.apache.org Subject: Re: sourceId of JMX I opened https://issues.apache.org/jira/browse/SOLR-4863 On Mon, May 27, 2013 at 7:35 PM, Shalin Shekhar Mangar shalinman...@gmail.com wrote: This is a bug. The sourceId should have been removed from the SolrDynamicMBean. I'll create an issue. On Mon, May 27, 2013 at 6:39 PM, 菅沼 嘉一 yo_sugan...@waku-2.com wrote: Hello Our team faced the problem regarding the sourceId of JMX when getting the information of JMX from tomcat manager. Command: curl http://localhost: ${PORT}/manager/jmxproxy?qry=solr:type=documentCache,* Here is the error log (tomcat/manager log). --- 2013/05/27 0:04:01 org.apache.catalina.core.ApplicationContext log JMXProxy: Error getting attribute solr:type=documentCache,id=org.apache.solr.search.LRUCache sourceId javax.management.AttributeNotFoundException: sourceId --- Solr ver. : 4.1.0 I think this error comes from when JMX cannot get the sourceId. BTW Let's look at this issue. https://issues.apache.org/jira/browse/SOLR-3329 It is decided to drop getSourceId() in this issue. But in org.apache.solr.core.JmxMonitoredMap.SolrDynamicMBean, staticStats.add(sourceId) is still defined in SolrDynamicMBean at L211. http://javasourcecode.org/html/open-source/solr/solr-3.3.0/org/apache/solr/core/J mx MonitoredMap.SolrDynamicMBean.java.html#line.202 -- l.211 staticStats.add(sourceId); -- Maybe this error comes from this inconsistency. This problem is not critical, but I think this is inconsistent. 1. Anyone knows why staticStats.add(sourceId) still remained in SolrDynamicMBean? Do you have any idea? 2. Has anyone faced such error ? How did you solved it? Thank you. Regards suganuma -- Regards, Shalin Shekhar Mangar. -- Regards, Shalin Shekhar Mangar. -- Regards, Shalin Shekhar Mangar.