external file field and fl parameter
I am playing with external file field for sorting. I created a dynamic field using the ExternalFileField type. I naively assumed that the fl argument would allow me to return the value the external field but doesnt seem to do so. For instance I have a defined a dynamic field: *_efloat then I used: sort=foo_efloat desc fl=foo_efloat, score, description I get the score and description but the foo_efloat seems to be missing in action. Thoughts? C
Getting indexed content of files using ExtractingRequestHandler
Hi, I'm using the PHP Solr client (ver: 1.0.2). I'm indexing the contents through my database. Suppose $data is a stdClass object having id, name, title, etc. from a database entry. Next, I declare a solr Document and assign fields to it.: $doc = new SolrInputDocument(); $doc-addField ('id' , $data-id); $doc-addField ('name' , $data-name); I wanted to know how can I store the contents of a pdf file (whose path I've stored in $data-filepath), in the same solr document, say in a field ('filecontent'). Referring to the wiki, I was unable to figure out the proper cURL request for achieving this. I was able to create a completely new solr document but how do I get the contents of the pdf file in the same solr document so that I can store that in a field? $doc = new SolrInputDocument(); $doc-addField ('id' , $data-id); $doc-addField ('name' , $data-name); //fire the curl request here referring to the file at $data-filepath $doc-addField ('filecontent' , //content of the pdf file); Also, instead of firing the raw cURL request, is there a better way? I don't know if the current PECL SOLR Client 1.0.2 has the feature of indexing pdf files. -- View this message in context: http://lucene.472066.n3.nabble.com/Getting-indexed-content-of-files-using-ExtractingRequestHandler-tp4077856.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Problem using Term Component in solr
Hi, Vocabulary is not known that's the main issue else I will implement synonyms instead. what do u mean by 'regularizing the title'. so let me know some solution... -- View this message in context: http://lucene.472066.n3.nabble.com/Problem-using-Term-Component-in-solr-tp4077200p4077865.html Sent from the Solr - User mailing list archive at Nabble.com.
How to from solr facet exclude specific “Tag”!
solr 4.3 this is my query request params: lst name=responseHeaderint name=status0/intint name=QTime15/intlst name=paramsstr name=facettrue/strstr name=indenttrue/strstr name=q*:*/strstr name=_1373713374569/strarr name=facet.fieldstr{!ex=city}CityId/strstr{!ex=company}CompanyId/str/arrstr name=wtxml/strstr name=fq{!tag=city}CityId:729 AND {!tag=company}CompanyId:16122/str/lst/lst This is the query response Facet content: lst name=facet_countslst name=facet_queries/lst name=facet_fieldslst name=CityIdint name=11100171/intint name=140489406/intint name=077477/intint name=136665780/intint name=136258092/intint name=72929213/intint name=79828975/int...int name=7262808/intint name=432776/intint name=1146772/intint name=1653765/intint name=1078668/intint name=814667/intint name=2049402/intint name=456401/intint name=401390/int/lstlst name=CompanyIdint name=16122971/intint name=690/intint name=710/intint name=720/intint name=790/intint name=800/intint name=850/intint name=880/intint name=940/int...int name=980/intint name=1040/intint name=1120/intint name=1130/intint name=1180/intint name=1230/intint name=1260/intint name=1310/intint name=1360/intint name=1390/int/lst/lstlst name=facet_dates/lst name=facet_ranges//lst You can see CityId the Facet is correct, it excludes the {! Tag = city} CityId: 729 queries , but CompanyId the Facet is not correct , he did not rule out {! Tag = company} CompanyId: 16122 queries. How to solve it ?
Re: Does Solrj Batch Processing Querying May Confuse?
Well, if you can find one of the docs, or you know one of the IDs that's missing, try explainOther, see: http://wiki.apache.org/solr/CommonQueryParameters#explainOther Best Erick On Fri, Jul 12, 2013 at 8:29 AM, Furkan KAMACI furkankam...@gmail.com wrote: I've crawled some webpages and indexed them at Solr. I've queried data at Solr via Solrj. url is my unique field and I've define my query as like that: ModifiableSolrParams params = new ModifiableSolrParams(); params.set(q, lang:tr); params.set(fl, url); params.set(sort, url desc); I've run my program to query 1000 rows at each query and wrote them in a file. However I realized that there are some documents that are indexed at Solr (I query them from admin page, not from Solrj as a 1000 row batch process) but is not at my file. What may be the problem for that?
Re: Multiple queries or Filtering Queries in Solr
Isn't this just a filter query? (fq=)? Something like q=query2fq=query1 Although I don't quite understand the 500 50, but you can always tack on additional fq clauses, it's basically set intersection. As for limiting the results a user sees, that's what thr rows parameter is for. So another way of looking at this is can you form a query that expresses the use-case and just show the top N (in this case 50)? Does that work? Best Erick On Fri, Jul 12, 2013 at 10:44 AM, dcode darshan.bengal...@gmail.com wrote: My problem is I have n fields (say around 10) in Solr that are searchable, they all are indexed and stored. I would like to run a query first on my whole index of say 5000 docs which will hit around an average of 500 docs. Next I would like to query using a different set of keywords on these 500 docs and NOT on the whole index. So the first time I send a query a score will be generated, the second time I run a query the new score generated should be based on the 500 documents of the previous query, or in other words Solr should consider only these 500 docs as the whole index. To summarise this, Index of 5000 will be filtered to 500 and then 50 (500050050). Its basically filtering but I would like to do this in Solr. I have reasonable basic knowledge and still learning. Update: If represented mathematically it would look like this: results1=f(query1) results2=f(query2, results1) final_results=f(query3, results2) I would like this to be accomplish using a program and end-user will only see 50 results. So faceting is not an option. -- View this message in context: http://lucene.472066.n3.nabble.com/Multiple-queries-or-Filtering-Queries-in-Solr-tp4077574.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: add to ContributorsGroup - Instructions for setting up SolrCloud on jboss
Done, sorry it took so long, hadn't looked at the list in a couple of days. Erick On Fri, Jul 12, 2013 at 5:46 PM, Ali, Saqib docbook@gmail.com wrote: username: saqib On Fri, Jul 12, 2013 at 2:35 PM, Ali, Saqib docbook@gmail.com wrote: Hello, Can you please add me to the ContributorsGroup? I would like to add instructions for setting up SolrCloud using Jboss. thanks.
Re: Custom processing in Solr Request Handler plugin and its debugging ?
Not sure how to do the pass to another request handler thing, but the debugging part is pretty straightforward. I use IntelliJ, but as far as I know Eclipse has very similar capabilities. First, I cheat and path to the jar that's the output from my IDE, that saves copying the jar around. So my solrconfig.xml file has a lib directive like ../../../../../eoe/project/out/artifact/jardir where this is wherever your IDE wants to put it. It can sometimes be tricky to get enough ../../../ in there. Second, edit config, select remote and a form comes up. Fill in host and port, something like localhost and 5900 (this latter is whatever you want. In IntelliJ that'll give you the specific command to use to start Solr so you can attach. This looks like the following for my setup: java -Xdebug -Xrunjdwp:transport=dt_socket,server=y,suspend=y,address=5900 -jar start.jar Now just fire up Solr as above. Fire up your remote debugging session in IntelliJ. Set breakpoints as you wish. NOT: the suspend=y bit above means that Solr will do _nothing_ until you attach the debugger and hit go HTH Erick On Sat, Jul 13, 2013 at 6:57 AM, Tony Mullins tonymullins...@gmail.com wrote: Please any help on how to pass the search request to different RequestHandler from within the custom RequestHandler and how to debug the custom RequestHandler plugin ? Thanks, Tony On Fri, Jul 12, 2013 at 4:41 PM, Tony Mullins tonymullins...@gmail.comwrote: Hi, I have defined my new Solr RequestHandler plugin like this in SolrConfig.xml requestHandler name=/myendpoint class=com.abc.MyRequestPlugin /requestHandler And its working fine. Now I want to do some custom processing from my this plugin by making a search query to regular '/select' handler. requestHandler name=/select class=solr.SearchHandler /requestHandler And then receive the results back from '/select' handler and perform some custom processing on those results and send the response back to my custom /myendpoint handler. And for this I need help on how to make a call to '/select' handler from within the .MyRequestPlugin class and perform some calculation on the results. I also need some help on how to debug my plugin ? As its .jar is been deployed to solr_hom/lib ... how can I attach my plugin's code in eclipse to Solr process so I could debug it when user will send request to my plugin. Thanks, Tony
Re: external file field and fl parameter
Did you store the field? I.e. set stored=true? And does the EFF contain values for the docs you're returning? Best Erick On Sun, Jul 14, 2013 at 3:32 AM, Chris Collins ch...@geekychris.com wrote: I am playing with external file field for sorting. I created a dynamic field using the ExternalFileField type. I naively assumed that the fl argument would allow me to return the value the external field but doesnt seem to do so. For instance I have a defined a dynamic field: *_efloat then I used: sort=foo_efloat desc fl=foo_efloat, score, description I get the score and description but the foo_efloat seems to be missing in action. Thoughts? C
Re: Getting indexed content of files using ExtractingRequestHandler
Well, cURL is generally not what people use for production. What I'd consider is using SolrJ (which you can access Tika from) and then store the raw pdf (or whatever) document as a binary data type in Solr. Here's an example (with DB indexing mixed in, but you should be able to pull that part out). Best Erick On Sun, Jul 14, 2013 at 4:05 AM, xan p...@prateeksachan.com wrote: Hi, I'm using the PHP Solr client (ver: 1.0.2). I'm indexing the contents through my database. Suppose $data is a stdClass object having id, name, title, etc. from a database entry. Next, I declare a solr Document and assign fields to it.: $doc = new SolrInputDocument(); $doc-addField ('id' , $data-id); $doc-addField ('name' , $data-name); I wanted to know how can I store the contents of a pdf file (whose path I've stored in $data-filepath), in the same solr document, say in a field ('filecontent'). Referring to the wiki, I was unable to figure out the proper cURL request for achieving this. I was able to create a completely new solr document but how do I get the contents of the pdf file in the same solr document so that I can store that in a field? $doc = new SolrInputDocument(); $doc-addField ('id' , $data-id); $doc-addField ('name' , $data-name); //fire the curl request here referring to the file at $data-filepath $doc-addField ('filecontent' , //content of the pdf file); Also, instead of firing the raw cURL request, is there a better way? I don't know if the current PECL SOLR Client 1.0.2 has the feature of indexing pdf files. -- View this message in context: http://lucene.472066.n3.nabble.com/Getting-indexed-content-of-files-using-ExtractingRequestHandler-tp4077856.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Problem using Term Component in solr
by regularizing the title I meant either indexing and searching exactly: Medical Engineering and Physics or Medical Eng. and Phys. Or you could remove the stopwords yourself at both index and query time, which would fix your Physics of Fluids example. The problem here is that you'll be forever fiddling with this and getting it _almost_ right, then the next anomaly will happen Siiigh You might actually be much better off with an ngram or edgeNgram approach. You'd probably want to tokenize the titles, and perhaps auto-generate phrase queries... Best Erick On Sun, Jul 14, 2013 at 7:30 AM, Parul Gupta(Knimbus) parulgp...@gmail.com wrote: Hi, Vocabulary is not known that's the main issue else I will implement synonyms instead. what do u mean by 'regularizing the title'. so let me know some solution... -- View this message in context: http://lucene.472066.n3.nabble.com/Problem-using-Term-Component-in-solr-tp4077200p4077865.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Getting indexed content of files using ExtractingRequestHandler
Sorry, but did you forget to send me the example's link? -- View this message in context: http://lucene.472066.n3.nabble.com/Getting-indexed-content-of-files-using-ExtractingRequestHandler-tp4077856p4077877.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: SolrCloud leader
The problem is that I don't want to invoke data import on 8 server nodes but to choose only one for scheduling. Of course if this server will shut down then another one needs to take the scheduler role. I can see that there is task for sheduling https://issues.apache.org/jira/browse/SOLR-2305 . I hope they will take into account SolrCloud. And that's why I wanted to know if current node is *currently* elected as the leader. The leader would be the scheduler. In the meanwhile, any ideas of how to solve data import scheduling on SolrCloud architecture? Kowish -- View this message in context: http://lucene.472066.n3.nabble.com/SolrCloud-leader-tp4077759p4077878.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: HTTP Status 503 - Server is shutting down
Hi Shawn, I'm also getting the HTTP Status 503 - Server is shutting down error when navigating to http://localhost:8080/solr-4.3.1/ I already copied the logging.properties file from C:\Dropbox\Databases\solr-4.3.1\example\etc to C:\Dropbox\Databases\solr-4.3.1\example\lib Here's my Tomcat console log: 14-jul-2013 14:21:57 org.apache.catalina.core.AprLifecycleListener init INFO: The APR based Apache Tomcat Native library which allows optimal performanc e in production environments was not found on the java.library.path: C:\Program Files\Apache Software Foundation\Tomcat 6.0\bin;C:\Windows\Sun\Java\bin;C:\Windo ws\system32;C:\Windows;C:\Program Files\Common Files\Microsoft Shared\Windows Li ve;C:\Program Files (x86)\Common Files\Microsoft Shared\Windows Live;C:\Windows\ system32;C:\Windows;C:\Windows\System32\Wbem;C:\Windows\System32\WindowsPowerShe ll\v1.0\;C:\Program Files\TortoiseSVN\bin;c:\msxsl;C:\Program Files (x86)\Window s Live\Shared;C:\Program Files\Microsoft\Web Platform Installer\;C:\Program File s (x86)\Microsoft ASP.NET\ASP.NET Web Pages\v1.0\;C:\Program Files (x86)\Windows Kits\8.0\Windows Performance Toolkit\;C:\Program Files\Microsoft SQL Server\110 \Tools\Binn\;C:\Program Files (x86)\Microsoft SQL Server\110\Tools\Binn\;C:\Prog ram Files\Microsoft SQL Server\110\DTS\Binn\;C:\Program Files (x86)\Microsoft SQ L Server\110\Tools\Binn\ManagementStudio\;C:\Program Files (x86)\Microsoft SQL S erver\110\DTS\Binn\;C:\Program Files (x86)\Java\jre6\bin;C:\Program Files\Java\j re631\bin;. 14-jul-2013 14:21:57 org.apache.coyote.http11.Http11Protocol init INFO: Initializing Coyote HTTP/1.1 on http-8080 14-jul-2013 14:21:57 org.apache.catalina.startup.Catalina load INFO: Initialization processed in 283 ms 14-jul-2013 14:21:57 org.apache.catalina.core.StandardService start INFO: Starting service Catalina 14-jul-2013 14:21:57 org.apache.catalina.core.StandardEngine start INFO: Starting Servlet Engine: Apache Tomcat/6.0.37 14-jul-2013 14:21:57 org.apache.catalina.startup.HostConfig deployDescriptor INFO: Deploying configuration descriptor manager.xml 14-jul-2013 14:21:57 org.apache.catalina.startup.HostConfig deployWAR INFO: Deploying web application archive solr-4.3.1.war log4j:WARN No appenders could be found for logger (org.apache.solr.servlet.SolrD ispatchFilter). log4j:WARN Please initialize the log4j system properly. log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more in fo. 14-jul-2013 14:21:58 org.apache.catalina.startup.HostConfig deployDirectory INFO: Deploying web application directory ROOT 14-jul-2013 14:21:58 org.apache.coyote.http11.Http11Protocol start INFO: Starting Coyote HTTP/1.1 on http-8080 14-jul-2013 14:21:58 org.apache.jk.common.ChannelSocket init INFO: JK: ajp13 listening on /0.0.0.0:8009 14-jul-2013 14:21:58 org.apache.jk.server.JkMain start INFO: Jk running ID=0 time=0/55 config=null 14-jul-2013 14:21:58 org.apache.catalina.startup.Catalina start INFO: Server startup in 719 ms -- View this message in context: http://lucene.472066.n3.nabble.com/HTTP-Status-503-Server-is-shutting-down-tp4065958p4077879.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: external file field and fl parameter
Yep I did switch on stored=true in the field type. I was able to confirm a few ways that there are values for the eff by two methods: 1) changing desc to asc produced drastically different results. 2) debugging FileFloatSource the following was getting triggered filling the vals array: while ((doc = docsEnum.nextDoc()) != DocIdSetIterator.NO_MORE_DOCS) { vals[doc] = fval; } At least by you asking these questions I guess it should work. I will continue dissecting. Thanks Erick. C On Jul 14, 2013, at 5:16 AM, Erick Erickson erickerick...@gmail.com wrote: Did you store the field? I.e. set stored=true? And does the EFF contain values for the docs you're returning? Best Erick On Sun, Jul 14, 2013 at 3:32 AM, Chris Collins ch...@geekychris.com wrote: I am playing with external file field for sorting. I created a dynamic field using the ExternalFileField type. I naively assumed that the fl argument would allow me to return the value the external field but doesnt seem to do so. For instance I have a defined a dynamic field: *_efloat then I used: sort=foo_efloat desc fl=foo_efloat, score, description I get the score and description but the foo_efloat seems to be missing in action. Thoughts? C
Re: SolrCloud leader
In theory, each of the nodes uses the same configuration, right? So, in theory, ANY of the nodes can do a DIH import. It is only way down low in the update processing chain that an individual Solr input document needs to have its key hashed and then the request is routed to the leader of the appropriate shard. In short, YOU decide whatever node that YOU want the DIH import to run on, and Solr will automatically take care of actual distribution of individual document update requests. If you want to pick a leader node, fine, but there is no requirement or need that you do so. Scheduling is currently outside of the scope of Solr and SolrCloud. -- Jack Krupansky -Original Message- From: kowish.adamosh Sent: Sunday, July 14, 2013 8:42 AM To: solr-user@lucene.apache.org Subject: Re: SolrCloud leader The problem is that I don't want to invoke data import on 8 server nodes but to choose only one for scheduling. Of course if this server will shut down then another one needs to take the scheduler role. I can see that there is task for sheduling https://issues.apache.org/jira/browse/SOLR-2305 . I hope they will take into account SolrCloud. And that's why I wanted to know if current node is *currently* elected as the leader. The leader would be the scheduler. In the meanwhile, any ideas of how to solve data import scheduling on SolrCloud architecture? Kowish -- View this message in context: http://lucene.472066.n3.nabble.com/SolrCloud-leader-tp4077759p4077878.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Getting indexed content of files using ExtractingRequestHandler
Right, sorry... http://searchhub.org/dev/2012/02/14/indexing-with-solrj/ On Sun, Jul 14, 2013 at 8:31 AM, xan p...@prateeksachan.com wrote: Sorry, but did you forget to send me the example's link? -- View this message in context: http://lucene.472066.n3.nabble.com/Getting-indexed-content-of-files-using-ExtractingRequestHandler-tp4077856p4077877.html Sent from the Solr - User mailing list archive at Nabble.com.
ACL implementation: Pseudo-join performance Atomic Updates
Hello all, Situation: We have a collection of files in SOLR with ACL applied: each file has a multi-valued field that contains the list of userID's that can read it: here is sample data: Id | content | userId 1 | text text | 4,5,6,2 2 | text text | 4,5,9 3 | text text | 4,2 Problem: when ACL is changed for a big folder, we compute the ACL for all child items and reindex in SOLR using atomic updates (updating only 'userIds' column), but because it deletes/reindexes the record, the performance is very poor. Question: I suppose the delete/reindex approach will not change soon (probably it's due to actual SOLR architecture), ? Possible solution: assuming atomic updates will be super fast on an index without fulltext, keep a separate ACLIndex and FullTextIndex and use Pseudo-Joins: Example: searching 'foo' as user '999' /solr/FullTextIndex/select/?q=foofq{!join fromIndex=ACLIndex from=Id to=Id }userId:999 Question: what about performance here? what if the index is 100,000 records? notice that the worst situation is when everyone has access to all the files, it means the first filter will be the full index. Would be happy to get any links that deal with the issue of Pseudo-join performance for large datasets (i.e. initial filtered set of IDs). Regards, Oleg P.S. we found that having the list of all users that have access for each record is better overall, because there are much more read requests (people accessing the library) then write requests (a new user is added/removed).
Re: Getting indexed content of files using ExtractingRequestHandler
Thanks for the link. Also, having gone quite far with my work using the PHP Solr client, isn't there anything that could be done using the PHP Solr client only? -- View this message in context: http://lucene.472066.n3.nabble.com/Getting-indexed-content-of-files-using-ExtractingRequestHandler-tp4077856p4077893.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: ACL implementation: Pseudo-join performance Atomic Updates
Take a look at LucidWorks Search and its access control: http://docs.lucidworks.com/display/help/Search+Filters+for+Access+Control Role-based security is an easier nut to crack. Karl Wright of ManifoldCF had a Solr patch for document access control at one point: SOLR-1895 - ManifoldCF SearchComponent plugin for enforcing ManifoldCF security at search time https://issues.apache.org/jira/browse/SOLR-1895 http://www.slideshare.net/lucenerevolution/wright-nokia-manifoldcfeurocon-2011 For some other thoughts: http://wiki.apache.org/solr/SolrSecurity#Document_Level_Security I'm not sure if external file fields will be of any value in this situation. There is also a proposal for bitwise operations: SOLR-1913 - QParserPlugin plugin for Search Results Filtering Based on Bitwise Operations on Integer Fields https://issues.apache.org/jira/browse/SOLR-1913 But the bottom line is that clearly updating all documents in the index is a non-starter. -- Jack Krupansky -Original Message- From: Oleg Burlaca Sent: Sunday, July 14, 2013 11:02 AM To: solr-user@lucene.apache.org Subject: ACL implementation: Pseudo-join performance Atomic Updates Hello all, Situation: We have a collection of files in SOLR with ACL applied: each file has a multi-valued field that contains the list of userID's that can read it: here is sample data: Id | content | userId 1 | text text | 4,5,6,2 2 | text text | 4,5,9 3 | text text | 4,2 Problem: when ACL is changed for a big folder, we compute the ACL for all child items and reindex in SOLR using atomic updates (updating only 'userIds' column), but because it deletes/reindexes the record, the performance is very poor. Question: I suppose the delete/reindex approach will not change soon (probably it's due to actual SOLR architecture), ? Possible solution: assuming atomic updates will be super fast on an index without fulltext, keep a separate ACLIndex and FullTextIndex and use Pseudo-Joins: Example: searching 'foo' as user '999' /solr/FullTextIndex/select/?q=foofq{!join fromIndex=ACLIndex from=Id to=Id }userId:999 Question: what about performance here? what if the index is 100,000 records? notice that the worst situation is when everyone has access to all the files, it means the first filter will be the full index. Would be happy to get any links that deal with the issue of Pseudo-join performance for large datasets (i.e. initial filtered set of IDs). Regards, Oleg P.S. we found that having the list of all users that have access for each record is better overall, because there are much more read requests (people accessing the library) then write requests (a new user is added/removed).
Re: Getting indexed content of files using ExtractingRequestHandler
I'm completely ignorant of all things PHP, including the state of any Solr client code, so I'm afraid I can't help with that... Best Erick On Sun, Jul 14, 2013 at 11:03 AM, xan p...@prateeksachan.com wrote: Thanks for the link. Also, having gone quite far with my work using the PHP Solr client, isn't there anything that could be done using the PHP Solr client only? -- View this message in context: http://lucene.472066.n3.nabble.com/Getting-indexed-content-of-files-using-ExtractingRequestHandler-tp4077856p4077893.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: ACL implementation: Pseudo-join performance Atomic Updates
Join performance is most sensitive to the number of values in the field being joined on. So if you have lots and lots of distinct values in the corpus, join performance will be affected. bq: I suppose the delete/reindex approach will not change soon There is ongoing work (search the JIRA for Stacked Segments) on actually doing something about this, but it's been under consideration for at least 3 years so your guess is as good as mine. bq: notice that the worst situation is when everyone has access to all the files, it means the first filter will be the full index. One way to deal with this is to implement a post filter, sometimes called a no cache filter. The distinction here is that 1 it is not cached (duh!) 2 it is only called for documents that have made it through all the other lower cost filters (and the main query of course). 3 lower cost means the filter is either a standard, cached filters and any no cache filters with a cost (explicitly stated in the query) lower than this one's. Critically, and unlike normal filter queries, the result set is NOT calculated for all documents ahead of time You _still_ have to deal with the sysadmin doing a *:* query as you are well aware. But one can mitigate that by having the post-filter fail all documents after some arbitrary N, and display a message in the app like too many documents, man. Please refine your query. Partial results below. Of course this may not be acceptable, but HTH Erick On Sun, Jul 14, 2013 at 12:05 PM, Jack Krupansky j...@basetechnology.com wrote: Take a look at LucidWorks Search and its access control: http://docs.lucidworks.com/display/help/Search+Filters+for+Access+Control Role-based security is an easier nut to crack. Karl Wright of ManifoldCF had a Solr patch for document access control at one point: SOLR-1895 - ManifoldCF SearchComponent plugin for enforcing ManifoldCF security at search time https://issues.apache.org/jira/browse/SOLR-1895 http://www.slideshare.net/lucenerevolution/wright-nokia-manifoldcfeurocon-2011 For some other thoughts: http://wiki.apache.org/solr/SolrSecurity#Document_Level_Security I'm not sure if external file fields will be of any value in this situation. There is also a proposal for bitwise operations: SOLR-1913 - QParserPlugin plugin for Search Results Filtering Based on Bitwise Operations on Integer Fields https://issues.apache.org/jira/browse/SOLR-1913 But the bottom line is that clearly updating all documents in the index is a non-starter. -- Jack Krupansky -Original Message- From: Oleg Burlaca Sent: Sunday, July 14, 2013 11:02 AM To: solr-user@lucene.apache.org Subject: ACL implementation: Pseudo-join performance Atomic Updates Hello all, Situation: We have a collection of files in SOLR with ACL applied: each file has a multi-valued field that contains the list of userID's that can read it: here is sample data: Id | content | userId 1 | text text | 4,5,6,2 2 | text text | 4,5,9 3 | text text | 4,2 Problem: when ACL is changed for a big folder, we compute the ACL for all child items and reindex in SOLR using atomic updates (updating only 'userIds' column), but because it deletes/reindexes the record, the performance is very poor. Question: I suppose the delete/reindex approach will not change soon (probably it's due to actual SOLR architecture), ? Possible solution: assuming atomic updates will be super fast on an index without fulltext, keep a separate ACLIndex and FullTextIndex and use Pseudo-Joins: Example: searching 'foo' as user '999' /solr/FullTextIndex/select/?q=foofq{!join fromIndex=ACLIndex from=Id to=Id }userId:999 Question: what about performance here? what if the index is 100,000 records? notice that the worst situation is when everyone has access to all the files, it means the first filter will be the full index. Would be happy to get any links that deal with the issue of Pseudo-join performance for large datasets (i.e. initial filtered set of IDs). Regards, Oleg P.S. we found that having the list of all users that have access for each record is better overall, because there are much more read requests (people accessing the library) then write requests (a new user is added/removed).
Re: HTTP Status 503 - Server is shutting down
On 7/14/2013 6:43 AM, PeterKerk wrote: Hi Shawn, I'm also getting the HTTP Status 503 - Server is shutting down error when navigating to http://localhost:8080/solr-4.3.1/ snip INFO: Deploying web application archive solr-4.3.1.war log4j:WARN No appenders could be found for logger (org.apache.solr.servlet.SolrD ispatchFilter). log4j:WARN Please initialize the log4j system properly. log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more in fo. THe logging.properties file is used for JDK logging, which was the default in Solr prior to version 4.3.0. In older versions, jarfiles were embedded in the .war file that set up slf4j to use java.util.logging, also known as JDK logging because this logging framework comes with Java. Solr 4.3.0 and later does not have ANY slf4j jarfiles in the .war file, so you need to put them in your classpath. Jarfiles are included in the example, in example/lib/ext, and those jarfiles set up logging to use log4j, a much more flexible logging framework than JDK logging. JDK logging is typically set up with a file called logging.properties, which I think you must use a system property to configure. You aren't using JDK logging, you are using log4j, which uses a file called log4j.properties. http://wiki.apache.org/solr/SolrLogging#Using_the_example_logging_setup_in_containers_other_than_Jetty It appears that you have followed part of the instructions above and copied jars from example/lib/ext to a lib directory on your classpath. Now if you copy example/resources/log4j.properties to the same place, logging should work. It will not log to the tomcat log, it will log to the location specified in log4j.properties, which by default is logs/solr.log relative to the current working directory. As I already said on this thread, if you want Tomcat to be in control of the logging, you must switch back to java.util.logging as described in the wiki: http://wiki.apache.org/solr/SolrLogging#Switching_from_Log4J_back_to_JUL_.28java.util.logging.29 Thanks, Shawn
Re: ACL implementation: Pseudo-join performance Atomic Updates
Hello Jack, Thanks for so many links, my comments are below, I'll found a way to rephrase all my questions in one: How to implement a DAC (Discretionary Access Control) similar to Windows OS using SOLR? What we have: a hierarchical filesystem, user and groups, permissions applied at the level of a file/folder. What we need: full-text search restricting access based on ACL. How to deal with a change in permissions for a big folder? How to check if the user can delete a folder? (it means he should have write access to all files/sub-folders) Role-based security is an easier nut to crack yep, but we need DAC :( http://docs.lucidworks.com/display/help/Search+Filters+for+Access+Control The documentation doesn't reveal what happens when content should be reindexed, although the last chapter Document-based Authorization shows the same approach: user list specified at the level of the document. Karl Wright of ManifoldCF had a Solr patch for document access control at one point: SOLR-1895 - ManifoldCF SearchComponent plugin for enforcing ManifoldCF security at search time https://issues.apache.org/**jira/browse/SOLR-1895https://issues.apache.org/jira/browse/SOLR-1895 It states LCF SearchComponent which filters returned results based on access tokens provided by LCF's authority service That means filtering is applied on the results only. Issues: faceting doesn't work correctly (i.e. counting), because the filter isn't applied yet. Even worse: you have to scroll through the result set until you find records accessible by the user (what if the user has access to 10 from 100,000 files?) http://www.slideshare.net/**lucenerevolution/wright-nokia-** manifoldcfeurocon-2011http://www.slideshare.net/lucenerevolution/wright-nokia-manifoldcfeurocon-2011 Page 9 says docs and access tokens. Separate bins for allow tokens, deny tokens for file It's similar to the approach we use: each record in SOLR has two fields: readAccess and WriteAccess both is a multivalued field with userId's. it allows us to quickly delete a bunch of items the user has access to for ex. (or checking a hierarchical delete) http://wiki.apache.org/solr/SolrSecurity#Document_Level_Security http://wiki.apache.org/solr/SolrSecurity#Document_Level_Security It works by adding security tokens from the source repositories as metadata on the indexed documents Again, the permission info is stored within the record itself, and if we change access for big folder, it means reindexing. https://issues.apache.org/**jira/browse/SOLR-1913https://issues.apache.org/jira/browse/SOLR-1913 Thanks for the link, need to meditate if I can find a way to use it. But the bottom line is that clearly updating all documents in the index is a non-starter. I have scratched my head, and monitoring SOLR features for a long time, trying to find something I can use. Today I've watched Yonik Seeley video: http://vimeopro.com/user11514798/apache-lucene-eurocon-2012/video/55387447 and found PSEUDO-JOINS, nice This seems a perfect solution, I can have two indexes, one with full-text and another one with objId and userId's, the second one should be fast to update I hope. But the question is: what about performance? Regards On Sun, Jul 14, 2013 at 7:05 PM, Jack Krupansky j...@basetechnology.comwrote: Take a look at LucidWorks Search and its access control: http://docs.lucidworks.com/**display/help/Search+Filters+** for+Access+Controlhttp://docs.lucidworks.com/display/help/Search+Filters+for+Access+Control Role-based security is an easier nut to crack. Karl Wright of ManifoldCF had a Solr patch for document access control at one point: SOLR-1895 - ManifoldCF SearchComponent plugin for enforcing ManifoldCF security at search time https://issues.apache.org/**jira/browse/SOLR-1895https://issues.apache.org/jira/browse/SOLR-1895 http://www.slideshare.net/**lucenerevolution/wright-nokia-** manifoldcfeurocon-2011http://www.slideshare.net/lucenerevolution/wright-nokia-manifoldcfeurocon-2011 For some other thoughts: http://wiki.apache.org/solr/**SolrSecurity#Document_Level_**Securityhttp://wiki.apache.org/solr/SolrSecurity#Document_Level_Security I'm not sure if external file fields will be of any value in this situation. There is also a proposal for bitwise operations: SOLR-1913 - QParserPlugin plugin for Search Results Filtering Based on Bitwise Operations on Integer Fields https://issues.apache.org/**jira/browse/SOLR-1913https://issues.apache.org/jira/browse/SOLR-1913 But the bottom line is that clearly updating all documents in the index is a non-starter. -- Jack Krupansky -Original Message- From: Oleg Burlaca Sent: Sunday, July 14, 2013 11:02 AM To: solr-user@lucene.apache.org Subject: ACL implementation: Pseudo-join performance Atomic Updates Hello all, Situation: We have a collection of files in SOLR with ACL applied: each file has a multi-valued field that contains the list of userID's that can read it: here is
Re: SolrCloud leader
On 7/14/2013 6:42 AM, kowish.adamosh wrote: The problem is that I don't want to invoke data import on 8 server nodes but to choose only one for scheduling. Of course if this server will shut down then another one needs to take the scheduler role. I can see that there is task for sheduling https://issues.apache.org/jira/browse/SOLR-2305 . I hope they will take into account SolrCloud. And that's why I wanted to know if current node is *currently* elected as the leader. The leader would be the scheduler. In the meanwhile, any ideas of how to solve data import scheduling on SolrCloud architecture? As Jack already replied, this is outside the scope of Solr. SOLR-2305 has been around for a VERY long time. Adding scheduling capability to the dataimport handler is not very hard, but nobody has done so because we do not believe this is something Solr should be handling. Also, it's easy to get something wrong, so users can run into bugs that would break their scheduling. Every operating system has scheduling capability. Windows has the task scheduler. On virtually all other operating systems, you'll find cron. These systems have had years of operation for their authors to work out the bugs, and they are VERY solid. We would not be able to make the same robustness guarantee if we included scheduling in Solr. Additionally, we really want to be sure that Solr never does anything on its own that has not been specifically requested by a user or program, or through certain external events such as a hardware or software failure. For my own multi-server Linux Solr installation, which doesn't use SolrCloud even though it's got two complete copies of the index and uses shards, I have worked out how to do clustered scheduling. I have a corosync/pacemaker cluster set up on my servers, which ensures that only one copy of my cronjobs is running on the cluster. If a server dies, it will start up the cronjobs on another server. Thanks, Shawn
Re: external file field and fl parameter
On 7/14/2013 7:05 AM, Chris Collins wrote: Yep I did switch on stored=true in the field type. I was able to confirm a few ways that there are values for the eff by two methods: 1) changing desc to asc produced drastically different results. 2) debugging FileFloatSource the following was getting triggered filling the vals array: while ((doc = docsEnum.nextDoc()) != DocIdSetIterator.NO_MORE_DOCS) { vals[doc] = fval; } At least by you asking these questions I guess it should work. I will continue dissecting. Did you reindex when you changed the schema? Sorting uses indexed values, not stored values. The fl parameter requires the stored values. These are separate within the index, and one cannot substitute for the other. If you didn't reindex, then you won't have the stored values for existing documents. http://wiki.apache.org/solr/HowToReindex Thanks, Shawn
Apache Solr 4 - after 1st commit the index does not grow
I have written my own plugin for Apache Nutch 2.2.1 to crawl images, videos and podcasts from selected sites (I have 180 urls in my seed). I put this metadata to a hBase store and now I want to save it to the index (Solr). I have a lot of metadatas to save (webpages + images + videos + podcast). I am using Nutch script bin/crawl for the whole process (inject, generate, fetch, parse... and finally solrindex and dedup) but I have one problem. When I run this script for a first time, there are stored approximately 6000 documents (Lets say it is 3700 docs for images, 1700 for wegpages and the rest of docs are for videos and podcasts) to the index. It is ok... but... When I run the script for a second time, third time and so on... the index does not increase the number of documents (there are still 6000 documents) but a count of rows stored in hBase table grows (there is 97383 rows now)... Do you now where is the problem please? I am fighting with this problem really long time and I dont know... If it could be helpful, this is my configuration of solrconfix.xml http://pastebin.com/uxMW2nuq and this is my nutch-site.xml http://pastebin.com/4bj1wdmT -- View this message in context: http://lucene.472066.n3.nabble.com/Apache-Solr-4-after-1st-commit-the-index-does-not-grow-tp4077913.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: external file field and fl parameter
Why would I be re-indexing an external file field? The whole purpose is that its brought in at runtime and not part of the index? C On Jul 14, 2013, at 10:13 AM, Shawn Heisey s...@elyograg.org wrote: On 7/14/2013 7:05 AM, Chris Collins wrote: Yep I did switch on stored=true in the field type. I was able to confirm a few ways that there are values for the eff by two methods: 1) changing desc to asc produced drastically different results. 2) debugging FileFloatSource the following was getting triggered filling the vals array: while ((doc = docsEnum.nextDoc()) != DocIdSetIterator.NO_MORE_DOCS) { vals[doc] = fval; } At least by you asking these questions I guess it should work. I will continue dissecting. Did you reindex when you changed the schema? Sorting uses indexed values, not stored values. The fl parameter requires the stored values. These are separate within the index, and one cannot substitute for the other. If you didn't reindex, then you won't have the stored values for existing documents. http://wiki.apache.org/solr/HowToReindex Thanks, Shawn
Re: ACL implementation: Pseudo-join performance Atomic Updates
Hello Erick, Join performance is most sensitive to the number of values in the field being joined on. So if you have lots and lots of distinct values in the corpus, join performance will be affected. Yep, we have a list of unique Id's that we get by first searching for records where loggedInUser IS IN (userIDs) This corpus is stored in memory I suppose? (not a problem) and then the bottleneck is to match this huge set with the core where I'm searching? Somewhere in maillist archive people were talking about external list of Solr unique IDs but didn't find if there is a solution. Back in 2010 Yonik posted a comment: http://find.searchhub.org/document/363a4952446b3cd#363a4952446b3cd bq: I suppose the delete/reindex approach will not change soon There is ongoing work (search the JIRA for Stacked Segments) Ah, ok, I was feeling it affects the architecture, ok, now the only hope is Pseudo-Joins )) One way to deal with this is to implement a post filter, sometimes called a no cache filter. thanks, will have a look, but as you describe it, it's not the best option. The approach too many documents, man. Please refine your query. Partial results below means faceting will not work correctly? ... I have in mind a hybrid approach, comments welcome: Most of the time users are not searching, but browsing content, so our virtual filesystem stored in SOLR will use only the index with the Id of the file and the list of users that have access to it. i.e. not touching the fulltext index at all. Files may have metadata (EXIF info for images for ex) that we'd like to filter by, calculate facets. Meta will be stored in both indexes. In case of a fulltext query: 1. search FT index (the fulltext index), get only the number of search results, let it be Rf 2. search DAC index (the index with permissions), get number of search results, let it be Rd let maxR be the maximum size of the corpus for the pseudo-join. *That was actually my question: what is a reasonable number? 10, 100, 1000 ? * if (Rf maxR) or (Rd maxR) then use the smaller corpus to join onto the second one. this happens when (only a few documents contains the search query) OR (user has access to a small number of files). In case none of these happens, we can use the too many documents, man. Please refine your query. Partial results below but first searching the FT index, because we want relevant results first. What do you think? Regards, Oleg On Sun, Jul 14, 2013 at 7:42 PM, Erick Erickson erickerick...@gmail.comwrote: Join performance is most sensitive to the number of values in the field being joined on. So if you have lots and lots of distinct values in the corpus, join performance will be affected. bq: I suppose the delete/reindex approach will not change soon There is ongoing work (search the JIRA for Stacked Segments) on actually doing something about this, but it's been under consideration for at least 3 years so your guess is as good as mine. bq: notice that the worst situation is when everyone has access to all the files, it means the first filter will be the full index. One way to deal with this is to implement a post filter, sometimes called a no cache filter. The distinction here is that 1 it is not cached (duh!) 2 it is only called for documents that have made it through all the other lower cost filters (and the main query of course). 3 lower cost means the filter is either a standard, cached filters and any no cache filters with a cost (explicitly stated in the query) lower than this one's. Critically, and unlike normal filter queries, the result set is NOT calculated for all documents ahead of time You _still_ have to deal with the sysadmin doing a *:* query as you are well aware. But one can mitigate that by having the post-filter fail all documents after some arbitrary N, and display a message in the app like too many documents, man. Please refine your query. Partial results below. Of course this may not be acceptable, but HTH Erick On Sun, Jul 14, 2013 at 12:05 PM, Jack Krupansky j...@basetechnology.com wrote: Take a look at LucidWorks Search and its access control: http://docs.lucidworks.com/display/help/Search+Filters+for+Access+Control Role-based security is an easier nut to crack. Karl Wright of ManifoldCF had a Solr patch for document access control at one point: SOLR-1895 - ManifoldCF SearchComponent plugin for enforcing ManifoldCF security at search time https://issues.apache.org/jira/browse/SOLR-1895 http://www.slideshare.net/lucenerevolution/wright-nokia-manifoldcfeurocon-2011 For some other thoughts: http://wiki.apache.org/solr/SolrSecurity#Document_Level_Security I'm not sure if external file fields will be of any value in this situation. There is also a proposal for bitwise operations: SOLR-1913 - QParserPlugin plugin for Search Results Filtering Based on Bitwise Operations on Integer Fields
Re: external file field and fl parameter
Hi Chris, Try wrapping the field name in a field() function in your fl parameter list, like so: fl=field(eff_field_name) Alan Woodward www.flax.co.uk On 14 Jul 2013, at 18:41, Chris Collins wrote: Why would I be re-indexing an external file field? The whole purpose is that its brought in at runtime and not part of the index? C On Jul 14, 2013, at 10:13 AM, Shawn Heisey s...@elyograg.org wrote: On 7/14/2013 7:05 AM, Chris Collins wrote: Yep I did switch on stored=true in the field type. I was able to confirm a few ways that there are values for the eff by two methods: 1) changing desc to asc produced drastically different results. 2) debugging FileFloatSource the following was getting triggered filling the vals array: while ((doc = docsEnum.nextDoc()) != DocIdSetIterator.NO_MORE_DOCS) { vals[doc] = fval; } At least by you asking these questions I guess it should work. I will continue dissecting. Did you reindex when you changed the schema? Sorting uses indexed values, not stored values. The fl parameter requires the stored values. These are separate within the index, and one cannot substitute for the other. If you didn't reindex, then you won't have the stored values for existing documents. http://wiki.apache.org/solr/HowToReindex Thanks, Shawn
Re: external file field and fl parameter
Yes that worked, thanks Alan. The consistency of this api is challenging. C On Jul 14, 2013, at 11:03 AM, Alan Woodward a...@flax.co.uk wrote: Hi Chris, Try wrapping the field name in a field() function in your fl parameter list, like so: fl=field(eff_field_name) Alan Woodward www.flax.co.uk On 14 Jul 2013, at 18:41, Chris Collins wrote: Why would I be re-indexing an external file field? The whole purpose is that its brought in at runtime and not part of the index? C On Jul 14, 2013, at 10:13 AM, Shawn Heisey s...@elyograg.org wrote: On 7/14/2013 7:05 AM, Chris Collins wrote: Yep I did switch on stored=true in the field type. I was able to confirm a few ways that there are values for the eff by two methods: 1) changing desc to asc produced drastically different results. 2) debugging FileFloatSource the following was getting triggered filling the vals array: while ((doc = docsEnum.nextDoc()) != DocIdSetIterator.NO_MORE_DOCS) { vals[doc] = fval; } At least by you asking these questions I guess it should work. I will continue dissecting. Did you reindex when you changed the schema? Sorting uses indexed values, not stored values. The fl parameter requires the stored values. These are separate within the index, and one cannot substitute for the other. If you didn't reindex, then you won't have the stored values for existing documents. http://wiki.apache.org/solr/HowToReindex Thanks, Shawn
Re: Apache Solr 4 - after 1st commit the index does not grow
When I look into the log, there is: SEVERE: auto commit error...:java.lang.IllegalStateException: this writer hit an OutOfMemoryError; cannot commit at org.apache.lucene.index.IndexWriter.prepareCommitInternal(IndexWriter.java:2668) at org.apache.lucene.index.IndexWriter.commitInternal(IndexWriter.java:2834) at org.apache.lucene.index.IndexWriter.commit(IndexWriter.java:2814) at org.apache.solr.update.DirectUpdateHandler2.commit(DirectUpdateHandler2.java:529) at org.apache.solr.update.CommitTracker.run(CommitTracker.java:216) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334) at java.util.concurrent.FutureTask.run(FutureTask.java:166) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:178) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:292) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:722) -- View this message in context: http://lucene.472066.n3.nabble.com/Apache-Solr-4-after-1st-commit-the-index-does-not-grow-tp4077913p4077924.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: HTTP Status 503 - Server is shutting down
Ok, still getting the same error HTTP Status 503 - Server is shutting down, so here's what I did now: - reinstalled tomcat - deployed solr-4.3.1.war in C:\Program Files\Apache Software Foundation\Tomcat 6.0\webapps - copied log4j-1.2.16.jar,slf4j-api-1.6.6.jar,slf4j-log4j12-1.6.6.jar to C:\Program Files\Apache Software Foundation\Tomcat 6.0\webapps\solr-4.3.1\WEB-INF\lib - copied log4j.properties from C:\Dropbox\Databases\solr-4.3.1\example\resources to C:\Dropbox\Databases\solr-4.3.1\example\lib - restarted tomcat Now this shows in my Tomcat console: 14-jul-2013 20:54:38 org.apache.catalina.core.AprLifecycleListener init INFO: The APR based Apache Tomcat Native library which allows optimal performanc e in production environments was not found on the java.library.path: C:\Program Files\Apache Software Foundation\Tomcat 6.0\bin;C:\Windows\Sun\Java\bin;C:\Windo ws\system32;C:\Windows;C:\Program Files\Common Files\Microsoft Shared\Windows Li ve;C:\Program Files (x86)\Common Files\Microsoft Shared\Windows Live;C:\Windows\ system32;C:\Windows;C:\Windows\System32\Wbem;C:\Windows\System32\WindowsPowerShe ll\v1.0\;C:\Program Files\TortoiseSVN\bin;c:\msxsl;C:\Program Files (x86)\Window s Live\Shared;C:\Program Files\Microsoft\Web Platform Installer\;C:\Program File s (x86)\Microsoft ASP.NET\ASP.NET Web Pages\v1.0\;C:\Program Files (x86)\Windows Kits\8.0\Windows Performance Toolkit\;C:\Program Files\Microsoft SQL Server\110 \Tools\Binn\;C:\Program Files (x86)\Microsoft SQL Server\110\Tools\Binn\;C:\Prog ram Files\Microsoft SQL Server\110\DTS\Binn\;C:\Program Files (x86)\Microsoft SQ L Server\110\Tools\Binn\ManagementStudio\;C:\Program Files (x86)\Microsoft SQL S erver\110\DTS\Binn\;C:\Program Files (x86)\Java\jre6\bin;C:\Program Files\Java\j re631\bin;. 14-jul-2013 20:54:39 org.apache.coyote.http11.Http11Protocol init INFO: Initializing Coyote HTTP/1.1 on http-8080 14-jul-2013 20:54:39 org.apache.catalina.startup.Catalina load INFO: Initialization processed in 287 ms 14-jul-2013 20:54:39 org.apache.catalina.core.StandardService start INFO: Starting service Catalina 14-jul-2013 20:54:39 org.apache.catalina.core.StandardEngine start INFO: Starting Servlet Engine: Apache Tomcat/6.0.37 14-jul-2013 20:54:39 org.apache.catalina.startup.HostConfig deployDescriptor INFO: Deploying configuration descriptor manager.xml 14-jul-2013 20:54:39 org.apache.catalina.startup.HostConfig deployWAR INFO: Deploying web application archive solr-4.3.1.war log4j:WARN No appenders could be found for logger (org.apache.solr.servlet.SolrD ispatchFilter). log4j:WARN Please initialize the log4j system properly. log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more in fo. 14-jul-2013 20:54:39 org.apache.catalina.startup.HostConfig deployDirectory INFO: Deploying web application directory ROOT 14-jul-2013 20:54:39 org.apache.coyote.http11.Http11Protocol start INFO: Starting Coyote HTTP/1.1 on http-8080 14-jul-2013 20:54:39 org.apache.jk.common.ChannelSocket init INFO: JK: ajp13 listening on /0.0.0.0:8009 14-jul-2013 20:54:39 org.apache.jk.server.JkMain start INFO: Jk running ID=0 time=0/55 config=null 14-jul-2013 20:54:39 org.apache.catalina.startup.Catalina start INFO: Server startup in 732 ms And this in the catalina.log: 14-jul-2013 20:54:38 org.apache.catalina.core.AprLifecycleListener init INFO: The APR based Apache Tomcat Native library which allows optimal performance in production environments was not found on the java.library.path: C:\Program Files\Apache Software Foundation\Tomcat 6.0\bin;C:\Windows\Sun\Java\bin;C:\Windows\system32;C:\Windows;C:\Program Files\Common Files\Microsoft Shared\Windows Live;C:\Program Files (x86)\Common Files\Microsoft Shared\Windows Live;C:\Windows\system32;C:\Windows;C:\Windows\System32\Wbem;C:\Windows\System32\WindowsPowerShell\v1.0\;C:\Program Files\TortoiseSVN\bin;c:\msxsl;C:\Program Files (x86)\Windows Live\Shared;C:\Program Files\Microsoft\Web Platform Installer\;C:\Program Files (x86)\Microsoft ASP.NET\ASP.NET Web Pages\v1.0\;C:\Program Files (x86)\Windows Kits\8.0\Windows Performance Toolkit\;C:\Program Files\Microsoft SQL Server\110\Tools\Binn\;C:\Program Files (x86)\Microsoft SQL Server\110\Tools\Binn\;C:\Program Files\Microsoft SQL Server\110\DTS\Binn\;C:\Program Files (x86)\Microsoft SQL Server\110\Tools\Binn\ManagementStudio\;C:\Program Files (x86)\Microsoft SQL Server\110\DTS\Binn\;C:\Program Files (x86)\Java\jre6\bin;C:\Program Files\Java\jre631\bin;. 14-jul-2013 20:54:39 org.apache.coyote.http11.Http11Protocol init INFO: Initializing Coyote HTTP/1.1 on http-8080 14-jul-2013 20:54:39 org.apache.catalina.startup.Catalina load INFO: Initialization processed in 287 ms 14-jul-2013 20:54:39 org.apache.catalina.core.StandardService start INFO: Starting service Catalina 14-jul-2013 20:54:39 org.apache.catalina.core.StandardEngine start INFO: Starting Servlet Engine: Apache Tomcat/6.0.37 14-jul-2013 20:54:39 org.apache.catalina.startup.HostConfig deployDescriptor INFO:
Re: solr autodetectparser tikaconfig dataimporter error
hi is there nowone with a idea what this error is or even give me a pointer where to look? If not is there a alternitave way to import documents from a xml-file with meta-data and the filename to parse? thanks for any help. On 12. Jul 2013, at 10:38 PM, Andreas Owen wrote: i am using solr 3.5, tika-app-1.4 and tagcloud 1.2.1. when i try to = import a file via xml i get this error, it doesn't matter what file format i try = to index txt, cfm, pdf all the same error: SEVERE: Exception while processing: rec document : SolrInputDocument[{id=3Did(1.0)=3D{myTest.txt}, title=3Dtitle(1.0)=3D{Beratungsseminar kundenbrief}, = contents=3Dcontents(1.0)=3D{wie kommuniziert man}, author=3Dauthor(1.0)=3D{Peter Z.}, = path=3Dpath(1.0)=3D{download/online}}]:org.apache.solr.handler.dataimport.= DataImportHandlerException: java.lang.NoSuchMethodError: = org.apache.tika.parser.AutoDetectParser.setConfig(Lorg/apache/tika/config/= TikaConfig;)V at = org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.jav= a:669) at = org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.jav= a:622) at = org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:2= 68) at = org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:187)= at = org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.= java:359) at = org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:4= 27) at = org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:40= 8) Caused by: java.lang.NoSuchMethodError: = org.apache.tika.parser.AutoDetectParser.setConfig(Lorg/apache/tika/config/= TikaConfig;)V at = org.apache.solr.handler.dataimport.TikaEntityProcessor.nextRow(TikaEntityP= rocessor.java:122) at = org.apache.solr.handler.dataimport.EntityProcessorWrapper.nextRow(EntityPr= ocessorWrapper.java:238) at = org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.jav= a:596) ... 6 more Jul 11, 2013 5:23:36 PM org.apache.solr.common.SolrException log SEVERE: Full Import failed:org.apache.solr.handler.dataimport.DataImportHandlerException: java.lang.NoSuchMethodError: = org.apache.tika.parser.AutoDetectParser.setConfig(Lorg/apache/tika/config/= TikaConfig;)V at = org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.jav= a:669) at = org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.jav= a:622) at = org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:2= 68) at = org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:187)= at = org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.= java:359) at = org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:4= 27) at = org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:40= 8) Caused by: java.lang.NoSuchMethodError: = org.apache.tika.parser.AutoDetectParser.setConfig(Lorg/apache/tika/config/= TikaConfig;)V at = org.apache.solr.handler.dataimport.TikaEntityProcessor.nextRow(TikaEntityP= rocessor.java:122) at = org.apache.solr.handler.dataimport.EntityProcessorWrapper.nextRow(EntityPr= ocessorWrapper.java:238) at = org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.jav= a:596) ... 6 more Jul 11, 2013 5:23:36 PM org.apache.solr.update.DirectUpdateHandler2 = rollback data-config.xml: dataConfig dataSource type=3DBinURLDataSource name=3Ddata/ dataSource type=3DURLDataSource = baseUrl=3Dhttp://127.0.0.1/tkb/internet/; name=3Dmain/ document entity name=3Drec processor=3DXPathEntityProcessor = url=3DdocImport.xml forEach=3D/albums/album dataSource=3Dmain=20 field column=3Dtitle xpath=3D//title / field column=3Did xpath=3D//file / field column=3Dcontents xpath=3D//description / field column=3Dpath xpath=3D//path / field column=3DAuthor xpath=3D//author / =09 =09 =09 entity processor=3DTikaEntityProcessor = url=3Dfile:///C:\web\development\tkb\internet\public\download\online\${re= c.id} dataSource=3Ddata onerror=3Dskip field column=3Dcontents name=3Dtext / /entity /entity /document /dataConfig the lib are included and declared in the logs, i have also tried = tika-app 1.0 and tagsoup 1.2 with the same result. can someone please help, i = don't know where to start looking for the error.
Re: solr autodetectparser tikaconfig dataimporter error
Caused by: java.lang.NoSuchMethodError: That means you have some out of date jars or some newer jars mixed in with the old ones. -- Jack Krupansky -Original Message- From: Andreas Owen Sent: Sunday, July 14, 2013 3:07 PM To: solr-user@lucene.apache.org Subject: Re: solr autodetectparser tikaconfig dataimporter error hi is there nowone with a idea what this error is or even give me a pointer where to look? If not is there a alternitave way to import documents from a xml-file with meta-data and the filename to parse? thanks for any help. On 12. Jul 2013, at 10:38 PM, Andreas Owen wrote: i am using solr 3.5, tika-app-1.4 and tagcloud 1.2.1. when i try to = import a file via xml i get this error, it doesn't matter what file format i try = to index txt, cfm, pdf all the same error: SEVERE: Exception while processing: rec document : SolrInputDocument[{id=3Did(1.0)=3D{myTest.txt}, title=3Dtitle(1.0)=3D{Beratungsseminar kundenbrief}, = contents=3Dcontents(1.0)=3D{wie kommuniziert man}, author=3Dauthor(1.0)=3D{Peter Z.}, = path=3Dpath(1.0)=3D{download/online}}]:org.apache.solr.handler.dataimport.= DataImportHandlerException: java.lang.NoSuchMethodError: = org.apache.tika.parser.AutoDetectParser.setConfig(Lorg/apache/tika/config/= TikaConfig;)V at = org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.jav= a:669) at = org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.jav= a:622) at = org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:2= 68) at = org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:187)= at = org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.= java:359) at = org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:4= 27) at = org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:40= 8) Caused by: java.lang.NoSuchMethodError: = org.apache.tika.parser.AutoDetectParser.setConfig(Lorg/apache/tika/config/= TikaConfig;)V at = org.apache.solr.handler.dataimport.TikaEntityProcessor.nextRow(TikaEntityP= rocessor.java:122) at = org.apache.solr.handler.dataimport.EntityProcessorWrapper.nextRow(EntityPr= ocessorWrapper.java:238) at = org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.jav= a:596) ... 6 more Jul 11, 2013 5:23:36 PM org.apache.solr.common.SolrException log SEVERE: Full Import failed:org.apache.solr.handler.dataimport.DataImportHandlerException: java.lang.NoSuchMethodError: = org.apache.tika.parser.AutoDetectParser.setConfig(Lorg/apache/tika/config/= TikaConfig;)V at = org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.jav= a:669) at = org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.jav= a:622) at = org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:2= 68) at = org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:187)= at = org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.= java:359) at = org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:4= 27) at = org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:40= 8) Caused by: java.lang.NoSuchMethodError: = org.apache.tika.parser.AutoDetectParser.setConfig(Lorg/apache/tika/config/= TikaConfig;)V at = org.apache.solr.handler.dataimport.TikaEntityProcessor.nextRow(TikaEntityP= rocessor.java:122) at = org.apache.solr.handler.dataimport.EntityProcessorWrapper.nextRow(EntityPr= ocessorWrapper.java:238) at = org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.jav= a:596) ... 6 more Jul 11, 2013 5:23:36 PM org.apache.solr.update.DirectUpdateHandler2 = rollback data-config.xml: dataConfig dataSource type=3DBinURLDataSource name=3Ddata/ dataSource type=3DURLDataSource = baseUrl=3Dhttp://127.0.0.1/tkb/internet/; name=3Dmain/ document entity name=3Drec processor=3DXPathEntityProcessor = url=3DdocImport.xml forEach=3D/albums/album dataSource=3Dmain=20 field column=3Dtitle xpath=3D//title / field column=3Did xpath=3D//file / field column=3Dcontents xpath=3D//description / field column=3Dpath xpath=3D//path / field column=3DAuthor xpath=3D//author / =09 =09 =09 entity processor=3DTikaEntityProcessor = url=3Dfile:///C:\web\development\tkb\internet\public\download\online\${re= c.id} dataSource=3Ddata onerror=3Dskip field column=3Dcontents name=3Dtext / /entity /entity /document /dataConfig the lib are included and declared in the logs, i have also tried = tika-app 1.0 and tagsoup 1.2 with the same result. can someone please help, i = don't know where to start looking for the error.
Re: Apache Solr 4 - after 1st commit the index does not grow
Well, that's one. OutOfMemoryErrors will stop things from happening for sure, the cure is to give the JVM more memory. Additionally, multiple update of a doc with the same uniqueKey will replace the old copy with a new one, that might be what you're seeing. But get rid of the OOM first. Best Erick On Sun, Jul 14, 2013 at 2:40 PM, glumet jan.bouch...@gmail.com wrote: When I look into the log, there is: SEVERE: auto commit error...:java.lang.IllegalStateException: this writer hit an OutOfMemoryError; cannot commit at org.apache.lucene.index.IndexWriter.prepareCommitInternal(IndexWriter.java:2668) at org.apache.lucene.index.IndexWriter.commitInternal(IndexWriter.java:2834) at org.apache.lucene.index.IndexWriter.commit(IndexWriter.java:2814) at org.apache.solr.update.DirectUpdateHandler2.commit(DirectUpdateHandler2.java:529) at org.apache.solr.update.CommitTracker.run(CommitTracker.java:216) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334) at java.util.concurrent.FutureTask.run(FutureTask.java:166) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:178) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:292) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:722) -- View this message in context: http://lucene.472066.n3.nabble.com/Apache-Solr-4-after-1st-commit-the-index-does-not-grow-tp4077913p4077924.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Solr caching clarifications
Alright, thanks Erick. For the question about memory usage of merges, taken from Mike McCandless Blog The big thing that stays in RAM is a logical int[] mapping old docIDs to new docIDs, but in more recent versions of Lucene (4.x) we use a much more efficient structure than a simple int[] ... see https://issues.apache.org/jira/browse/LUCENE-2357 How much RAM is required is mostly a function of how many documents (lots of tiny docs use more RAM than fewer huge docs). A related clarification As my users are not aware of the fq possibility, i was wondering how do I make the best out of this field cache. Would if be efficient transforming implicitly their query to a filter query on fields that are boolean searches (date range etc. that do not affect the score of a document). Is this a good practice? Is there any plugin for a query parser that makes it? Inline On Thu, Jul 11, 2013 at 8:36 AM, Manuel Le Normand manuel.lenorm...@gmail.com wrote: Hello, As a result of frequent java OOM exceptions, I try to investigate more into the solr jvm memory heap usage. Please correct me if I am mistaking, this is my understanding of usages for the heap (per replica on a solr instance): 1. Buffers for indexing - bounded by ramBufferSize 2. Solr caches 3. Segment merge 4. Miscellaneous- buffers for Tlogs, servlet overhead etc. Particularly I'm concerned by Solr caches and segment merges. 1. How much memory consuming (bytes per doc) are FilterCaches (bitDocSet) and queryResultCaches (DocList)? I understand it is related to the skip spaces between doc id's that match (so it's not saved as a bitmap). But basically, is every id saved as a java int? Different beasts. filterCache consumes, essentially, maxDoc/8 bytes (you can get the maxDoc number from your Solr admin page). Plus some overhead for storing the fq text, but that's usually not much. This is for each entry up to Size. queryResultCache is usually trivial unless you've configured it extravagantly. It's the query string length + queryResultWindowSize integers per entry (queryResultWindowSize is from solrconfig.xml). 2. QueryResultMaxDocsCached - (for example = 100) means that any query resulting in more than 100 docs will not be cached (at all) in the queryResultCache? Or does it have to do with the documentCache? It's just a limit on the queryResultCache entry size as far as I can tell. But again this cache is relatively small, I'd be surprised if it used significant resources. 3. DocumentCache - written on the wiki it should be greater than max_results*concurrent_queries. Max result is just the num of rows displayed (rows-start) param, right? Not the queryResultWindow. Yes. This a cache (I think) for the _contents_ of the documents you'll be returning to be manipulated by various components during the life of the query. 4. LazyFieldLoading=true - when quering for id's only (fl=id) will this cache be used? (on the expense of eviction of docs that were already loaded with stored fields) Not sure, but I don't think this will contribute much to memory pressure. This is about now many fields are loaded to get a single value from a doc in the results list, and since one is usually working with 20 or so docs this is usually a small amount of memory. 5. How large is the heap used by mergings? Assuming we have a merge of 10 segments of 500MB each (half inverted files - *.pos *.doc etc, half non inverted files - *.fdt, *.tvd), how much heap should be left unused for this merge? Again, I don't think this is much of a memory consumer, although I confess I don't know the internals. Merging is mostly about I/O. Thanks in advance, Manu But take a look at the admin page, you can see how much memory various caches are using by looking at the plugins/stats section. Best Erick
Re: How to from solr facet exclude specific “Tag”!
Make your two fq clauses separate fq params? Would be better for your caches, and would mean the tag is easily associated with the whole fq querystring. Upayavira On Sun, Jul 14, 2013, at 03:14 AM, 张智 wrote: solr 4.3 this is my query request params: lst name=responseHeaderint name=status0/intint name=QTime15/intlst name=paramsstr name=facettrue/strstr name=indenttrue/strstr name=q*:*/strstr name=_1373713374569/strarr name=facet.fieldstr{!ex=city}CityId/strstr{!ex=company}CompanyId/str/arrstr name=wtxml/strstr name=fq{!tag=city}CityId:729 AND {!tag=company}CompanyId:16122/str/lst/lst This is the query response Facet content: lst name=facet_countslst name=facet_queries/lst name=facet_fieldslst name=CityIdint name=11100171/intint name=140489406/intint name=077477/intint name=136665780/intint name=136258092/intint name=72929213/intint name=79828975/int...int name=7262808/intint name=432776/intint name=1146772/intint name=1653765/intint name=1078668/intint name=814667/intint name=2049402/intint name=456401/intint name=401390/int/lstlst name=CompanyIdint name=16122971/intint name=690/intint name=710/intint name=720/intint name=790/intint name=800/intint name=850/intint name=880/intint name=940/int...int name=980/intint name=1040/intint name=1120/intint name=1130/intint name=1180/intint name=1230/intint name=1260/intint name=1310/intint name=1360/intint name=1390/int/lst/lstlst name=facet_dates/lst name=facet_ranges//lst You can see CityId the Facet is correct, it excludes the {! Tag = city} CityId: 729 queries , but CompanyId the Facet is not correct , he did not rule out {! Tag = company} CompanyId: 16122 queries. How to solve it ?
Re: Norms
On Jul 10, 2013, at 4:39 AM, Daniel Collins danwcoll...@gmail.com wrote: QueryNorm is what I'm still trying to get to the bottom of exactly :) If you have not seen it, some reading from the past here… https://issues.apache.org/jira/browse/LUCENE-1896 - Mark