Re: For an XML fieldtype
Thanks Chris, this idea has been discussed before, most notably in this thread... http://www.nabble.com/Indexing-XML-files-to7705775.html ...as discussed there, the crux of the isue is not a special fieldtype, but a custom ResponseWriter that outputs the XML you want, and leaves any field values you want unescaped (assuming you trust them to be wellformed) how you decide what field values to leave unescaped could either be hardcoded, or driven by the FieldType of each field (in which case you might write an XmlField that subclasses StrField, but you wouldn't need to override any methods -- just see that the FieldType is XmlField and use that as your guide. Sorry to haven't find this link. I discovered that I have done exactly the same as mirko-9 http://www.nabble.com/Re%3A-Indexing-XML-files-p7742668.html xmlWriter.writePrim(xml, name, f.stringValue(), false); So, this a good way to implement our need, but, there's good reasons to not commit it to Solr core : XmlResponseWriter schema, code injection risks. Such prudence make us very confident in Solr. : I would be glad that this class could be commited, so that I do not need to : keep it up to date with future Solr release. as long as you stick to the contracts of FieldType and/or ResponseWriter you don't need to worry -- these are published SolrPlugin APIs that Solr won't break ... we expect people to implment them, and people can expect their plugins to work when they upgrade Solr. -- Frédéric Glorieux
Re: Conceptual Question
Hi Yonik, Sorry to jump on an old post There is a change interface in JIRA, as long as all of the fields originally sent are stored. Do you remember the JIRA issue, or a token to find it ? It sounds useful in some cases, for example, when you are working on analysers. That could be real life for me in future. -- Frédéric Glorieux École nationale des chartes direction des nouvelles technologies et de l'informatique
Re: Multiple doc types in schema
Otis, Thanks for the link and the work ! Maybe around september, I will need this patch, if it's not already commit to the Solr sources. I will also need multiple indexes searches, but understand that there is no simple, fast and genereric solution in solr context. Maybe I should lose solr caching, but it seems not an impossible work to design its own custom request handler to query different indexes, like lucene allow it. SOLR-215 support multiple indices on a single Solr instance. It does *not* support searching of multiple indices at once (e.g. parallel search) and merging of results. -- Frédéric Glorieux École nationale des chartes direction des nouvelles technologies et de l'informatique
Re: Multiple doc types in schema
Hi Sonic, I will also need multiple indexes searches, Do you mean: 2) Multiple indexes with different schemas, search will search across all or some subset and combine the results (federated search) Exactly that. I'm comming from a quite old lucene based project, called SDX http://www.nongnu.org/sdx/docs/html/doc-sdx2/en/presentation/bases.html. Sorry for the link, the project is mainly documented in french. The framework is cocoon base, maybe heavy now. It allows to host multiple applications, with multiple bases, a base is a kind of Solr Schema, in 2000. From this experience, I can say cross search between different schemas is possible, and users may find it important. Take for example a library. They have different collections, lets say : csv records obtained from digitized photos, a light model, no write waited ; and a complex librarian model documented every day. These collections share at least a title and author field, and should be opened behind the same form for public ; but each one should have also its own application, according to its information model. With the SDX framework upper, I know real life applications with 30 lucene indexes. It's possible, because lucene allow it (MultiReader) http://lucene.zones.apache.org:8080/hudson/job/Lucene-Nightly/javadoc/org/apache/lucene/index/MultiReader.html. -- Frédéric Glorieux École nationale des chartes direction des nouvelles technologies et de l'informatique 1) Multiple unrelated indexes with different schemas, that you will search separately... but you just want them in the same JVM for some reason. 3) Multiple indexes with the same schema, each index is a shard that contains part of the total collection. Search will merge results across all shards to give appearance of a single large collection (distributed search). -Yonik
Re: Multiple doc types in schema
Thanks Yonik to share your reflexion, This doesn't sound like true federated search, I'm affraid to not understand federated search, you seems to have a precise idea behind the head. since you have a number of fields that are the same in each index that you search across, and you treat them all the same. This is functionally equivalent to having a single schema and a single index. You can still have multiple applications that query the single collection differently. Before a pointer or a web example from you, what you describe seems to me like implement a complete database with a single table (not easy to understand and maintain, but possible). To my experience, a collection is a schema, with thousands or millions XML documents, could be 10, 20 or more fields, and search configuration is generated from a kind of data schema (there's no real standard for explaining for example, that a title or a subject need one field for exact match, and another for word search). If an index was too big (hopefully I never touch this limit with lucene), I guess there are solutions. My problem is to maintain different collections with each their intellectual logic, some shared FieldNames, like Dublin Core, or at least fulltext, but also specific for each ones. Depending on update patterns and index sizes, you can probably get better efficiency with multiple indexes, but not really more functionality (in your case), right? Maybe let it understandable could be accepted as a functionality ? Perhaps less now, but it was a time when lucene index could become corrupted, so that separate them was important. I guess that those specific problems will not be Solr priorities, but till I have been corrected, I'm still feeling that multiple indexes are useful. -- Frédéric Glorieux École nationale des chartes direction des nouvelles technologies et de l'informatique
Re: Multiple doc types in schema
After further reading, especially http://people.apache.org/~hossman/apachecon2006us/faceted-searching-with-solr.pdf (Thanks Hoss) Depending on update patterns and index sizes, you can probably get better efficiency with multiple indexes, but not really more functionality (in your case), right? Maybe I'm approaching your point of view : Loose Schema with Dynamic Fields, this is probably my solution. There's something strange to me to consider a lucene index as a blob, but if it works for bigger than me, I should follow. So, it means one fieldtype by analyzer, and the datamodel logic is only from the collection side. I think I got my idea for september, but I would be very glad if you have something to add. -- Frédéric Glorieux École nationale des chartes direction des nouvelles technologies et de l'informatique
Re: indexing documents (or pieces of a document) by access controls
Hello, With all do respect, I really think the problem is largely underestimated here, and is far more complex then these suggestions...unless we are talking about 100.000 documents, couple of users, and updating ones a day. If you want millions of documents, facetted authorized navigation including counting and every second a new indexed document which should be reflected in the result instantly and changing authorisationsthe problem isn't relatively easy to solve anymore :-) When I had those kind of problems (less complex) with lucene, the only idea was to filter from the front-end, according to the ACL policy. Lucene docs and fields weren't protected, but tagged. Searching was always applied with a field audience, with hierarchical values like public, reserved, protected, secret, so that a public document has the secret value also, to be found with a audience:secret, according to the rights of the user who searchs. For the fields, the not allowed ones for some users where striped. May be you can have a look to the xmldb Exist ? The search engine, xquery based, is not focused on the same goals as lucene, but I can promise you that all queries will never return results from documents you are not allowed to read. -- Frédéric Glorieux École nationale des chartes direction des nouvelles technologies et de l'informatique
Re: Wildcards / Binary searches
Chris Hostetter a écrit : : It could be a useful request handler ? Giving a field, with a perhaps, but as i said -- i think it requires more then just a special request handler, you want a special index as well. FYI: there is an ongoing thread on this general topic on the java-user list, i didn't have the time/energy to follow it but the concepts discussed there might prove interesting for you (most of the people involved have spent a lot more time on problems like this then i have)... http://www.nabble.com/How-to-implement-AJAX-search%7ELucene-Search-part--tf3887286.html Interesting, here is my idea : WildcardTermEnum (NOT query) http://www.nabble.com/Re%3A-How-to-implement-AJAX-search%7ELucene-Search-part--p11027221.html -- Frédéric Glorieux École nationale des chartes direction des nouvelles technologies et de l'informatique
Re: Wildcards / Binary searches
Sorry to jump on a Side note of the thread, but the topic is about some of my need of the moment. Side Note: It's my opinion that type ahead or auto complete' style functionality is best addressed by customized logic (most likely using specially built fields containing all of the prefixes of the key words up to N characters as seperate tokens). Do you mean something like below ? field name=autocompletew wo wor word/field simple uses of PrefixQueries are only going ot get you so far particularly under heavy load or in an index with a large number of unique terms. For a bibliographic app with lucene, I implemented a suggest on different fields (especially subject terms, like topic or place), to populate a form with already used values. I used the Lucene IndexReader to get very fastly list of terms in sorting order, without duplicate values. http://lucene.zones.apache.org:8080/hudson/job/Lucene-Nightly/javadoc/org/apache/lucene/index/IndexReader.html#terms(org.apache.lucene.index.Term) There's a bad drawback of this way, The enumeration is ordered by Term.compareTo(), the sorting order is natively ASCII, uppercase is before lowercase. I had to patch Lucene Term.compareTo() for this project, definitively not a good practice for portability of indexes. A duplicate field with an analyser to produce a sortable ASCII version would be better. Opinions of the list on this topic would be welcome. -- Frédéric Glorieux École nationale des chartes direction des nouvelles technologies et de l'informatique
Re: highlight and wildcards ?
Xuesong (?), Thanks a lot for your answer, sorry to have not scan the archives before. This a really good and understandable reason, but sad for my project. Prefix queries will be the main activities of my users (they need to search latin texts, so that domin* is enough to match dominus or domino). So, I need some more investigations. Xuesong Luo a écrit : Frédéric, I asked a similar question several days before, it seems we don't have a perfect solution when using prefix wildcard with highlight. Here is what Chris said: in Solr 1.1, highlighting used the info from the raw query to do highlighting, hence in your query for consult* it would highlight the Consult part of Consultant even though the prefix query was matchign the whole word. In the trunk (soon to be Solr 1.2) Mike fixed that so the query is rewritten to it's expanded form before highlighting is done ... this works great for true wild card queries (ie: cons*t* or cons?lt*) but for prefix queries Solr has an optimization ofr Prefix queries (ie: consult*) to reduce the likely hood of Solr crashing if the prefix matches a lot of terms ... unfortunately this breaks highlighting of prefix queries, and no one has implemented a solution yet... -- Frédéric Glorieux École nationale des chartes direction des nouvelles technologies et de l'informatique
Re: highlight and wildcards ?
Same in my project. Chris does mention we can put a ? before the *, so instead of domin*, you can use domin?*, however that requires at least one char following your search string. Right, it works well, and one char is a detail. With a?* I get the documented lucene error maxClauseCount is set to 1024 http://lucene.apache.org/java/1_4_3/api/org/apache/lucene/search/BooleanQuery.html#getMaxClauseCount() I know that some of my users will like to find big lists of words of phrases on common prefix, like ante for example. I should evaluate RegexQuery. -- Frédéric Glorieux École nationale des chartes direction des nouvelles technologies et de l'informatique
Re: highlight and wildcards ?
Hoss, Thanks for all your information and pointers. I know that my problems are not mainstream. ConstantScoreQuery @author yonik public void extractTerms(Set terms) { // OK to not add any terms when used for MultiSearcher, // but may not be OK for highlighting } ConstantScoreRangeQuery @author yonik ConstantScorePrefixQuery @author yonik May be a kind of ConstantScoreRegexQuery will be a part of my solution for things like (ante|post).* (our users are linguists). Score will be lost, but this is not a problem for this kind of users who want to read all matches of a pattern. For an highlighter , I should investigate in your code, to see where the regexp could be plugged, without losing analysers (that we also need, nothing is simple). -- Frédéric Glorieux École nationale des chartes direction des nouvelles technologies et de l'informatique : With a?* I get the documented lucene error : maxClauseCount is set to 1024 Which is why Solr converts PrefixQueries to ConstantScorePrefixQueries that don't have that problem --the trade off being that they can't be highlighted, and we're right back where we started. It's a question of priorities. In developing Solr, we prioritized cosistent stability regardless of query or index characteristics and highlighting of PrefxQueries suffered. Working arround that decision by using Wildcards may get highlighting working for you, but the stability issue of the maxClauseCount is always going to be there (you can increase maxClauseCount in the solrconfig, but there's always the chance the a user will specify a wildcard that results in 1 more clause then you've configured) : I should evaluate RegexQuery. for the record, i don't think that will help ... RegexQuery it works just like WildcardQuery but with a differnet syntax -- it rewrites itself to a BooleanQuery containing all of the Terms in the index that match your regex. -Hoss
custom writer, working but... a strange exception in logs
Hi all, At first, lucene user for years, I should really thanks you for Solr. For a start, I wrote a little results writer for an app. It works like what I understand of Solr, except a strange exception I'm not able to puzzle. Version : fresh subversion. 1. Class 2. stacktrace 3. maybe ? 1. Class public class HTMLResponseWriter implements QueryResponseWriter { public static String CONTENT_TYPE_HTML_UTF8 = text/html; charset=UTF-8; /** A custom HTML header configured from solrconfig.xml */ static String HEAD; /** A custom HTML footer configured from solrconfig.xml */ static String FOOT; /** get some snippets from conf */ public void init(NamedList n) { String s=(String)n.get(head); if (s != null !.equals(s)) HEAD = s; s=(String)n.get(foot); if (s != null !.equals(s)) FOOT = s; } public void write(Writer writer, SolrQueryRequest req, SolrQueryResponse rsp) throws IOException { // cause the exception below writer.write(HEAD); /* loop on my results, working like it should */ // cause the exception below writer.write(FOOT); } public String getContentType(SolrQueryRequest request, SolrQueryResponse response) { return CONTENT_TYPE_HTML_UTF8; } } 2. Stacktrace = GRAVE: org.apache.solr.core.SolrException: Missing required parameter: q at org.apache.solr.request.RequiredSolrParams.get(RequiredSolrParams.java:50) at org.apache.solr.request.StandardRequestHandler.handleRequestBody(StandardRequestHandler.java:72) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:77) at org.apache.solr.core.SolrCore.execute(SolrCore.java:658) at org.apache.solr.servlet.SolrServlet.doGet(SolrServlet.java:66) at javax.servlet.http.HttpServlet.service(HttpServlet.java:707) ... 3. Maybe ? == I can't figure why, but when writer.write(HEAD) is executed, I see code from StandardRequestHandler executed 2 times in the debugger, first is OK, second hasn't the q parameter. Displaying results is always OK. Without such lines, there is only one call to StandardRequestHandler, no exception in log, but no more head or foot. When HEAD and FOOT values are hard coded and not configured, there's no exception. If HEAD and FOOT are not static, problem is the same. Is it a mistake in my code ? Every piece of advice welcome, and if I touch a bug, be sure I will do my best to help. -- Frédéric Glorieux École nationale des chartes direction des nouvelles technologies et de l'informatique
Re: custom writer, working but... a strange exception in logs
Thanks for answer, I'm feeling less guilty. I don't see a non-null default for HEAD/FOOT... perhaps do if (HEAD!=null) writer.write(HEAD); There may be an issue with how you register in solrconfig.xml I get every thing I want from solrconfig.xml, I was suspecting some classloader mystery. Following your advice from another post, I will write a specific request Handler, so it would be easier to trace the problem, with a very simple first solution, stop sending exception (to avoid gigabytes of logs). -- Frédéric Glorieux École nationale des chartes direction des nouvelles technologies et de l'informatique
SOLVED Re: custom writer, working but... a strange exception in logs
I'm baffled. [Yonic] I don't know why that would be... what is the client sending the request? If it gets an error, does it retry or something? Good ! It's the favicon.ico effect. Nothing in logs when the class is resquested from curl, but with a browser (here Opera), begin a response with html, and it requests for favicon.ico. -- Frédéric Glorieux École nationale des chartes direction des nouvelles technologies et de l'informatique
Re: SOLVED Re: custom writer, working but... a strange exception in logs
Frédéric Glorieux a écrit : I'm baffled. [Yonic] I don't know why that would be... what is the client sending the request? If it gets an error, does it retry or something? Good ! Nothing in logs when the class is resquested from curl, Sorry, same idea, but it's a CSS link. -- Frédéric Glorieux École nationale des chartes direction des nouvelles technologies et de l'informatique