Re: For an XML fieldtype

2008-02-07 Thread Frédéric Glorieux (École nationale des chartes)


Thanks Chris,


this idea has been discussed before, most notably in this thread...

http://www.nabble.com/Indexing-XML-files-to7705775.html
...as discussed there, the crux of the isue is not a special fieldtype, 
but a custom ResponseWriter that outputs the XML you want, and leaves any 
field values you want unescaped (assuming you trust them to be wellformed)  
how you decide what field values to leave unescaped could either be 
hardcoded, or driven by the FieldType of each field (in which case you 
might write an XmlField that subclasses StrField, but you wouldn't need to 
override any methods -- just see that the FieldType is XmlField and use 
that as your guide.



Sorry to haven't find this link. I discovered that I have done exactly 
the same as mirko-9

http://www.nabble.com/Re%3A-Indexing-XML-files-p7742668.html
xmlWriter.writePrim(xml, name, f.stringValue(), false);

So, this a good way to implement our need, but, there's good reasons to 
not commit it to Solr core : XmlResponseWriter schema, code injection 
risks. Such prudence make us very confident in Solr.



: I would be glad that this class could be commited, so that I do not need to
: keep it up to date with future Solr release.

as long as you stick to the contracts of FieldType and/or ResponseWriter 
you don't need to worry -- these are published SolrPlugin APIs that Solr 
won't break ... we expect people to implment them, and people can expect 
their plugins to work when they upgrade Solr.




--
Frédéric Glorieux


Re: Conceptual Question

2007-06-22 Thread Frédéric Glorieux

Hi Yonik,

Sorry to jump on an old post


There is a change interface in JIRA, as long as all of the fields
originally sent are stored.


Do you remember the JIRA issue, or a token to find it ? It sounds useful 
in some cases, for example, when you are working on analysers. That 
could be real life for me in future.


--
Frédéric Glorieux
École nationale des chartes
direction des nouvelles technologies et de l'informatique


Re: Multiple doc types in schema

2007-06-21 Thread Frédéric Glorieux


Otis,

Thanks for the link and the work !
Maybe around september, I will need this patch, if it's not already 
commit to the Solr sources.


I will also need multiple indexes searches, but understand that there is 
no simple, fast and genereric solution in solr context. Maybe I should 
lose solr caching, but it seems not an impossible work to design its own 
custom request handler to query different indexes, like lucene allow it.



SOLR-215 support multiple indices on a single Solr instance.  It does *not* 
support searching of multiple indices at once (e.g. parallel search) and 
merging of results.





--
Frédéric Glorieux
École nationale des chartes
direction des nouvelles technologies et de l'informatique


Re: Multiple doc types in schema

2007-06-21 Thread Frédéric Glorieux


Hi Sonic,


I will also need multiple indexes searches,


Do you mean:



2) Multiple indexes with different schemas, search will search across
all or some subset and combine the results (federated search)


Exactly that. I'm comming from a quite old lucene based project, called SDX
http://www.nongnu.org/sdx/docs/html/doc-sdx2/en/presentation/bases.html. 
Sorry for the link, the project is mainly documented in french. The 
framework is cocoon base, maybe heavy now. It allows to host multiple 
applications, with multiple bases, a base is a kind of Solr Schema, 
in 2000.


From this experience, I can say cross search between different schemas 
is possible, and users may find it important. Take for example a 
library. They have different collections, lets say : csv records 
obtained from digitized photos, a light model, no write waited ; and a 
complex librarian model documented every day. These collections share at 
least a title and author field, and should be opened behind the same 
form for public ; but each one should have also its own application, 
according to its information model.


With the SDX framework upper, I know real life applications with 30 
lucene indexes. It's possible, because lucene allow it (MultiReader) 
http://lucene.zones.apache.org:8080/hudson/job/Lucene-Nightly/javadoc/org/apache/lucene/index/MultiReader.html.



--
Frédéric Glorieux
École nationale des chartes
direction des nouvelles technologies et de l'informatique


 1) Multiple unrelated indexes with different schemas, that you will
 search separately... but you just want them in the same JVM for some
 reason.



3) Multiple indexes with the same schema, each index is a shard that
contains part of the total collection.  Search will merge results
across all shards to give appearance of a single large collection
(distributed search).

-Yonik





Re: Multiple doc types in schema

2007-06-21 Thread Frédéric Glorieux

Thanks Yonik to share your reflexion,

This doesn't sound like true federated search, 


I'm affraid to not understand federated search, you seems to have a 
precise idea behind the head.



since you have a number
of fields that are the same in each index that you search across, and
you treat them all the same.  This is functionally equivalent to
having a single schema and a single index.  You can still have
multiple applications that query the single collection differently.


Before a pointer or a web example from you, what you describe seems to 
me like implement a complete database with a single table (not easy to 
understand and maintain, but possible). To my experience, a collection 
is a schema, with thousands or millions XML documents, could be 10, 20 
or more fields, and search configuration is generated from a kind of 
data schema (there's no real standard for explaining for example, that a 
title or a subject need one field for exact match, and another for word 
search). If an index was too big (hopefully I never touch this limit 
with lucene), I guess there are solutions. My problem is to maintain 
different collections with each their intellectual logic, some shared 
FieldNames, like Dublin Core, or at least fulltext, but also specific 
for each ones.



Depending on update patterns and index sizes, you can probably get
better efficiency with multiple indexes, but not really more
functionality (in your case), right?


Maybe let it understandable could be accepted as a functionality ? 
Perhaps less now, but it was a time when lucene index could become 
corrupted, so that separate them was important.


I guess that those specific problems will not be Solr priorities, but 
till I have been corrected, I'm still feeling that multiple indexes are 
useful.



--
Frédéric Glorieux
École nationale des chartes
direction des nouvelles technologies et de l'informatique


Re: Multiple doc types in schema

2007-06-21 Thread Frédéric Glorieux


After further reading, especially 
http://people.apache.org/~hossman/apachecon2006us/faceted-searching-with-solr.pdf

(Thanks Hoss)


Depending on update patterns and index sizes, you can probably get
better efficiency with multiple indexes, but not really more
functionality (in your case), right?


Maybe I'm approaching your point of view : Loose Schema with Dynamic 
Fields, this is probably my solution. There's something strange to me 
to consider a lucene index as a blob, but if it works for bigger than 
me, I should follow. So, it means one fieldtype by analyzer, and the 
datamodel logic is only from the collection side. I think I got my idea 
for september, but I would be very glad if you have something to add.


--
Frédéric Glorieux
École nationale des chartes
direction des nouvelles technologies et de l'informatique


Re: indexing documents (or pieces of a document) by access controls

2007-06-13 Thread Frédéric Glorieux



Hello,

 With all do respect, I really think the problem is largely 
underestimated here, and is far more complex then these 
suggestions...unless we are talking about 100.000 documents, couple of 
users, and updating ones a day. If you want millions of documents, 
facetted authorized navigation including counting and every second a new 
indexed document which should be reflected in the result instantly and 
changing authorisationsthe problem isn't relatively easy to solve 
anymore :-)


When I had those kind of problems (less complex) with lucene, the only 
idea was to filter from the front-end, according to the ACL policy. 
Lucene docs and fields weren't protected, but tagged. Searching was 
always applied with a field audience, with hierarchical values like
public, reserved, protected, secret, so that a public document has 
the secret value also, to be found with a audience:secret, according 
to the rights of the user who searchs. For the fields, the not allowed 
ones for some users where striped.


May be you can have a look to the xmldb Exist ? The search engine, 
xquery based, is not focused on the same goals as lucene, but I can 
promise you that all queries will never return results from documents 
you are not allowed to read.



--
Frédéric Glorieux
École nationale des chartes
direction des nouvelles technologies et de l'informatique


Re: Wildcards / Binary searches

2007-06-10 Thread Frédéric Glorieux

Chris Hostetter a écrit :

: It could be a useful request handler ? Giving a field, with a

perhaps, but as i said -- i think it requires more then just a special
request handler, you want a special index as well.

FYI: there is an ongoing thread on this general topic on the java-user
list, i didn't have the time/energy to follow it but the concepts
discussed there might prove interesting for you (most of the people
involved have spent a lot more time on problems like this then i have)...

http://www.nabble.com/How-to-implement-AJAX-search%7ELucene-Search-part--tf3887286.html


Interesting, here is my idea : WildcardTermEnum (NOT query)

http://www.nabble.com/Re%3A-How-to-implement-AJAX-search%7ELucene-Search-part--p11027221.html


--
Frédéric Glorieux
École nationale des chartes
direction des nouvelles technologies et de l'informatique


Re: Wildcards / Binary searches

2007-06-07 Thread Frédéric Glorieux




Sorry to jump on a Side  note of the thread, but the topic is about 
some of my need of the moment.



Side Note: It's my opinion that type ahead or auto complete' style
functionality is best addressed by customized logic (most likely using
specially built fields containing all of the prefixes of the key words up
to N characters as seperate tokens).  


Do you mean something like below ?
field name=autocompletew wo wor word/field


simple uses of PrefixQueries are
only going ot get you so far particularly under heavy load or in an index
with a large number of unique terms.


For a bibliographic app with lucene, I implemented a suggest on 
different fields (especially subject terms, like topic or place), to 
populate a form with already used values. I used the Lucene IndexReader 
to get very fastly list of terms in sorting order, without duplicate values.


http://lucene.zones.apache.org:8080/hudson/job/Lucene-Nightly/javadoc/org/apache/lucene/index/IndexReader.html#terms(org.apache.lucene.index.Term)

There's a bad drawback of this way, The enumeration is ordered by 
Term.compareTo(), the sorting order is natively ASCII, uppercase is 
before lowercase. I had to patch Lucene Term.compareTo() for this 
project, definitively not a good practice for portability of indexes. A 
duplicate field with an analyser to produce a sortable ASCII version 
would be better.


Opinions of the list on this topic would be welcome.

--
Frédéric Glorieux
École nationale des chartes
direction des nouvelles technologies et de l'informatique


Re: highlight and wildcards ?

2007-06-07 Thread Frédéric Glorieux

Xuesong (?),

Thanks a lot for your answer, sorry to have not scan the archives 
before. This a really good and understandable reason, but sad for my 
project. Prefix queries will be the main activities of my users (they 
need to search latin texts, so that domin* is enough to match dominus 
or domino). So, I need some more investigations.


Xuesong Luo a écrit :


Frédéric,
I asked a similar question several days before, it seems we don't have a perfect solution when using prefix wildcard with highlight. Here is what Chris said: 


in Solr 1.1, highlighting used the info from the raw query to do highlighting, hence in 
your query for consult* it would highlight the Consult part of Consultant even though the 
prefix query was matchign the whole word.  In the trunk (soon to be Solr 1.2) Mike fixed 
that so the query is rewritten to it's expanded form before highlighting is 
done ...
this works great for true wild card queries (ie: cons*t* or cons?lt*) but for 
prefix queries Solr has an optimization ofr Prefix queries (ie:
consult*) to reduce the likely hood of Solr crashing if the prefix matches a 
lot of terms ... unfortunately this breaks highlighting of prefix queries, and 
no one has implemented a solution yet...




--
Frédéric Glorieux
École nationale des chartes
direction des nouvelles technologies et de l'informatique


Re: highlight and wildcards ?

2007-06-07 Thread Frédéric Glorieux

Same in my project. Chris does mention we can put a ? before the *, so instead 
of domin*, you can use domin?*, however that requires at least one char 
following your search string.


Right, it works well, and one char is a detail.

With a?* I get the documented lucene error
maxClauseCount is set to 1024

http://lucene.apache.org/java/1_4_3/api/org/apache/lucene/search/BooleanQuery.html#getMaxClauseCount()

I know that some of my users will like to find big lists of words of 
phrases on common prefix, like ante for example.


I should evaluate RegexQuery.

--
Frédéric Glorieux
École nationale des chartes
direction des nouvelles technologies et de l'informatique


Re: highlight and wildcards ?

2007-06-07 Thread Frédéric Glorieux

Hoss,

Thanks for all your information and pointers. I know that my problems 
are not mainstream.


ConstantScoreQuery @author yonik
  public void extractTerms(Set terms) {
// OK to not add any terms when used for MultiSearcher,
// but may not be OK for highlighting
  }
ConstantScoreRangeQuery @author yonik
ConstantScorePrefixQuery @author yonik

May be a kind of ConstantScoreRegexQuery will be a part of my solution
for things like (ante|post).* (our users are linguists).

Score will be lost, but this is not a problem for this kind of users who 
want to read all matches of a pattern. For an highlighter , I should 
investigate in your code, to see where the regexp could be plugged, 
without losing analysers (that we also need, nothing is simple).


--
Frédéric Glorieux
École nationale des chartes
direction des nouvelles technologies et de l'informatique




: With a?* I get the documented lucene error
: maxClauseCount is set to 1024

Which is why Solr converts PrefixQueries to ConstantScorePrefixQueries
that don't have that problem --the trade off being that they can't be
highlighted, and we're right back where we started.

It's a question of priorities.  In developing Solr, we prioritized
cosistent stability regardless of query or index characteristics and
highlighting of PrefxQueries suffered.  Working arround that decision by
using Wildcards may get highlighting working for you, but the stability
issue of the maxClauseCount is always going to be there (you can increase
maxClauseCount in the solrconfig, but there's always the chance the a user
will specify a wildcard that results in 1 more clause then you've
configured)

: I should evaluate RegexQuery.

for the record, i don't think that will help ... RegexQuery it works just
like WildcardQuery but with a differnet syntax -- it rewrites itself to a
BooleanQuery containing all of the Terms in the index that match your
regex.


-Hoss





custom writer, working but... a strange exception in logs

2007-06-06 Thread Frédéric Glorieux


Hi all,

At first, lucene user for years, I should really thanks you for Solr.

For a start, I wrote a little results writer for an app. It works like 
what I understand of Solr, except a strange exception I'm not able to 
puzzle.


Version : fresh subversion.
 1. Class
 2. stacktrace
 3. maybe ?

1. Class


public class HTMLResponseWriter implements QueryResponseWriter {
  public static String CONTENT_TYPE_HTML_UTF8 = text/html; charset=UTF-8;
  /** A custom HTML header configured from solrconfig.xml */
  static String HEAD;
  /** A custom HTML footer configured from solrconfig.xml */
  static String FOOT;

  /** get some snippets from conf */
  public void init(NamedList n) {
String s=(String)n.get(head);
if (s != null  !.equals(s)) HEAD = s;
s=(String)n.get(foot);
if (s != null  !.equals(s)) FOOT = s;
  }

  public void write(Writer writer, SolrQueryRequest req, 
SolrQueryResponse rsp)

  throws IOException {
// cause the exception below
writer.write(HEAD);
/* loop on my results, working like it should */
// cause the exception below
writer.write(FOOT);
  }

  public String getContentType(SolrQueryRequest request, 
SolrQueryResponse response) {

return CONTENT_TYPE_HTML_UTF8;
  }

}

2. Stacktrace
=

GRAVE: org.apache.solr.core.SolrException: Missing required parameter: q
	at 
org.apache.solr.request.RequiredSolrParams.get(RequiredSolrParams.java:50)
	at 
org.apache.solr.request.StandardRequestHandler.handleRequestBody(StandardRequestHandler.java:72)
	at 
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:77)

at org.apache.solr.core.SolrCore.execute(SolrCore.java:658)
at org.apache.solr.servlet.SolrServlet.doGet(SolrServlet.java:66)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:707)
...

3. Maybe ?
==

I can't figure why, but when writer.write(HEAD) is executed, I see code 
from StandardRequestHandler executed 2 times in the debugger, first is 
OK, second hasn't the q parameter. Displaying results is always OK. 
Without such lines, there is only one call to StandardRequestHandler, no 
exception in log, but no more head or foot. When HEAD and FOOT values 
are hard coded and not configured, there's no exception. If HEAD and 
FOOT are not static, problem is the same.


Is it a mistake in my code ? Every piece of advice welcome, and if I 
touch a bug, be sure I will do my best to help.


--
Frédéric Glorieux
École nationale des chartes
direction des nouvelles technologies et de l'informatique


Re: custom writer, working but... a strange exception in logs

2007-06-06 Thread Frédéric Glorieux


Thanks for answer,

I'm feeling less guilty.

 I don't see a non-null default for HEAD/FOOT... perhaps
 do   if (HEAD!=null) writer.write(HEAD);
 There may be an issue with how you register in solrconfig.xml

I get every thing I want from solrconfig.xml, I was suspecting some 
classloader mystery. Following your advice from another post, I will 
write a specific request Handler, so it would be easier to trace the 
problem, with a very simple first solution, stop sending exception (to 
avoid gigabytes of logs).


--
Frédéric Glorieux
École nationale des chartes
direction des nouvelles technologies et de l'informatique


SOLVED Re: custom writer, working but... a strange exception in logs

2007-06-06 Thread Frédéric Glorieux


 I'm baffled.

[Yonic]
 I don't know why that would be... what is the client sending the request?
 If it gets an error, does it retry or something?

Good !
It's the favicon.ico effect.
Nothing in logs when the class is resquested from curl, but with a 
browser (here Opera), begin a response with html, and it requests for 
favicon.ico.




--
Frédéric Glorieux
École nationale des chartes
direction des nouvelles technologies et de l'informatique


Re: SOLVED Re: custom writer, working but... a strange exception in logs

2007-06-06 Thread Frédéric Glorieux

Frédéric Glorieux a écrit :


  I'm baffled.

[Yonic]
  I don't know why that would be... what is the client sending the 
request?

  If it gets an error, does it retry or something?

Good !


Nothing in logs when the class is resquested from curl, 


Sorry, same idea, but it's a CSS link.

--
Frédéric Glorieux
École nationale des chartes
direction des nouvelles technologies et de l'informatique