XMLWriter escaping issue

2006-04-21 Thread Erik Hatcher
I encountered an escaping issue with XMLWriter.  Locally I've added  
the following test to BasicFunctionalityTest to demonstrate:


  public void testXMLWriter() throws Exception {

SolrQueryResponse rsp = new SolrQueryResponse();
rsp.add("\"quoted\"", "value");

StringWriter writer = new StringWriter(32000);
XMLWriter.writeResponse(writer,req("foo"),rsp);

System.out.println("writer.toString() = " + writer.toString());
DocumentBuilder builder = DocumentBuilderFactory.newInstance 
().newDocumentBuilder();

builder.parse(new ByteArrayInputStream
 (writer.toString().getBytes("UTF-8")));
  }


Quotes within XML attributes cause invalid XML to be generated.

I've corrected this in my local copy with this patch adding the  
escaping to attribute names and the " to XML.chardata_escapes.   
The question is, is it appropriate to escape quotes everywhere, or  
should it just be done when writing attribute values?  It should be  
fine to do it across the board for attribute values and element text,  
but I wanted to verify that with solr-dev before committing it.


Comments?

Erik



Index: src/java/org/apache/solr/request/XMLWriter.java
===
--- src/java/org/apache/solr/request/XMLWriter.java (revision  
395873)

+++ src/java/org/apache/solr/request/XMLWriter.java (working copy)
@@ -178,7 +178,7 @@
 writer.write(tag);
 if (name!=null) {
   writer.write(" name=\"");
-  writer.write(name);
+  XML.escapeCharData(name, writer);
   if (closeTag) {
 writer.write("\"/>");
   } else {
Index: src/java/org/apache/solr/util/XML.java
===
--- src/java/org/apache/solr/util/XML.java  (revision 395873)
+++ src/java/org/apache/solr/util/XML.java  (working copy)
@@ -32,7 +32,7 @@
   // many chars less than 0x20 are *not* valid XML, even when escaped!
   // for example, � is invalid XML.
   private static final String[] chardata_escapes=
-   
{"#0;","#1;","#2;","#3;","#4;","#5;","#6;","#7;","#8;",null,null,"#11;", 
"#12;",null,"#14;","#15;","#16;","#17;","#18;","#19;","#20;","#21;","#22 
;","#23;","#24;","#25;","#26;","#27;","#28;","#29;","#30;","#31;",null,n 
ull,null,null,null,null,"&",null,null,null,null,null,null,null,null, 
null,null,null,null,null,null,null,null,null,null,null,null,null,"<"} 
;
+   
{"#0;","#1;","#2;","#3;","#4;","#5;","#6;","#7;","#8;",null,null,"#11;", 
"#12;",null,"#14;","#15;","#16;","#17;","#18;","#19;","#20;","#21;","#22 
;","#23;","#24;","#25;","#26;","#27;","#28;","#29;","#30;","#31;",null,n 
ull,""",null,null,null,"&",null,null,null,null,null,null,null,n 
ull,null,null,null,null,null,null,null,null,null,null,null,null,null,"&l 
t;"};




Re: XMLWriter escaping issue

2006-04-21 Thread Yonik Seeley
On 4/21/06, Erik Hatcher <[EMAIL PROTECTED]> wrote:
> I've corrected this in my local copy with this patch adding the
> escaping to attribute names and the " to XML.chardata_escapes.
> The question is, is it appropriate to escape quotes everywhere, or
> should it just be done when writing attribute values?

I'd prefer just escaping quotes in attribute values as it makes things
like debugging output that contains query strings easier to read, and
easier to paste back into the query box for debugging from someone
elses output.

The attribute values definitely need to be XML escaped though.

-Yonik


patch submission

2006-04-21 Thread Mike Baranczak
First of all, I want to thank all of you. I was just assigned to  
write a Lucene-based search server, when I found out that such a  
system already existed and its name was Solr. You just saved me a lot  
of work.


My question: what's the preferred way to submit patches? Should I  
just send them to this list?


-MB




Re: patch submission

2006-04-21 Thread Yoav Shapira
Hola,

> My question: what's the preferred way to submit patches? Should I
> just send them to this list?

Post them to our issue tracker, JIRA, at
http://issues.apache.org/jira/browse/SOLR.  Consult
http://www.apache.org/dev/contributors.html for actual tips like
preferred patch format if unsure.  And thanks in advance for
contributing ;)

Have a great weekend,

Yoav


[jira] Created: (SOLR-13) patch to Config class; improves the loading of resources from classpath

2006-04-21 Thread Mike Baranczak (JIRA)
patch to Config class; improves the loading of resources from classpath
---

 Key: SOLR-13
 URL: http://issues.apache.org/jira/browse/SOLR-13
 Project: Solr
Type: Improvement

 Environment: Mac OS 10.4.6, Java 1.5.0_06, JBoss 4.0.1 
Reporter: Mike Baranczak
Priority: Minor


If config files aren't found in the expected places, Config attempts to find 
them on the classpath. The trouble is, it's using the current thread's context 
classpath, which means that the web application's own classpath is ignored.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Updated: (SOLR-13) patch to Config class; improves the loading of resources from classpath

2006-04-21 Thread Mike Baranczak (JIRA)
 [ http://issues.apache.org/jira/browse/SOLR-13?page=all ]

Mike Baranczak updated SOLR-13:
---

Attachment: config.patch

patch submitted by Mike Baranczak of epublishing.com


> patch to Config class; improves the loading of resources from classpath
> ---
>
>  Key: SOLR-13
>  URL: http://issues.apache.org/jira/browse/SOLR-13
>  Project: Solr
> Type: Improvement

>  Environment: Mac OS 10.4.6, Java 1.5.0_06, JBoss 4.0.1 
> Reporter: Mike Baranczak
> Priority: Minor
>  Attachments: config.patch
>
> If config files aren't found in the expected places, Config attempts to find 
> them on the classpath. The trouble is, it's using the current thread's 
> context classpath, which means that the web application's own classpath is 
> ignored.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



Re: XMLWriter escaping issue

2006-04-21 Thread Erik Hatcher
I've committed a change to escape attributes and character data  
differently, all tests pass.  Let me know if there are any issues  
with it and I'd be happy to address them.


Erik


On Apr 21, 2006, at 10:04 AM, Yonik Seeley wrote:


On 4/21/06, Erik Hatcher <[EMAIL PROTECTED]> wrote:

I've corrected this in my local copy with this patch adding the
escaping to attribute names and the " to XML.chardata_escapes.
The question is, is it appropriate to escape quotes everywhere, or
should it just be done when writing attribute values?


I'd prefer just escaping quotes in attribute values as it makes things
like debugging output that contains query strings easier to read, and
easier to paste back into the query box for debugging from someone
elses output.

The attribute values definitely need to be XML escaped though.

-Yonik




multiple same-named parameters

2006-04-21 Thread Erik Hatcher
In my application I'm POSTing into the Solr for querying, and need to  
specify any number of "constraints" (which correspond to BitSet  
filters that get combined into a BitDocSet as a filter on the  
query).  Rather than have the client uniquely name each of the  
parameters, I want to name them all "constraint" in the request.  The  
servlet API supports this, but SolrQueryRequest currently does not.   
I've patched my local copy with the following and it all works fine.   
For LocalSolrQueryRequest, since it is using a Map for the arguments  
anyway, I simply added the new getParams method to return a single  
item array.


Any objections to me committing this?

Thanks,
Erik



Index: src/java/org/apache/solr/request/LocalSolrQueryRequest.java
===
--- src/java/org/apache/solr/request/LocalSolrQueryRequest.java  
(revision 395947)
+++ src/java/org/apache/solr/request/LocalSolrQueryRequest.java  
(working copy)

@@ -70,6 +70,10 @@
 return (String)args.get(name);
   }
+  public String[] getParams(String name) {
+return new String[] {(String)args.get(name)};
+  }
+
   public String getQueryString() {
 return query;
   }
Index: src/java/org/apache/solr/request/SolrQueryRequest.java
===
--- src/java/org/apache/solr/request/SolrQueryRequest.java   
(revision 395947)
+++ src/java/org/apache/solr/request/SolrQueryRequest.java   
(working copy)

@@ -31,6 +31,8 @@
   public String getParam(String name);
+  public String[] getParams(String name);
+
   public String getQueryString();
   // signifies the syntax and the handler that should be used
Index: src/webapp/src/org/apache/solr/servlet/SolrServletRequest.java
===
--- src/webapp/src/org/apache/solr/servlet/ 
SolrServletRequest.java  (revision 395947)
+++ src/webapp/src/org/apache/solr/servlet/ 
SolrServletRequest.java  (working copy)

@@ -25,7 +25,11 @@
 return req.getParameter(name);
   }
+  public String[] getParams(String name) {
+return req.getParameterValues(name);
+  }
+
   public String getParamString() {
 StringBuilder sb = new StringBuilder(128);
 try {



Re: multiple same-named parameters

2006-04-21 Thread Chris Hostetter

: For LocalSolrQueryRequest, since it is using a Map for the arguments
: anyway, I simply added the new getParams method to return a single
: item array.

I would suggest that LocalSolrRequest be modified a little more so that if
the value of an entry in the Map is an Array, then getParams returns the
Array, and getParam returns the first item in the array.




-Hoss



Re: exception in rendering /select XML

2006-04-21 Thread Erik Hatcher
I've finally revisited this issue.  I switched to Tomcat (5.5.16) and  
all is well, so it certainly appears as if this is a Jetty issue.   
*sigh*


I'll look into whether there is a newer version of Jetty and if that  
fixes this.


I did make some improvements to text encoding in my indexing process,  
which is actually quite involved: RDF files, parsed by Java, some  
pointers to URLs that get fetched via HttpClient, and then packaged  
into a JDOM XML DOM, serialized to a String, and then sent to Solr  
via HttpClient.  Even if I had the wrong encoding somewhere along the  
way, if a valid String is retrieved from a Lucene Field it should be  
serializable again, at least I'm assuming so - so it is at least  
reassuring that the bug is in Jetty and not in my complicated process.


Erik


On Apr 12, 2006, at 10:24 AM, Yonik Seeley wrote:


On 4/12/06, Chris Hostetter <[EMAIL PROTECTED]> wrote:

: > The weird thing is that the last Solr line in the trace is
: > org.apache.solr.util.XML.escapeCharData(XML.java:100)
: >
: > 99if (start==0) {
: > 100  out.write(str);


Thanks, I had missed that.  I just verified that line 100 is the same
in both versions of the file, so the most likely explanation is a
corrupt string (the string might end in the first char of a multi char
character) that triggers the exception in Sun's UTF-8 encoder.

So the question then is, how did this bad string come about?
Chris' guess about a bad charset somewhere is probably right.

-Yonik




commit

2006-04-21 Thread jason rutherglen
In using Solr I've found the need to have a reload command in addition to a 
commit.  The reason for this is sometimes updates are made but are not 
available via the server.  The commit makes a snapshot which on a large index 
is a potentially expensive operation.  Is there a way to do reload today?



Re: commit

2006-04-21 Thread Chris Hostetter

: In using Solr I've found the need to have a reload command in addition
: to a commit.  The reason for this is sometimes updates are made but are
: not available via the server.  The commit makes a snapshot which on a
: large index is a potentially expensive operation.  Is there a way to do
: reload today?

I'm a little confused, here's a bunch of thoughts on your email in no
particular order...


Making snapshots should be really really really fast and easy -- it's just
creating hardlinks to files, so it shouldn't take very long ... are you
sure it's really an issue?


Generally speaking all a "commit" operation really is is a an instruction
to:
  1) close/reopen the current writer/reaer used for adds/deletes
  2) open a new reader for searches
  3) close the older reader for searches once all currently processing
 requests finish.

...the concept of "reloading" the index really requires that all three of
those things happen -- so in my mind that's what a commit is.

The Solr app won't create snapshots automatically -- it will only do that
if you have a call to the snapshooter script registered as aprt of a
listener -- it sounds like you have a postCommit event listener which does
this.  You might want to turn that off, or change it to a postOptimize
event listener so that snapshoots are only made when you optimize -- or
you could not have any listeners, and just run snapshooter yourself
whenever you want a snapshot

(Bill/Yonik: sanity check me here: there's no reason snapshooter can't be
run manually right?)

When registering a listener, there is also a "wait" option that controlls
wether the operation will block untill the listener is done ...  i don't
know if there's any particular reason why the example for snapshooter has
wait=true, but i think you can change that to false if you think
snapshooting is taking too long (again: bill/yonik, am i wrong?)

Another thing that might take a while when doing a commit -- seperate from
snapshooting -- is the (auto)warming of the various caches that happens
when opening the new reader for searching.  if you are doing lots of
commits at a rapid rate because you really want the newly added docs to
appear right, you may want to turn off any newSearcher listeners you have,
and change the autowarm count on your caches to be 0.


-Hoss



Re: commit

2006-04-21 Thread jason rutherglen
Is there a way to decouple the snapshot creation from the index reloading 
currently?  If not I was going to build it in.  We have a 700 meg index, so 
creating a snapshot basically copies that, and after several snapshots takes up 
a lot of storage.  Sometimes I just want to see a change show up on the master, 
sometimes I want to create a snapshot for the slave servers.  This was very 
confusing when I first started using Solr.  

Thanks,

Jason

- Original Message 
From: Chris Hostetter <[EMAIL PROTECTED]>
To: solr-dev@lucene.apache.org; jason rutherglen <[EMAIL PROTECTED]>
Sent: Friday, April 21, 2006 5:28:15 PM
Subject: Re: commit


: In using Solr I've found the need to have a reload command in addition
: to a commit.  The reason for this is sometimes updates are made but are
: not available via the server.  The commit makes a snapshot which on a
: large index is a potentially expensive operation.  Is there a way to do
: reload today?

I'm a little confused, here's a bunch of thoughts on your email in no
particular order...


Making snapshots should be really really really fast and easy -- it's just
creating hardlinks to files, so it shouldn't take very long ... are you
sure it's really an issue?


Generally speaking all a "commit" operation really is is a an instruction
to:
  1) close/reopen the current writer/reaer used for adds/deletes
  2) open a new reader for searches
  3) close the older reader for searches once all currently processing
 requests finish.

...the concept of "reloading" the index really requires that all three of
those things happen -- so in my mind that's what a commit is.

The Solr app won't create snapshots automatically -- it will only do that
if you have a call to the snapshooter script registered as aprt of a
listener -- it sounds like you have a postCommit event listener which does
this.  You might want to turn that off, or change it to a postOptimize
event listener so that snapshoots are only made when you optimize -- or
you could not have any listeners, and just run snapshooter yourself
whenever you want a snapshot

(Bill/Yonik: sanity check me here: there's no reason snapshooter can't be
run manually right?)

When registering a listener, there is also a "wait" option that controlls
wether the operation will block untill the listener is done ...  i don't
know if there's any particular reason why the example for snapshooter has
wait=true, but i think you can change that to false if you think
snapshooting is taking too long (again: bill/yonik, am i wrong?)

Another thing that might take a while when doing a commit -- seperate from
snapshooting -- is the (auto)warming of the various caches that happens
when opening the new reader for searching.  if you are doing lots of
commits at a rapid rate because you really want the newly added docs to
appear right, you may want to turn off any newSearcher listeners you have,
and change the autowarm count on your caches to be 0.


-Hoss