[jira] Closed: (LUCENE-477) Build an index which allows me to broswe by category.

2005-12-06 Thread Erik Hatcher (JIRA)
 [ http://issues.apache.org/jira/browse/LUCENE-477?page=all ]
 
Erik Hatcher closed LUCENE-477:
---

Resolution: Invalid

Yes, please bring this topic to the user list rather than JIRA

> Build an index which allows me to broswe by category.
> -
>
>  Key: LUCENE-477
>  URL: http://issues.apache.org/jira/browse/LUCENE-477
>  Project: Lucene - Java
> Type: Task
>   Components: Index
> Versions: 1.4
>  Environment: JDK 1.4, Windows 2003, Tomcat 5.0.28
> Reporter: Mark Dos Santos

>
> Hello there,
> I have a collection of documents that I am using lucene to build an index 
> for, and then I have a jsp app to search my documents. This all works great. 
> I believe lucene is such an amazing product, but thats a whole other topic. 
> Anyway, maybe it's my lack of experience in building indexes, but I am have 
> trouble coming up with an index that kind of mimics verity's parametric 
> index.  You see my documents all have a category path (I have over 50,000 
> docs).  A document can be at any level of the category path, and that same 
> path can have many different documents. IE. Document x, has a category path 
> USA//New Jersey//Trenton//09890 and Document y has a category path USA//New 
> Jersey//Trenton//09890.  
> Basically, I would like to build an index using lucene, where when I search, 
> if my results were to bring back those two documents, I would like to 
> retrieve the distinct category path for those two documents.  Of course I can 
> loop through and build a vector with only the unique paths that come in the 
> search results, but that obviously would take to long when I get lets say 
> 1 results from my search.
> So the question I guess is, how can I build an index that would facilitate 
> this functionality for me.  If anyone has any suggestions I would greatly 
> appreciate it.
> Thanks,
> Mark

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-477) Build an index which allows me to broswe by category.

2005-12-06 Thread Hoss Man (JIRA)
[ 
http://issues.apache.org/jira/browse/LUCENE-477?page=comments#action_12359495 ] 

Hoss Man commented on LUCENE-477:
-

This isn't a "bug" or a "feature" or a "task" as much as it is a "question" 
about using lucene in a particular way.  Questions generally recieve more 
comment on the java-user at lucene dot apache dot org then they do when posted 
in JIRA.

In particular, you should search the mailing list archive for "facet" or 
"faceted" before you ask teh question, previous discussions may give you enough 
info to solve your problem.

> Build an index which allows me to broswe by category.
> -
>
>  Key: LUCENE-477
>  URL: http://issues.apache.org/jira/browse/LUCENE-477
>  Project: Lucene - Java
> Type: Task
>   Components: Index
> Versions: 1.4
>  Environment: JDK 1.4, Windows 2003, Tomcat 5.0.28
> Reporter: Mark Dos Santos

>
> Hello there,
> I have a collection of documents that I am using lucene to build an index 
> for, and then I have a jsp app to search my documents. This all works great. 
> I believe lucene is such an amazing product, but thats a whole other topic. 
> Anyway, maybe it's my lack of experience in building indexes, but I am have 
> trouble coming up with an index that kind of mimics verity's parametric 
> index.  You see my documents all have a category path (I have over 50,000 
> docs).  A document can be at any level of the category path, and that same 
> path can have many different documents. IE. Document x, has a category path 
> USA//New Jersey//Trenton//09890 and Document y has a category path USA//New 
> Jersey//Trenton//09890.  
> Basically, I would like to build an index using lucene, where when I search, 
> if my results were to bring back those two documents, I would like to 
> retrieve the distinct category path for those two documents.  Of course I can 
> loop through and build a vector with only the unique paths that come in the 
> search results, but that obviously would take to long when I get lets say 
> 1 results from my search.
> So the question I guess is, how can I build an index that would facilitate 
> this functionality for me.  If anyone has any suggestions I would greatly 
> appreciate it.
> Thanks,
> Mark

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: "Advanced" query language

2005-12-06 Thread Steven Rowe

For normal text data, with valid unicode characters that aren't legal
XML, I'd rather have a simple escaping mechanism.  Something like
backslash escaping that is easily understood.  Maybe something as
simple as \00 for � (backslash followed by two hex digits).


Similar RFC for an extension to XML-RPC to enable the same thing 
(proposed syntax is "\n..n;", e.g. "\0;" for character zero):




-Steve

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: "Advanced" query language

2005-12-06 Thread Steven Rowe

Yonik wrote:

For normal text data, with valid unicode characters that aren't legal
XML, I'd rather have a simple escaping mechanism.  Something like
backslash escaping that is easily understood.  Maybe something as
simple as \00 for � (backslash followed by two hex digits).


I agree with your goal of transparency, especially for the cases of 
human authorship.


However, I don't agree with the idea of an application-specific escape 
syntax.  What if someone wants to use the query metacharacter(s) ('\' in 
your example) literally?  The usual answer is to escape the 
metacharacters, e.g. "\\00" to encode literal "\00".  But *especially* 
for the human-authored cases, introduction of this complexity is less 
than ideal.


An alternative mechanism could be empty XML elements, e.g.:



Or less verbosely, with a fixed set of element names (and there are 28 
of these, right?: [#x00-#x08] | #x0B | #x0C | [#x0E-#x1F]):



  


-Steve

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Created: (LUCENE-477) Build an index which allows me to broswe by category.

2005-12-06 Thread Mark Dos Santos (JIRA)
Build an index which allows me to broswe by category.
-

 Key: LUCENE-477
 URL: http://issues.apache.org/jira/browse/LUCENE-477
 Project: Lucene - Java
Type: Task
  Components: Index  
Versions: 1.4
 Environment: JDK 1.4, Windows 2003, Tomcat 5.0.28
Reporter: Mark Dos Santos


Hello there,

I have a collection of documents that I am using lucene to build an index for, 
and then I have a jsp app to search my documents. This all works great. I 
believe lucene is such an amazing product, but thats a whole other topic. 
Anyway, maybe it's my lack of experience in building indexes, but I am have 
trouble coming up with an index that kind of mimics verity's parametric index.  
You see my documents all have a category path (I have over 50,000 docs).  A 
document can be at any level of the category path, and that same path can have 
many different documents. IE. Document x, has a category path USA//New 
Jersey//Trenton//09890 and Document y has a category path USA//New 
Jersey//Trenton//09890.  

Basically, I would like to build an index using lucene, where when I search, if 
my results were to bring back those two documents, I would like to retrieve the 
distinct category path for those two documents.  Of course I can loop through 
and build a vector with only the unique paths that come in the search results, 
but that obviously would take to long when I get lets say 1 results from my 
search.

So the question I guess is, how can I build an index that would facilitate this 
functionality for me.  If anyone has any suggestions I would greatly appreciate 
it.

Thanks,
Mark

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: "Advanced" query language

2005-12-06 Thread Paul Elschot
On Tuesday 06 December 2005 03:20, Chris Hostetter wrote:
...
> 
> I can think of at least two big use cases that I'm concerned about
> 
> 1) Human creation
...
> 
> 2) Aliasing
> 
...

Meanwhile I scratched some surface off XSL, and I think it can allow
both simplification and aliasing in one go.

> 
> Especially if I can convince Yonik's boss to pay him to do all the hard
> work. :)
> 

My strategy now is to wait and see what XML structures will be introduced
and then try and define some XSL in front of these.

Regards,
Paul Elschot.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: "Advanced" query language

2005-12-06 Thread Wolfgang Hoschek
That's basically what I'm implementing with Nux, except that the  
syntax and calling conventions are a bit different, and that Lucene  
analyzers can optionally be specified, which makes it a lot more  
powerful (but also a bit more complicated).


Wolfgang.

On Dec 6, 2005, at 10:48 AM, Incze Lajos wrote:


Maybe, I'm a bit late with this, but.

There is an ongoing effort at w3c to define a fulltext
search language that could extend their xpath and xquery
languages (which clearly makes sense).

These are the current documents on the topic:

http://www.w3.org/TR/2005/WD-xquery-full-text-20051103/
http://www.w3.org/TR/2005/WD-xmlquery-full-text-use-cases-20051103/

incze

(This case, the query language itself is not xml, as has to
serve as a selection criteria in an xpath or xquery expression,
but xml conform, so may be embedded in any xml doc.)

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]





-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: "Advanced" query language

2005-12-06 Thread Incze Lajos
Maybe, I'm a bit late with this, but.

There is an ongoing effort at w3c to define a fulltext
search language that could extend their xpath and xquery
languages (which clearly makes sense).

These are the current documents on the topic:

http://www.w3.org/TR/2005/WD-xquery-full-text-20051103/
http://www.w3.org/TR/2005/WD-xmlquery-full-text-use-cases-20051103/

incze

(This case, the query language itself is not xml, as has to
serve as a selection criteria in an xpath or xquery expression,
but xml conform, so may be embedded in any xml doc.)

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: "Advanced" query language

2005-12-06 Thread Yonik Seeley
> Are you aware, though, of an existing Unicode serialization/markup
> mechanism without XML's gaps?

No, but I'm not advocating anything other than XML.  I'm just pointing
out a problem that needs to be solved.

> Base64 is frequently used as an escape mechanism for binary data in XML.

Yeah, but it's not necessarily binary data.  I just want to be able to
express all of unicode.

> One possible solution to the escaping issue is a standard optional
> attribute named "encoding",

It's an application level convention, not a standard, and it's still
not clear what is being encoded in base64.  Is it UTF-8, Java
characters, or true binary?

For normal text data, with valid unicode characters that aren't legal
XML, I'd rather have a simple escaping mechanism.  Something like
backslash escaping that is easily understood.  Maybe something as
simple as \00 for � (backslash followed by two hex digits).


-Yonik

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: renaming IndexReader.delete() ?

2005-12-06 Thread slagraulet
I certainly agree with this change!

We all have interest in having a clean and understandable API.




   
 Erik Hatcher  
 <[EMAIL PROTECTED] 
 utions.com> A
   java-dev@lucene.apache.org  
 05/12/2005 23:43   cc
   
 Objet
 Veuillez répondre Re: renaming IndexReader.delete() ?
 à   
 [EMAIL PROTECTED] 
 pache.org 
   
   
   






On Dec 5, 2005, at 3:04 PM, Andi Vajda wrote:
> Since Lucene 1.9 is a long-awaited release where we're considering
> deprecating some APIs, how about renaming and deprecating
> IndexReader.delete() as follows:
>
>  - IndexReader.delete(int)  -> IndexReader.deleteDocument(int)
>  - IndexReader.delete(Term) -> IndexReader.deleteDocuments(Term)


+1

I volunteer to make this change if it is agreed upon.  I have a
vested interest in seeing GCJ work as well as possible :)

 Erik


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



SearchBlox adds REST-API and Spelling Suggestions in Version 3.1

2005-12-06 Thread Robert Selvaraj

SearchBlox Software has released Version 3.1 of its J2EE Content Search
Software. SearchBlox delivers out-of-the-box search functionality for quick
and easy integration with websites, applications, intranets and portals.
SearchBlox uses the Lucene Search API and incorporates integrated
HTTP/HTTPS, File System and Feed (RSS/Atom) crawlers, support for various
document formats including HTML, Word, PDF, PowerPoint and Excel, support
for indexing and searching content in 30 languages and customizable search
results, all controlled from a browser-based Admin Console. 

Main features in Version 3.1: 
=
- REST API (Free and Enterprise Editions Only) for indexing and deleting
custom content. The built-in browser-based SearchBlox Development
Environment provides developers with an easy-to-use interface to develop and
test using the REST API. 
- Spelling Suggestions based on the indexed content in the collection
- Support for selective indexing of content within HTML documents using
  or   tags 
- Support for JDK 1.5. 


SearchBlox is available as a Web Archive (WAR) and has been tested with all
major Java Application Servers. It is also available as a standalone
application for Windows and Mac OS X.

The SearchBlox FREE Edition is available free of charge and can index up to
1000 documents. 

The software can be downloaded from http://www.searchblox.com





-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: "Advanced" query language

2005-12-06 Thread Steven Rowe

Yonik Seeley wrote:

On 12/6/05, Erik Hatcher <[EMAIL PROTECTED]> wrote:

Also I'd be curious to see a problem with Unicode code points in XML,
if you have one handy.


The definition of valid XML 1.0 characters:
#x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x1-#x10]

The simplest example is code-point 0.  It's a valid unicode character,
but it's not a valid XML character (even when you replace it with an
entity).
Example: NullTerminated�  is not valid XML


Are you aware, though, of an existing Unicode serialization/markup 
mechanism without XML's gaps?



I'm confident that XML can accommodate our needs just fine, and any
other text transmission would have to re-solve many issues that XML
has already solved.


Agreed.  It wasn't a blocker, but it was something I wanted to see
tackled up front.  It means adding a little more application logic to
handle escaping/unescaping.

The bottom line is I want to be able to represent the perfectly valid
lucene query new TermQuery(new Term("field","\u")).


Base64 is frequently used as an escape mechanism for binary data in XML. 
 It has the nice property that it can be used directly as XML character 
data, since its standard representation does not use any XML metacharacters.


One possible solution to the escaping issue is a standard optional 
attribute named "encoding", the value of which could be extensible, with 
value "base64" built into the initial implementation.  Then, unless the 
attribute is present, all data is taken literally.  E.g. (taking Yonik's 
example 'TermQuery(new Term("field","\u"))'):



  AA==


Note that this solution would limit the serialization syntax, though, 
because unless there is a single attribute name for possibly-escaped 
data (very unlikely, methinks), escapable text would only be 
representable as text node children of elements, and *not* as attribute 
values.


Steve

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: "Advanced" query language

2005-12-06 Thread Yonik Seeley
On 12/6/05, Erik Hatcher <[EMAIL PROTECTED]> wrote:
> Suppose a user of the Swing or RoR client enters "some phrase", who
> is responsible for analyzing that phrase so that it is suitable for
> PhraseQuery.add()?  Right?

Right, and even more.  The query one specifies may be morphed into
another type since the analysis phase can produce multiple tokens out
of one, and can produce multiple tokens at the same position (as well
as stopping out words, etc).

laptop
  with a synonym filter, would end up parsed into (using the current syntax)
foo:laptop foo:notebook

OR

WiFi
  with a case-change splitter could end up as
foo:wifi foo:"wi fi"

That's what I meant about a lot of the QueryParser logic needing to be
duplicated.

> I'm currently thinking that we want to support both the client and
> the server having this option.

The server side is a must.  Can't have all the clients having to
duplicate the logic and stay in sync with how things were indexed.



> A client definitely must be able to
> construct a phrase query in XML precisely with each term (all XML
> just for example, no endorsement implied):
>
> 
> 
> 
> 
>
> but we should also allow for the client to push the analysis
> responsibility to the server:

I agree that level of control is needed, but the server side analysis
still needs to be done right (stemming, stopping, etc)

There may be an option to not do analysis in the event that it was
already done (that's the easier part of all of this).

> 
> some phrase
> 

Hmmm, I had thought it working like the current QueryParser... you
give it an Analyzer, and it knows about how to treat the different
fields.  Explicitly specifying it is interesting, but should be
optional.

>
> Interesting topic, even if this isn't what you originally meant Yonik :)

I think it was what I meant... I realized this isn't just query
serialization.  To be useful, all the related analysis stuff must be
there.

-Yonik

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: "Advanced" query language

2005-12-06 Thread Yonik Seeley
On 12/6/05, Erik Hatcher <[EMAIL PROTECTED]> wrote:
> > example:  � is not valid XML
> Can you give an example of a query that needs binary information?

It's never an absolute need - one could always work around the
problem, for sure.  The issue was more a desire to be able to
represent everything that *currently* works in lucene (as far as
queries go).

- hacking the bits of numerics directly into chunks (7 or 15 bits for example)
  (I actually do this)
- representing separation of values or sentences with a null byte

Previously, all I had to watch out for was UCS-16 surrogates: as long
as I stayed below 0xD800, everything worked fine.

> Also I'd be curious to see a problem with Unicode code points in XML,
> if you have one handy.

The definition of valid XML 1.0 characters:
#x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x1-#x10]

The simplest example is code-point 0.  It's a valid unicode character,
but it's not a valid XML character (even when you replace it with an
entity).
Example: NullTerminated�  is not valid XML


> http://www.fawcette.com/javapro/2003_02/magazine/features/ehatcher/
> (must register to see the full article, unfortunately)
>
> I'm confident that XML can accommodate our needs just fine, and any
> other text transmission would have to re-solve many issues that XML
> has already solved.

Agreed.  It wasn't a blocker, but it was something I wanted to see
tackled up front.  It means adding a little more application logic to
handle escaping/unescaping.

The bottom line is I want to be able to represent the perfectly valid
lucene query new TermQuery(new Term("field","\u")).


-Yonik

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: "Advanced" query language

2005-12-06 Thread DM Smith
One thing I like about the possibility of XML (as opposed to other 
syntax) is that I could create query templates and process them with 
XSLT. And I can do this client side and also in most modern browsers.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: "Advanced" query language

2005-12-06 Thread mark harwood
> but we should also allow for the client to push the
> analysis  
> responsibility to the server:

Yet another variation we could support is to use the
existing QueryParser server-side for handling
user-typed input. On the client user input is unparsed
and combined with the lower-level constraints created
by application code e.g:



 

 "some phrase"

   
   
 
  
Java 
XML 
  
   







___ 
How much free photo storage do you get? Store your holiday 
snaps for FREE with Yahoo! Photos http://uk.photos.yahoo.com

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: "Advanced" query language

2005-12-06 Thread Erik Hatcher


On Dec 5, 2005, at 9:18 PM, Yonik Seeley wrote:


If we go with XML, I think this must be solved (or else we are at the
point where we can only represent a subset of queries that lucene can
handle again).


Hmmm, maybe it's not quite so serious if the XML represents a
pre-analyzed query vs post-analyzed.

This doesn't appear quite as simple as serialization of Query objects
as XML any more.  The analysis phase still needs to be done, right?
Still doable, but much of the QueryParser logic needs to be
duplicated.


I started to post that I was confused by what you mean by pre and  
post analyzed queries, but I think I understand after some pondering.


Suppose a user of the Swing or RoR client enters "some phrase", who  
is responsible for analyzing that phrase so that it is suitable for  
PhraseQuery.add()?  Right?


That's a great question, and to be honest one I hadn't considered.   
I'm currently thinking that we want to support both the client and  
the server having this option.  A client definitely must be able to  
construct a phrase query in XML precisely with each term (all XML  
just for example, no endorsement implied):







but we should also allow for the client to push the analysis  
responsibility to the server:



some phrase


Interesting topic, even if this isn't what you originally meant Yonik :)

Erik


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: "Advanced" query language

2005-12-06 Thread Erik Hatcher


On Dec 5, 2005, at 9:07 PM, Yonik Seeley wrote:

There is one little problem with XML though...  It's inability to
directly represent binary data, or even all unicode code points (no,
entities don't fix this).  I use binary data in lucene to represent
some numerics, and that can't be represented in standard XML.  An
application specific escaping mechanism can be used, but then you are
a step away from standard XML.

example:  � is not valid XML

If we go with XML, I think this must be solved (or else we are at the
point where we can only represent a subset of queries that lucene can
handle again).


Can you give an example of a query that needs binary information?   
Also I'd be curious to see a problem with Unicode code points in XML,  
if you have one handy.


Even something like setBoost(float f) isn't taking a String.  But the  
XML->Query mapping would translate  by parsing the  
boost attribute and calling setBoost appropriately.  Is this what you  
mean?


For example, Ant has this sort of type mapping capability built-in.
Here's some info on how that works:


	http://ant.apache.org/manual/develop.html#writingowntask  
("Conversions Ant will perform for attributes" section)


I described this in some more detail in a JavaPro article a couple of  
years ago also:


	http://www.fawcette.com/javapro/2003_02/magazine/features/ehatcher/  
(must register to see the full article, unfortunately)


I'm confident that XML can accommodate our needs just fine, and any  
other text transmission would have to re-solve many issues that XML  
has already solved.


Erik


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]