Hi Paul

My apologies for not acknowledging your message earlier.

I had not thought of indexing the same content twice, with the WS analyser as 
one field and with the standard analyser as the other, but that may be 
sufficient for our needs, at least to begin with. Then I can do a crude test of 
each search pattern to decide which field to query against.

When I get stuck I will be back!

Cheers
T

-----Original Message-----
From: Paul Libbrecht <p...@hoplahup.net.INVALID> 
Sent: Monday, 23 November 2020 21:23
To: java-user@lucene.apache.org
Subject: Re: Using Lucene for technical documentation

Hello Trevor,

I don’t know of an analyzer for mixes of code and text but I know of an 
analyser for mixes of code and formulæ.

Clearly, you could build a custom analyzer that would tokenize differently 
depending on weather you’re in code or in text. That’s no super hard.

However, where things get complicated is at mixing and that happens latest at 
querying: If you query `while` you want to find matches for the real world and 
the stemmed word too. If you use Lucene for other tasks than searches, however, 
this may be a problem (e.g. clustering, LSA…).

In the case of the formula-enabled search I built, the query modalities were 
different (two different input-fields) so that you knew how to transform the 
query (for math, span queries were used).

I’m suspecting that you should decide on this first: if you want to just search 
and query by a mix then I’d recommend simply using different field-names with a 
whitespace and a standard-analyzer. Later on the code-oriented one is able to, 
say, enrich code-tokens by alternative names (e.g. use “loop” as an weaker 
alternative of the “for” token). Solr and lucene can do this really well 
(eDismax provides an easy parametrisation).

But I’d be happy to read of others’ works on this!

In the Math working group of W3C at the time, work stopped when considering the 
complexity of compound documents: the alternatives as above (mix words or 
recognise math pieces?) certainly made things difficult.

paul


PS: [paper for my math search
here](https://hoplahup.net/paul_pubs/AccessRetrievalAM.html). Please ask for 
the source code, it is old and built on Lucene 3.5 so would need quite some 
upgrade.

On 23 Nov 2020, at 8:42, Trevor Nicholls wrote:

> Hello, I'd better begin by identifying myself as a newbie.
>
>
>
> I am investigating using Lucene as a search tool for a library of 
> technical documents, much of which consists of pieces of source code 
> and discussion of the content.
>
>
>
> The standard analyzer does an adequate job with normal text but strips 
> out non-alpha characters in code fragments; the whitespace analyzer 
> does an adequate job with source code but at the expense of treating 
> punctuation characters as significant text.
>
>
>
> As a couple of trivial examples, the line "The !F1 key." ideally needs 
> to be analyzed as  [the] [!f1] [key]. The standard analyzer turns it 
> into [the] [f1] [key] while the Whitespace analyzer turns it into 
> [the] [!f1] [key.].
>
>
>
> Similarly "the abort() function, or the stop() function." ideally 
> needs to be analyzed as [the] [abort()] [function] [or] [the] [stop()] 
> [function].
> But no analyzer will retain the parentheses while discarding the comma 
> and full stop.
>
>
>
> Are there examples of analyzers for technical documentation around, or 
> any helpful pointers? Or am I barking up a rotten tree here?
>
>
>
> cheers
>
> T


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Reply via email to