AW: Lexical analysis tools for German language data

2012-04-13 Thread Michael Ludwig
 Von: Tomas Zerolo

   There can be transformations or inflections, like the s in
   Weinachtsbaum (Weinachten/Baum).
 
  I remember from my linguistics studies that the terminus technicus
  for these is Fugenmorphem (interstitial or joint morpheme) [...]
 
 IANAL (I am not a linguist -- pun intended ;) but I've always read
 that as a genitive. Any pointers?

Admittedly, that's what you'd think, and despite linguistics telling me
otherwise I'd maintain there's some truth in it. For this case, however,
consider: die Weihnacht declines like die Nacht, so:

nom. die Weihnacht
gen. der Weihnacht
dat. der Weihnacht
akk. die Weihnacht

As you can see, there's no s to be found anywhere, not even in the
genitive. But my gut feeling, like yours, is that this should indicate
genitive, and I would make a point of well-argued gut feeling being at
least as relevant as formalist analysis.

Michael


Lexical analysis tools for German language data

2012-04-12 Thread Michael Ludwig
Given an input of Windjacke (probably wind jacket in English), I'd
like the code that prepares the data for the index (tokenizer etc) to
understand that this is a Jacke (jacket) so that a query for Jacke
would include the Windjacke document in its result set.

It appears to me that such an analysis requires a dictionary-backed
approach, which doesn't have to be perfect at all; a list of the most
common 2000 words would probably do the job and fulfil a criterion of
reasonable usefulness.

Do you know of any implementation techniques or working implementations
to do this kind of lexical analysis for German language data? (Or other
languages, for that matter?) What are they, where can I find them?

I'm sure there is something out (commercial or free) because I've seen
lots of engines grokking German and the way it builds words.

Failing that, what are the proper terms do refer to these techniques so
you can search more successfully?

Michael


AW: Lexical analysis tools for German language data

2012-04-12 Thread Michael Ludwig
 Given an input of Windjacke (probably wind jacket in English),
 I'd like the code that prepares the data for the index (tokenizer
 etc) to understand that this is a Jacke (jacket) so that a
 query for Jacke would include the Windjacke document in its
 result set.
 
 It appears to me that such an analysis requires a dictionary-
 backed approach, which doesn't have to be perfect at all; a list
 of the most common 2000 words would probably do the job and fulfil
 a criterion of reasonable usefulness.

A simple approach would obviously be a word list and a regular
expression. There will, however, be nuts and bolts to take care of.
A more sophisticated and tested approach might be known to you.

Michael


AW: Lexical analysis tools for German language data

2012-04-12 Thread Michael Ludwig
 Von: Valeriy Felberg

 If you want that query jacke matches a document containing the word
 windjacke or kinderjacke, you could use a custom update processor.
 This processor could search the indexed text for words matching the
 pattern .*jacke and inject the word jacke into an additional field
 which you can search against. You would need a whole list of possible
 suffixes, of course.

Merci, Valeriy - I agree on the feasability of such an approach. The
list would likely have to be composed of the most frequently used terms
for your specific domain.

In our case, it's things people would buy in shops. Reducing overly
complicated and convoluted product descriptions to proper basic terms -
that would do the job. It's like going to a restaurant boasting fancy
and unintelligible names for the dishes you may order when they are
really just ordinary stuff like pork and potatoes.

Thinking some more about it, giving sufficient boost to the attached
category data might also do the job. That would shift the burden of
supplying proper semantics to the guys doing the categorization.

 It would slow down the update process but you don't need to split
 words during search.

  Le 12 avr. 2012 à 11:52, Michael Ludwig a écrit :
 
  Given an input of Windjacke (probably wind jacket in English),
  I'd like the code that prepares the data for the index (tokenizer
  etc) to understand that this is a Jacke (jacket) so that a
  query for Jacke would include the Windjacke document in its
  result set.

A query for Windjacke or Kinderjacke would probably not have to be
de-specialized to Jacke because, well, that's the user input and users
looking for specific things are probably doing so for a reason. If no
matches are found you can still tell them to just broaden their search.

Michael


AW: Lexical analysis tools for German language data

2012-04-12 Thread Michael Ludwig
 Von: Markus Jelsma

 We've done a lot of tests with the HyphenationCompoundWordTokenFilter
 using a from TeX generated FOP XML file for the Dutch language and
 have seen decent results. A bonus was that now some tokens can be
 stemmed properly because not all compounds are listed in the
 dictionary for the HunspellStemFilter.

Thank you for pointing me to these two filter classes.

 It does introduce a recall/precision problem but it at least returns
 results for those many users that do not properly use compounds in
 their search query.

Could you define what the term recall should be taken to mean in this
context? I've also encountered it on the BASIStech website. Okay, I
found a definition:

http://en.wikipedia.org/wiki/Precision_and_recall

Dank je wel!

Michael


AW: Lexical analysis tools for German language data

2012-04-12 Thread Michael Ludwig
 Von: Walter Underwood

 German noun decompounding is a little more complicated than it might
 seem.
 
 There can be transformations or inflections, like the s in
 Weinachtsbaum (Weinachten/Baum).

I remember from my linguistics studies that the terminus technicus for
these is Fugenmorphem (interstitial or joint morpheme). But there's
not many of them - phrased in a regex, it's /e?[ns]/. The Weinachtsbaum
in the example above is from the singular (die Weihnacht), then s,
then Baum. Still, it's much more complex then, say, English or Italian.

 Internal nouns should be recapitalized, like Baum above.

Casing won't matter for indexing, I think. The way I would go about
obtaining stems from compound words is by using a dictionary of stems
and a regex. We'll see how far that'll take us.

 Some compounds probably should not be decompounded, like Fahrrad
 (farhren/Rad). With a dictionary-based stemmer, you might decide to
 avoid decompounding for words in the dictionary.

Good point.

 Note that highlighting gets pretty weird when you are matching only
 part of a word.

Guess it'll be a weird when you get it wrong, like Noten in
Notentriegelung.

 Luckily, a lot of compounds are simple, and you could well get a
 measurable improvement with a very simple algorithm. There isn't
 anything complicated about compounds like Orgelmusik or
 Netzwerkbetreuer.

Exactly.

 The Basis Technology linguistic analyzers aren't cheap or small, but
 they work well.

We will consider our needs and options. Thanks for your thoughts.

Michael


Re: Implementing PhraseQuery and MoreLikeThis Query in one app

2009-07-02 Thread Michael Ludwig

SergeyG schrieb:


Can both queries - PhraseQuery and MoreLikeThis Query - be implemented
in the same app taking into account the fact that for the former to
work the stop words list needs to be included and this results in the
latter putting stop words among the most important words?


Why would the inclusion of a stopword list result in stopwords being of
top importance in the MoreLikeThis query?

Michael Ludwig


Re: Search for phrase including prepositions

2009-07-01 Thread Michael Ludwig

akinori schrieb:

When I search make for, solr returns words include both make and
for, but when I type more than 3 words such as in order to, the
result becomes 0 though the index is sure to have several words
including 3 of the words. 2 words are ok but more than 3 words
resulted zero. Why is happens?


Hi Akinori,

I guess you're using the DisMax query parser. Please read this entire
page: http://wiki.apache.org/solr/DisMaxRequestHandler

The parameter that allows you to tweak this is the mm parameter.

Michael Ludwig


Re: Installing a patch in a solr nightly on Windows

2009-07-01 Thread Michael Ludwig

Koji Sekiguchi schrieb:

I'm not a Windows user, but I think you can use Linux command (e.g.
patch, to apply SOLR-284 patch to Solr nightly build) on cygwin
environment.


The standalone patch utility for Win32 is another option.

http://gnuwin32.sourceforge.net/packages/patch.htm

Michael Ludwig


Re: Monitor search traffic

2009-07-01 Thread Michael Ludwig

Gurjot Singh schrieb:

Hi,
Is there a way to monitor the number of search queries made on the
solr index.


http://localhost:8983/solr/admin/stats.jsp

Look for requests :.

Michael Ludwig


Re: spelling suggestion in solr.

2009-06-30 Thread Michael Ludwig

Radha C. schrieb:


The feature spelling suggestion is available in solr? If yes, can
you tell me some documentations?


Have you tried googling for: solr spelling ? First hit:

http://wiki.apache.org/solr/SpellCheckComponent

Michael Ludwig


Re: SOLR SpeelChecker and german Umlauts

2009-06-30 Thread Michael Ludwig

Kraus, Ralf | pixelhouse GmbH schrieb:

When I am searching for ONE word with an german umlaut like
kräuterkeckse (the right word is kräuterkekse) the spellchecker
gives me two corrections :

Spellcheck for kr = kren
Spellcheck for uterkeksse = butterkekse

WHY is SOLR break this ONE word apart ?


Moin Ralf,

please read the following threads to understand the issue. In short,
you need to specify your query in spellcheck.q as well.

Re: French and SpellingQueryConverter - Shalin Shekhar Mangar
http://markmail.org/message/k35r7qmpatjvllsc

SpellCheckComponent: queryAnalyzerFieldType - Michael Ludwig
http://markmail.org/thread/dgi4llhc7x5wuroc

(BTW, the patch in SOLR-1204 is ready but still awaiting clarification.
See comments from June 11 and 18.)


My Config is :

spellcheck = 'true';
spellcheck.dictionary = 'jarowinkler'
spellcheck.onlyMorePopular = 'true'
spellcheck.build = 'false'
spellcheck.count = 1


So add: spellcheck.q = 'your query'

Michael Ludwig


Re: Search for phrase including prepositions

2009-06-30 Thread Michael Ludwig

akinori schrieb:

I indexed English dictionary to solr.
When I search apple juice for example, solr understands the query is
apple  juice as what I want. Howerver, when I search apple for,
solr thinks that the query is just apple.
How can I solve this? I think I have to understand the analyzer.


Exactly.


Could anyone navigate me?


Go to your analysis page, enter your field name (or type), check
verbose output, enter your query, and press Analyze.

http://localhost:8983/solr/admin/analysis.jsp

You'll probably find that the word for is removed as a so-called
stopword.

Michael Ludwig


Re: nested dismax queries

2009-06-29 Thread Michael Ludwig

Ensdorf Ken schrieb:


For exmaple, a user might enter Alabama Biotechnology in the main
search box, triggering a dismax request which returns lots of
different types of results.  They may then want to refine their search
by selecting a specific industry from a drop-down box.  We handle this
by adding a filterquery (fq=) to the original query.  We have dozens
of additional fields like this - some with a finite set of discrete
values, some with arbitrary text values.  The combinations are
infinite, and I'm worried we will overwhelm the filterCache by
supporting all of these cases as filter queries.


Filter queries with arbitrary text values may swamp the cache in 1.3.

Otherwise, the combinations aren't infinite. Keep the filters seperate
in order to limit their number. Specify two simple filters instead of
one composite filter, fq=x:bla and fq=y:blub instead of fq=x:bla
AND y:blub. See:

filterCache/@size, queryResultCache/@size, documentCache/@size
http://markmail.org/thread/tb6aanicpt43okcm

Michael Ludwig


Re: nested dismax queries

2009-06-29 Thread Michael Ludwig

Ensdorf Ken schrieb:

Filter queries with arbitrary text values may swamp the cache in 1.3.


Are you implying this won't happen in 1.4?


I intended to say just this, but I was on the wrong track.


Can you point me to the feature that would mitigate this?


What I was thinking of is the following:

[#SOLR-475] multi-valued faceting via un-inverted field
https://issues.apache.org/jira/browse/SOLR-475

But as you can see, this refers to faceting on multi-valued fields, not
to filter queries with arbitrary text. I was off on a tangent. Sorry.

To get back to your initial mail, I tend to think that drop-down boxes
(the values of which you control) are a nice match for the filter query,
whereas user-entered text is more likely to be a candidate for the main
query.

Michael Ludwig


Re: Searching across multivalued fields

2009-06-19 Thread Michael Ludwig

MilkDud schrieb:

Michael Ludwig-4 wrote:

What do you expect the user to enter?

* dream theater innocence faded - certainly wrong
* dream theater innocence faded - much better


Most likely they would just enter dream theater innocence faded, no
quotes.  Without any quotes around any fields, which is a large cause
of the problem.  Now if i index on the track level, than all those
words would have to show up in just one track (including the album,
artist, and track name), which is expected.  If i index on the album
level however, now, those words just need to show up anywhere
throughout the entire album.


Give the user separate form fields, in this case, don't use DisMax, and
route each form field value to the appropriate field.

Or go with DisMax, it has the mm option to fine-tune how multiple
terms in the query should influence matching.


So, while it will match dream theater - innocence faded, it will also
match an album that has all the words dream theater innocence faded
mentioned across all tracks, which for small queries can be very
common.

Basically, I'm looking for a way to say match all the words in the
search query across the artist, album, and track name, but only
looking at one track (a multivalued field) at a time given a query
without any quotes. Does that make sense at all?


If that's your use case (which I may have been unable to see up to now),
then your approach of splitting up albums in tiny track documents makes
sense.


That is why I was leaning towards the track level index, such as: id,
artist, album, track (all single valued)


Yes, that makes sense. Good luck! (Off for a week now.)

Michael Ludwig


Re: Searching across multivalued fields

2009-06-18 Thread Michael Ludwig

MilkDud schrieb:

Ok, so lets suppose i did index across just the album.  Using that
index, how would I be able to handle searches of the form artist name
track name.


What does the user interface look like? Do you have separate fields for
artists and tracks? Or just one field?


If i do the search using a phrase query, this won't match anything
because the artist and track are not in one field (hence my idea of
creating a third concatenated field).


What do you expect the user to enter?

* dream theater innocence faded - certainly wrong
* dream theater innocence faded - much better

Use the DisMax query parser to read the query, as I suggested in my
first reply. You need to become more familiar with the various search
facilities, that will probably steer your ideas in more promising
directions. Read up about DisMax.


If i make it a non phrase query, itll return albums that have those
words across all the tracks, which is not ideal.  I.e. if you search
for a track titled love me you will get back albums with the words
love and me in different tracks.


That doesn't make sense me to me. Did you inspect your query using
debugQuery=true as I suggested? What did it boil down to?


Basically, i'd like it to look at each track individually


That tells me you're thinking database and table scan.


and if the artist + just one track match all the search terms, then
that counts as a match.  Does that make sense?  If i index on the
track level, that should work, but then i have to store album/artist
info on each track.


I think the following makes much more sense:


An album should be a document and have the following fields (and
maybe more, if you have more data attached to it):

id - unique, an identifier
title - album title
interpret - the musician, possibly multi-valued
track - every song or whatever, definitely multi-valued


Read up about multi-valued fields (sample schema.xml, for example, or
Google) if you're unsure what this is; your posting subject, however,
suggests you aren't.

Regards,

Michael Ludwig


Re: Few Queries regarding indexes in Solr

2009-06-18 Thread Michael Ludwig

Otis Gospodnetic schrieb:

[...] nothing prevents the indexing client from sending the same doc
to multiple shards.  In some scenarios that's exactly what you want
to do.

What kind of scenario would that be?


One scenario is making use of small and large core to provide near
real-time search - you index to both - to smaller so you can
flip/drop/purge+reopen it frequently and quickly, the large one to
persist.  You search across both of them and remove dupes.


This makes sense. Thanks for taking the time to answer this.


Q: What is the most annoying thing in e-mail?


A: it never stops!


Imagine it did one day!

Michael Ludwig


Re: FilterCache issue

2009-06-18 Thread Michael Ludwig

Manepalli, Kalyan schrieb:

I am seeing an issue with the filtercache setting on my solr app
which is causing slower faceting.

Here is the configuration.
filterCache class=solr.LRUCache size=512 initialSize=512
  autowarmCount=256/



hitratio : 0.00
inserts : 973531
evictions : 972978
size : 512



cumulative_hitratio : 0.00
cumulative_inserts : 61170111
cumulative_evictions : 61153787

As we can see the cache hit ratio is almost zero. How do I improve the
filter cache.


Maybe these pages add some ideas to the mix:

http://wiki.apache.org/solr/FilterQueryGuidance
https://issues.apache.org/jira/browse/SOLR-475

Michael Ludwig


Re: Distributed querying using solr multicore.

2009-06-18 Thread Michael Ludwig

Rakhi Khatwani schrieb:

[...] how do we do a distributed search across multicores??  is it
just like how we query using multiple shards?


I don't know how we're supposed to use it. I did the following:

http://flunder:8983/solr/xpg/select?q=blashards=flunder:8983/solr/xpg,flunder:8983/solr/kk

For SolrJ, see this thread:

Using SolrJ with multicore/shards - ahammad
http://markmail.org/thread/qnytfrk4dytmgjis


if so, isnt there a better way to do that?


No idea.

Michael Ludwig


Re: Distributed querying using solr multicore.

2009-06-18 Thread Michael Ludwig

Rakhi Khatwani schrieb:

On Thu, Jun 18, 2009 at 3:51 PM, Michael Ludwig m...@as-guides.com
wrote:



I don't know how we're supposed to use it. I did the following:

http://flunder:8983/solr/xpg/select?q=blashards=flunder:8983/solr/xpg,flunder:8983/solr/kk


i am gettin a page load error... cannot find server


This is not a public server, just an example for the syntax I found by
trial and error.

Michael Ludwig


Re: Searching across multivalued fields

2009-06-18 Thread Michael Ludwig

Hi Vicky,

Vicky_Dev schrieb:

We are also facing same problem mentioned in the post (we are using
dismaxrequesthandler)::



When we are searching for --q=prdTitle_s:ladybirdqt=dismax , we are
getting 2 results --  unique key ID =1000 and  unique key ID =1001


(1) Append debugQuery=true to your query and see how the DisMax query
parser rewrites your query, interpreting what you think is a field name
as just another query term.

(2) Proceed immediately to read the whole Wiki page explaining DisMax:

http://wiki.apache.org/solr/DisMaxRequestHandler


Is it possible to just exact match which is nothing but unique key =
1001?


Yes, it is:  q=id:1001

(1) Don't use DisMax here, that will not interpret field names.
(2) Replace id by whatever name you gave to your unique key field.

Michael Ludwig


Re: Searching across multivalued fields

2009-06-17 Thread Michael Ludwig

MilkDud schrieb:


To be more specific, I'm indexing a collection of music albums that
have multiple tracks and an album artist.  So, some searches will
contain both the artist name and the track name.  I can't make this a
single phrase query as it is indexed across two separate fields.


Use the DisMaxRequestHandler and specify all fields you want to use in
your query in the qf parameter.

  !-- qf = query fields: list of fields with boost factor --
  str name=qf artist^3 album^2 track^1 /str

http://wiki.apache.org/solr/DisMaxRequestHandler

Michael Ludwig


Re: Few Queries regarding indexes in Solr

2009-06-17 Thread Michael Ludwig

Otis Gospodnetic schrieb:

Regarding that 3rd answer below:


Putting it back in context (where it belongs :-) :


My (very limited) understanding of shards is that you repartition
your documents among shards and send each document to only one
shard. (Not sure this is correct.)



Yes, that's what most people do, though nothing prevents the indexing
client from sending the same doc to multiple shards.  In some
scenarios that's exactly what you want to do.


What kind of scenario would that be?

Michael Ludwig

--
A: Because it messes up the order in which people normally read text.
Q: Why is top-posting such a bad thing?
A: Top-posting.
Q: What is the most annoying thing in e-mail?


Re: what date format to pass for search in Solr?

2009-06-17 Thread Michael Ludwig

chem leakhina schrieb:

Does anyone know what date format pass to search in Solr?


A restricted subset of the W3C datetime format. See:

http://wiki.apache.org/solr/IndexingDates


Could you give me any examples for search with Date in solr?


Examples can be very easily found searching for something like solr
date range query. For example, see:

http://www.nabble.com/Date-Range-Query-%2B-Fields-to16108517.html

Michael Ludwig


Re: Could solr build two different indexes?

2009-06-17 Thread Michael Ludwig

fei dong schrieb:

I wanna build many instances of solr. My requirement is to statisfy
different product search. Could I do that?


Yes. Read all of the following:

Multi-index Design - Chris Masters
http://markmail.org/thread/6p7viwpinrwmj6my

http://wiki.apache.org/solr/MultipleIndexes

http://wiki.apache.org/solr/CoreAdmin

Michael Ludwig


Re: Solr Query | Field:value with dismaxquery

2009-06-17 Thread Michael Ludwig

prerna07 schrieb:


I am facing issue with query with dismaxrequest.



?q=facetFormat_product_s:Pfqs ePub eBook Sfqs - return correct
results

?q=facetFormat_product_s:Pfqs ePub eBook Sfqsqt=dismaxrequest -
dose not return results, although field facetFormat_product_s is
defined in dismaxrequest Handler of solrconfig.xml


You mustn't include the fieldname in the query when sending the query
to the DisMax query parser. The fieldname will be interpreted as just
another term to build a query clause from.


?q=facetFormat_product_s:Pfqs Cassette Sfqsqt=dismaxrequest -
return correct results


I'd attribute that to the mm (minimum match) parameter, the meaning
of which you can understand reading the following page, which it would
probably make a lot of sense to read anyway:

http://wiki.apache.org/solr/DisMaxRequestHandler

Michael Ludwig


Re: fq vs. q

2009-06-17 Thread Michael Ludwig

Fergus McMenemie schrieb:


While q= and fq= affect the results portion of a search response.
The facet.query only affects the facets portion of a response.
facet.query(s) are only used where you want a facet summary of your
query based on some kind of complex expression rather than the terms
within a single field.

I added the comment in that I think that a wiki page discussing fs vs
q should also mention facet.query.


It now does: http://wiki.apache.org/solr/FilterQueryGuidance

Michael Ludwig


Re: Searching across multivalued fields

2009-06-17 Thread Michael Ludwig

MilkDud schrieb:


Basically, what I am trying to do is index a collection of music for
an online music store.  This contains information on the track, album,
and artist levels.  These are all different object types in the same
schema and it does contain a lot of redundant information.


What's a document in your case? If I were you, I'd probably organize
the data so that each album is one document, because that's what you'd
expect (shopping experience).


For example, a track will have its own listing, but will show up again
in the album listing and the artist listing for the objects that own
that track.


Sounds a bit bizarre to me, but then I don't know much about your
requirements.


There are reasons it is done this way as we search/display across the
three differently.


Hmm.


That said, I have thought of ways of just indexing tracks and
maintaining all the relevant information, but that seems to introduce
its own issues.


An album should be a document and have the following fields (and maybe
more, if you have more data attached to it):

id - unique, an identifier
title - album title
interpret - the musician, possibly multi-valued
track - every song or whatever, definitely multi-valued

Michael Ludwig


Re: Few Queries regarding indexes in Solr

2009-06-16 Thread Michael Ludwig

Rakhi Khatwani schrieb:


1. Is it possible to query from another index folder (say
index1) in solr?


I think you're looking for the multi-core feature.

http://wiki.apache.org/solr/MultipleIndexes
http://wiki.apache.org/solr/CoreAdmin


2. Is it possible to query 2 indexes(folders index1 and index2)
stored in the same machine using the same port on a single solr
instance?


Sounds like multi-core.


3. consider a case: i have indexes in 2 shards, and i merge the
indexes (present in 2 shards) onto the 3rd shard now i add more
documents into shard1 and delete some documents from shard 2 and
update the indexes. is it possible to send the differences only
into shard 3 and then merge it at shard 3?


My (very limited) understanding of shards is that you repartition
your documents among shards and send each document to only one
shard. (Not sure this is correct.)

Michael Ludwig


Re: fq vs. q

2009-06-15 Thread Michael Ludwig

Ensdorf Ken schrieb:


I ran into this very issue recently as we are using a freshness
filter for our data that can be 6//12/18 months etc.  I discovered
that even though we were only indexing with day-level granularity, we
were specifying the query by computing a date down to the second and
thus virutally every filter was unique.  It's amazing how something
this simple could bring solr to it's knees on a large data set.


I want to retrieve documents (TV programs) by a particular date and
decided to convert the date to an integer, so I have:

* 20090615
* 20090616
* 20090617 etc.

I lose all date logic (timezones) for that date field, but it works for
this particular use case, as the date is merely a tag, and not a real
date I need to perform more logic on than an integer allows.

Also, an integer looks about as efficient as it gets, so I thought it
preferable to a date for this use case. YMMV.

I think if you truncate dates to incomplete dates, you effectively also
lose all the date logic. You may still apply it, but what would you take
the result to mean? You can't regain precision you've decided to drop.

The actual points in time where my TV programs start and end are
encoded as a UNIX timestamp with exactitude down to the second, also
stored as an integer, as I don't need sub-second precision.

This makes sense for my client, which is not Java, but PHP, so it uses
the C library strftime and friends, which need UNIX timestamps.

Bottom line, I think it may make perfect sense to store dates and times
in integers, depending on your use case and your client.

Michael Ludwig


Re: fq vs. q

2009-06-15 Thread Michael Ludwig

Fergus McMenemie schrieb:


The article could explain the difference between fq= and
facet.query= and when you should use one in preference to
the other.


My understanding is that while these query modifiers rely on the
same implementation (cached filters) to boost performance, they
simply and obviously differ in that fq limits the result set to
your filter criterion whereas facet.query does not restrict the
result but instead enhances it with statistical information gained
from applying set intersection of result and facet query filters.

It looks like facet.query is just a more flexible means of
defining a filter than possible using a mere facet.field.

Would that be approximately correct?

A question of mine:

It appears to me that each facet.query invariably leads to one
boolean filter, so if you wanted to do range faceting for a given
field and obtain, say, results reduced from their actual continuum
of values to three ranges {A,B,C}, you'd have to define three
facet.query parameters accordingly. A mere facet.field, on the
other hand, creates as many filters as there are unique values in
the field. Is that correct?

Michael Ludwig


Re: fq vs. q

2009-06-15 Thread Michael Ludwig

Shalin Shekhar Mangar schrieb:

On Mon, Jun 15, 2009 at 4:39 PM, Michael Ludwig m...@as-guides.com
wrote:



I think if you truncate dates to incomplete dates, you effectively
also lose all the date logic. You may still apply it, but what would
you take the result to mean? You can't regain precision you've
decided to drop.


Note that with Trie search coming in (see example schema.xml in the
nightly builds), this rounding may not be necessary any more.


http://svn.apache.org/repos/asf/lucene/solr/trunk/example/solr/conf/schema.xml

Not sure I understand correctly, but this sounds as if given an
integer field and a @precisionStep of 3, the original value is stored
along with three copies that omit (1) the last bit, (2) the two last
bits, (3) the three last bits. So a given range query might be
optimized to an equality query. But I'm not sure I'm on the right
track here.

Michael Ludwig


Re: Joins or subselects in solr

2009-06-15 Thread Michael Ludwig

Nasseam Elkarra schrieb:


I am storing items in an index. Each item has a comma separated list
of related items. Is it possible to bring back an item and all of its
related items in one query? If so how and how would you distinguish
between which one is the main item and which are the related.


Think about the data structure. You're saying there is a main item,
which suggests there is some regularity to the underlying data
structure, possibly a tree.

If there is a main item, each item should store a reference to the main
item. You could then perform a lookup specifying q=mainitem:12345. That
would retrieve all items related to 12345 and solve the problem more
efficiently than having each item store a list of all its related items.

I'm thinking of small or moderately sized trees here, such as they grow
in mailing lists or discussion boards.

If it's not a tree, but some less regular graph, then the notion of a
main item needs clarification.

Michael Ludwig


Re: fq vs. q

2009-06-12 Thread Michael Ludwig

Michael Ludwig schrieb:

Martin Davidsson schrieb:

I've tried to read up on how to decide, when writing a query, what
criteria goes in the q parameter and what goes in the fq parameter,
to achieve optimal performance. Is there [...] some kind of rule of
thumb to help me decide how to split things up when querying against
one or more fields.


This is a good question. I don't know if there is any such rule. I'm
going to sum up my understanding of filter queries hoping that the
pros will point out any flaws in my assumptions.


I've summarized what I've learnt about filter queries on this page:

http://wiki.apache.org/solr/FilterQueryGuidance

Michael Ludwig


Re: Customizing results

2009-06-11 Thread Michael Ludwig

revas schrieb:


What is GNU gettext and how this can be used in a multilanguage
scenario?


It'a an internationalization technology, so it is well suited to the
tasks of internationalizing and localizing applications.

http://www.gnu.org/software/gettext/manual/
http://www.gnu.org/software/gettext/manual/html_node/Why.html

In your case, it might mean that the client is equipped with the
language packages it needs and uses the name returned by Solr (likely
the English term) to look up the translation by means of Gettext. But
it certainly depends very much on your particular setup. It might be
overkill for your particular situation.

Michael Ludwig


Re: Build Failed

2009-06-11 Thread Michael Ludwig

Mukerjee, Neiloy (Neil) schrieb:

When running ant example to do an example configuration, I get the
following message:

BUILD FAILED



/home/stagger2/Solr/apache-solr-1.3.0/common-build.xml:149: Compile
failed; see the compiler error output for details.

I've tried reading through the files in question, but I can't seem to
find the issue. Any suggestions?


Run: ant -verbose

Michael Ludwig


Re: dismax parsing applied to specific fields

2009-06-11 Thread Michael Ludwig

Nick Jenkin schrieb:

Hi
I was wondering if there is a way of applying dismax parsing to
specific fields, where there are multiple fields being searched
- all with different query values
e.g.

author:(tolkien) AND title:(the lord of the rings)

would be something like:

dismax(author, tolkien) AND dismax(title, the lord of the rings)

I guess this can be thought of having two separate dismax
configurations, one searching author and one searching title -
and the intersection of the results is returned.


http://wiki.apache.org/solr/DisMaxRequestHandler

This says that the DisMaxRequestHandler is simply the standard request
handler with the default query parser set to the DisMax Query Parser.
So maybe you could program your own CustomDisMaxRequestHandler that
reuses the DisMax query parser (and probably other components) to
achieve what you want.

Michael Ludwig


Re: Build Failed

2009-06-11 Thread Michael Ludwig

Mukerjee, Neiloy (Neil) schrieb:

Running ant -verbose still doesn't allow me to run an example
configuration. I get the same error from ant example after getting
the following from ant -verbose:



Build sequence for target(s) `usage' is [usage]



usage:
 [echo] Welcome to the Solr project!
 [echo] Use 'ant example' to create a runnable example configuration.
 [echo] And for developers:
 [echo] Use 'ant clean' to clean compiled files.
 [echo] Use 'ant compile' to compile the source code.
 [echo] Use 'ant dist' to build the project WAR and JAR files.
 [echo] Use 'ant generate-maven-artifacts' to generate maven artifacts.
 [echo] Use 'ant package' to generate zip, tgz, and maven artifacts for 
distribution.
 [echo] Use 'ant test' to run unit tests.

BUILD SUCCESSFUL


You might want to read up on Ant usage in the Ant User Manual, a copy of
which should be part of your installation, or can be found on the web.
Quick overview:

ant -help

When I wrote ant -verbose, I meant ant -verbose your-target, so:

ant -verbose example

Michael Ludwig


Re: Faceting on text fields

2009-06-11 Thread Michael Ludwig

Yao Ge schrieb:

BTW, Carrot2 has a very impressive Clustering Workbench (based on
eclipse) that has built-in integration with Solr. If you have a Solr
service running, it is a just a matter of point the workbench to it.
The clustering results and visualization are amazing.
(http://project.carrot2.org/download.html).


A new world opens up for me ...

Thanks for pointing out how cool this is!

Hint for other newcomers: Open the View Menu to configure the details of
how you perform your search, e.g. your Solr URL in case it differs from
the default, or your summary field, which is what gets used to analyze
the data in order to determine clusters, if I understand correctly.

Michael Ludwig


Re: fq vs. q

2009-06-10 Thread Michael Ludwig

Fergus McMenemie schrieb:

On Tue, Jun 9, 2009 at 7:25 PM, Michael Ludwig m...@as-guides.com
wrote:



A filter query is cached, which means that it is the more useful
the more often it is repeated. We know how often certain queries
arise, or at least have the means to collect that data - so we
know what might be candidates for filtering.

Sorry but I cant make any sense of the above. Could you have
another go at explaining it?


Filtering a given query result R on bla:eins, bla:zwei, bla:drei or
bla:vier is very common in my application. So while I could include
this criterion in my main query (q) and hope for the queryResultCache
to kick in, this would be unlikely to be efficient as my primary
query, which gave me R, likely varies a lot, resulting in a high
number of distinct queries, with relatively low probability for a
given query to occur frequently. So each of these query result sets
would enter the queryResultCache as a distinct set, hence high
contention, high eviction rate, poor cache efficiency.

Now I'm going to factor out those bla:{eins,zwei,drei,vier} filters
from my primary query (q) and put them in the filter query (fq). The
benefit is double:

(1) Solr has a dedicated cachespace for filters the usage of which I
control by my usage of the filter query (fq). I can set up things so
the usage of the primary query (q) is under the user's control while
the usage of the filter query (fq) is under my application's control.
I control this cache, I ensure its efficiency.

(2) Factoring out the filter query bla:{eins,zwei,drei,vier} from the
primary query also reduces variation in the primary query, thus making
the queryResultCache more efficient.

So instead of having, say, 1 distinct primary queries, no usage of
the filterCache, and poor usage of the queryResultCache, I may have
only, say, 3000 distinct primary queries, four cached filters in the
filterCache (bla:{eins,zwei,drei,vier}), and a somewhat better usage
of the queryResultCache.

I wrote that we know how often certain queries arise, or at least
have the means to collect that data, because we know the application
we're writing, so we either know the frequency of a given search
pattern based on the usage our application makes of Solr and on the
restrictions it imposes on the user by, say, using Dismax; or - if we
give the user fine-grained control over the query language - we may
somehow collect and analyze the actual queries in order to empirically
determine actual search engine usage and optimize accordingly.


The result of a filter query is cached and then used to filter a
primary query result using set intersection. If my filter query
result comprises more than 50 % of the entire document collection,
its selectivity is poor. I might need it despite this fact, but it
might also be worth while thinking about how to reframe the
requirement, allowing for more efficient filters.


So, just to be explicit, if I have a query containing:

   fq=EventType:fairfq=EventType:filmfq=LAT:[50 TO 60]fq=LONG:[-1 TO 1]

The first time this is encountered it is going to cause four
queries of the entire index and cause four sets of document ID's
to be cached. Subsequent queries will reuse the various cached
entries as appropriate. Is that correct?


I do think so.


I guess in the above case where my GEO search window will keep
changing I should ideally arrange that the lat and long element is
added to the q parameter to stop my cache being cluttered.


My understanding is that what varies heavily should *not* go into the
filterCache. Your GEO search window might vary quite a bit (probably
much more than EventType), so to me it looks like a candidate for the
main query.


Also what happens when the filter is full? If there any accounting
of which cache entries are getting the most or most recent hits?


Good question!

Michael Ludwig


Re: Faceting on text fields

2009-06-10 Thread Michael Ludwig

Yonik Seeley schrieb:

Yep, all that sounds right.
An additional optimization counts terms for the documents *not* in the
set when the base set is over half the size of the index.


Cool :-) Thanks for confirming my assumptions!

Michael Ludwig


Re: Faceting on text fields

2009-06-10 Thread Michael Ludwig

Otis Gospodnetic schrieb:


Solr can already cluster top N hits using Carrot2:
http://wiki.apache.org/solr/ClusteringComponent


Would it be fair to say that clustering as detailed on the page you're
referring to is a kind of dynamic faceting? The faceting not being done
based on distinct values of certain fields, but on the presence (and
frequency) of terms in one field?

The main difference seems to be that with faceting, grouping criteria
(facets) are known beforehand, while with clustering, grouping criteria
(the significant terms which create clusters - the cluster keys) have
yet to be determined. Is that a correct assessment?

Michael Ludwig


Re: Customizing results

2009-06-10 Thread Michael Ludwig

Manepalli, Kalyan schrieb:

Hi,
I am trying to customize the response that I receive from Solr. In the
index I have multiple fields that contain the same data in different
language.
At the query time client specifies the language. Based on this param,
I want to return the value, copied into a different field.
Eg:
str name=location_da_dkLubang, Filippinerne/str
str name=location_de_deLubang, Philippinen/str
str name=location_en_usLubang, Philippines/str
str name=location_es_esLubang, Filipinas/str

If the user specifies language as de_de, then I want to return the
result as str name=locationLubang, Philippinen/str


If you control how the client works, you could also consider using an
internationalization technology such as GNU Gettext for this purpose.
May or may not make sense in your particular situation.

Michael Ludwig


Re: How to disable posting updates from a remote server

2009-06-10 Thread Michael Ludwig

ashokc schrieb:

I find that I am freely able to post to my production SOLR server,
from any other host that can run the post command. So somebody can
wipe out the whole index by posting a delete query.


Control this at the IP level, have your server listen on 127.0.0.1
or on a private subnet address.

Michael Ludwig


Re: Solr relevancy score - conversion

2009-06-10 Thread Michael Ludwig

Vijay_here schrieb:


Would need an more proportionate score like rounded to 100% (95%
relevant, 80 % relevant and so on). Is there a way to make solr
returns such scores of such relevance.


In XSLT:

  xsl:template match=result/doc
xsl:variable name=score-percentage select=
  round( 100 * flo...@name='score'] div ../@maxScore)/

The div is the XPath division operator. Should be a straightforward
mapping to any other language.

Michael Ludwig


Re: copyfield and 'store' and highlighting

2009-06-10 Thread Michael Ludwig

ashokc schrieb:


Do I have to declare 'field1' also to be stored? 'field1' is never
returned in the response.


I find the following Wiki page helpful when dealing with @stored,
@indexed and friends:

http://wiki.apache.org/solr/FieldOptionsByUseCase

Michael Ludwig


Re: fq vs. q

2009-06-09 Thread Michael Ludwig

Martin Davidsson schrieb:

I've tried to read up on how to decide, when writing a query, what
criteria goes in the q parameter and what goes in the fq parameter, to
achieve optimal performance. Is there [...] some kind of rule of thumb
to help me decide how to split things up when querying against one or
more fields.


This is a good question. I don't know if there is any such rule. I'm
going to sum up my understanding of filter queries hoping that the pros
will point out any flaws in my assumptions.

http://wiki.apache.org/solr/SolrCaching - filterCache

A filter query is cached, which means that it is the more useful the
more often it is repeated. We know how often certain queries arise, or
at least have the means to collect that data - so we know what might be
candidates for filtering.

The result of a filter query is cached and then used to filter a primary
query result using set intersection. If my filter query result comprises
more than 50 % of the entire document collection, its selectivity is
poor. I might need it despite this fact, but it might also be worth
while thinking about how to reframe the requirement, allowing for more
efficient filters.

Memory consumption is probably not a great concern here as the cache
stores only document IDs. (And if those are integers, it's just 4 bytes
each.) So having 100 filters containing 100,000 items on average, the
memory consumption increase should be around 40 MB.

By the way, are these document IDs (user in filterCache, documentCache,
queryResultCache) the ones I configure in schema.xml or does Solr map my
IDs to integers in order to ensure efficiency?

A filter query should probably be orthogonal to the primary query, which
means in plain English: unrelated to the primary query. To give an
example, I have a field category, which is a required field. In the
class of searches where I use a filter on that field, the primary search
is for something entirely different, so in most cases, it will not, or
not necessarily, bias the primary result to any particular distribution
of the category values. I then allow the application to apply filtering
by category, incidentally, using faceting, which is a typical usage
pattern, I guess.

Michael Ludwig


filterCache/@size, queryResultCache/@size, documentCache/@size

2009-06-09 Thread Michael Ludwig

Common cache configuration parameters include @size (size attribute).

http://wiki.apache.org/solr/SolrCaching

For each of the following, does this mean the maximum size of:

* filterCache/@size - filter query results?
* queryResultCache/@size - query results?
* documentCache/@size - documents?

So if I know my tiny documents don't take up much memory (just 500
Bytes on average), I'd want to have very different settings for the
documentCache than if I decided to store 10 KB per doc in Solr?

And if I know that only 100 filters are possible, there is no point
raising the filterCache/@size above that threshold?

Given the following three filtering scenarios of (a) x:bla, (b) y:blub,
and (c) x:bla AND y:blub, will I end up with two or three distinct
filters? In other words, may filters be composites or are they
decomposed as far as their number (relevant for @size) is concerned?

Michael Ludwig


Re: filter on millions of IDs from external query

2009-06-09 Thread Michael Ludwig

Ryan McKinley schrieb:

I am working with an in index of ~10 million documents.  The index
does not change often.

I need to preform some external search criteria that will return some
number of results -- this search could take up to 5 mins and return
anywhere from 0-10M docs.


If it really takes so long, then something is likely wrong. You might be
able to achieve a significant improvement by reframing your requirement.


I would like to use the output of this long running query as a filter
in solr.

Any suggestions on how to wire this all together?


Just use it as a filter query. The result will be cached, the query
won't have to be executed again (if I'm not mistaken) until a new index
searcher is opened (after an index update and a commit), or until the
filter query result is evicted from the cache, which you should make
sure won't happen if your query really is so terribly expensive.

Michael Ludwig


Re: Field Compression

2009-06-09 Thread Michael Ludwig

Fer-Bj schrieb:

for all the documents we have a field called small_body , which is a
60 chars max text field that were we store the abstract for each
article.



we need to display this small_body we want to compress every time.


If this works like compressing individual files, the overhead for just
60 characters (which may be no more than 60 bytes) may mean that any
attempt at compression results in inflation.

On the other hand, if lower-level units (pages) are compressed (as
opposed to individual fields), then I don't know what sense a
configurable compression threshold might make.

Maybe one of the pros can clarify.


Last question: what's the best way to determine the compress
threshold ?


One fairly obvious way would be to index the same set of documents
twice, with compression and then without, and then to compare the index
size on disk. If you don't save, say, five or ten percent (YMMV), it
might not be worth the effort.

Michael Ludwig


Re: Faceting on text fields

2009-06-09 Thread Michael Ludwig

Yao Ge schrieb:


The facet query is considerably slower comparing to other facets from
structured database fields (with highly repeated values). What I found
interesting is that even after I constrained search results to just a
few hunderd hits using other facets, these text facets are still very
slow.

I understand that text fields are not good candidate for faceting as
it can contain very large number of unique values. However why it is
still slow after my matching documents is reduced to hundreds? Is it
because the whole filter is cached (regardless the matching docs) and
I don't have enough filter cache size to fit the whole list?


Very interesting questions! I think an answer would both require and
further an understanding of how filters work, which might even lead to
a more general guideline on when and how to use filters and facets.

Even though faceting appears to have changed in 1.4 vs 1.3, it would
still be interesting to understand the 1.3 side of things.


Lastly, what I really want to is to give user a chance to visualize
and filter on top relevant words in the free-text fields. Are there
alternative to facet field approach? term vectors? I can do client
side process based on top N (say 100) hits for this but it is my last
option.


Also a very interesting data mining question! I'm sorry I don't have any
answers for you. Maybe someone else does.

Best,

Michael Ludwig


Re: Faceting on text fields

2009-06-09 Thread Michael Ludwig

Yonik Seeley schrieb:

Are you using Solr 1.3?
You might want to try the latest 1.4 test build -
faceting has changed a lot.


I found two significant changes (but there may well be more):

[#SOLR-911] multi-select facets - ASF JIRA
https://issues.apache.org/jira/browse/SOLR-911

Yao,

it sounds like the following (which is in 1.4) might have a chance of
helping your faceting performance issue:

[#SOLR-475] multi-valued faceting via un-inverted field - ASF JIRA
https://issues.apache.org/jira/browse/SOLR-475

Yonik,

from your initial comment for SOLR-475:

| * To save space and speed up faceting, any term that matches enough
| * documents will not be un-inverted... it will be skipped while
| * building the un-inverted field structore, and will use a set
| * intersection method during faceting.

Does this mean that frequently occurring terms (which we can use for
faceting in 1.3 without problems) are handled exactly as they were
before, by allocating a slot in the filter cache upon request, while
those zillions of pesky little fringe terms outside the mainstream,
for which allocating a slot in the filter cache would be overkill
(and possibly cause inefficient contention, eviction, and, hence,
a performance penalty) are now handled by the new structure mapping
documents to term numbers?

So doing faceting for a given set of documents would result in (a) doing
set intersection using those filter query results that have been set up
(for the terms occurring in many documents), and (b) collecting all the
pesky little terms from the new structure mapping documents to term
numbers?

So basically, depending on expediency, you (a) know the facets and count
the documents which display them, or you (b) take the documents and see
what facets they have?

Michael Ludwig


Re: statistics about word distances in solr

2009-06-09 Thread Michael Ludwig

Moin Jens,

Jens Fischer schrieb:

I was wondering if there's an option to return statistics about
distances from the query terms to the most frequent terms in the
result documents.



The additional information I'm looking for is the average distance
between these terms and my search term.

So let's say I have two docs

the house is red
I live in a red house

The search for house should also return the info

the:1
is:1
red:1.5
I:5
live:4


Could you explain what the distance here is? Something like edit
distance? Ah, I see: You want the textual distance between the search
term and other terms in the document, and then you want that averaged,
i.e. the cumulative distance divided by the number of occurrences.

No idea if that functionality is available.

However, the sort of calculation you want to perform requires the engine
to not only collect all the terms to present as facets (much improved in
1.4, as I've just learned), but to also analyze each document (if I'm
not mistaken) to determine the distance for each facet term from your
primary query term. (Or terms.)

The number of lookup operations is likely to scale as the product of
the number of your primary search results, the number of your search
terms, and the number of your facets.

I assume this is an expensive operation.

Michael Ludwig


Re: fq vs. q

2009-06-09 Thread Michael Ludwig

Shalin Shekhar Mangar schrieb:

On Tue, Jun 9, 2009 at 7:25 PM, Michael Ludwig m...@as-guides.com
wrote:



A filter query should probably be orthogonal to the primary query,
which means in plain English: unrelated to the primary query. To give
an example, I have a field category, which is a required field. In
the class of searches where I use a filter on that field, the primary
search is for something entirely different, so in most cases, it will
not, or not necessarily, bias the primary result to any particular
distribution of the category values. I then allow the application to
apply filtering by category, incidentally, using faceting, which is a
typical usage pattern, I guess.


Yes and no. There are use-cases where the query is applicable only to
the filtered set. For example, when the same index contains many
different types of documents. It is just that the intersection may
need to do more or less work.


Sorry, I don't understand. I used to think that the engine applies the
filter to the primary query result. What you're saying here sounds as if
it could also pre-filter my document collection to then apply a query to
it (which should yield the same result). What does it mean that the
query is applicable only to the filtered set?

And thanks for having clarified the other points!

Michael Ludwig


Re: filterCache/@size, queryResultCache/@size, documentCache/@size

2009-06-09 Thread Michael Ludwig

Shalin Shekhar Mangar schrieb:

On Tue, Jun 9, 2009 at 7:47 PM, Michael Ludwig m...@as-guides.com
wrote:



Given the following three filtering scenarios of (a) x:bla, (b)
y:blub, and (c) x:bla AND y:blub, will I end up with two or three
distinct filters? In other words, may filters be composites or are
they decomposed as far as their number (relevant for @size) is
concerned?


It will be three. If you want to cache separately, send them as
separate fq parameters.


Thanks a lot for clarifying all my questions.

Michael Ludwig


Re: fq vs. q

2009-06-09 Thread Michael Ludwig

Shalin Shekhar Mangar schrieb:


No, both filters and queries are computed on the entire index.

My comment was related to the A filter query should probably be
orthogonal to the primary query... part. I meant that both kinds of
use-cases are common.


Got it. Thanks :-)

Michael Ludwig


Re: SpellCheckComponent: queryAnalyzerFieldType

2009-06-05 Thread Michael Ludwig

Shalin Shekhar Mangar schrieb:


Is it correct to say that when I intend to always use
the spellcheck.q parameter I do not need to specify a
queryAnalyzerFieldType in my spellcheck searchComponent,
which I define in solrconfig.xml?


Yes, that is correct.

Even if a queryAnalyzerFieldType is not specified and your query uses
q, then WhitespaceTokenizer is used by default.


Thanks for clarifying.


SpellingQueryConverter was written for a very simple use-case dealing
with ASCII only. But there is no reason why we cannot extend it to
cover the full UTF-8 set.



Can you please open an issue and if possible, give a patch?


Please see: https://issues.apache.org/jira/browse/SOLR-1204

Regards,

Michael Ludwig


Re: spell checking

2009-06-05 Thread Michael Ludwig

Walter Underwood schrieb:

query suggest --wunder


That's very good.

On the other hand, I noticed how the term spellcheck is spread
all over the place, and that would be a massive renaming orgy.
An explanation at the appropriate place in the documentation is
less invasive. I added two sentences to the Introduction of:

http://wiki.apache.org/solr/SpellCheckComponent

Michael Ludwig


Re: spell checking

2009-06-04 Thread Michael Ludwig

Yao Ge schrieb:


Maybe we should call this alternative search terms or
suggested search terms instead of spell checking. It is
misleading as there is no right or wrong in spelling, there
is only popular (term frequency?) alternatives.


I had exactly the same difficulty in understanding the concept
because of the name given to the feature, which usually denotes
just what it says, i.e. a spellchecker, which is driven by an
authoritative dictionary and a set of rules, as integrated in
word processors, in order to ensure orthography.

What we have here is quite different from a spellchecker.

IMHO, a name conveying the actual meaning, along the lines of
suggest, would make more sense.

Michael Ludwig


SpellCheckComponent: queryAnalyzerFieldType

2009-06-04 Thread Michael Ludwig

Shalin Shekhar Mangar wrote:

| If you use spellcheck.q parameter for specifying
| the spelling query, then the field's analyzer will
| be used [...] If you use the q parameter, then the
| SpellingQueryConverter is used.

http://markmail.org/message/k35r7qmpatjvllsc - message
http://markmail.org/thread/gypvpfnsd5sggkpx  - whole thread

Is it correct to say that when I intend to always use
the spellcheck.q parameter I do not need to specify a
queryAnalyzerFieldType in my spellcheck searchComponent,
which I define in solrconfig.xml?

Given the limitations of the SpellingQueryConverter laid
out in the thread referred to above, it seems you want to
use the spellcheck.q parameter for anything but what can
be encoded in ASCII. Is that true?

Michael Ludwig


Re: French and SpellingQueryConverter

2009-05-19 Thread Michael Ludwig

Shalin Shekhar Mangar schrieb:

On Mon, May 11, 2009 at 2:46 PM, Michael Ludwig m...@as-guides.com
wrote:


Could you give an example of how the spellcheck.q parameter can be
brought into play to (take non-ASCII characters into account, so
that Käse isn't mishandled) given the following example:


You will need to set the correct tokenizer and filters for your field
which can handle your language correctly. Look at the GermanAnalyzer
in Lucene contrib-analysis. It uses StandardTokenizer, StandardFilter,
LowerCaseFilter, StopFilter, GermanStemFilter with a custom stopword
list.


Hello Shalin,

thanks for your kind answer, and sorry for my delay in responding.

Due to my newbieness in this domain, I misphrased my question. What
I wanted to say (and Jonathan, too, I think) is that the regular
expression in that SpellingQueryConverter only deals with ASCII,
which is insufficient for most languages, including French and
German.

I think the regular expression in SpellingQueryConverter should be
something like:

(?:(?!(\w+:|\d+)))[\p{javaLowerCase}\p{javaUpperCase}\d_]+
vs. (?:(?!(\w+:|\d+)))\w+

Then, correct German and French TokenStreams are generated in the
example program I posted.

But I may well have misunderstood the purpose of this class. You
will know.

Michael Ludwig


Re: French and SpellingQueryConverter

2009-05-19 Thread Michael Ludwig

Jonathan Mamou schrieb:

Thanks Michael for your answer!
I think that (?:(?!(\w+:|\d+)))[\p{L}]+
should also be OK.


Oh yes, that's much simpler and clearer than my suggestion.
(Newbieness factor for Java style regular expressions, too.)

Or maybe this:(?:(?!(\w+:|\d+)))[\p{L}\d_]+:-)

Michael Ludwig


Re: Replication master+slave

2009-05-15 Thread Michael Ludwig

Bryan Talbot schrieb:

So how are people managing solrconfig.xml files which are largely the
same other than differences for replication?

I don't think it's a good thing to maintain two copies of the same
file and I'd like to avoid that.  Maybe enabling the XInclude feature
in DocumentBuilders would make it possible to modularize configuration
files to make this possible?


This is already possible using the XML feature called entities,
more precisely external general parsed entities (EGPE). I've never
seen a parser that doesn't do entities.

C:\MILU\dev\XML # type egpe-net.xml
!DOCTYPE Urmel [
!ENTITY egpe_from_the_net
  SYSTEM http://lobster.as-guides.com/ds/solr.schema.ent; 
!ENTITY egpe_from_the_local_disk
  SYSTEM egpe-local.ent 
]
Urmel
egpe_from_the_net;
egpe_from_the_local_disk;
/Urmel

C:\MILU\dev\XML # type egpe-local.ent
eins/
zwei/
drei/

Michael Ludwig


Re: Selective Searches Based on User Identity

2009-05-13 Thread Michael Ludwig

Terence Gannon schrieb:

Paul -- thanks for the reply, I appreciate it.  That's a very
practical approach, and is worth taking a closer look at.  Actually,
taking your idea one step further, perhaps three fields; 1) ownerUid
(uid of the document's owner) 2) grantedUid (uid of users who have
been granted access), and 3) deniedUid (uid of users specifically
denied access to the document).


Grants might change quite a bit, the owner will likely remain the same.

Wouldn't it be better to include only the owner in the document and
store grants someplace else, like in an RDBMS or - if you don't want
one - a lightweight embedded database like BDB?

That way you could have your application tag an ineluctable filter query
onto each and every user query, which would ensure to include only those
documents in the results the owner of which has granted the user access.

Considering that I'm a Solr/Lucene newbie, this approach might have a
disadvantage that escapes me, which is why other people haven't made
this particular suggestion. If so, I'd be happy to learn why this isn't
preferable.

Michael Ludwig


Re: Selective Searches Based on User Identity

2009-05-13 Thread Michael Ludwig

Hi Terence,

Terence Gannon schrieb:

Yes, the ownerUid will likely be assigned once and never changed.  But
you still need it, in order to keep track of who has contributed which
document.


Yes, of course!


I've been going over some of the simpler query scenarios, and Solr is
capable of handling them without having to resort to an external
RDBMS.


The database is only to store grants - it's not to help with searching.
It would look like this:

  grantee| grant
  ---+--
  fritz  | fred,frank,egon
  frank  | egon,fritz
  egon   | terence,frank
  ...

Each user is granted to access to his own documents and to those he
had received grants for.


In order to limit documents to those which a given user owns,
or those to which he has been granted access, the syntax fragment
would be something like;

ownerUid:ab2734 or grantedUid:ab2734


I think it could be:

  ownerUid:egon OR ownerUid:terence OR ownerUid:frank

No need to embed grants in the document.

Ah, I see my mistake now. You want grants based on the document, not on
the user - I had overlooked that fact. That makes my suggestion invalid.


I'll plead ignorance of the 'ineluctable filter query' and will have
to read up on that one.


I meant a filter query that the application tags onto the query on
behalf of the user and without the user being able to do anything about
it so he cannot circumvent the filter.

Best regards,

Michael Ludwig


Re: French and SpellingQueryConverter

2009-05-11 Thread Michael Ludwig

Shalin Shekhar Mangar schrieb:

On Fri, May 8, 2009 at 2:14 AM, Jonathan Mamou ma...@il.ibm.com
wrote:



SpellingQueryConverter always splits words with special
character. I think that the issue is in SpellingQueryConverter
class Pattern.compile.((?:(?!(\\w+:|\\d+)))\\w+);?:
According to
http://java.sun.com/javase/6/docs/api/java/util/regex/Pattern.html,
\w A word character: [a-zA-Z_0-9]
I think that special character should also be added to the regex.


Same issue for the GermanAnalyzer as for the FrenchAnalyzer.

http://wiki.apache.org/solr/SpellCheckComponent says:

  The SpellingQueryConverter class does not deal properly with
  non-ASCII characters. In this case, you have either to use
  spellcheck.q, or to implement your own QueryConverter.


If you use spellcheck.q parameter for specifying the spelling
query, then the field's analyzer will be used (in this case,
FrenchAnalyzer). If you use the q parameter, then the
SpellingQueryConverter is used.


Could you give an example of how the spellcheck.q parameter can be
brought into play to (take non-ASCII characters into account, so
that Käse isn't mishandled) given the following example:

package org.apache.solr.spelling;
import org.apache.lucene.analysis.de.GermanAnalyzer;
public class GermanTest {
public static void main(String[] args) {
SpellingQueryConverter sqc = new SpellingQueryConverter();
sqc.analyzer = new GermanAnalyzer();
System.out.println(sqc.convert(Käse));
}
}

Note the result of the above, which is plain wrong, reads:

  [(k,0,1,type=ALPHANUM), (se,2,4,type=ALPHANUM)]

Thanks.

Michael Ludwig


Organizing multiple searchers around overlapping subsets of data

2009-05-08 Thread Michael Ludwig

I have one type of document, but different searchers, each of
which is interested in a different subset of the documents,
which are different configurations of TV channels {A,B,C,D}.

* Application S1 is interested in all channels, i.e. {A,B,C,D}.
* Application S2 is interested in {A,B,C}.
* Application S3 is interested in {A,C,D}.
* Application S4 is interested in {B,D}.

As can be seen from this simplified example, the subsets are
not disjoint, but do have considerable overlaps.

The total data volume is only about 200 MB. There are four
searchers, and they may become ten or a dozen.

The set elements an application may or may not be interested
in, however, i.e. the channels, which are {A,B,C,D} in this
example, are not just four, but about 150, each of which has
about 1000 documents.

What is the best way to organize this?

(a) Set up different cores for each application, i.e. going
multi-core, thereby incurring a good deal of redundancy, but
simplifying searches?

(b) Apply filter queries to select documents from only, say
60, 80 or 110 out of 150 channels.

(c) Something else I'm not aware of.

Am I right in suspecting that multi-core makes less sense with
increasing overlaps and hence redundancy?

Michael Ludwig


Re: What are the Unicode encodings supported by Solr?

2009-05-08 Thread Michael Ludwig

KK schrieb:


I'd like to know about the different Unicode[/any other?] encodings
supported by Solr for posting docs [thru Solrj in my case]. Is it that
just UTF-8, UCN  supported or other character encodings like
NCR(decimal), NCR(hex) etc are supported as well?


Any numerical character reference (NCR), decimal or hexadecimal, is
valid UTF-8 as long as it maps to a valid Unicode character.


I found that for most of the pages the encoding is UTF-8[in this case
searching works fine] but for others the encoding is some other
character encoding[like NCR(dec), NCR(hex) or might be something else,
don't have much idea on this].


Whatever the encoding is, your application needs to know what it is when
dealing with bytes read from the network.


So when I fetch the page content thru java methods using
InputSteamReaders and after stripping various tags what I obtained
is raw text with some encoding not getting supported by Solr.


Did you make sure to not rely on your platform default encoding
(Charset) when constructing the InputStreamReader? If in doubt, take
a look at the InputStreamReader constructors.

Michael Ludwig


Re: Multi-index Design

2009-05-06 Thread Michael Ludwig

Matt Weber schrieb:


http://wiki.apache.org/solr/MultipleIndexes


Thanks, Mark. Your explanation and the pointer to the Wiki have
clarified things for me.

Michael Ludwig


Re: schema.xml: default values for @indexed and @stored

2009-05-06 Thread Michael Ludwig

Otis Gospodnetic schrieb:

Attribute values for fields should be inherited from attribute values
of their field types.


Thanks, that answers my question pertaining to @indexed and @stored in
the fieldtype and field elements in schema.xml.

Michael Ludwig


Re: unable to run the solr in tomcat 5.0

2009-05-06 Thread Michael Ludwig

uday kumar maddigatla schrieb:

Hi,

I'm new to this Solr. I got distribution of Solr. i placed the war
file in tomcat/webapps.

After that i don't know what to do. I got confused while reading The
instalation notes which is given in wiki .


It might be easier for you to follow the instructions in the tutorial
and run Solr in Jetty as per the distribution, which works out of the
box:

http://lucene.apache.org/solr/tutorial.html

Michael Ludwig


Re: unable to run the solr in tomcat 5.0

2009-05-06 Thread Michael Ludwig

uday kumar maddigatla schrieb:


The link which shows the things in Jetty. But i'm using Tomcat.

If i run the command which is given in the link, it is tryinge to post
the indexes at port number 8983. But in my case my tomcat is running
on 8080.

Where to change the port.


That's a basic Tomcat question. The answer is: In your Tomcat's
server.xml configuration file. Look here:

http://tomcat.apache.org/tomcat-6.0-doc/config/

Then, look for the port parameter here:

http://tomcat.apache.org/tomcat-6.0-doc/config/http.html

You could also change the port in the address bar of your browser.
Or even do a string replacement s/8983/8080/g on the Solr doc you're
viewing.

Michael Ludwig


Re: unable to run the solr in tomcat 5.0

2009-05-06 Thread Michael Ludwig

uday kumar maddigatla schrieb:


My intention is to use 8080 as port.

Is there any other way taht Solr will post the files in 8080 port


Solr doesn't post, it listens.

Use the curl utility as indicated in the documentation.

http://wiki.apache.org/solr/UpdateXmlMessages

Michael Ludwig


Re: unable to run the solr in tomcat 5.0

2009-05-06 Thread Michael Ludwig

uday kumar maddigatla schrieb:


When i try to use the command java -post.jar *.*. It is trying to Post
files in Solr which is there in 8983 port.


The post.jar seems to be hardcoded to port 8983, that's why I pointed
you to the curl utilty, which lets you specify any port and address you
can dream up.

Seriously, read the docs, it'll help you :-)

Michael Ludwig


Re: Multi-index Design

2009-05-05 Thread Michael Ludwig

Chris Masters schrieb:


 - flatten the searchable objects as much as I can - use a type field
   to distinguish - into a single index
 - use multi-core approach to segregate domains of data


Some newbie questions:

(1) What is a type field? Is it to designate different types of
documents, e.g. product descriptions and forum postings?

(2) Would I include such a type field in the data I send to the update
facility and maybe configure Solr to take special action depending on
the value of the update field?

(3) Like, write the processing results to a domain dedicated to that
type of data that I could limit my search to, as per Otis' post?

(4) And is that what's called a core here?

(5) Or, failing (3), and lumping everything together in one search
domain (core?), would I use that type field to limit my search to
a particular type of data?

Michael Ludwig


schema.xml: default values for @indexed and @stored

2009-05-04 Thread Michael Ludwig

From the apache-solr-1.3.0\example\solr\conf\schema.xml file:

!-- since fields of this type are by default not stored or indexed,
 any data added to them will be ignored outright --
fieldtype name=ignored stored=false indexed=false
  class=solr.StrField /

So for both fieldtype/@stored and fieldtype/@indexed, the default is
true, correct?

And does the fieldtype configuration constitute a default for field
so that field/@stored and field/@indexed take their effective values
according to field/@type?

Or do these default to true regardless of what's specified in the
respective fieldtype?

Michael Ludwig


Re: Problem adding unicoded docs to Solr through SolrJ

2009-04-30 Thread Michael Ludwig

ahmed baseet schrieb:


I tried something stupid but working though. I first converted the
whole string to byte array and then used that byte array to create a
new utf-8 encoded sting like this,

// Encode in Unicode UTF-8
byte [] utfEncodeByteArray = textOnly.getBytes();


This yields a sequence of bytes using the platform's default charset,
which may not be UTF-8. Check:

* String#getBytes()
* String#getBytes(String charsetName)


String utfString = new String(utfEncodeByteArray,
Charset.forName(UTF-8));


Note that strings in Java are always internally encoded in UTF-16, so it
doesn't make much sense to call it utfString, especially if you think
that it is encoded in UTF-8, which it is not.

The above operation is only guaranteed to succeed without losing data
(resulting in ? in the output) when the sequence of bytes is valid as
UTF-8, i.e. in this case when your platform encoding, which you've
relied upon, is UTF-8.


then passed the utfString to the function for posting to Solr and it
works prefectly.
But is there any intelligent way of doing all this, like straight from
default encoded string to utf-8 encoded string, without going via byte
array.


It is a feature of the java.lang.String that you don't need to know the
encoding, as the string contains characters, not bytes. Only for input
and output you are concerned with encoding. So where you're dealing with
encodings, you're dealing with bytes.

And when dealing with bytes on the wire, you're likely concerned with
encodings, for example when the page you read via HTTP comes with a
Content-Type header specifying the encoding, or when you send documents
to the Solr indexer.

For more intelligent ways, you could take a look at the class
java.nio.charset.Charset and the methods encode, decode, newEncoder,
newDecoder.

Michael Ludwig


Re: UTF8 compatibility

2009-04-29 Thread Michael Ludwig

Muhammed Sameer schrieb:


We run post.jar periodically ie after every 15mins to commit the
changes, Is this approach correct ?


Sounds reasonable to me.


SimplePostTool: WARNING: Make sure your XML documents are encoded in
UTF-8, other encodings are not currently supported


That's just to remind you not to try and post documents in another
encoding. This seems to be a limitation of the SimplePostTool, not of
Solr. I guess the reason is that in order for Solr to work quickly and
reliably, it relies on the Content-Type of the request to determine the
encoding. If, for example, you send XML encoded in ISO-8859-1, you have
to specify that in two places:

* XML declaration: ?xml version=1.0 encoding=ISO-8859-1?
* HTTP header: Content-Type: text/xml; charset=ISO-8859-1

The SimplePostTool, however, being just what the name says, may not
bother to read the encoding from the document and bring the HTTP content
type header in line. Instead, it explicitly requests UTF-8, probably in
the interest of simplicity.

Well, that's just my theory. Can anyone confirm?


So I tried to run the test_utf8.sh script and got the following output
{code}
Solr server is up.
HTTP GET is accepting UTF-8
HTTP POST is accepting UTF-8
HTTP POST defaults to UTF-8
ERROR: HTTP GET is not accepting UTF-8 beyond the basic multilingual plane
ERROR: HTTP POST is not accepting UTF-8 beyond the basic multilingual plane
ERROR: HTTP POST + URL params is not accepting UTF-8 beyond the basic 
multilingual plane
{code}

Are these errors normal or do I need to change something ?


I'm seeing the same output, don't worry, just some tests. It is possible
to have Solr index documents containing characters outside of the BMP
(Basic Multilingual Plane), which can be verified posting something like
this:

add
  doc
field name=id1001/field
field name=titleBMP plus 1 #x1;/field
  /doc
/add

Maybe the test script output says that such characters cannot be used
for querying. Hardly relevant if you consider that the BMP comprises
even languages such as Telugu, Bopomofo and French.

Best,

Michael Ludwig


Re: Performance and number of search results

2009-04-29 Thread Michael Ludwig

Wouter Samaey schrieb:


Can someone please comment on the performance impact of the number of
search results?
Is there a big difference between querying for 1 result, 10, 20 or
even 100 ?


Probably not, but YMMV, as the question is very general.

Consider that for fast queries the HTTP round trip may well be the
determining factor. Or XML parsing. If you've stored a lot of data in
Solr and request all of it to be returned, the difference between 1 and
100 results may be the difference between 1 and 100 KB payload.

If you think it matters, the best thing for you would be to do some
profiling for your specific scenario.

The rule of thumb here is probably: Get what you need.

Michael Ludwig


Re: Problem adding unicoded docs to Solr through SolrJ

2009-04-29 Thread Michael Ludwig

ahmed baseet schrieb:


public void postToSolrUsingSolrj(String rawText, String pageId) {



doc.addField(features, rawText );



In the above the param rawText is just the html stripped off of all
its tags, js, css etc and pageId is the Url for that page. When I'm
using this for English pages its working perfectly fine but the
problem comes up when I'm trying to index some non-english pages.


Maybe you're constructing a string without specifying the encoding, so
Java uses your default platform encoding?

String(byte[] bytes)
  Constructs a new String by decoding the specified array of
  bytes using the platform's default charset.

String(byte[] bytes, Charset charset)
  Constructs a new String by decoding the specified array of bytes using
  the specified charset.


Now what I did is just extracted the raw text from that html page and
manually created an xml page like this

?xml version=1.0 encoding=UTF-8?
add
  doc
field name=idUTF2TEST/field
field name=nameTest with some UTF-8 encoded characters/field
field name=features*some tamil unicode text here*/field
   /doc
/add

and posted this from command line using the post.jar file. Now searching
gives me the result but unlike last time browser shows the indexed text in
tamil itself and not the raw unicode.


Now that's perfect, isn't it?


I tried doing something like this also,



// Encode in Unicode UTF-8
 utfEncodedText = new String(rawText.getBytes(UTF-8));

but even this didn't help eighter.


No encoding specified, so the default platform encoding is used, which
is likely not what you want. Consider the following example:

package milu;
import java.nio.charset.Charset;
public class StringAndCharset {
  public static void main(String[] args) {
byte[] bytes = { 'K', (byte) 195, (byte) 164, 's', 'e' };
System.out.println(Charset.defaultCharset().displayName());
System.out.println(new String(bytes));
System.out.println(new String(bytes,  Charset.forName(UTF-8)));
  }
}

Output:

windows-1252
Käse (bad)
Käse (good)

Michael Ludwig


Highlighting using XML instead of strings?

2009-04-29 Thread Michael Ludwig

http://wiki.apache.org/solr/HighlightingParameters

I can specify the strings to highlight matched text with using
hl.simple.pre and hl.simple.post, for example b and /b.
The result looks like this:

  strlt;bgt;Eumellt;/bgt; NDR Ländermagazine/str

However, what if as the result of favouring XML over strings,
I rather want something like this:

  strbEumel/b NDR Ländermagazine/str

There could be a parameter hl.xml which I could use to request
modified XML like this:

  hl.xlm=em
  hl.xlm=b

This would allow smoother processing technologies like XSLT.
Is such a feature available?

Michael Ludwig