Re: problem in setting field attribute in schema.xml

2011-05-25 Thread Michael Lackhoff

Am 25.05.2011 15:47, schrieb Vignesh Raj:

It's very strange. Even I tried the same now and am getting the same result.
I have set both indexed=false and stored=false.
But still if I search for a keyword using my default search, I get the
results in these fields as well.
But if I specify field:value, it shows 0 results.

Can anyone explain?


I guess you copy the field to your default search field.

-Michael


Re: problem in setting field attribute in schema.xml

2011-05-26 Thread Michael Lackhoff

Am 26.05.2011 12:52, schrieb Romi:

i have done it, i deleted old indexes and created new indexes but still able
to search it through *:*, and no result when i search it as field:value.
really surprising result. :-O


I really don't understand your problem. Thist is not at all surprising 
but the expected behaviour:
*:* just gives you every document in your index, no matter what of the 
document is stored or indexed, it just gives _everything_ whereas
field:value does an actual search if there is an indexed value "value" 
in field "field". So no surprise either that you didn't get a result 
here if you didn't index "field".


-Michael


Re: problem in setting field attribute in schema.xml

2011-05-26 Thread Michael Lackhoff

Am 26.05.2011 14:10, schrieb Romi:

did u mean when i set indexed="false" and store="true", solr does not index
the field's value but store its value as it is???


I don't know if you are asking me since you do not quote anything but 
yes of course this is exactly the purpose of "indexed" and "stored".


-Michael


EnglishPorterFilterFactory and PatternReplaceFilterFactory

2009-07-02 Thread Michael Lackhoff
In Germany we have a strange habbit of seeing some sort of equivalence
between Umlaut letters and a two letter representation. Example 'ä' and
'ae' are expected to give the same search results. To achieve this I
added this filter to the "text" fieldtype definition:

to both index and query analyzers (and more for the other umlauts).

This works well when I search for a name (a word not stemmed) but not
e.g. with the word "Wärme".
search for 'wärme' works
search for 'waerme' does not work
search for 'waerm' works if I move the EnglishPorterFilterFactory after
the PatternReplaceFilterFactory.

DebugQuery for "waerme" gives a parsedquery FS:waerm.
What I don't understand is why the (existing) records are not found. If
I understand it right, there should be 'waerm' in the index as well.

By the way, the reason why I keep the EnglishPorterFilterFactory is that
the records are in many languages and the English stemming gives good
results in many cases and I don't want (yet) to multiply my fields to
have language specific versions.
But even if the stemming is not right because the language is not
English I think records should be found as long as the analyzers are the
same for index and query.

This is with Solr 1.3.

Can someone shed some light on what is going on and how I can achieve my
goal?

-Michael


Re: EnglishPorterFilterFactory and PatternReplaceFilterFactory

2009-07-02 Thread Michael Lackhoff
On 02.07.2009 16:34 Walter Underwood wrote:

> First, don't use an English stemmer on German text. It will give some odd
> results.

I know but at the moment I only have the choice between no stemmer at
all and one stemmer and since more than half of the records are English
(about 60% English, 30% German, some Italian, French and others) the
results are not too bad.

> Are you using the same conversions on the index and query side?

Yes, index and query look exactly the same. That is what I don't
understand. I am not complaining about a misbehaving stemmer, unless it
does already something odd with the umlauts.

> The German stemmer might already handle "typewriter umlauts". If it doesn't,
> use the pattern replace factory. You will also need to convert "ß" to "ss".

That is what I tried. And yes I also have a filter for "ß" to "ss". It
just doesn't work as expected.

> You really do need separate fields for each language.

Eventually. But now I have to get ready really soon with a small
application and people don't find what they expect.

> Handling these characters is language-specific. The typewriter umlaut
> conversion is wrong for English. It is correct, but rare, to see a diaresis
> in English when vowels are pronounced separately, like "coöperate". In
> Swedish, it is not OK to convert "ö" to another letter or combination
> of letters.

It is just for German users and at the moment it would be totally ok to
have "coöperate" indexed as "cooeperate", I know it is wrong and it will
be fixed but given the tight schedule all I want at the moment is the
combination of some stemming (perhaps 70% right or more) and "typewriter
umlauts" (perhaps 90% correct, you gave examples for the missing 10%).

Do I have any chance?

-Michael



Re: EnglishPorterFilterFactory and PatternReplaceFilterFactory

2009-07-02 Thread Michael Lackhoff
On 02.07.2009 17:28 Erick Erickson wrote:

> I'm shooting a bit in the dark here, but I'd guess that these are
> actually understandable results.

Perhaps not too much in the dark

> That is your implicit assumption, it seems to me, is that'wärme'  and
> 'waerme' should go through the stemmer and
> become 'wärm'  and 'waerm', that you can then do the substitution
> on and produce the same output. I don't think that's a valid
> assumption.

Sounds very reasonable. Will see what I can make out of all this to keep
our librarians happy...

Yonik Seeley wrote:

> Also, check out MappingCharFilterFactory in Solr 1.4
> and mapping-ISOLatin1Accent.txt in example/solr/conf

Thanks for the hint, looking forward to the 1.4 release ;-) at the
moment we are on 1.3 though, I hope to upgrade soon but probably not
soon enough for this app.

-Michael


Preparing the ground for a real multilang index

2009-07-02 Thread Michael Lackhoff
As pointed out in the recent thread about stemmers and other language
specifics I should handle them all in their own right. But how?

The first problem is how to know the language. Sometimes I have a
language identifier within the record, sometimes I have more than one,
sometimes I have none. How should I handle the non-obvious cases?

Given I somehow know record1 is English and record2 is German. Then I
need all my (relevant) fields for every language, e.g. I will have
TITLE_ENG and TITLE_GER and both will have their respective stemmer. But
what with exotic languages? Use a catch all "language" without a stemmer?

Now a user searches for TITLE:term and I don't know beforehand the
language of "term". Do I have to expand the query to something like
"TITLE_ENG:term OR TITLE_GER:term OR TITLE_XY:term OR ..." or is there
some sort of copyfield for analyzed fields? Then I could just copy all
the TITLE_* fields to TITLE and don't bother with the language of the query.

Are there any solutions that prevent an index with thousands of fields
and dozens of ORed query terms?

I know I will have to implement some better multilanguage support but
would also like to keep it as simple as possible.

-Michael


Re: Preparing the ground for a real multilang index

2009-07-02 Thread Michael Lackhoff
On 03.07.2009 00:49 Paul Libbrecht wrote:

[I'll try to address the other responses as well]

> I believe the proper way is for the server to compute a list of  
> accepted languages in order of preferences.
> The web-platform language (e.g. the user-setting), and the values in  
> the Accept-Language http header (which are from the browser or  
> platform).

All this is not going to help much because the main application is a
scientific search portal for books and articles with many users
searching cross-language. The most typical use case is a German user
searching multilingual. So we might even get the search multilingual,
e.g. TITLE:cancer OR TITLE:krebs. No way here to watch out for
Accept-headers or a language select field (would be left on "any" in
most cases). Other popular use cases are citations (in whatever
language) cut and pasted into the search field.

> Then you expand your query for surfing waves (say) to:
> - phrase query: surfing waves exactly (^2.0)
> - two terms, no stemming: surfing waves (^1.5)
> - iterate through the languages and query for stemmed variants:
>- english: surf wav ^1.0
>- german surfing wave ^0.9
>- 
> - then maybe even try the phonetic analyzer (matched in a separate  
> field probably)

This is an even more sophisticated variant of the multiple "OR" I came
up with. Oh well...

> I think this is a common pattern on the web where the users, browsers,  
> and servers are all somewhat multilingual.

indeed and often users are not even aware of it, especially in a
scientific context they use their native tongue and English almost
interchangably -- and they expect the search engine to cope with it.

I think the best would be to process the data according to its language
but don't make any assumptions about the query language and I am totally
lost how to get a clever schema.xml out of all this.

Thanks everyone for listening and I am still open for good suggestions
to deal with this problem!

-Michael


Re: Preparing the ground for a real multilang index

2009-07-07 Thread Michael Lackhoff
On 08.07.2009 00:50 Jan Høydahl wrote:

> itself and do not need to know the query language. You may then want  
> to do a copyfield from all your text_ -> text for convenient one- 
> field-to-rule-them-all search.

Would that really help? As I understand it, copyfield takes the raw, not
yet analyzed field value. I cannot see yet the advantage of this
"text"-field over the current situation with no text_ fields at all.
The copied-to text field has to be language agnostic with no stemming at
all, so it would miss many hits. Or is there a way to combine many
differently stemmed variants into one field to be able to search against
all of them at once? That would be great indeed!

-Michael


Getting started with DIH

2009-11-08 Thread Michael Lackhoff
I would like to start using DIH to index some RSS-Feeds and mail folders

To get started I tried the RSS example from the wiki but as it is Solr
complains about the missing id field. After some experimenting I found
out two ways to fill the id:

-  in schema.xml
This works but isn't very flexible. Perhaps I have other types of
records with a real id or a multivalued link-field. Then this solution
would break.

- Changing the id field to type "uuid"
Again I would like to keep real ids where I have them and not a random UUID.

What didn't work but looks like the potentially best solution is to fill
the id in my data-config by using the link twice:
  
  
This would be a definition just for this single data source but I don't
get any docs (also no error message). No trace of any inserts whatsoever.
Is it possible to fill the id that way?

Another question regarding MailEntityProcessor
I found this example:

   


But what is the dataSource (the enclosing tag to document)? That is, how
would a minimal but complete data-config.xml look like to index mails
from an IMAP server?

And finally, is it possible to combine the definitions for several
RSS-Feeds and Mail-accounts into one data-config? Or do I need a
separate config file and request handler for each of them?

-Michael


Re: Getting started with DIH

2009-11-08 Thread Michael Lackhoff
On 08.11.2009 17:03 Lucas F. A. Teixeira wrote:

> You have an example on using mail dih in solr distro

Don't know where my eyes were. Thanks!

When I was at it I looked at the schema.xml for the rss example and it
uses "link" as UniqueKey, which is of course good, if you only have rss
items but not so good if you also plan to add other data sources.
So I am still interested in a good solution for my id problem:

>> What didn't work but looks like the potentially best solution is to fill
>> the id in my data-config by using the link twice:
>>  
>>  
>> This would be a definition just for this single data source but I don't
>> get any docs (also no error message). No trace of any inserts whatsoever.
>> Is it possible to fill the id that way?

and this one:

>> And finally, is it possible to combine the definitions for several
>> RSS-Feeds and Mail-accounts into one data-config? Or do I need a
>> separate config file and request handler for each of them?

Thanks
-Michael


Re: Getting started with DIH

2009-11-08 Thread Michael Lackhoff
On 08.11.2009 16:56 Michael Lackhoff wrote:

> What didn't work but looks like the potentially best solution is to fill
> the id in my data-config by using the link twice:
>   
>   
> This would be a definition just for this single data source but I don't
> get any docs (also no error message). No trace of any inserts whatsoever.
> Is it possible to fill the id that way?

Found the answer in the list archive: use TemplateTransformer:
  
  

Only minor and cosmetic problem: there are brackets around the id field
(like [http://somelink/]). For an id this doesn't really matter but I
would like to understand what is going on here. In the wiki I found only
this info:
> The rules for the template are same as the templates in 'query', 'url'
> etc
but I couldn't find any info about those either. Is this documented
somewhere?

-Michael


Re: Getting started with DIH

2009-11-08 Thread Michael Lackhoff
On 09.11.2009 06:54 Erik Hatcher wrote:

> The brackets probably come from it being transformed as an array.  Try  
> saying multiValued="false" on your  specifications.

Indeed. Thanks Erik that was it.

My first steps with DIH showed me what a powerful tool this is but
although the DIH wiki page might well be the longest in the whole wiki
there are so many mysteries left for the uninitiated. Is there any other
documentation I might have missed?

Thanks
-Michael


Re: Getting started with DIH

2009-11-08 Thread Michael Lackhoff
On 09.11.2009 08:20 Noble Paul നോബിള്‍ नोब्ळ् wrote:

> It just started of as a single page and the features just got piled up
> and the page just bigger.  we are thinking of cutting it down to
> smaller more manageable pages

Oh, I like it the way it is as one page, so that the browser full text
search can help. It is just that the features and power seem to grow
even faster than the wike page ;-)
E.g. I couldn't find a way how to add a second rss feed. I tried with a
second entity parallel to the slashdot one but got an exception:
"java.io.IOException: FULL" whatever that means, so I must be doing
something wrong but couldn't find a hint.

-Michael


How to import multiple RSS-feeds with DIH

2009-11-08 Thread Michael Lackhoff
[A new thread for this particular problem]

On 09.11.2009 08:44 Noble Paul നോബിള്‍ नोब्ळ् wrote:

> The tried and tested strategy is to post the question in this mailing
> list w/ your data-config.xml.

See my data-config.xml below. The first is the usual slashdot example
with my 'id' addition, the second a very simple addtional feed. The
second example works if I delete the slashdot-feed but as I said I would
like to have them both.

-Michael


  

  http://rss.slashdot.org/Slashdot/slashdot";
processor="XPathEntityProcessor"
forEach="/RDF/channel | /RDF/item"
transformer="TemplateTransformer,DateFormatTransformer">
















  
  http://www.heise.de/newsticker/heise.rdf";
processor="XPathEntityProcessor"
forEach="/RDF/channel | /RDF/item"
transformer="TemplateTransformer">






  




Re: How to import multiple RSS-feeds with DIH

2009-11-09 Thread Michael Lackhoff
On 09.11.2009 09:46 Noble Paul നോബിള്‍ नोब्ळ् wrote:

> When you say , the second example does not work , what does it mean?
> some exception?(if yes, please post the stacktrace)

Very mysterious. Now it works but I am sure I got an exception before.
All I remember is something like "java.io.IOException: FULL". In the
right frame of the DIH debugging screen I got an error message from
firefox: "the connection was reset while displaying the page".

But I don't think it is reproducable now, perhaps some unrelated problem
like low memory or such. Thanks anyway and sorry for the noise.

-Michael


Re: schema-based Index-time field boosting

2009-11-23 Thread Michael Lackhoff
On 23.11.2009 19:33 Chris Hostetter wrote:

> ...if there was a way to oost fields at index time that was configured in 
> the schema.xml, then every doc would get that boost on it's instances of 
> those fields but the only purpose of index time boosting is to indicate 
> that one document is more significant then another doc -- if every doc 
> gets the same boost, it becomes a No-OP.
> 
> (think about the math -- field boosts become multipliers in the fieldNorm 
> -- if every doc gets the same multiplier, then there is no net effect)

Coming in a bit late but I would like a variant that is not a No-OP.
Think of something like title:searchstring^10 OR catch_all:searchstring
Of course I can always add the boosting at query time but it would make
life easier if I could define a default boost in the schema so that my
query could just be title:searchstring OR catch_all:searchstring
but still get the boost for the title field.

Thinking this further it would be even better if it was possible to
define one (or more) fallback field(s) with associated boost factor in
the schema. Then it would be enough to query for title:searchstring and
it would be automatically expanded to e.g.
title:searchstring^10 OR title_other_language:searchstring^5 OR
catchall:searchstring
or whatever you define in the schema.

-Michael




Moving from single core to multicore

2009-02-09 Thread Michael Lackhoff
Hello,

I am not that experienced but managed to get a Solr index going by
copying the "example" dir from the distribution (1.3 released version)
and changing the fields in schema.xml to my needs. As I said everything
is working very well so far.
Now I need a second index on the same machine and the natural solution
seems to be multicore (I would really like to keep the two distinct so I
didn't put everything in one index).
But I have some problems setting this up. As long as I try the multicore
sample everything works but when I copy my schema.xml into the
multicore/core0/conf dir I only get 404 error messages when I enter the
admin url.
Looks like I cannot just copy over a single core config to a multicore
environment and that is o.k., what I am missing is some guidance what to
look out for. What are the settings that have to be adjusted to
multicore? I would like to avoid trial and error for every single
setting I have in my config.

And a related question: I would like to keep the existing data dir as
core0-datadir (/path_to_installation/example/solr/data). Is this
possible with the dataDir parameter? And if yes, what would be the
correct value? "/solr/data/" or
"/path_to_installation/example/solr/data/"? Do I need an absolute path
or is it relative to the dir where my start.jar is?

Thanks,
Michael


Re: Moving from single core to multicore

2009-02-09 Thread Michael Lackhoff
On 09.02.2009 15:40 Ryan McKinley wrote:

>> But I have some problems setting this up. As long as I try the  
>> multicore
>> sample everything works but when I copy my schema.xml into the
>> multicore/core0/conf dir I only get 404 error messages when I enter  
>> the
>> admin url.
> 
> what is the url you are hitting?
those from the wiki: http://localhost:8983/solr/core0/select?q=*:*
> Do you see links from the index page?
Sorry, I don't know what you mean by this

> Are there any messages in the log files?

This looks like the key. The output is a bit difficult to follow but I
found the most likely reason: the txt files were missing (stopwords.txt,
synonyms.txt ...) and then the fieldtype definitions failed. After I
copied the complete conf dir over to multicore it is almost working now.

Only problems: I get this warning:
2009-02-09 16:27:31.177::WARN:  /solr/admin/
java.lang.IllegalStateException: STREAM
at org.mortbay.jetty.Response.getWriter(Response.java:571)
[lots more]

and both cores seem to reference the old single core data. If I do a
search both give (the same) results (from the old core), I expected them
to be empty, searching in a newly created index somewhere below the
"multicore" dir.

I couldn't find a datadir definition so I still don't know how to add a
real second core (not just two cores with the same data).

Any ideas?

Thanks so far
Michael


Re: Moving from single core to multicore

2009-02-09 Thread Michael Lackhoff
On 09.02.2009 17:01 Ryan McKinley wrote:

> Check your solrconfig.xml  you probably have somethign like this:
> 
>
>${solr.data.dir:./solr/data}
> (from the example)
> 
> either remove that or make each one point to the correct location

Thanks, that's it!

Now all that is left is a more cosmetic change I would like to make:
I tried to place the solr.xml in the example dir to get rid of the
"-Dsolr.solr.home=multicore" for the start and changed the first entry
from "core0" to "solr" and moved the core1 dir from multicore directly
under the example dir
Idea behind all this: Use the original single core under "solr" as core0
and add a second one on the same directory level ("core1" parallel to
"solr"). Then I started solr with the old "java -jar start.jar" in the
"example" dir. But the multicore config seems to be ignored then, I get
my old single core e.g. http://localhost:8983/solr/core1/select?q=*:* is
no longer found.
As I said everything works if I leave it in the multicore subdir and
start with "-Dsolr.solr.home=multicore" but it would be nice if I could
do without that extra subdir and the extra start parameter.

--Michael



Re: Moving from single core to multicore

2009-02-10 Thread Michael Lackhoff
On 10.02.2009 02:39 Chris Hostetter wrote:

> : Now all that is left is a more cosmetic change I would like to make:
> : I tried to place the solr.xml in the example dir to get rid of the
> : "-Dsolr.solr.home=multicore" for the start and changed the first entry
> : from "core0" to "solr" and moved the core1 dir from multicore directly
> : under the example dir
> : Idea behind all this: Use the original single core under "solr" as core0
> : and add a second one on the same directory level ("core1" parallel to
> : "solr"). Then I started solr with the old "java -jar start.jar" in the
> : "example" dir. But the multicore config seems to be ignored then, I get
> 
> solr looks for conf/solr.xml relative the "Solr Home Dir" and if it 
> doesn't find it then it looks for conf/solrconfig.xml ... if you don't set 
> the solr.solr.home system property then the Solr Home Dir defaults to 
> "./solr/"
> 
> so putting your new solr.xml file in example/solr/conf should be what you 
> are looking for.

Almost. I had to change solr.xml like this, otherwise everything was
expected under ./solr looking for solr/solr and solr/core1:

  






  

Though the dataDir property seems to be ignored, I had to set it in
solrconfig.xml of both cores.

Thanks for all your help, the support all of you are giving is really
outstanding!
--Michael



Is semicolon a character that needs escaping?

2010-09-02 Thread Michael Lackhoff
According to http://lucene.apache.org/java/2_9_1/queryparsersyntax.html
only these characters need escaping:
+ - && || ! ( ) { } [ ] ^ " ~ * ? : \
but with this simple query:
TI:stroke; AND TI:journal
I got the error message:
HTTP ERROR: 400
Unknown sort order: TI:journal

My first guess was that it was a URL encoding issue but everything looks
fine:
http://localhost:8983/solr/select/?q=TI%3Astroke%3B+AND+TI%3Ajournal&version=2.2&start=0&rows=10&indent=on
as you can see, the semicolon is encoded as %3B
There is no problem when the query ends with the semicolon:
TI:stroke;
gives no error.
The first query also works if I escape the semicolon:
TI:stroke\; AND TI:journal

>From this I conclude that there is a bug either in the docs or in the
query parser or I missed something. What is wrong here?

-Michael


Re: Is semicolon a character that needs escaping?

2010-09-02 Thread Michael Lackhoff
On 03.09.2010 00:57 Ken Krugler wrote:

> The docs need to be updated, I believe. From some code I wrote back in  
> 2006...
> [...]

Thanks this explains it very well.

> But in general escaping characters in a query gets tricky - if you can  
> directly build queries versus pre-processing text sent to the query  
> parser, you'll save yourself some pain and suffering.

What do you mean by these two alternatives? That is, what exactly could
I do better?

> Also, since I did the above code the DisMaxRequestHandler has been  
> added to Solr, and it (IIRC) tries to be smart about handling this  
> type of escaping for you.

Dismax is not (yet) an option because we need the full lucene syntax
within the query. Perhaps this will change with the new enhanced dismax
request handler but I didn't play with it enough (will do with the next
release).

-Michael


Re: Is semicolon a character that needs escaping?

2010-09-02 Thread Michael Lackhoff
Hi Ken,

>>> But in general escaping characters in a query gets tricky - if you  
>>> can
>>> directly build queries versus pre-processing text sent to the query
>>> parser, you'll save yourself some pain and suffering.
>>
>> What do you mean by these two alternatives? That is, what exactly  
>> could
>> I do better?
> 
> By "can build...", I meant if you can come up with a GUI whereby the  
> user doesn't have to use special characters (other than say quoting)  
> then you can take a collection of clauses and programmatically build  
> your query, without using the query parser.

I think I have that (escaping of characters that have a special meaning
in Solr). I just didn't know that the semicolon is one of them. So it
would be nice if the docs could be updated to account for this.

Thanks again
-Michael


Re: Is semicolon a character that needs escaping?

2010-09-07 Thread Michael Lackhoff
On 08.09.2010 00:05 Chris Hostetter wrote:

> 
> : Subject: Is semicolon a character that needs escaping?
>   ...
> : >From this I conclude that there is a bug either in the docs or in the
> : query parser or I missed something. What is wrong here?
> 
> Back in Solr 1.1, the standard query parser treated ";" as a special 
> character and looked for sort instructions after it.  
> 
> Starting in Solr 1.2 (released in 2007) a "sort" param was added, and 
> semicolon was only considered a special character if you did not 
> explicilty mention a "sort" param (for back compatibility)
> 
> Starting with Solr 1.4, the default was changed so that semicolon wasn't 
> considered a meta-character even if you didn't have a sort param -- you 
> have to explicilty select the "lucenePlusSort" QParser to get this 
> behavior.
> 
> I can only assume that if you are seeing this behavior, you are either 
> using a very old version of Solr, or you have explicitly selected the 
> lucenePlusSort parser somewhere in your params/config.
> 
> This was heavily documented in CHANGES.txt for Solr 1.4 (you can find 
> mention of it when searching for either ";" or "semicolon")

I am using 1.3 without a sort param which explains it, I think. It would
be nice to update to 1.4 but we try to avoid such actions on a
production server as long as everything runs fine (the semicolon thing
was only reported recently).

Many thanks for your detailed explanation!
-Michael


Re: Confused by Solr Ranking

2010-03-09 Thread Michael Lackhoff
On 09.03.2010 16:01 Ahmet Arslan wrote:

> 
>> I kind of suspected stemming to be the reason behind this.
>> But I consider stemming to be a good feature.
> 
> This is the side effect of stemming. Stemming increases recall while harming 
> precision.

But most people want the best possible combination of both, something like:
(raw_field:word OR stemmed_field:word^0.5)
and it is nice that Solr allows such arrangements but it would be even
nicer to have some sort of automatic "take this field, transform the
contents in a couple of ways and do some boosting in the order given".
At least this would be my wish for the recent question about the one
feature I would like to see.
Or even better, allow not only a hierarchy of transformations but also a
hierarchy of fields (like in dismax, but with the full power of the
standard request handler)

-Michael



Re: exceptionhandling & error-reporting?

2010-04-06 Thread Michael Lackhoff
On 06.04.2010 17:49 Alexander Rothenberg wrote:

> On Monday 05 April 2010 20:14:44 Chris Hostetter wrote:
>> define "crashes" ? ... presumabl you are tlaking about the client crashing
>> because it can't parse theerro response, correct? ... the best suggestion
>> given the current state of Solr is to make hte client smart enough to not
>> attempt parsing of hte response unless the response code is 200.
> 
> Yes, it tries to parse the HTML-output but exspecting JSON syntax. Because it 
> is a perl-mod from CPAN, i dont really want to customize it...

You don't have to. Just wrap the call in an eval, at least that is what
I do.

-Michael


Re: Very basic questions: Indexing text

2010-06-28 Thread Michael Lackhoff
On 28.06.2010 23:00 Ahmet Arslan wrote:

>> 1) I can get my docs in the index, but when I search, it
>> returns the entire document.  I'd love to have it only
>> return the line (or two) around the search term.
> 
> Solr can generate Google-like snippets as you describe. 
> http://wiki.apache.org/solr/HighlightingParameters

I didn't know this is possible and am also interested in this feature
but even after reading the given Wiki page I cannot make out which is
the parameter to use. The only paramter that could be similar is
'hl.maxAlternateFieldLength' where it is possible to give a length to
return but according to the description that is for the case "no match".
And there is "hl.fragmentsBuilder" but with no explanation (the refered
page SolrFragmentsBuilder does not yet exist).

Could you give an example?
E.g. lets say I have a field 'title' and a field 'fulltext' and my
search term is 'solr'. What would be the right set of parameters to get
back the whole title-field but only a sniplet of 50 words (or three
sentences or whatever the unit) from the fulltext field.


Thanks
-Michael


Re: Another text I cannot get into SOLR with csv

2008-01-08 Thread Michael Lackhoff

On 08.01.2008 16:11 Yonik Seeley wrote:


Ahh, wait, it looks a single quote as the encapsulator for split field
values by default.
Try adding f.PUBLPLACE.encapsulator=%00
to disable the encapsulation.


Hmm. Yes, this works but:
- I didn't find anything about it in the docs (wiki). On the contrary
  it suggests that the single quote has to be explicitly set:
  f.tags.encapsulator='

(http://wiki.apache.org/solr/UpdateCSV?#head-c238cb494f800d345766acda16e08d82663127ce)
- A literal encapsulator should be possible to add by doubling
  it ' => '' but this gives the same error
- is it possible to change the split field separator for all fields? The
  URL is getting rather long already.



Re: Another text I cannot get into SOLR with csv

2008-01-08 Thread Michael Lackhoff

On 08.01.2008 16:55 Yonik Seeley wrote:


- A literal encapsulator should be possible to add by doubling
   it ' => '' but this gives the same error


I think you would have to tripple it (the first is the encapsulator).
Regardless, don't use encapsulation on the split fields unless you
have to.


I don't want to use encapsulation it is just that the character is 
_interpreted_ as encapsulation character and I need a way to tell SOLR 
that it is not.



- is it possible to change the split field separator for all fields? The
   URL is getting rather long already.


if "f.myfield.separator" is missing, it uses "separator"  (standard
per-field parameters).
So if everything uses "," you don't have to specify a separator anywhere.


Oh, sorry, I meant encapsulator of course, not separator. The 
encapsulator is the problem and I would like a way shorter than
&f.myfield1.encapsultor=%00&f.myfield2.encapsulator=%00... for about 20 
fields in addition to the parameters that are necessary to tell SOLR 
that all these are split fields.


-Michael


Re: Another text I cannot get into SOLR with csv

2008-01-08 Thread Michael Lackhoff

On 08.01.2008 19:09 Yonik Seeley wrote:


There is no shorter way, but if you update to the latest solr-dev
(changes I checked in today), the default will be no encapsulation for
split fields.


Many thanks, also for your patience!
Do you think the dev-version is ready for production?

-Michael


Some sort of join in SOLR?

2008-01-16 Thread Michael Lackhoff

Hello,

I have two sources of data for the same "things" to search. It is book 
data in a library. First there is the usual bibliographic data (author, 
title...) and then I have scanned and OCRed table of contents data about 
the same books. Both are updated independently.

Now I don't know how to best index and search this data.
- One option would be to save the data in different records. That would
  make updates easy because I don't have to worry about the fields
  from the other source. But searching would be more difficult: I have
  to do an additional search for every hit in the "contents" data to
  get the bibliographic data.
- The other option would be to save everything in one record but then
  updates would be difficult. Before I can update a record I must first
  look if there is any data from the other source, merge it into the
  record and only then update it. This option sounds very time consuming
  for a complete reindex.

The best solution would be some sort of join: Have two records in the 
index but always give both in the result no matter where the hit was.

Any ideas on how to best organize this kind of data?

-Michael



Re: Some sort of join in SOLR?

2008-01-17 Thread Michael Lackhoff

On 17.01.2008 16:53 Erick Erickson wrote:


I would *strongly* encourage you to store them together
as one document. There's no real method of doing
DB like joins in the underlying Lucene search engine.


Thanks, that was also my preference.


But that's generic advice. The question I have for you is
"What's the big deal about coordinating the sources?"
That is, you have to have something that allows you to
make a 1:1 correspondence between your data sources
or you couldn't relate them in the first place. Is it really
that onerous to check?


I don't have an index to check. Both sources come in huge text files, 
one of them daily, the other irregular. One has the ID, the other has a 
different ID that must be mapped first to the ID of the first source. So 
there is no easy way of saying: "Give me the record to this ID from the 
other set of records". It is all burried in plain text files.



If it is, why not build an index and search it when you
want to know?


That is what I will do now: Build a SQLite database with just two 
columns: ID and contents with an index on the ID. Then when I rebuild 
the SOLR index by processing the other data I will lookup the SQLite DB 
if there is a corresponding record from the other source.

My hope was that I could avoid this intermediate database.


You haven't described enough of your problem
space for me to render any opinion of whether
this is premature optimization or not, but it
sure smells like it from a distance ...


I don't think it was premature optimization. It was just the attempt to 
keep the nightly rebuild of the index as easy as possible and to avoid 
unnecessary complexity. But if it is necessary I will go this way.


-Michael


Re: Some sort of join in SOLR?

2008-01-17 Thread Michael Lackhoff

On 17.01.2008 18:32 Erick Erickson wrote:


There's some cost here, and I don't know how this
all plays with the sizes of your indexes. It may be
totally impractical.

Anyway, back to work.


I think I will have to play with the different possibilities and see 
what fits best to my situation. There will be many things to learn (I am 
a newbee to SOLR, Lucene and Java) until everythings plays nicely together.

As you say, back to work...

Thanks
-Michael



Re: Some sort of join in SOLR?

2008-01-17 Thread Michael Lackhoff

On 17.01.2008 23:48 Chris Hostetter wrote:

assuming these are simple delimited files, something like the unix "join" 
command can do this for you ... then your indexing code can just process 
on file linerally.  (if they aren't simple delimited files, you can 
preprocess them to strip out the excess markup and make them simple 
deliminted files ... depending on what these look like, you might not even 
need much custom indexing code at all .. "join" and the CSV update 
request handler might solve all your needs)


Thanks for the hint, haven't heard of a unix "join" command yet but will 
have a look.


-Michael



Out of heap space with simple updates

2008-01-23 Thread Michael Lackhoff
I wanted to try to do the daily update with XML updates (was mentioned 
recently as the recommended way) but got an "OutOfMemoryError: Java heap 
space" after 319000 records.
I am sending one document at a time through the http update interface, 
so every request should be short enough to not run out of memory.
Do I have to commit after every few thousand records to avoid the error? 
My understanding was that I have to do a commit only at the very end. Or 
are there other things I could try?
How can I increase the heap size? I use the included jetty and start 
solr with "java -jar start.jar".

After I ran into the error a commit wasn't possible either.

What is the best way to avoid this sort of problems?

Thanks
-Michael



Re: Out of heap space with simple updates

2008-01-23 Thread Michael Lackhoff

On 23.01.2008 20:57 Chris Harris wrote:


I'm using

java -Xms512M -Xmx1500M -jar start.jar



Thanks! I did see the -X... params in recent threads but didn't know 
where to place them -- not being a java guy at all ;-)


-Michael



Re: wildcard newbie question

2008-01-30 Thread Michael Lackhoff

On 31.01.2008 00:31 Alessandro Senserini wrote:

I have a text field type called courseTitle and it contains 


Struts 2

If I search courseTitle:strut*  I get the documents but if I search with
courseTitle:struts* I do not get any results.

Could you please explain why?


Just a guess: It might be because of stemming. Do you have the same 
effect with words that don't end in an 's' or similar?

If my guess is correct, only 'strut' is in the index, not 'struts'.

-Michael



Re: Searching for future or "null" dates

2008-09-23 Thread Michael Lackhoff
On 23.09.2008 00:30 Chris Hostetter wrote:

> : Here is what I was able to get working with your help.
> : 
> : (productId:(102685804)) AND liveDate:[* TO NOW] AND ((endDate:[NOW TO *]) OR
> : ((*:* -endDate:[* TO *])))
> : 
> : the *:* is what I was missing.
> 
> Please, PLEASE ... do yourself a favor and stop using "AND" and "OR" ...  
> food will taste better, flowers will smell fresher, and the world will be 
> a happy shinny place...
> 
> +productId:102685804 +liveDate:[* TO NOW] +(endDate:[NOW TO *] (*:* 
> -endDate:[* TO *]))

I would also like to follow your advice but don't know how to do it with
defaultOperator="AND". What I am missing is the equivalent to OR:
AND: +
NOT: -
OR: ???
I didn't find anything on the Solr or Lucene query syntax pages. If
there is such an equivalent then I guess the query would become:
productId:102685804 liveDate:[* TO NOW] (endDate:[NOW TO *] (*:*
-endDate:[* TO *]))

I switched to the AND-default because that is the default in my web
frontend so I don't have to change logic. What should I do in this
situation? Go back to the OR-default?

It is not so much this example I am after but I have a syntax translater
in my application that must be able to handle similar expressions and I
want to keep it simple and still have tasty food ;-)

-Michael


Re: Searching for future or "null" dates

2008-09-25 Thread Michael Lackhoff
On 26.09.2008 06:17 Chris Hostetter wrote:

> that's true, regretably there is no prefix operator to indicate a "SHOULD" 
> clause in the Lucene query langauge, so if you set the default op to "AND" 
> you can't then override it on individual clauses.
> 
> this is one of hte reasons i never make the default op AND.

Just for symmetry or to get rid of this restriction wouldn't it be a
good idea to add such a prefix operator?

> i'm sure your food will still taste pretty good :)

That's what my wife keeps telling me ;-)

Many thanks. I think I will leave it as is for the current application
but use OR-Default plus prefix operators for new projects.

-Michael



Re: date range query performance

2008-10-31 Thread Michael Lackhoff
On 31.10.2008 19:16 Chris Hostetter wrote:

> forteh record, you don't need to index as a "StrField" to get this 
> benefit, you can still index using DateField you just need to round your 
> dates to some less graunlar level .. if you always want to round down, you 
> don't even need to do the rounding yourself, just add "/SECOND" 
> or "/MINUTE" or "/HOUR" to each of your dates before sending them to solr.  
> (SOLR-741 proposes adding a config option to DateField to let this be done 
> server side)

Is this also possible for the timestamp that is automatically added to
all new/updated docs? I would like to be able to search (quickly) for
everything that was added within the last week or month or whatever. And
because I update the index only once a day a granuality of /DAY (if that
exists) would be fine.

- Michael


Re: date range query performance

2008-10-31 Thread Michael Lackhoff
On 01.11.2008 06:10 Erik Hatcher wrote:

> Yeah, this should work fine:
> 
>  default="NOW/DAY" multiValued="false"/>

Wow, that was fast, thanks!

-Michael


correct escapes in csv-Update files

2008-01-02 Thread Michael Lackhoff
I use UpdateCSV to feed my data into SOLR and it works very well. The
only thing I don't understand is how to properly escape the encapsulator
and the backslash.
An example with the default encapsulator ("):
"This is a text with a \"quote\""
"This gives one \ backslash"
"This gives two backslashes before the \\\"quote\""
"This gives an error \\"quote\""

So what if I want only one backslash before the quote, e.g. the
unescaped data looks like this:
Text with \"funny characters
(a real backslash before a real quote not an escaped quote)

I know this isn't common and perhaps it would be possible to find an
encapsulator that will be very, very unlikely to be found in the data
but you can never be sure.
So is there a way to correctly escape or otherwise encode all possible
combinations of special characters?

-Michael



Re: correct escapes in csv-Update files

2008-01-04 Thread Michael Lackhoff
On 03.01.2008 17:16 Yonik Seeley wrote:

> CSV doesn't use backslash escaping.
> http://www.creativyst.com/Doc/Articles/CSV/CSV01.htm
> 
> "This is text with a ""quoted"" string"

Thanks for the hint but the result is the same, that is, ""quoted""
behaves exactly like \"quoted\":
- both leave the single unescaped quote in the record: "quoted"
- both have the problem with a backslash before the escaped quote:
  "This is text with a \""quoted"" string" gives an error "invalid
  char between encapsualted token end delimiter".

So, is it possible to get a record into the index with csv that
originally looks like this?:
This is text with an unusual \"combination" of characters

A single quote is no problem: just double it (" -> "").
A single backslash is no problem: just leave it alone (\ -> \)
But what about a backslash followed by a quote (\" -> ???)

-Michael



Another text I cannot get into SOLR with csv

2008-01-04 Thread Michael Lackhoff
If the fields value is:
's-Gravenhage
I cannot get it into SOLR with CSV.
I tried to double the single quote/apostrophe or escape it in several
ways but I either get an error or another character (the "escape") in
front of the single quote. Is it not possible to have a field that
begins with an apostrophe/a single quote?
There is no error if the apostrophe is at the end of the field.
Is there anything I could try or do I have to use XML?

-Michael



Re: Another text I cannot get into SOLR with csv

2008-01-04 Thread Michael Lackhoff
On 04.01.2008 16:55 Yonik Seeley wrote:

> On Jan 4, 2008 10:25 AM, Michael Lackhoff <[EMAIL PROTECTED]> wrote:
>> If the fields value is:
>> 's-Gravenhage
>> I cannot get it into SOLR with CSV.
> 
> This one works for me fine.
> 
> $ cat t2.csv
> id,name
> 12345,"'s-Gravenhage"
> 12345,'s-Gravenhage
> 12345,"""s-Gravenhage"
> 
> $ curl http://localhost:8983/solr/update/csv?commit=true --data-binary
> @t2.csv -H 'Content-type:text/csv; charset=utf-8'

But you are cheating ;-) This works for me too but I am using a local
csv file for the update:
http://localhost:8983/solr/update/csv?stream.file=t2.csv&separator=%09&f.SIGNATURE.split=true&commit=true

Perhaps the problem is that I cannot define a charset for the stream.file?

-Michael



Re: correct escapes in csv-Update files

2008-01-04 Thread Michael Lackhoff
On 04.01.2008 17:35 Walter Underwood wrote:

> I recommend the opencsv library for Java or the csv package for Python.
> Either one can write legal CSV files.
> 
> There are lots of corner cases in CSV and some differences between
> applications, like whetehr newlines are allowed inside a quoted field.
> It is best to use a library for this instead of hacking at it.

I agree that it is best to use a library and I will eventually but in my
case this wouldn't help since the CSV _is_ legal (at least in my later
examples).
In one case it seems to be a bug in the SOLR CSV parser (\"" is legal
but gives an error).
In the other case the same file works if sent as post data but doesn't
if given as a local file. If it wasn't legal the post data version would
fail too.

-Michael

p.s.: I filed the bug report, Yonik.



Re: Another text I cannot get into SOLR with csv

2008-01-08 Thread Michael Lackhoff
After a long weekend I could do a deeper look into this one and it looks 
as if the problem has to do with splitting.



This one works for me fine.

$ cat t2.csv
id,name
12345,"'s-Gravenhage"
12345,'s-Gravenhage
12345,"""s-Gravenhage"

$ curl http://localhost:8983/solr/update/csv?commit=true --data-binary
@t2.csv -H 'Content-type:text/csv; charset=utf-8'


My csv-file:
DBRECORDID,PUBLPLACE
43298,"'s-Gravenhage"

The URL (giving a 400 error):
http://localhost:8983/solr/update/csv?f.PUBLPLACE.split=true&commit=true";
(PUBLPLACE is defined as multivalued field)

If I remove the "f.PUBLPLACE.split=true" parameter OR make sure that the 
apostrophe is not the first character, everything is fine.
But I need the field to be multivalued and thus need the split parameter 
(not for this record but for others) and as the example shows, some have 
an apostrophe as the first character. Any ideas how to deal with this?


-Michael


Re: SOLR 3.3.0 multivalued field sort problem

2011-08-13 Thread Michael Lackhoff
On 13.08.2011 18:03 Erick Erickson wrote:

> The problem I've always had is that I don't quite know what
> "sorting on multivalued fields" means. If your field had tokens
> a and z, would sorting on that field put the doc
> at the beginning or end of the list? Sure, you can define
> rules (first token, last token, average of all tokens (whatever
> that means)), but each solution would be wrong sometime,
> somewhere, and/or completely useless.

Of course it would need rules but I think it wouldn't be too hard to
find rules that are at least far better than the current situation.

My wish would include an option that decides if the field can be used
just once or every value on its own. If the option is set to FALSE, only
the first value would be used, if it is TRUE, every value of the field
would get its place in the result list.

so, if we have e.g.
record1: ccc and bbb
record2: aaa and zzz
it would be either
record2 (aaa)
record1 (ccc)
or
record2 (aaa)
record1 (bbb)
record1 (ccc)
record2 (zzz)

I find these two outcomes most plausible so I would allow them if
technical possible but whatever rule looks more plausible to the
experts: some solution is better than no solution.

-Michael


Re: SOLR 3.3.0 multivalued field sort problem

2011-08-13 Thread Michael Lackhoff
On 13.08.2011 20:31 Martijn v Groningen wrote:

> The first solution would make sense to me. Some kind of a strategy
> mechanism
> for this would allow anyone to define their own rules. Duplicating results
> would be confusing to me.

That is why I would only activate it on request (setting a special
option). Example use case: A library catalogue with an author sort. All
books of an author would be together, no matter how many co-authors the
book has.
So I think it could be useful (as an option) but I have no idea how
diffcult it would be to implement. As I said, it would be nice to have
at least something. Any possible customization would be an extra bonus.

-Michael


Re: SOLR 3.3.0 multivalued field sort problem

2011-08-13 Thread Michael Lackhoff
On 13.08.2011 21:28 Erick Erickson wrote:

> Fair enough, but what's "first value in the list"?
> There's nothing special about "mutliValued" fields,
> that is where the schema has "multiValued=true".
> under the covers, this is no different than just
> concatenating all the values together and putting them
> in at one go, except for some games with the
> position between one term and another
> (positionIncrementGap). Part of my confusion is
> that the term multi-valued is sometimes used to
> refer to "multiValued=true" and sometimes used
> to refer to documents with more than one
> *token* in a particular field (often as the result
> of the analysis chain)

I guess, since multivalued fields are not really different under the
hood, they should be treated the same. So, no matter if the different
values are the result of a "multiValued=true" or of the analysis chain:
if the whole thing starts with an "a" put it first, if it starts with a
"z" put it last.
Example (multivalued field):
Smith, Adam
Duck, Dagobert
=> sort as "s" (or "S")
Example tokenized field:
This is a tokenized field
=> sort as "t" (or "T")

> The second case seems to be more in the
> grouping/field collapsing arena, although
> that doesn't work on fields with more than one
> value yet either. But that seems a more sensible
> place to put the second case rather than
> overloading sorting.

It depends how you see the meaning of sorting:
1. Sort the records based on one single value per record (and return
them in this order)
2. Sort the values of the field to sort on (and return the records
belonging to the respective values)

As long as sorting is only allowed on single value fields, both are
identical. As soon as you allow multivalued fields to be sorted on, both
interpretations mean something different and I think both have their
valid use case.
But I don't want to stress this too far.

-Michael