Re: Checking Optimal Values for BM25

2016-12-15 Thread Sascha Szott

Hi Furkan,

in order to change the BM25 parameter values k1 and b, the following XML 
snippet needs to be added in your schema.xml configuration file:



  1.3
  0.7


It is even possible to specify the SimilarityFactory on individual index 
fields. See [1] for more details.


Best
Sascha

[1] https://wiki.apache.org/solr/SchemaXml#Similarity


Am 15.12.2016 um 14:58 schrieb Furkan KAMACI:

Hi,

Sole's default similarity is BM25 anymore. Its parameters are defined as

k1=1.2, b=0.75

as default. However is there any way that to check the effect of using
different coefficients to calculate BM25 to find the optimal values?

Kind Regards,
Furkan KAMACI



Re: field length within BM25 score calculation in Solr 6.3

2016-12-15 Thread Sascha Szott

Hi,

bumping my question after 10 days. Any clarification is appreciated.

Best
Sascha



Hi folks,

my Solr index consists of one document with a single valued field "title" of type 
"text_general". The title field was index with the content: 1 2 3 4 5 6 7 8 9. The field 
type text_general uses a StandardTokenizer which should result in 9 tokens. The corresponding 
length of field title in the given document is 9.

The field type is defined as follows:

   
 
   
   
   
 
 
   
   
   
   
 
   


I’ve checked that none of the nine tokens (1, 2, …, 9) is a stop word.

As expected, the query title:1 returns the given document. The BM25 score of 
the document for the given query is 0.272.

But why does Solr 6.3 states that the length of field title is 10.24?

0.27233246 = weight(title_alt:1 in 0) [SchemaSimilarity], result of:
   0.27233246 = score(doc=0,freq=1.0 = termFreq=1.0), product of:
 0.2876821 = idf(docFreq=1, docCount=1)
 0.94664377 = tfNorm, computed from:
   1.0 = termFreq=1.0
   1.2 = parameter k1
   0.75 = parameter b
   9.0 = avgFieldLength
   10.24 = fieldLength

In contrast, the value of avgFieldLength is correct.

The same observation can be made if the index consists of two simple documents:

doc1: title = 1 2 3 4
doc2: title = 1 2 3 4 5 6 7 8

The BM25 score calculation of doc2 is explained as:

0.14143422 = weight(title_alt:1 in 1) [SchemaSimilarity], result of:
   0.14143422 = score(doc=1,freq=1.0 = termFreq=1.0), product of:
 0.18232156 = idf(docFreq=2, docCount=2)
 0.7757405 = tfNorm, computed from:
   1.0 = termFreq=1.0
   1.2 = parameter k1
   0.75 = parameter b
   6.0 = avgFieldLength
   10.24 = fieldLength

The value of fieldLength does not match 8.

Is there same "magic“ applied to the value of field length that goes beyond the 
standard BM25 score formula?

If so, what is the idea behind this modification. If not, is this a Lucene / 
Solr bug?

Best regards,
Sascha







--
Sascha Szott :: KOBV/ZIB :: +49 30 84185-457


field length within BM25 score calculation in Solr 6.3

2016-12-04 Thread Sascha Szott
Hi folks,

my Solr index consists of one document with a single valued field "title" of 
type "text_general". The title field was index with the content: 1 2 3 4 5 6 7 
8 9. The field type text_general uses a StandardTokenizer which should result 
in 9 tokens. The corresponding length of field title in the given document is 9.

The field type is defined as follows:

  

  
  
  


  
  
  
  

  


I’ve checked that none of the nine tokens (1, 2, …, 9) is a stop word.

As expected, the query title:1 returns the given document. The BM25 score of 
the document for the given query is 0.272. 

But why does Solr 6.3 states that the length of field title is 10.24?

0.27233246 = weight(title_alt:1 in 0) [SchemaSimilarity], result of:
  0.27233246 = score(doc=0,freq=1.0 = termFreq=1.0), product of:
0.2876821 = idf(docFreq=1, docCount=1)
0.94664377 = tfNorm, computed from:
  1.0 = termFreq=1.0
  1.2 = parameter k1
  0.75 = parameter b
  9.0 = avgFieldLength
  10.24 = fieldLength

In contrast, the value of avgFieldLength is correct.

The same observation can be made if the index consists of two simple documents:

doc1: title = 1 2 3 4
doc2: title = 1 2 3 4 5 6 7 8

The BM25 score calculation of doc2 is explained as:

0.14143422 = weight(title_alt:1 in 1) [SchemaSimilarity], result of:
  0.14143422 = score(doc=1,freq=1.0 = termFreq=1.0), product of:
0.18232156 = idf(docFreq=2, docCount=2)
0.7757405 = tfNorm, computed from:
  1.0 = termFreq=1.0
  1.2 = parameter k1
  0.75 = parameter b
  6.0 = avgFieldLength
  10.24 = fieldLength

The value of fieldLength does not match 8.

Is there same "magic“ applied to the value of field length that goes beyond the 
standard BM25 score formula? 

If so, what is the idea behind this modification. If not, is this a Lucene / 
Solr bug?

Best regards,
Sascha






Re: Problem of facet on 170M documents

2013-11-02 Thread Sascha SZOTT
Hi Ming,

which Solr version are you using? In case you use one of the latest
versions (4.5 or above) try the new parameter facet.threads with a
reasonable value (4 to 8 gave me a massive performance speedup when
working with large facets, i.e. nTerms >> 10^7).

-Sascha


Mingfeng Yang wrote:
> I have an index with 170M documents, and two of the fields for each
> doc is "source" and "url".  And I want to know the top 500 most
> frequent urls from Video source.
> 
> So I did a facet with 
> "fq=source:Video&facet=true&facet.field=url&facet.limit=500", and
> the matching documents are about 9 millions.
> 
> The solr cluster is hosted on two ec2 instances each with 4 cpu, and
> 32G memory. 16G is allocated tfor java heap.  4 master shards on one
> machine, and 4 replica on another machine. Connected together via
> zookeeper.
> 
> Whenever I did the query above, the response is just taking too long
> and the client will get timed out. Sometimes,  when the end user is
> impatient, so he/she may wait for a few second for the results, and
> then kill the connection, and then issue the same query again and
> again.  Then the server will have to deal with multiple such heavy
> queries simultaneously and being so busy that we got "no server
> hosting shard" error, probably due to lost communication between solr
> node and zookeeper.
> 
> Is there any way to deal with such problem?
> 
> Thanks, Ming
> 


intersection of filter queries with raw query parser

2013-05-31 Thread Sascha Szott

Hi folks,

is it possible to use the raw query parser with a disjunctive filter 
query? Say, I have a field 'foo' and two values 'v1' and 'v2' (the field 
values are free text and can contain any character). What I want is to 
retrieve all documents satisying fq=foo:(v1 OR v2). In case only one 
field (v1) is given, the query fq={!raw f=foo}v1 works as expected. But 
how can I formulate the filter query (with the raw query parser) in case 
two values are provided.


The same question was posted on Stackoverflow 
(http://stackoverflow.com/questions/5637675/solr-query-with-raw-data-and-union-multiple-facet-values) 
two years ago. But there was only the advice to give up using the raw 
query parser which is not what I want to do.


Thanks in advance,
Sascha


Re: Does SolrCloud support distributed IDFs?

2012-10-22 Thread Sascha SZOTT

Hi Mark,

Mark Miller wrote:

Still waiting on that issue. I think Andrzej should just update it to
trunk and commit - it's option and defaults to off. Go vote :)
Sounds like the problem is already solved and the remaining work 
consists of code integration? Can somebody estimate how much work that 
would be?


-Sascha


Does SolrCloud support distributed IDFs?

2012-10-21 Thread Sascha Szott
Hi folks,

a known limitation of the "old" distributed search feature is the lack of 
distributed/global IDFs (#SOLR-1632). Does SolrCloud bring some improvements in 
this direction?

Best regards,
Sascha


Re: indexing documents in Apache Solr using php-curl library

2012-07-02 Thread Sascha SZOTT
Hi,

perhaps it's better to use a PHP Solr client library. I used

   https://code.google.com/p/solr-php-client/

in a project of mine and it worked just fine.

-Sascha

Asif wrote:
> I am indexing the file using php curl library. I am stuck here with the code
> echo "Stored in: " . "upload/" . $_FILES["file"]["name"];
>  $result=move_uploaded_file($_FILES["file"]["tmp_name"],"upload/" .
> $_FILES["file"]["name"]);
>  if ($result == 1) echo "Upload done .";
> $options = getopt("f:");
> $infile = $options['f'];
> 
> $url = "http://localhost:8983/solr/update/";;
> $filename = "upload/" . $_FILES["file"]["name"];
> $handle = fopen($filename, "rb");
> $contents = fread($handle, filesize($filename));
> fclose($handle);
> echo $url;
> $post_string = file_get_contents("upload/" .
> $_FILES["file"]["name"]);
> echo $contents;
> $header = array("Content-type:text/xml; charset=utf-8");
> 
> $ch = curl_init();
> 
> curl_setopt($ch, CURLOPT_URL, $url);
> curl_setopt($ch, CURLOPT_HTTPHEADER, $header);
> curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
> curl_setopt($ch, CURLOPT_POST, 1);
> curl_setopt($ch, CURLOPT_POSTFIELDS, $post_string);
> curl_setopt($ch, CURLOPT_HTTP_VERSION, CURL_HTTP_VERSION_1_1);
> curl_setopt($ch, CURLINFO_HEADER_OUT, 1);
> 
> $data = curl_exec($ch);
> 
> if (curl_errno($ch)) {
>print "curl_error:" . curl_error($ch);
> } else {
>curl_close($ch);
>print "curl exited okay\n";
>echo "Data returned...\n";
>echo "\n";
>echo $data;
>echo "\n";
> }
> 
> Nothing is showing as a result. Moreover there is nothing shown in the event
> log of Apache Solr. please help me with the code
> 



Re: Prefix query is not analysed?

2012-07-02 Thread Sascha Szott
Hi,

I suppose you are using Solr 3.6. Then take a look at

http://www.lucidimagination.com/blog/2011/11/29/whats-with-lowercasing-wildcard-multiterm-queries-in-solr/

-Sascha



Alok Bhandari  schrieb:

Thanks for reply.

If I check the debug query through solr-admin I can see that the lower case
filter is applied and 

"rawquerystring":"em_to_name:Follett'.*",
"querystring":"em_to_name:Follett'.*",
"parsedquery":"+em_to_name:follett'.*",
"parsedquery_toString":"+em_to_name:follett'.*",
"explain":{},
"QParser":"ExtendedDismaxQParser",


I can see this query. So is it the case that only tokenization is not done
for the wildcard queries but other filters specified are applied?

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Prefix-query-is-not-analysed-tp3992435p3992450.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: Prefix query is not analysed?

2012-07-02 Thread Sascha Szott
Hi,

wildcard and fuzzy queries are not analyzed.

-Sascha



Alok Bhandari  schrieb:

Hello ,

I am pushing "Chuck Follett'.?.?" in solr and when I query for this field
with query string field:Follett'.* I am getting 0 results.

field type declared is






 

and parser we are using is EdisMax .

Is it the case that for prefix query the text analysis is not done I am
getting 0 results or there is something fundamentally wrong with my
data/schema .

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Prefix-query-is-not-analysed-tp3992435.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: querying thru solritas gives me zero results

2012-06-30 Thread Sascha Szott
Hi,

Solritas uses the dismax query parser. The dismax config parameter 'qf' 
specifies the index fields to be searched in. Make sure that 'name' is your 
default search field.

-Sascha




Giovanni Gherdovich  schrieb:

Hi all,

this morning I was very proud of myself since I managed
to set up solritas ( http://wiki.apache.org/solr/VelocityResponseWriter )
for the solr instance on my server (ubuntu natty).

This joy lasted only half a minute, since the only query
that gets more than zero results with solritas is the catchall "*:*"

for example:
http://my.server.com:8080/solr/select/?q=foobar has thousands of results,
​http://my.server.com:8080/solr/itas?q=foobar has none

Here the standard and "velocity" request handlers from my solrconfig.xml;

-- -- >8 -- -- >8 -- -- >8 -- -- >8 -- -- >8 -- -- >8


explicit


-- -- >8 -- -- >8 -- -- >8 -- -- >8 -- -- >8 -- -- >8

-- -- >8 -- -- >8 -- -- >8 -- -- >8 -- -- >8 -- -- >8



velocity
browse
Solr cookbook example
dismax
*:*
10
*,score
name


-- -- >8 -- -- >8 -- -- >8 -- -- >8 -- -- >8 -- -- >8

any hint on how I can debug that?

cheers,
Giovanni



Re: how to retrieve a doc from its docID ?

2012-06-30 Thread Sascha Szott
Hi,

did you include the fl parameter in the Solr query URL? If that's the case make 
sure that the field name 'text' is mentioned there. You should also make sure 
that the field definition (in schema.xml) for 'text' says stored="true", 
otherwise the field will not be returned.

-Sascha



Giovanni Gherdovich  schrieb:

Hi all,

when querying my solr instance, the answers I get
are the document IDs of my docs. Here is how one of my docs
looks like:

-- -- >8 -- -- >8 -- -- >8 -- -- >8 -- -- >8 -- --


hello solar!
123


-- -- >8 -- -- >8 -- -- >8 -- -- >8 -- -- >8 -- --

here is the response if I query for "solar" :

-- -- >8 -- -- >8 -- -- >8 -- -- >8 -- -- >8 -- --



1.0
123


-- -- >8 -- -- >8 -- -- >8 -- -- >8 -- -- >8 -- --

which is, solr gives me the doc ID. How to retrieve the doc's field "text"
given its id ?

cheers,
Giovanni



Re: Searching for digits with strings

2012-06-27 Thread Sascha Szott
Hi,

as far as I know Solr does not provide such a feature. If you cannot make any 
assumptions on the numbers, choose an appropriate library that is able to 
transform between numerical and non-numerical representations and populate the 
search field with both versions at index-time.

-Sascha

Alireza Salimi  schrieb:

Hi,

Well that's the only solution I got so far and it would work for most of
the cases,
but l thought there might be some better solutions.

Thanks

On Wed, Jun 27, 2012 at 5:49 PM, Upayavira  wrote:

> How many numbers? 0-9? Or every number under the sun?
>
> You could achieve a limited number by using synonyms, 0 is a synonym for
> nought and zero, etc.
>
> Upayavira
>
> On Wed, Jun 27, 2012, at 05:22 PM, Alireza Salimi wrote:
> > Hi,
> >
> > I was wondering if there's a built in solution in Solr so that you can
> > search for documents with digits by their string representations.
> > i.e. search for 'two' would match fields which have '2' token and vice
> > versa.
> >
> > Thanks





Re: getting started

2011-06-16 Thread Sascha SZOTT

Hi Mari,

it depends ...

* How many records are stored in your MySQL databases?
* How often will updates occur?
* How many db records / index documents are changed per update?

I would suggest to start with a single Solr core first. Thereby, you can 
concentrate on the basics and do not need to deal with more advanced 
things like sharding. In case you encounter performance issues later on, 
you can switch to a multi-core setup.


-Sascha

Mari Masuda wrote:

Hello,

I am new to Solr and am in the beginning planning stage of a large project and 
could use some advice so as not to make a huge design blunder that I will 
regret down the road.

Currently I have about 10 MySQL databases that store information about 
different archival collections.  For example, we have data and metadata about a 
political poster collection, a television program, documents and photographs of 
and about a famous author, etc.  My job is to work with the staff archivists to 
come up with a standard metadata template so the 10 databases can be 
consolidated into one.

Currently the info in these databases is accessed through 10 different sets of 
PHP pages that were written a long time ago for PHP 4.  My plan is to write a 
new Java application that will handle both public display of the info as well 
as an administrative interface so that staff members can add or edit the 
records.

I have decided to use Solr as the search mechanism for this project.  Because the info in each of 
our 10 collections is slightly different (e.g., a record about a poster does not contain duration 
information, but a record about a TV show does) I was thinking it would be good to separate each 
collection's index into a separate Solr core so that commits coming from one collection do not bog 
down the other unrelated collections.  One reservation I have is that eventually we would like to 
be able to type in "Iraq" and find records across all of the collections at once instead 
of having to search each collection separately.  Although I don't know anything about it at this 
stage, I did Google "sharding" after reading someone's recent post on this list and it 
sounds like that may be a potential answer to my question.  Does anyone have any advice on how I 
should initially set up Solr for my situation?  I am slowly making my way through the wiki and 
RTFMing, but I wanted to see what

the experts have to say because at this point I don't really know where to 
start.


Thank you very much,
Mari


Re: Search failing for matched text in large field

2011-03-23 Thread Sascha Szott

On 23.03.2011 18:52, Paul wrote:

I increased maxFieldLength and reindexed a small number of documents.
That worked -- I got the correct results. In 3 minutes!

Did you mark the field in question as stored = false?

-Sascha



I assume that if I reindex all my documents that all searches will
become even slower. Is there any way to get all the results in a way
that is quick enough that my user won't get bored waiting? Is there
some optimization of this coming in solr 3.0?

On Wed, Mar 23, 2011 at 12:15 PM, Sascha Szott  wrote:

Hi Paul,

did you increase the value of the maxFieldLength parameter in your
solrconfig.xml?

-Sascha

On 23.03.2011 17:05, Paul wrote:


I'm using solr 1.4.1.

I have a document that has a pretty big field. If I search for a
phrase that occurs near the start of that field, it works fine. If I
search for a phrase that appears even a little ways into the field, it
doesn't find it. Is there some limit to how far into a field solr will
search?

Here's the way I'm doing the search. All I'm changing is the text I'm
searching on to make it succeed or fail:


http://localhost:8983/solr/my_core/select/?q=%22search+phrase%22&hl=on&hl.fl=text

Or, if it is not related to how large the document is, what else could
it possibly be related to? Could there be some character in that field
that is stopping the search?




Re: Search failing for matched text in large field

2011-03-23 Thread Sascha Szott

Hi Paul,

did you increase the value of the maxFieldLength parameter in your 
solrconfig.xml?


-Sascha

On 23.03.2011 17:05, Paul wrote:

I'm using solr 1.4.1.

I have a document that has a pretty big field. If I search for a
phrase that occurs near the start of that field, it works fine. If I
search for a phrase that appears even a little ways into the field, it
doesn't find it. Is there some limit to how far into a field solr will
search?

Here's the way I'm doing the search. All I'm changing is the text I'm
searching on to make it succeed or fail:

http://localhost:8983/solr/my_core/select/?q=%22search+phrase%22&hl=on&hl.fl=text

Or, if it is not related to how large the document is, what else could
it possibly be related to? Could there be some character in that field
that is stopping the search?


Re: Solr coding

2011-03-23 Thread Sascha Szott

Hi,

depending on your needs, take a look at Apache ManifoldCF. It adds 
document-level security on top of Solr.


-Sascha

On 23.03.2011 14:20, satya swaroop wrote:

Hi All,
   As for my project Requirement i need to keep privacy for search of
files so that i need to modify the code of solr,

for example if there are 5 users and each user indexes some files as
   user1 ->  java1, c1,sap1
   user2 ->  java2, c2,sap2
   user3 ->  java3, c3,sap3
   user4 ->  java4, c4,sap4
   user5 ->  java5, c5,sap5

and if a user2 searches for the keyword "java" then it should be display
only  the file java2 and not other files

so inorder to keep this filtering inside solr itself may i know where to
modify the code... i will access a database to check the user indexed files
and then filter the result... i didnt have any cores.. i indexed all files
in a single index...

Regards,
satya



Re: Index MS office

2011-02-02 Thread Sascha Szott

Hi,

have a look at Solr's ExtractingRequestHandler:

http://wiki.apache.org/solr/ExtractingRequestHandler

-Sascha

On 02.02.2011 16:49, Thumuluri, Sai wrote:

Good Morning,

  I am planning to get started on indexing MS office using ApacheSolr -
can someone please direct me where I should start?

Thanks,
Sai Thumuluri


Re: Malformed XML with exotic characters

2011-02-01 Thread Sascha Szott

Hi Markus,

in my case the JSON response writer returns valid JSON. The same holds 
for the PHP response writer.


-Sascha

On 01.02.2011 18:44, Markus Jelsma wrote:

You can exclude the input's involvement by checking if other response writers
do work. For me, the JSONResponseWriter works perfectly with the same returned
data in some AJAX environment.

On Tuesday 01 February 2011 18:29:06 Sascha Szott wrote:

Hi folks,

I've made the same observation when working with Solr's
ExtractingRequestHandler on the command line (no browser interaction).

When issuing the following curl command

curl
'http://mysolrhost/solr/update/extract?extractOnly=true&extractFormat=text&;
wt=xml&resource.name=foo.pdf' --data-binary @foo.pdf -H
'Content-type:text/xml; charset=utf-8'>  foo.xml

Solr's XML response writer returns malformed xml, e.g., xmllint gives me:

foo.xml:21: parser error : Char 0xD835 out of allowed range
foo.xml:21: parser error : PCDATA invalid Char value 55349

I'm not totally sure, if this is an Tika/PDFBox issue. However, I would
expect in every case that the XML output produced by Solr is well-formed
even if the libraries used under the hood return "garbage".


-Sascha

p.s. I can provide the pdf file in question, if anybody would like to
see it in action.

On 01.02.2011 16:43, Markus Jelsma wrote:

There is an issue with the XML response writer. It cannot cope with some
very exotic characters or possibly the right-to-left writing systems.
The issue can be reproduced by indexing the content of the home page of
wikipedia as it contains a lot of exotic matter. The problem does not
affect the JSON response writer.

The problem is, i am unsure whether this is a bug in Solr or that perhaps
Firefox itself trips over.


Here's the output of the JSONResponeWriter for a query returning the home
page:
{

   "responseHeader":{

"status":0,
"QTime":1,
"params":{

"fl":"url,content",
"indent":"true",
"wt":"json",
"q":"*:*",
"rows":"1"}},

   "response":{"numFound":6744,"start":0,"docs":[

{

 "url":"http://www.wikipedia.org/";,
 "content":"Wikipedia English The Free Encyclopedia 3 543 000+ articles
 日

本語 フリー百科事典 730 000+ 記事 Deutsch Die freie Enzyklopädie 1 181 000+ Artikel
Español La enciclopedia libre 710 000+ artículos Français L’encyclopédie
libre 1 061 000+ articles Русский Свободная энциклопедия 654 000+ статей
Italiano L’enciclopedia libera 768 000+ voci Português A enciclopédia
livre 669 000+ artigos Polski Wolna encyklopedia 769 000+ haseł
Nederlands De vrije encyclopedie 668 000+ artikelen Search  • Suchen  •
Rechercher  • Szukaj  • Ricerca  • 検索  • Buscar  • Busca  • Zoeken  •
Поиск  • Sök  • 搜尋  • Cerca  • Søk  • Haku  • Пошук  • Hledání  •
Keresés  • Căutare  • 찾기  • Tìm kiếm  • Ara • Cari  • Søg  • بحث  •
Serĉu  • Претрага  • Paieška  • Hľadať  • Suk  • جستجو • חיפוש  •
Търсене  • Poišči  • Cari  • Bilnga العربية Български Català Česky Dansk
Deutsch English Español Esperanto فارسی Français 한국어 Bahasa Indonesia
Italiano עברית Lietuvių Magyar Bahasa Melayu Nederlands 日本語 Norsk
(bokmål) Polski Português Română Русский Slovenčina Slovenščina Српски /
Srpski Suomi Svenska Türkçe Українська Tiếng Việt Volapük Winaray 中文
100 000+   العربية • Български  • Català  • Česky  • Dansk  • Deutsch  •
English  • Español  • Esperanto  • فارسی  • Français  • 한국어  • Bahasa
Indonesia  • Italiano  • עברית • Lietuvių  • Magyar  • Bahasa Melayu  •
Nederlands  • 日本語  • Norsk (bokmål) • Polski  • Português  • Русский  •
Română  • Slovenčina  • Slovenščina  • Српски / Srpski  • Suomi  •
Svenska  • Türkçe  • Українська  • Tiếng Việt  • Volapük  • Winaray  •
中文   10 000+   Afrikaans  • Aragonés  • Armãneashce  • Asturianu  •
Kreyòl Ayisyen  • Azərbaycan / آذربايجان ديلی  • বাংলা  • Беларуская (
Акадэмічная  • Тарашкевiца )  • বিষ্ণুপ্রিযা় মণিপুরী  • Bosanski  •
Brezhoneg  • Чăваш • Cymraeg  • Eesti  • Ελληνικά  • Euskara  • Frysk  •
Gaeilge  • Galego  • ગુજરાતી  • Հայերեն  • हिन्दी  • Hrvatski  • Ido  •
Íslenska  • Basa Jawa  • ಕನ್ನಡ  • ქართული  • Kurdî / كوردی  • Latina  •
Latviešu  • Lëtzebuergesch  • Lumbaart • Македонски  • മലയാളം  • मराठी
• नेपाल भाषा  • नेपाली  • Norsk (nynorsk)  • Nnapulitano • Occitan  •
Piemontèis  • Plattdüütsch  • Ripoarisch  • Runa Simi  • شاہ مکھی پنجابی
  • Shqip  • Sicilianu  • Simple English  • Sinugboanon  • Srpskohrvatski
/ Српскохрватски  • Basa Sunda  • Kiswahili  • Tagalog  • தமிழ் • తెలుగు
  • ไทย  • اردو  • Walon  • Yorùbá  • 粵語  • Žemaitėška   1 000+   Bahsa
Acèh  • Alemannisch  • አማርኛ  • Arpitan  • ܐܬܘܪܝܐ  • Avañe’ẽ  • Aymar Aru
  • Bân-lâm-gú  • Bahasa Banjar  • Basa Banyumasan  •

Re: Malformed XML with exotic characters

2011-02-01 Thread Sascha Szott

Hi folks,

I've made the same observation when working with Solr's 
ExtractingRequestHandler on the command line (no browser interaction).


When issuing the following curl command

curl 
'http://mysolrhost/solr/update/extract?extractOnly=true&extractFormat=text&wt=xml&resource.name=foo.pdf' 
--data-binary @foo.pdf -H 'Content-type:text/xml; charset=utf-8' > foo.xml


Solr's XML response writer returns malformed xml, e.g., xmllint gives me:

foo.xml:21: parser error : Char 0xD835 out of allowed range
foo.xml:21: parser error : PCDATA invalid Char value 55349

I'm not totally sure, if this is an Tika/PDFBox issue. However, I would 
expect in every case that the XML output produced by Solr is well-formed 
even if the libraries used under the hood return "garbage".



-Sascha

p.s. I can provide the pdf file in question, if anybody would like to 
see it in action.



On 01.02.2011 16:43, Markus Jelsma wrote:

There is an issue with the XML response writer. It cannot cope with some very
exotic characters or possibly the right-to-left writing systems. The issue can
be reproduced by indexing the content of the home page of wikipedia as it
contains a lot of exotic matter. The problem does not affect the JSON response
writer.

The problem is, i am unsure whether this is a bug in Solr or that perhaps
Firefox itself trips over.


Here's the output of the JSONResponeWriter for a query returning the home
page:
{
  "responseHeader":{
   "status":0,
   "QTime":1,
   "params":{
"fl":"url,content",
"indent":"true",
"wt":"json",
"q":"*:*",
"rows":"1"}},
  "response":{"numFound":6744,"start":0,"docs":[
{
 "url":"http://www.wikipedia.org/";,
 "content":"Wikipedia English The Free Encyclopedia 3 543 000+ articles 
日
本語 フリー百科事典 730 000+ 記事 Deutsch Die freie Enzyklopädie 1 181 000+ Artikel
Español La enciclopedia libre 710 000+ artículos Français L’encyclopédie libre
1 061 000+ articles Русский Свободная энциклопедия 654 000+ статей Italiano
L’enciclopedia libera 768 000+ voci Português A enciclopédia livre 669 000+
artigos Polski Wolna encyklopedia 769 000+ haseł Nederlands De vrije
encyclopedie 668 000+ artikelen Search  • Suchen  • Rechercher  • Szukaj  •
Ricerca  • 検索  • Buscar  • Busca  • Zoeken  • Поиск  • Sök  • 搜尋  • Cerca  •
Søk  • Haku  • Пошук  • Hledání  • Keresés  • Căutare  • 찾기  • Tìm kiếm  • Ara
• Cari  • Søg  • بحث  • Serĉu  • Претрага  • Paieška  • Hľadať  • Suk  • جستجو
• חיפוש  • Търсене  • Poišči  • Cari  • Bilnga العربية Български Català Česky
Dansk Deutsch English Español Esperanto فارسی Français 한국어 Bahasa Indonesia
Italiano עברית Lietuvių Magyar Bahasa Melayu Nederlands 日本語 Norsk (bokmål)
Polski Português Română Русский Slovenčina Slovenščina Српски / Srpski Suomi
Svenska Türkçe Українська Tiếng Việt Volapük Winaray 中文   100 000+   العربية
• Български  • Català  • Česky  • Dansk  • Deutsch  • English  • Español  •
Esperanto  • فارسی  • Français  • 한국어  • Bahasa Indonesia  • Italiano  • עברית
• Lietuvių  • Magyar  • Bahasa Melayu  • Nederlands  • 日本語  • Norsk (bokmål)
• Polski  • Português  • Русский  • Română  • Slovenčina  • Slovenščina  •
Српски / Srpski  • Suomi  • Svenska  • Türkçe  • Українська  • Tiếng Việt  •
Volapük  • Winaray  • 中文   10 000+   Afrikaans  • Aragonés  • Armãneashce  •
Asturianu  • Kreyòl Ayisyen  • Azərbaycan / آذربايجان ديلی  • বাংলা  • 
Беларуская
( Акадэмічная  • Тарашкевiца )  • বিষ্ণুপ্রিযা় মণিপুরী  • Bosanski  • 
Brezhoneg  • Чăваш
• Cymraeg  • Eesti  • Ελληνικά  • Euskara  • Frysk  • Gaeilge  • Galego  •
ગુજરાતી  • Հայերեն  • हिन्दी  • Hrvatski  • Ido  • Íslenska  • Basa Jawa  • 
ಕನ್ನಡ  •
ქართული  • Kurdî / كوردی  • Latina  • Latviešu  • Lëtzebuergesch  • Lumbaart
• Македонски  • മലയാളം  • मराठी  • नेपाल भाषा  • नेपाली  • Norsk (nynorsk)  • 
Nnapulitano
• Occitan  • Piemontèis  • Plattdüütsch  • Ripoarisch  • Runa Simi  • شاہ مکھی
پنجابی  • Shqip  • Sicilianu  • Simple English  • Sinugboanon  •
Srpskohrvatski / Српскохрватски  • Basa Sunda  • Kiswahili  • Tagalog  • தமிழ்
• తెలుగు  • ไทย  • اردو  • Walon  • Yorùbá  • 粵語  • Žemaitėška   1 000+   Bahsa
Acèh  • Alemannisch  • አማርኛ  • Arpitan  • ܐܬܘܪܝܐ  • Avañe’ẽ  • Aymar Aru  •
Bân-lâm-gú  • Bahasa Banjar  • Basa Banyumasan  • Башҡорт  • भोजपुरी  • Bikol
Central  • Boarisch  • བོད་ཡིག  • Chavacano de Zamboanga  • Corsu  • Deitsch  •
ދިވެހި  • Diné Bizaad  • Eald Englisc  • Emigliàn–Rumagnòl  • Эрзянь  • 
Estremeñu
• Fiji Hindi  • Føroyskt  • Furlan  • Gaelg  • Gàidhlig  • 贛語  • گیلکی  • Hak-
kâ-fa / 客家話  • Хальмг  • ʻŌlelo Hawaiʻi  • Hornjoserbsce  • Ilokano  •
Interlingua  • Interlingue  • Ирон Æвзаг  • Kapampangan  • Kaszëbsczi  •
Kernewek  • ភាសាខ្មែរ  • Kinyarwanda  • Коми  • Кыргызча  • Ladino / לאדינו  •
Ligure  • Limburgs  • Lingála  • lojban  • Malagasy  • Malti  • 文言  • Māori  •
مصرى  • مازِرونی / Mäzeruni  • Монгол  • မြန်မာဘာသာ  • Nāhuatlahtōlli  •
Nedersaksisch  • Nouormand  • Novial  • Нохчийн  • Олык Марий  • O‘zbek  • पाऴि
• Pangasinán  • ਪੰਜਾਬੀ 

Re: missing type check when working with pint field type

2011-01-18 Thread Sascha Szott

Hi Erick,

I see the point. But what is pint (plong, pfloat, pdouble) actually 
intended for (sorting is not possible, no type checking is performed)? 
Seems to me as it is something very similar to the string type (both 
store and index the value verbatim).


-Sascha

On 18.01.2011 14:38, Erick Erickson wrote:

I suspect you missed this comment in the schema file:
***
Plain numeric field types that store and index the text
   value verbatim (and hence don't support range queries, since the
   lexicographic ordering isn't equal to the numeric ordering)
***

So what's happening is that the field is being indexed as a text type and, I
suspect,
begin tokenized. The error you're getting is when trying to sort against a
tokenized
field which is undefined. At least that's my story and I'm sticking to
it

Best
Erick

On Tue, Jan 18, 2011 at 8:10 AM, Sascha Szott  wrote:


Hi folks,

I've noticed an unexpected behavior while working with the various built-in
integer field types (int, tint, pint). It seems as the first two ones are
subject to type checking, while the latter one is not.

I'll give you an example based on the example schema that is shipped out
with Solr. When trying to index the document


  1
  invalid_value
  1
  1


Solr responds with a NumberFormatException (the same holds when setting the
value of foo_ti to "invalid_value"):

java.lang.NumberFormatException: For input string: "invalid_value"

Surprisingly, an attempt to index the document


  1
  1
  1
  invalid_value


is successful. In the end, sorting on foo_pi leads to an exception, e.g.,
http://localhost:8983/solr/select?q=*:*&sort=foo_pi desc

raises an HTTP 500 error:

java.lang.StringIndexOutOfBoundsException: String index out of range: 0
at java.lang.String.charAt(String.java:686)
at
org.apache.lucene.search.FieldCache$7.parseInt(FieldCache.java:234)
at
org.apache.lucene.search.FieldCacheImpl$IntCache.createValue(FieldCacheImpl.java:457)
at
org.apache.lucene.search.FieldCacheImpl$Cache.get(FieldCacheImpl.java:224)
at
org.apache.lucene.search.FieldCacheImpl.getInts(FieldCacheImpl.java:430)
at
org.apache.lucene.search.FieldCacheImpl$IntCache.createValue(FieldCacheImpl.java:447)
at
org.apache.lucene.search.FieldCacheImpl$Cache.get(FieldCacheImpl.java:224)
at
org.apache.lucene.search.FieldCacheImpl.getInts(FieldCacheImpl.java:430)
at
org.apache.lucene.search.FieldComparator$IntComparator.setNextReader(FieldComparator.java:332)
at
org.apache.lucene.search.TopFieldCollector$OneComparatorNonScoringCollector.setNextReader(TopFieldCollector.java:94)
at
org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:249)
at org.apache.lucene.search.Searcher.search(Searcher.java:171)
at
org.apache.solr.search.SolrIndexSearcher.getDocListNC(SolrIndexSearcher.java:988)
at
org.apache.solr.search.SolrIndexSearcher.getDocListC(SolrIndexSearcher.java:884)
at
org.apache.solr.search.SolrIndexSearcher.search(SolrIndexSearcher.java:341)
at
org.apache.solr.handler.component.QueryComponent.process(QueryComponent.java:182)
at
org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:195)
at
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316)
at
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:338)
at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:241)
[...]


Is this a bug or did I missed something?

-Sascha





--
Sascha Szott :: KOBV/ZIB ::  :: +49 30 84185-457


missing type check when working with pint field type

2011-01-18 Thread Sascha Szott

Hi folks,

I've noticed an unexpected behavior while working with the various 
built-in integer field types (int, tint, pint). It seems as the first 
two ones are subject to type checking, while the latter one is not.


I'll give you an example based on the example schema that is shipped out 
with Solr. When trying to index the document



  1
  invalid_value
  1
  1


Solr responds with a NumberFormatException (the same holds when setting 
the value of foo_ti to "invalid_value"):


java.lang.NumberFormatException: For input string: "invalid_value"

Surprisingly, an attempt to index the document


  1
  1
  1
  invalid_value


is successful. In the end, sorting on foo_pi leads to an exception, 
e.g., http://localhost:8983/solr/select?q=*:*&sort=foo_pi desc


raises an HTTP 500 error:

java.lang.StringIndexOutOfBoundsException: String index out of range: 0
at java.lang.String.charAt(String.java:686)
at org.apache.lucene.search.FieldCache$7.parseInt(FieldCache.java:234)
	at 
org.apache.lucene.search.FieldCacheImpl$IntCache.createValue(FieldCacheImpl.java:457)
	at 
org.apache.lucene.search.FieldCacheImpl$Cache.get(FieldCacheImpl.java:224)

at 
org.apache.lucene.search.FieldCacheImpl.getInts(FieldCacheImpl.java:430)
	at 
org.apache.lucene.search.FieldCacheImpl$IntCache.createValue(FieldCacheImpl.java:447)
	at 
org.apache.lucene.search.FieldCacheImpl$Cache.get(FieldCacheImpl.java:224)

at 
org.apache.lucene.search.FieldCacheImpl.getInts(FieldCacheImpl.java:430)
	at 
org.apache.lucene.search.FieldComparator$IntComparator.setNextReader(FieldComparator.java:332)
	at 
org.apache.lucene.search.TopFieldCollector$OneComparatorNonScoringCollector.setNextReader(TopFieldCollector.java:94)

at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:249)
at org.apache.lucene.search.Searcher.search(Searcher.java:171)
	at 
org.apache.solr.search.SolrIndexSearcher.getDocListNC(SolrIndexSearcher.java:988)
	at 
org.apache.solr.search.SolrIndexSearcher.getDocListC(SolrIndexSearcher.java:884)
	at 
org.apache.solr.search.SolrIndexSearcher.search(SolrIndexSearcher.java:341)
	at 
org.apache.solr.handler.component.QueryComponent.process(QueryComponent.java:182)
	at 
org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:195)
	at 
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131)

at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316)
	at 
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:338)
	at 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:241)

[...]


Is this a bug or did I missed something?

-Sascha


Re: post search using solrj

2010-12-30 Thread Sascha SZOTT

Hi Don,

you could give the HTTP method to be used as a second argument to the 
QueryRequest constructor:


[http://lucene.apache.org/solr/api/org/apache/solr/client/solrj/request/QueryRequest.html#QueryRequest(org.apache.solr.common.params.SolrParams,%20org.apache.solr.client.solrj.SolrRequest.METHOD)]

-Sascha


Don Hill wrote:

Hi. I am using solrj and it has been working fine. I now have a requirement
to add more parameters. So many that I get a max URI exceeded error. Is
there anyway using SolrQuery todo a http post so I don't have these issues?

don



Re: DataImportHandler in Solr 1.4.1: exception handling in FileListEntityProcessor

2010-08-11 Thread Sascha Szott

Sorry, there was a mistake in the stack trace. The correct one is:

SEVERE: Full Import failed
org.apache.solr.handler.dataimport.DataImportHandlerException: 'baseDir' 
value: /home/doe/foo is not a directory Processing Document # 3
at 
org.apache.solr.handler.dataimport.FileListEntityProcessor.init(FileListEntityProcessor.java:122)
at 
org.apache.solr.handler.dataimport.EntityProcessorWrapper.init(EntityProcessorWrapper.java:71)
at 
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:319)
at 
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:383)
at 
org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:242)
at 
org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:180)
at 
org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:331)
at 
org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:389)
at 
org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:370) 



-Sascha

On 11.08.2010 15:18, Sascha Szott wrote:

Hi folks,

why does FileListEntityProcessor ignores onError="continue" and abort
indexing if a directory or a file does not exist?

I'm using both XPathEntityProcessor and FileListEntityProcessor with
onError set to continue. In case a directory or file is not present an
Exception is thrown and indexing is stopped immediately.

Below you can find a stack trace that is generated in case the directory
/home/doe/foo does not exist:

SEVERE: Full Import failed
org.apache.solr.handler.dataimport.DataImportHandlerException: 'baseDir'
value: /home/doe/foo/bar.xml is not a directory Processing Document # 3
at
org.apache.solr.handler.dataimport.FileListEntityProcessor.init(FileListEntityProcessor.java:122)

at
org.apache.solr.handler.dataimport.EntityProcessorWrapper.init(EntityProcessorWrapper.java:71)

at
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:319)

at
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:383)

at
org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:242)

at
org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:180)
at
org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:331)

at
org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:389)

at
org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:370)


How should I configure both processors so that missing directories and
files are ignored and the indexing process does not stop immediately?

Best,
Sascha


DataImportHandler in Solr 1.4.1: exception handling in FileListEntityProcessor

2010-08-11 Thread Sascha Szott

Hi folks,

why does FileListEntityProcessor ignores onError="continue" and abort 
indexing if a directory or a file does not exist?


I'm using both XPathEntityProcessor and FileListEntityProcessor with 
onError set to continue. In case a directory or file is not present an 
Exception is thrown and indexing is stopped immediately.


Below you can find a stack trace that is generated in case the directory 
/home/doe/foo does not exist:


SEVERE: Full Import failed
org.apache.solr.handler.dataimport.DataImportHandlerException: 'baseDir' 
value: /home/doe/foo/bar.xml is not a directory Processing Document # 3
at 
org.apache.solr.handler.dataimport.FileListEntityProcessor.init(FileListEntityProcessor.java:122)
at 
org.apache.solr.handler.dataimport.EntityProcessorWrapper.init(EntityProcessorWrapper.java:71)
at 
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:319)
at 
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:383)
at 
org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:242)
at 
org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:180)
at 
org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:331)
at 
org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:389)
at 
org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:370)


How should I configure both processors so that missing directories and 
files are ignored and the indexing process does not stop immediately?


Best,
Sascha


Re: problem with formulating a negative query

2010-07-06 Thread Sascha Szott

Hi,

Chris Hostetter wrote:

AND, OR, and NOT are just syntactic-sugar for modifying
the MUST, MUST_NOT, and SHOULD.  The default op of "OR" only affects the
first clause of your query (R) because it doesn't have any modifiers --

Thanks for pointing that out!

-Sascha


the second clause has that NOT modifier so your query is effectivley...

topic:R -topic:[* TO *]

...which by definition can't match anything.

-Hoss



Re: Is there a way to delete multiple documents using wildcard?

2010-06-30 Thread Sascha Szott

Hi,

take a look inside Solr's log file. Are there any error messages with 
respect to the update request?


Furthermore, you could try the following two commands instead:

curl "http://host:port/solr/update"; --form-string 
stream.body="uid:6-HOST*"


curl "http://host:port/solr/update"; --form-string stream.body=""

-Sascha

bbarani wrote:


Yeah, I am getting the results when I use /select handler.

I tried the below query..

/select?q=uid:6-HOST*

Got

Thanks
BB


Re: Is there a way to delete multiple documents using wildcard?

2010-06-30 Thread Sascha Szott

Hi,

does /select?q=uid:6-HOST* return any documents?

-Sascha

bbarani wrote:


Hi,

Thanks a lot for your reply..

I tried the below query

update?commit=true%20-H%20"Content-Type:%20text/xml"%20--data-binary%20'uid:6-HOST*'

But even now none of the documents are getting deleted.. Am I forming the
URL wrong?

Thanks,
BB


Re: Is there a way to delete multiple documents using wildcard?

2010-06-30 Thread Sascha Szott

Hi,

you can delete all docs that match a certain query:

uid:6-HOST*

-Sascha

bbarani wrote:


Hi,

I am trying to delete a group of documents using wildcard. Something like

update?commit=true%20-H%20"Content-Type:%20text/xml"%20--data-binary%20'6-HOST*'

I want to delete all documents which contains the uid starting with 6-HOST
but this query doesnt seem to work.. Am I doing anything wrong??

Thanks,
BB


Re: problem with formulating a negative query

2010-06-30 Thread Sascha Szott

Hi Erick,

thanks for your explanations. But why are all docs being *removed* from 
the set of all docs that contain R in their topic field? This would 
correspond to a boolean AND and would stand in conflict with the clause 
q.op=OR. This seems a bit strange to me.


Furthermore, Smiley & Pugh stated in their Solr 1.4 book on pg. 102 that 
adding the a subexpression containing the negative query (-[* TO *]) and 
the match-all-docs clause (*:*) is only a workaround. Why is this 
workaround necessary at all?


Best,
Sascha

Erick Erickson wrote:

This may help:
http://lucene.apache.org/java/2_4_0/queryparsersyntax.html#Boolean%20operators

But the clause you specified translates roughly as "find all the
documents that contain R, then remove any of them that match
"* TO *". * TO * contains all the documents with R, so everything
you just matched is removed from your results.

HTH
Erick

On Tue, Jun 29, 2010 at 12:40 PM, Sascha Szott  wrote:


Hi Ahmet,

it works, thanks a lot!

To be true I have no idea what's the problem with
defType=lucene&q.op=OR&df=topic&q=R NOT [* TO *]

-Sascha


Ahmet Arslan wrote:


I have a (multi-valued) field topic in my index which does

not need to exist in every document. Now, I'm struggling
with formulating a query that returns all documents that
either have no topic field at all *or* whose topic field
value is R.



Does this work?
&defType=lucene&q.op=OR&q=topic:R (+*:* -topic:[* TO *])




Re: problem with formulating a negative query

2010-06-29 Thread Sascha Szott

Hi Ahmet,

it works, thanks a lot!

To be true I have no idea what's the problem with
defType=lucene&q.op=OR&df=topic&q=R NOT [* TO *]

-Sascha

Ahmet Arslan wrote:

I have a (multi-valued) field topic in my index which does
not need to exist in every document. Now, I'm struggling
with formulating a query that returns all documents that
either have no topic field at all *or* whose topic field
value is R.


Does this work?
&defType=lucene&q.op=OR&q=topic:R (+*:* -topic:[* TO *])



problem with formulating a negative query

2010-06-29 Thread Sascha Szott

Hi folks,

I have a (multi-valued) field topic in my index which does not need to 
exist in every document. Now, I'm struggling with formulating a query 
that returns all documents that either have no topic field at all *or* 
whose topic field value is R.


Unfortunately, the query

/select?q={!lucene q.op=OR df=topic}(R NOT [* TO *])

does not return any docs even though there are documents in my index 
that fulfil the specified condition as you can deduce from the queries 
listed below:


/select?q=topic:R  returns > 0 docs

/select?q=-topic:[* TO *]  returns > 0 docs

Appending the query with debugQuery=true returns:
{!lucene q.op=OR df=topic}(R NOT [* TO *])
{!lucene q.op=OR df=topic}(R NOT [* TO *])
topic:R -topic:[* TO *]
topic:R -topic:[* TO *]

Does anybody have a clue of what is wrong here?

Thanks in advance,
Sascha


Re: Specifiying multiple mlt.fl fields

2010-06-19 Thread Sascha Szott

Hi Darren,

try mlt.fl=field1 field2

Best,
Sascha

Darren Govoni wrote:

Hi,
   I read the wiki and tried about a dozen variations such as:

...&mlt.fl=field1&mlt.fl=field2

and

...&mlt.fl=field1,field2&...

to specify more than one MLT field and it won't take. What's the trick?
Also, how to do it with SolrJ?

Nothing I try works. Solr 4.0 nightly build.

Any tips, very appreciated!

Darren







Re: federated / meta search

2010-06-18 Thread Sascha Szott

Hi Joe & Markus,

sounds good! Maybe I should better add a note on the Wiki page on 
federated search [1].


Thanks,
Sascha

[1] http://wiki.apache.org/solr/FederatedSearch

Joe Calderon wrote:

yes, you can use distributed search across shards with different
schemas as long as the query only references overlapping fields, i
usually test adding new fields or tokenizers on one shard and deploy
only after i verified its working properly

On Thu, Jun 17, 2010 at 1:10 PM, Markus Jelsma  wrote:

Hi,



Check out Solr sharding [1] capabilities. I never tested it with different 
schema's but if each node is queried with fields that it supports, it should 
return useful results.



[1]: http://wiki.apache.org/solr/DistributedSearch



Cheers.

-Original message-----
From: Sascha Szott
Sent: Thu 17-06-2010 19:44
To: solr-user@lucene.apache.org;
Subject: federated / meta search

Hi folks,

if I'm seeing it right Solr currently does not provide any support for
federated / meta searching. Therefore, I'd like to know if anyone has
already put efforts into this direction? Moreover, is federated / meta
search considered a scenario Solr should be able to deal with at all or
is it (far) beyond the scope of Solr?

To be more precise, I'll give you a short explanation of my
requirements. Assume, there are a couple of Solr instances running at
different places. The documents stored within those instances are all
from the same domain (bibliographic records), but it can not be ensured
that the schema definitions conform to 100%. But lets say, there are at
least some index fields that are present in all instances (fields with
the same name and type definition). Now, I'd like to perform a search on
all instances at the same time (with the restriction that the query
contains only those fields that overlap among the different schemas) and
combine the results in a reasonable way by utilizing the score
information associated with each hit. Please note, that due to legal
issues it is not feasible to build a single index that integrates the
documents of all Solr instances under consideration.

Thanks in advance,
Sascha






federated / meta search

2010-06-17 Thread Sascha Szott

Hi folks,

if I'm seeing it right Solr currently does not provide any support for 
federated / meta searching. Therefore, I'd like to know if anyone has 
already put efforts into this direction? Moreover, is federated / meta 
search considered a scenario Solr should be able to deal with at all or 
is it (far) beyond the scope of Solr?


To be more precise, I'll give you a short explanation of my 
requirements. Assume, there are a couple of Solr instances running at 
different places. The documents stored within those instances are all 
from the same domain (bibliographic records), but it can not be ensured 
that the schema definitions conform to 100%. But lets say, there are at 
least some index fields that are present in all instances (fields with 
the same name and type definition). Now, I'd like to perform a search on 
all instances at the same time (with the restriction that the query 
contains only those fields that overlap among the different schemas) and 
combine the results in a reasonable way by utilizing the score 
information associated with each hit. Please note, that due to legal 
issues it is not feasible to build a single index that integrates the 
documents of all Solr instances under consideration.


Thanks in advance,
Sascha



Re: strange results with query and hyphened words

2010-05-31 Thread Sascha Szott
Sorry Markus, I mixed up the index and query field in analysis.jsp. In 
fact, I meant that a search for profiauskunft matches profi-auskunft.


I'm not sure, whether the case you are dealing with (search for 
profi-auskunft should match profiauskunft) is appropriately addressed by 
the WordDelimiterFilter. What about using the PatternReplaceCharFilter 
at query time to eliminate all intra-word hyphens?


-Sascha

Sascha Szott wrote:

Hi Markus,


the default-config for index is:



and for query:



That's not true. The default configuration for query-time processing is:



By using this setting, a search for "profi-auskunft" will match
"profiauskunft".

It's important to note, that WordDelimiterFilterFactory's catenate*
parameters should only be used in the index-time analysis stack.
Otherwise the strange behaviour (search for profi-auskunft is translated
into "profi followed by (auskunft or profiauskunft)" you mentioned will
occur.

Best,
Sascha


-Ursprüngliche Nachricht-
Von: Sascha Szott [mailto:sz...@zib.de]
Gesendet: Sonntag, 30. Mai 2010 19:01
An: solr-user@lucene.apache.org
Betreff: Re: strange results with query and hyphened words

Hi Markus,

I was facing the same problem a few days ago and found an
explanation in
the mail archive that clarifies my question regarding the usage of
Solr's WordDelimiterFilterFactory:

http://markmail.org/message/qoby6kneedtwd42h

Best,
Sascha

markus.rietz...@rzf.fin-nrw.de wrote:

i am wondering why a search term with hyphen doesn't match.

my search term is "prof-auskunft". in

WordDelimiterFilterFactory i have

catenateWords, so my understanding is that profi-auskunft

would search

for profiauskunft. when i use the analyse panel in solr

admi i see that

profi-auskunft matches a term "profiauskunft".

the analyse will show

Query Analyzer
WhitespaceTokenizerFactory
profi-auskunft
SynonymFilterFactory
profi-auskunft
StopFilterFactory
profi-auskunft

WordDelimiterFilterFactory

term position 1 2
term text profi auskunft
profiauskunft
term type word word
word
source start,end 0,5 6,14
0,15

LowerCaseFilterFactory
SnowballPorterFilterFactory

why is auskunft and profiauskunft in one column. how do they get
searched?

when i search "profiauskunft" i have 230 hits, when i now search for
"profi-auskunft" i do get less hits. when i call the search with
debugQuery=on i see

body:"profi (auskunft profiauskunft)"

what does this query mean? profi and "auskunft or profiauskunft"?




































Re: strange results with query and hyphened words

2010-05-31 Thread Sascha Szott

Hi Markus,


the default-config for index is:



and for query:



That's not true. The default configuration for query-time processing is:



By using this setting, a search for "profi-auskunft" will match 
"profiauskunft".


It's important to note, that WordDelimiterFilterFactory's catenate* 
parameters should only be used in the index-time analysis stack. 
Otherwise the strange behaviour (search for profi-auskunft is translated 
into "profi followed by (auskunft or profiauskunft)" you mentioned will 
occur.


Best,
Sascha


-Ursprüngliche Nachricht-
Von: Sascha Szott [mailto:sz...@zib.de]
Gesendet: Sonntag, 30. Mai 2010 19:01
An: solr-user@lucene.apache.org
Betreff: Re: strange results with query and hyphened words

Hi Markus,

I was facing the same problem a few days ago and found an
explanation in
the mail archive that clarifies my question regarding the usage of
Solr's WordDelimiterFilterFactory:

http://markmail.org/message/qoby6kneedtwd42h

Best,
Sascha

markus.rietz...@rzf.fin-nrw.de wrote:

i am wondering why a search term with hyphen doesn't match.

my search term is "prof-auskunft". in

WordDelimiterFilterFactory i have

catenateWords, so my understanding is that profi-auskunft

would search

for profiauskunft. when i use the analyse panel in solr

admi i see that

profi-auskunft matches a term "profiauskunft".

the analyse will show

Query Analyzer
WhitespaceTokenizerFactory
profi-auskunft
SynonymFilterFactory
profi-auskunft
StopFilterFactory
profi-auskunft

WordDelimiterFilterFactory

term position   1   2
term text   profi   auskunft
profiauskunft
term type   wordword
word
source start,end0,5 6,14
0,15

LowerCaseFilterFactory
SnowballPorterFilterFactory

why is auskunft and profiauskunft in one column. how do they get
searched?

when i search "profiauskunft" i have 230 hits, when i now search for
"profi-auskunft" i do get less hits. when i call the search with
debugQuery=on i see

body:"profi (auskunft profiauskunft)"

what does this query mean? profi and "auskunft or profiauskunft"?






  
  
  
  
  
  
  
  
  
  
  


  
  
  
  
  
  











Re: strange results with query and hyphened words

2010-05-30 Thread Sascha Szott

Hi Markus,

I was facing the same problem a few days ago and found an explanation in 
the mail archive that clarifies my question regarding the usage of 
Solr's WordDelimiterFilterFactory:


http://markmail.org/message/qoby6kneedtwd42h

Best,
Sascha

markus.rietz...@rzf.fin-nrw.de wrote:

i am wondering why a search term with hyphen doesn't match.

my search term is "prof-auskunft". in WordDelimiterFilterFactory i have
catenateWords, so my understanding is that profi-auskunft would search
for profiauskunft. when i use the analyse panel in solr admi i see that
profi-auskunft matches a term "profiauskunft".

the analyse will show

Query Analyzer
WhitespaceTokenizerFactory
profi-auskunft
SynonymFilterFactory
profi-auskunft
StopFilterFactory
profi-auskunft

WordDelimiterFilterFactory

term position   1   2
term text   profi   auskunft
profiauskunft
term type   wordword
word
source start,end0,5 6,14
0,15

LowerCaseFilterFactory
SnowballPorterFilterFactory

why is auskunft and profiauskunft in one column. how do they get
searched?

when i search "profiauskunft" i have 230 hits, when i now search for
"profi-auskunft" i do get less hits. when i call the search with
debugQuery=on i see

body:"profi (auskunft profiauskunft)"

what does this query mean? profi and "auskunft or profiauskunft"?





   
 
 
 
 
 
 
 
 
 
 
 
   
   
 
 
 
 
 
 
   







Re: sort by field length

2010-05-26 Thread Sascha Szott

Hi Erick,

Erick Erickson wrote:

Ah, I may have misunderstood, I somehow got it in my mind
you were talking about the length of each term (as in string length).

But if you're looking at the field length as the count of terms, that's
another question, sorry for the confusion...

I have to ask, though, why you want to sort this way? The relevance
calculations already factor in both term frequency and field length. What's
the use-case for sorting by field length given the above?
It's not a real world use-case -- I just want to get a better 
understanding of the data I'm indexing (therefore, performance is 
neglectable). In my current use case, you can think of the field length 
as an indicator of data quality (i.e., the longer the field content, the 
worse the quality is). Being able to sort the field data in order of 
decreasing length would allow me to investigate "exceptional" data items 
that are not appropriately handled by my curation process.


Best,
Sascha



Best
Erick

On Tue, May 25, 2010 at 3:40 AM, Sascha Szott  wrote:


Hi Erick,


Erick Erickson wrote:


Are you sure you want to recompute the length when sorting?
It's the classic time/space tradeoff, but I'd suggest that when
your index is big enough to make taking up some more space
a problem, it's far too big to spend the cycles calculating each
term length for sorting purposes considering you may be
sorting all the terms in your index worst-case.


Good point, thank you for the clarification. I "thought" that Lucene
internally stores the field length (e.g., in order to compute the relevance)
and getting this information at query time requires only a simple lookup.

-Sascha




But you could consider payloads for storing the length, although
that would still be redundant...

Best
Erick

On Mon, May 24, 2010 at 8:30 AM, Sascha Szott   wrote:

  Hi folks,


is it possible to sort by field length without having to (redundantly)
save
the length information in a seperate index field? At first, I thought to
accomplish this using a function query, but I couldn't find an
appropriate
one.

Thanks in advance,
Sascha











Re: Faceted search not working?

2010-05-25 Thread Sascha Szott

Hi,

please note, that the FacetComponent is one of the six search components 
that are automatically associated with solr.SearchHandler (this holds 
also for the QueryComponent).


Another note: By using name="components" all default components will be 
replaced by the components you explicitly mentioned (i.e., 
QueryComponent and FacetComponent in your example). To avoid this, use 
name="last-components" instead.


-Sascha

Jean-Sebastien Vachon wrote:

Is the FacetComponent loaded at all?


   
   query
   facet




On 2010-05-25, at 3:32 AM, Sascha Szott wrote:


Hi Birger,

Birger Lie wrote:

I don't think the bolean fields is mapped to "on" and "off" :)

You can use true and on interchangeably.

-Sascha




-birger

-Original Message-
From: Ilya Sterin [mailto:ster...@gmail.com]
Sent: 24. mai 2010 23:11
To: solr-user@lucene.apache.org
Subject: Faceted search not working?

I'm trying to perform a faceted search without any luck.  Result set doesn't 
return any facet information...

http://localhost:8080/solr/select/?q=title:*&facet=on&facet.field=title

I'm getting the result set, but no face information present?  Is there 
something else that needs to happen to turn faceting on?

I'm using latest Solr 1.4 release.  Data is indexed from the database using 
dataimporter.

Thanks.

Ilya Sterin









Re: Highlighting is not happening

2010-05-25 Thread Sascha Szott

Hi,

to accomplish that, use the highlighting parameters hl.simple.pre and 
hl.simple.post.


By the way, there are a plenty of other parameters that affect 
highlighting. Take a look at:


http://wiki.apache.org/solr/HighlightingParameters

-Sascha

Doddamani, Prakash wrote:

Hey,

I thought the Highlights would happen in the field of the documents
returned from SOLR J
But it gives new list of Highlighting at below, sorry for the confusion

I was wondering is there a way that the fields returned itself contains
bold characters

Eg : if searched for "query"


returned response which contains
query  should be bold



Regards
Prakash

-Original Message-----
From: Sascha Szott [mailto:sz...@zib.de]
Sent: Monday, May 24, 2010 10:55 PM
To: solr-user@lucene.apache.org
Subject: Re: Highlighting is not happening

Hi Prakash,

can you provide

1. the definition of the relevant field
2. your query
3. the definition of the relevant request handler 4. a field value that
is stored in your index and should be highlighted

-Sascha

Doddamani, Prakash wrote:

Thanks Sascha,

The "type" for fields for which I am searching are all "text" , and I
am using solr.TextField




  
  
  
  
  
  
  
  


  
  
  
  
  
  
  

  

Regards
Prakash


-----Original Message-
From: Sascha Szott [mailto:sz...@zib.de]
Sent: Monday, May 24, 2010 10:29 PM
To: solr-user@lucene.apache.org
Subject: Re: Highlighting is not happening

Hi Prakash,

more importantly, check the field type and its associated analyzer. In



case you use a "non-tokenized" type (e.g., string), highlighting will
not appear if only a partial field match exists (only exact matches,
i.e. the query coincides with the field value, will be highlighted).
If that's not your intent, you should at least define an tokenizer for



the field type.

Best,
Sascha

Doddamani, Prakash wrote:

Hey Daren,
Yes the fields for which I am searching are stored and indexed, also
they are returned from the query, Also it is not coming, if the
entire



search keyword is part of the field.

Thanks
Prakash

-Original Message-
From: dar...@ontrenet.com [mailto:dar...@ontrenet.com]
Sent: Monday, May 24, 2010 9:32 PM
To: solr-user@lucene.apache.org
Subject: Re: Highlighting is not happening

Check that the field you are highlighting on is "stored". It won't
work otherwise.


Now, this also means that the field is returned from the query. For
large text fields to be highlighted only, this means the entire text
is returned for each result.


There is a pending feature to address this, that allows you to tell
Solr to NOT return a specific field (to avoid unecessary transfer of
large text fields in this scenario).

Darren


Hi



I am using dismax request handler, I wanted to highlight the search
field,

So added

true

I was expecting like if I search for keyword "Akon" resultant docs
wherever the Akon is available is bold.



But I am not seeing them getting bold, could some one tell me the
real



path where I should tune

If I pass explicitly the hl=true does not work



I have added the request handler




   
dismax
explicit
0.01

 name^20.0 coming^5 playing^4 keywords^0.1

 
   rord(isclassic)^0.5 ord(listeners)^0.3

 
 name, coming, playing, keywords, score


   2<-1 5<-2 6<90%

100
*:*


true
   


0


regex

   
 

regards
prakash







Re: sort by field length

2010-05-25 Thread Sascha Szott

Hi Erick,

Erick Erickson wrote:

Are you sure you want to recompute the length when sorting?
It's the classic time/space tradeoff, but I'd suggest that when
your index is big enough to make taking up some more space
a problem, it's far too big to spend the cycles calculating each
term length for sorting purposes considering you may be
sorting all the terms in your index worst-case.
Good point, thank you for the clarification. I "thought" that Lucene 
internally stores the field length (e.g., in order to compute the 
relevance) and getting this information at query time requires only a 
simple lookup.


-Sascha



But you could consider payloads for storing the length, although
that would still be redundant...

Best
Erick

On Mon, May 24, 2010 at 8:30 AM, Sascha Szott  wrote:


Hi folks,

is it possible to sort by field length without having to (redundantly) save
the length information in a seperate index field? At first, I thought to
accomplish this using a function query, but I couldn't find an appropriate
one.

Thanks in advance,
Sascha






Re: Faceted search not working?

2010-05-25 Thread Sascha Szott

Hi Birger,

Birger Lie wrote:

I don't think the bolean fields is mapped to "on" and "off" :)

You can use true and on interchangeably.

-Sascha




-birger

-Original Message-
From: Ilya Sterin [mailto:ster...@gmail.com]
Sent: 24. mai 2010 23:11
To: solr-user@lucene.apache.org
Subject: Faceted search not working?

I'm trying to perform a faceted search without any luck.  Result set doesn't 
return any facet information...

http://localhost:8080/solr/select/?q=title:*&facet=on&facet.field=title

I'm getting the result set, but no face information present?  Is there 
something else that needs to happen to turn faceting on?

I'm using latest Solr 1.4 release.  Data is indexed from the database using 
dataimporter.

Thanks.

Ilya Sterin




Re: Faceted search not working?

2010-05-24 Thread Sascha Szott

Hi Ilya,

Ilya Sterin wrote:

I'm trying to perform a faceted search without any luck.  Result set
doesn't return any facet information...

http://localhost:8080/solr/select/?q=title:*&facet=on&facet.field=title

I'm getting the result set, but no face information present?  Is there
something else that needs to happen to turn faceting on?

No.

What does http://localhost:8080/solr/select/?q=title:*&fl=title&wt=xml 
return?


-Sascha



Re: Highlighting is not happening

2010-05-24 Thread Sascha Szott

Hi Prakash,

can you provide

1. the definition of the relevant field
2. your query
3. the definition of the relevant request handler
4. a field value that is stored in your index and should be highlighted

-Sascha

Doddamani, Prakash wrote:

Thanks Sascha,

The "type" for fields for which I am searching are all "text" , and I am
using solr.TextField



   
 
 
 
 
 
 
 
 
   
   
 
 
 
 
 
 
 
   
 

Regards
Prakash


-Original Message-
From: Sascha Szott [mailto:sz...@zib.de]
Sent: Monday, May 24, 2010 10:29 PM
To: solr-user@lucene.apache.org
Subject: Re: Highlighting is not happening

Hi Prakash,

more importantly, check the field type and its associated analyzer. In
case you use a "non-tokenized" type (e.g., string), highlighting will
not appear if only a partial field match exists (only exact matches,
i.e. the query coincides with the field value, will be highlighted). If
that's not your intent, you should at least define an tokenizer for the
field type.

Best,
Sascha

Doddamani, Prakash wrote:

Hey Daren,
Yes the fields for which I am searching are stored and indexed, also
they are returned from the query, Also it is not coming, if the entire



search keyword is part of the field.

Thanks
Prakash

-Original Message-
From: dar...@ontrenet.com [mailto:dar...@ontrenet.com]
Sent: Monday, May 24, 2010 9:32 PM
To: solr-user@lucene.apache.org
Subject: Re: Highlighting is not happening

Check that the field you are highlighting on is "stored". It won't
work otherwise.


Now, this also means that the field is returned from the query. For
large text fields to be highlighted only, this means the entire text
is returned for each result.


There is a pending feature to address this, that allows you to tell
Solr to NOT return a specific field (to avoid unecessary transfer of
large text fields in this scenario).

Darren


Hi



I am using dismax request handler, I wanted to highlight the search
field,

So added

true

I was expecting like if I search for keyword "Akon" resultant docs
wherever the Akon is available is bold.



But I am not seeing them getting bold, could some one tell me the
real



path where I should tune

If I pass explicitly the hl=true does not work



I have added the request handler




  
   dismax
   explicit
   0.01
   
name^20.0 coming^5 playing^4 keywords^0.1
   

  rord(isclassic)^0.5 ord(listeners)^0.3
   

name, coming, playing, keywords, score
   
   
  2<-1 5<-2 6<90%
   
   100
   *:*
   

   true
  
   

   0
   
   
   regex   

  


regards
prakash









Re: Highlighting is not happening

2010-05-24 Thread Sascha Szott

Hi Prakash,

more importantly, check the field type and its associated analyzer. In 
case you use a "non-tokenized" type (e.g., string), highlighting will 
not appear if only a partial field match exists (only exact matches, 
i.e. the query coincides with the field value, will be highlighted). If 
that's not your intent, you should at least define an tokenizer for the 
field type.


Best,
Sascha

Doddamani, Prakash wrote:

Hey Daren,
Yes the fields for which I am searching are stored and indexed, also
they are returned from the query,
Also it is not coming, if the entire search keyword is part of the
field.

Thanks
Prakash

-Original Message-
From: dar...@ontrenet.com [mailto:dar...@ontrenet.com]
Sent: Monday, May 24, 2010 9:32 PM
To: solr-user@lucene.apache.org
Subject: Re: Highlighting is not happening

Check that the field you are highlighting on is "stored". It won't work
otherwise.


Now, this also means that the field is returned from the query. For
large text fields to be highlighted only, this means the entire text is
returned for each result.


There is a pending feature to address this, that allows you to tell Solr
to NOT return a specific field (to avoid unecessary transfer of large
text fields in this scenario).

Darren


Hi



I am using dismax request handler, I wanted to highlight the search
field,

So added

true

I was expecting like if I search for keyword "Akon" resultant docs
wherever the Akon is available is bold.



But I am not seeing them getting bold, could some one tell me the real



path where I should tune

If I pass explicitly the hl=true does not work



I have added the request handler




 
  dismax
  explicit
  0.01
  
   name^20.0 coming^5 playing^4 keywords^0.1
  
   
 rord(isclassic)^0.5 ord(listeners)^0.3
  
   
   name, coming, playing, keywords, score
  
  
 2<-1 5<-2 6<90%
  
  100
  *:*
  

  true
 
  

  0
  
  
  regex  

 
   

regards
prakash







sort by field length

2010-05-24 Thread Sascha Szott

Hi folks,

is it possible to sort by field length without having to (redundantly) 
save the length information in a seperate index field? At first, I 
thought to accomplish this using a function query, but I couldn't find 
an appropriate one.


Thanks in advance,
Sascha



Re: Wildcard queries

2010-05-21 Thread Sascha Szott

Hi Robert,

thanks, you're absolutely right. I should better refine my initial 
question to: What's the idea behind the fact that no *lowercasing* is 
performed on wildcarded search terms if the field in question contains a 
LowercaseFilter in its associated field type definition?


-Sascha

Robert Muir wrote:

we can use stemming as an example:

lets say your query is c?ns?st?nt?y

how will this match "consistently", which the porter stemmer
transforms to 'consistent'.
furthermore, note that i replaced the vowels with ?'s here. The porter
stemmer doesnt just rip stuff off the end, but attempts to guess
syllables as part of the process, so it cannot possibly work.

the only way it would work in this situation would be if you formed
permutations of all the possible words this wildcard would match, and
then did analysis on each form, and searched on all stems.

but, this is impossible, since the * operator allows an infinite language.

On Fri, May 21, 2010 at 10:11 AM, Sascha Szott  wrote:

Hi folks,

what's the idea behind the fact that no text analysis (e.g. lowercasing) is
performed on wildcarded search terms?

In my context this behaviour seems to be counter-intuitive (I guess that's
the case in the majority of applications) and my application needs to
lowercase any input term before sending the HTTP request to my Solr server.

Would it be easy to disable this behaviour in Solr (1.5)? I would like to
see a config parameter (per field type) that allows to disable this "odd"
behaviour if needed. To ensure backward compatibility the "odd" behaviour
should be the default anymore.

Am I missing any drawbacks?

Best,
Sascha






Wildcard queries

2010-05-21 Thread Sascha Szott

Hi folks,

what's the idea behind the fact that no text analysis (e.g. lowercasing) 
is performed on wildcarded search terms?


In my context this behaviour seems to be counter-intuitive (I guess 
that's the case in the majority of applications) and my application 
needs to lowercase any input term before sending the HTTP request to my 
Solr server.


Would it be easy to disable this behaviour in Solr (1.5)? I would like 
to see a config parameter (per field type) that allows to disable this 
"odd" behaviour if needed. To ensure backward compatibility the "odd" 
behaviour should be the default anymore.


Am I missing any drawbacks?

Best,
Sascha



Re: How to tell which field matched?

2010-05-15 Thread Sascha Szott

Hi,

I'm not sure if debugQuery=on is a feasible solution in a productive 
environment, as generating such extra information requires a reasonable 
amount of computation.


-Sascha

Jon Baer wrote:

Does the standard debug component (?debugQuery=on) give you what you need?

http://wiki.apache.org/solr/SolrRelevancyFAQ#Why_does_id:archangel_come_before_id:hawkgirl_when_querying_for_.22wings.22

- Jon

On May 14, 2010, at 4:03 PM, Tim Garton wrote:


All,
 I've searched around for help with something we are trying to do
and haven't come across much.  We are running solr 1.4.  Here is a
summary of the issue we are facing:

A simplified example of our schema is something like this:

   
   
   
   
   
   

When someone does a search we search across the title,
supplement_title, and supplement_pdf_text fields.  When we get our
results, we would like to be able to tell which field the search
matched and if it's a multiValued field, which of the multiple values
matched.  This is so that we can display results similar to:

Example Title
Example Supplement Title
Example Supplement Title 2 (your search matched this document)
Example Supplement Title 3

Example Title 2
Example Supplement Title 4
Example Supplement Title 5
Example Supplement Title 6 (your search matched this document)

etc.

How would you recommend doing this?  Is there some way to get solr to
tell us which field matched, including multiValued fields?  As a
workaround we have been using highlighting to tell which field
matched, but it doesn't get us what we want for multiValued fields and
there is a significant cost to enabling the highlighting.  Should we
design our schema in some other fashion to achieve these results?
Thanks.

-Tim






Re: Autosuggest

2010-05-15 Thread Sascha Szott

Hi,

maybe you would like to have a look at solr.ShingleFilterFactory [1] to 
expand your autosuggest to more than one term.


-Sascha

[1] 
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.ShingleFilterFactory


Blargy wrote:


Thanks for your help and especially your analyzer.. probably saved me a
full-import or two  :)





Re: Solr Schema Question

2010-04-17 Thread Sascha Szott

Hi Serdar,

take a look at Solr's DataImportHandler:

http://wiki.apache.org/solr/DataImportHandler

Best,
Sascha

Serdar Sahin wrote:

Hi,

I am rather new to Solr and have a question.

We have around 200.000 txt files which are placed into the file cloud.
The file path is something similar to this:

file/97/8f/840/fa4-1.txt
file/a6/9d/ab0/ca2-2.txt etc.

and we also store the metadata (like title, description, tags etc)
about these files in the mysql server. So, what I want to do is to
index title, description, tags and other data from mysql, and also get
the txt file from file server, and link them as one record for
searching, but I could not figure out how to automatize this process.
I can give the path from the sql query like, Select id, title,
description, file_path, and then solr can use this path to retrieve
txt file, but I don't know whether is it possible or not.

What is the best way to index these files with their tag title and
description without coding in Java (Perl is ok). These txt files are
large, between 100kb-10mb, so the last option is to store them in the
database.

Thanks,

Serdar




Re: StreamingUpdateSolrServer hangs

2010-04-16 Thread Sascha Szott

Hi Yonik,

thanks for your fast reply.

Yonik Seeley wrote:

Thanks for the report Sascha.
So after the hang, it never recovers?  Some amount of hanging could be
visible if there was a commit on the Solr server or something else to
cause the solr requests to block for a while... but it should return
to normal on it's own...
In my case the whole application hangs and never recovers (CPU 
utilization goes down to near 0%). Interestingly, the problem 
reproducibly occurs only if SUSS is created with *more than 2* threads.



Looking at the stack trace, it looks like threads are blocked waiting
to get an http connection.
I forgot to mention that my index app has exclusive access to the Solr 
instance. Therefore, concurrent searches against the same Solr instance 
while indexing are excluded.



I'm traveling all next week, but I'll open a JIRA issue for this now.

Thank you!


Anything that would help us reproduce this is much appreciated.

Are there any other who have experienced the same problem?

-Sascha



On Fri, Apr 16, 2010 at 8:57 AM, Sascha Szott  wrote:

Hi Yonik,

Yonik Seeley wrote:


Stephen, were you running stock Solr 1.4, or did you apply any of the
SolrJ patches?
I'm trying to figure out if anyone still has any problems, or if this
was fixed with SOLR-1711:


I'm using the latest trunk version (rev. 934846) and constantly running into
the same problem. I'm using StreamingUpdateSolrServer with 3 treads and a
queue size of 20 (not really knowing if this configuration is optimal). My
multi-threaded application indexes 200k data items (bibliographic metadata
in Dublin Core format) and constantly hangs after running for some time.

Below you can find the thread dump of one of my index threads (after the app
hangs all dumps are the same)

"thread 19" prio=10 tid=0x7fe8c0415800 nid=0x277d waiting on condition
[0x42d05000]
   java.lang.Thread.State: WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for<0x7fe8cdcb7598>  (a
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:158)
at
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:1925)
at
java.util.concurrent.LinkedBlockingQueue.put(LinkedBlockingQueue.java:254)
at
org.apache.solr.client.solrj.impl.StreamingUpdateSolrServer.request(StreamingUpdateSolrServer.java:216)
at
org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:105)
at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:64)
at
de.kobv.ked.index.SolrIndexWriter.addIndexDocument(SolrIndexWriter.java:29)
at
de.kobv.ked.index.SolrIndexWriter.addIndexDocument(SolrIndexWriter.java:10)
at
de.kobv.ked.index.AbstractIndexThread.addIndexDocument(AbstractIndexThread.java:59)
at de.kobv.ked.rss.RssThread.indiziere(RssThread.java:30)
at de.kobv.ked.rss.RssThread.run(RssThread.java:58)



and of the three SUSS threads:

"pool-1-thread-3" prio=10 tid=0x7fe8c7b7f000 nid=0x2780 in Object.wait()
[0x409ac000]
   java.lang.Thread.State: WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
- waiting on<0x7fe8cdcb6f10>  (a
org.apache.commons.httpclient.MultiThreadedHttpConnectionManager$ConnectionPool)
at
org.apache.commons.httpclient.MultiThreadedHttpConnectionManager.doGetConnection(MultiThreadedHttpConnectionManager.java:518)
- locked<0x7fe8cdcb6f10>  (a
org.apache.commons.httpclient.MultiThreadedHttpConnectionManager$ConnectionPool)
at
org.apache.commons.httpclient.MultiThreadedHttpConnectionManager.getConnectionWithTimeout(MultiThreadedHttpConnectionManager.java:416)
at
org.apache.commons.httpclient.HttpMethodDirector.executeMethod(HttpMethodDirector.java:153)
at
org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:397)
at
org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:323)
at
org.apache.solr.client.solrj.impl.StreamingUpdateSolrServer$Runner.run(StreamingUpdateSolrServer.java:153)
at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:619)

"pool-1-thread-2" prio=10 tid=0x7fe8c7afa000 nid=0x277f in Object.wait()
[0x40209000]
   java.lang.Thread.State: WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
- waiting on<0x7fe8cdcb6f10>  (a
org.apache.commons.httpclient.MultiThreadedHttpConnectionManager$ConnectionPool)
at
org.apache.commons.httpclient.MultiThreadedHttpConnectionManager.doGetConnection(M

Re: StreamingUpdateSolrServer hangs

2010-04-16 Thread Sascha Szott

Hi Yonik,

Yonik Seeley wrote:

Stephen, were you running stock Solr 1.4, or did you apply any of the
SolrJ patches?
I'm trying to figure out if anyone still has any problems, or if this
was fixed with SOLR-1711:
I'm using the latest trunk version (rev. 934846) and constantly running 
into the same problem. I'm using StreamingUpdateSolrServer with 3 treads 
and a queue size of 20 (not really knowing if this configuration is 
optimal). My multi-threaded application indexes 200k data items 
(bibliographic metadata in Dublin Core format) and constantly hangs 
after running for some time.


Below you can find the thread dump of one of my index threads (after the 
app hangs all dumps are the same)


"thread 19" prio=10 tid=0x7fe8c0415800 nid=0x277d waiting on 
condition [0x42d05000]

   java.lang.Thread.State: WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
	- parking to wait for  <0x7fe8cdcb7598> (a 
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)

at java.util.concurrent.locks.LockSupport.park(LockSupport.java:158)
	at 
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:1925)
	at 
java.util.concurrent.LinkedBlockingQueue.put(LinkedBlockingQueue.java:254)
	at 
org.apache.solr.client.solrj.impl.StreamingUpdateSolrServer.request(StreamingUpdateSolrServer.java:216)
	at 
org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:105)

at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:64)
	at 
de.kobv.ked.index.SolrIndexWriter.addIndexDocument(SolrIndexWriter.java:29)
	at 
de.kobv.ked.index.SolrIndexWriter.addIndexDocument(SolrIndexWriter.java:10)
	at 
de.kobv.ked.index.AbstractIndexThread.addIndexDocument(AbstractIndexThread.java:59)

at de.kobv.ked.rss.RssThread.indiziere(RssThread.java:30)
at de.kobv.ked.rss.RssThread.run(RssThread.java:58)



and of the three SUSS threads:

"pool-1-thread-3" prio=10 tid=0x7fe8c7b7f000 nid=0x2780 in 
Object.wait() [0x409ac000]

   java.lang.Thread.State: WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
	- waiting on <0x7fe8cdcb6f10> (a 
org.apache.commons.httpclient.MultiThreadedHttpConnectionManager$ConnectionPool)
	at 
org.apache.commons.httpclient.MultiThreadedHttpConnectionManager.doGetConnection(MultiThreadedHttpConnectionManager.java:518)
	- locked <0x7fe8cdcb6f10> (a 
org.apache.commons.httpclient.MultiThreadedHttpConnectionManager$ConnectionPool)
	at 
org.apache.commons.httpclient.MultiThreadedHttpConnectionManager.getConnectionWithTimeout(MultiThreadedHttpConnectionManager.java:416)
	at 
org.apache.commons.httpclient.HttpMethodDirector.executeMethod(HttpMethodDirector.java:153)
	at 
org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:397)
	at 
org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:323)
	at 
org.apache.solr.client.solrj.impl.StreamingUpdateSolrServer$Runner.run(StreamingUpdateSolrServer.java:153)
	at 
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
	at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)

at java.lang.Thread.run(Thread.java:619)

"pool-1-thread-2" prio=10 tid=0x7fe8c7afa000 nid=0x277f in 
Object.wait() [0x40209000]

   java.lang.Thread.State: WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
	- waiting on <0x7fe8cdcb6f10> (a 
org.apache.commons.httpclient.MultiThreadedHttpConnectionManager$ConnectionPool)
	at 
org.apache.commons.httpclient.MultiThreadedHttpConnectionManager.doGetConnection(MultiThreadedHttpConnectionManager.java:518)
	- locked <0x7fe8cdcb6f10> (a 
org.apache.commons.httpclient.MultiThreadedHttpConnectionManager$ConnectionPool)
	at 
org.apache.commons.httpclient.MultiThreadedHttpConnectionManager.getConnectionWithTimeout(MultiThreadedHttpConnectionManager.java:416)
	at 
org.apache.commons.httpclient.HttpMethodDirector.executeMethod(HttpMethodDirector.java:153)
	at 
org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:397)
	at 
org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:323)
	at 
org.apache.solr.client.solrj.impl.StreamingUpdateSolrServer$Runner.run(StreamingUpdateSolrServer.java:153)
	at 
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
	at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)

at java.lang.Thread.run(Thread.java:619)

"pool-1-thread-1" prio=10 tid=0x7fe8c79f2800 nid=0x277e in 
Object.wait() [0x42e06000]

   java.lang.Thread.State: WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
	- waiting on <0x7fe8cdcb6f10> (a 
org.apache.commons.httpclient.MultiThreadedHttpConnectionManager$ConnectionPool)
	at 
org.apache.commons.httpclient.MultiThreadedHttpConnectionManager.

How to sort facet values lexicographically in descending order?

2010-03-11 Thread Sascha Szott

Hi folks,

is there a way to sort facet values lexicographically in descending 
order? If it's not possible right now, are there any feasible 
workarounds to accomplish this?


Note: I've seen issue SOLR-1672, but it does not solve my problem since 
it deals with facet counts only.


Best,
Sascha



Re: (default) maximum chars per field

2010-02-05 Thread Sascha Szott

markus.rietz...@rzf.fin-nrw.de wrote:

ok,
i was looking for all types of "max" but somehow didn't saw the 
maxFieldLength.
this is a global parameter, right? can this be defined on a field basis?
It's a global parameter counting the maximum number of tokens(!) - not 
the number of characters or bytes - per field. If a field's content 
exceeds that number, the remaining tokens are truncated without any notice.


-Sascha



global would be enough at the moment.

thank you


-Ursprüngliche Nachricht-
Von: Shalin Shekhar Mangar [mailto:shalinman...@gmail.com]
Gesendet: Freitag, 5. Februar 2010 11:35
An: solr-user@lucene.apache.org
Betreff: Re: (default) maximum chars per field

On Fri, Feb 5, 2010 at 3:56 PM,
  wrote:


hi,
what is the default maximum charsize per field? i found a macChars
paramater for copyField but i don't think, that this is what i am
looking for.

we have indexed some documents via tika/solrcell. only the

beginning of

these documents can be searched. where can i define the

maximum size of

a document/field that will be indexed? at the moment we do

the updates

via xml upload. is there a maxsize for this xml. in

solconfig.xml i have

found "multipartUploadLimitInKB=2048000", that means 2 GB

would be the

max size to post. that would be enough...



Increase maxFieldLength in your solrconfig.xml. The default is 10KB.

--
Regards,
Shalin Shekhar Mangar.





Re: Deploying Solr 1.3 in JBoss 5

2010-02-05 Thread Sascha Szott
at 
org.jboss.deployers.plugins.deployers.DeployersImpl.doDeploy(DeployersImpl.java:1440)
at 
org.jboss.deployers.plugins.deployers.DeployersImpl.doInstallParentFirst(DeployersImpl.java:1158)
at 
org.jboss.deployers.plugins.deployers.DeployersImpl.doInstallParentFirst(DeployersImpl.java:1179)
at 
org.jboss.deployers.plugins.deployers.DeployersImpl.install(DeployersImpl.java:1099)
at 
org.jboss.dependency.plugins.AbstractControllerContext.install(AbstractControllerContext.java:348)
at 
org.jboss.dependency.plugins.AbstractController.install(AbstractController.java:1633)
at 
org.jboss.dependency.plugins.AbstractController.incrementState(AbstractController.java:935)
at 
org.jboss.dependency.plugins.AbstractController.resolveContexts(AbstractController.java:1083)
at 
org.jboss.dependency.plugins.AbstractController.resolveContexts(AbstractController.java:985)
at 
org.jboss.dependency.plugins.AbstractController.change(AbstractController.java:823)
at 
org.jboss.dependency.plugins.AbstractController.change(AbstractController.java:553)
at 
org.jboss.deployers.plugins.deployers.DeployersImpl.process(DeployersImpl.java:782)
at 
org.jboss.deployers.plugins.main.MainDeployerImpl.process(MainDeployerImpl.java:702)
at 
org.jboss.system.server.profileservice.repository.MainDeployerAdapter.process(MainDeployerAdapter.java:117)
at 
org.jboss.system.server.profileservice.repository.ProfileDeployAction.install(ProfileDeployAction.java:70)
at 
org.jboss.system.server.profileservice.repository.AbstractProfileAction.install(AbstractProfileAction.java:53)
at 
org.jboss.system.server.profileservice.repository.AbstractProfileService.install(AbstractProfileService.java:403)
at 
org.jboss.dependency.plugins.AbstractControllerContext.install(AbstractControllerContext.java:348)
at 
org.jboss.dependency.plugins.AbstractController.install(AbstractController.java:1633)
at 
org.jboss.dependency.plugins.AbstractController.incrementState(AbstractController.java:935)
at 
org.jboss.dependency.plugins.AbstractController.resolveContexts(AbstractController.java:1083)
at 
org.jboss.dependency.plugins.AbstractController.resolveContexts(AbstractController.java:985)
at 
org.jboss.dependency.plugins.AbstractController.install(AbstractController.java:775)
at 
org.jboss.dependency.plugins.AbstractController.install(AbstractController.java:540)
at 
org.jboss.system.server.profileservice.repository.AbstractProfileService.registerProfile(AbstractProfileService.java:308)
at 
org.jboss.system.server.profileservice.ProfileServiceBootstrap.start(ProfileServiceBootstrap.java:256)
at 
org.jboss.bootstrap.AbstractServerImpl.start(AbstractServerImpl.java:461)
at org.jboss.Main.boot(Main.java:221)
at org.jboss.Main$1.run(Main.java:556)
at java.lang.Thread.run(Thread.java:619)
Caused by: org.jboss.xb.binding.JBossXBException: Failed to create a
new SAX parser
at 
org.jboss.xb.binding.parser.sax.SaxJBossXBParser.(SaxJBossXBParser.java:97)
at 
org.jboss.xb.binding.UnmarshallerImpl.(UnmarshallerImpl.java:56)
at 
org.jboss.xb.binding.UnmarshallerFactory$UnmarshallerFactoryImpl.newUnmarshaller(UnmarshallerFactory.java:96)
... 73 more
Caused by: java.lang.ClassCastException:
org.apache.xerces.parsers.XIncludeAwareParserConfiguration cannot be
cast to org.apache.xerces.xni.parser.XMLParserConfiguration


It seems I'm not not the only one in the web, but I'm still working on it.

Just to make some statistics, did someone manage to depoy solr 1.3 to jBoss 5?

I should update the wiki page too, maybe.

Thank you very much, Sascha.

Bye

L.M.



On 2 February 2010 18:02, Sascha Szott  wrote:

Luca Molteni wrote:


Actually, if I hard-code the value, it gives me the same error...
interesting.


According to the error message:

The content of element type "env-entry" must match
"(description?,env-entry-name,env-entry-value?,env-entry-type)"

Maybe it helps to change the order of elements within env-entry
(env-entry-value before env-entry-type)?

-Sascha




On 2 February 2010 17:14, Sascha Szottwrote:


Hi,

I'm not sure if that's a Solr issue. However, what happens if you set
env-entry-value to C:/mypath/solr instead of ${solr.home.myhome}?

-Sascha

Am 02.02.2010 15:20, schrieb Luca Molteni:


Hello list,

I'm having some problem deploying solr to JBoss 5.

The problem is with environment variables:

Following this page of the wiki:  http://wiki.apache.org/solr/SolrJBoss

I've added to the web.xml of WEB-INF of solr

   
solr/home
java.lang.String
${solr.home.myhome}
  

Since I'm using lots of instances of solr in the same container.

This variable should be expanded by jboss itself in a path using
properties-services.xml:

 
solr.home

Re: Deploying Solr 1.3 in JBoss 5

2010-02-02 Thread Sascha Szott

Luca Molteni wrote:

Actually, if I hard-code the value, it gives me the same error... interesting.

According to the error message:

The content of element type "env-entry" must match
"(description?,env-entry-name,env-entry-value?,env-entry-type)"

Maybe it helps to change the order of elements within env-entry 
(env-entry-value before env-entry-type)?


-Sascha




On 2 February 2010 17:14, Sascha Szott  wrote:

Hi,

I'm not sure if that's a Solr issue. However, what happens if you set
env-entry-value to C:/mypath/solr instead of ${solr.home.myhome}?

-Sascha

Am 02.02.2010 15:20, schrieb Luca Molteni:


Hello list,

I'm having some problem deploying solr to JBoss 5.

The problem is with environment variables:

Following this page of the wiki:  http://wiki.apache.org/solr/SolrJBoss

I've added to the web.xml of WEB-INF of solr

   
solr/home
java.lang.String
${solr.home.myhome}
  

Since I'm using lots of instances of solr in the same container.

This variable should be expanded by jboss itself in a path using
properties-services.xml:

 
solr.home.myhome=C:/mypath/solr
 

Unfortunately, during deployment of the solr application, it gives me
this error:

Caused by: org.jboss.xb.binding.JBossXBException: Failed to parse
source: The content of element type "env-entry" must match
"(description?,env-entry-name,env-entry-value?,env-entry-type)". @

vfsfile:/C:/pathtojboss/server/solrrepo/deploy/Solrrepo/solr-mysolr.war/WEB-INF/web.xml[146,14]
at
org.jboss.xb.binding.parser.sax.SaxJBossXBParser.parse(SaxJBossXBParser.java:203)

... 33 more
Caused by: org.xml.sax.SAXException: The content of element type
"env-entry" must match
"(description?,env-entry-name,env-entry-value?,env-entry-type)". @

vfsfile:/C:/pathtojboss/server/solrrepo/deploy/Solrrepo/solr-mysolr.war/WEB-INF/web.xml[146,14]
at
org.jboss.xb.binding.parser.sax.SaxJBossXBParser$MetaDataErrorHandler.error(SaxJBossXBParser.java:426)


Notice that the same .war and properties-services.xml works flawlessly
in JBoss 4.2.3

Any ideas?

Thank you very much.

L.M.




Re: java.lang.NullPointerException with MySQL DataImportHandler

2010-02-02 Thread Sascha Szott

Hi,

since some of the fields used in your DIH configuration aren't mandatory 
(e.g., keywords and tags are defined as nullable in your db table 
schema), add a default value to all optional fields in your schema 
configuration (e.g., default = ""). Note, that Solr does not understand 
the db-related concept of null values.


Solr's log output

SolrInputDocument[{keywords=keywords(1.0)={Dolce}, name=name(1.0)={Dolce
& Gabbana D&G Neckties designer Tie for men 543},
productID=productID(1.0)={220213}}]

indicates that there aren't any tags or descriptions stored for the item 
with productId 220213. Since no default value is specified, Solr raises 
an error when creating the index document.


-Sascha

Jean-Michel Philippon-Nadeau wrote:

Hi,

Thanks for the reply.

On Tue, 2010-02-02 at 16:57 +0100, Sascha Szott wrote:

* the output of MySQL's describe command for all tables/views referenced
in your DIH configuration


mysql>  describe products;
++--+--+-+-++
| Field  | Type | Null | Key | Default | Extra
|
++--+--+-+-++
| productID  | int(10) unsigned | NO   | PRI | NULL|
auto_increment |
| skuCode| varchar(320) | YES  | MUL | NULL|
|
| upcCode| varchar(320) | YES  | MUL | NULL|
|
| name   | varchar(320) | NO   | | NULL|
|
| description| text | NO   | | NULL|
|
| keywords   | text | YES  | | NULL|
|
| disqusThreadID | varchar(50)  | NO   | | NULL|
|
| tags   | text | YES  | | NULL|
|
| createdOn  | int(10) unsigned | NO   | | NULL|
|
| lastUpdated| int(10) unsigned | NO   | | NULL|
|
| imageURL   | varchar(320) | YES  | | NULL|
|
| inStock| tinyint(1)   | YES  | MUL | 1   |
|
| active | tinyint(1)   | YES  | | 1   |
|
++--+--+-+-++
13 rows in set (0.00 sec)

mysql>  describe product_soldby_vendor;
+-+--+--+-+-+---+
| Field   | Type | Null | Key | Default | Extra |
+-+--+--+-+-+---+
| productID   | int(10) unsigned | NO   | MUL | NULL|   |
| productVendorID | int(10) unsigned | NO   | MUL | NULL|   |
| price   | double   | NO   | | NULL|   |
| currency| varchar(5)   | NO   | | NULL|   |
| buyURL  | varchar(320) | NO   | | NULL|   |
+-+--+--+-+-+---+
5 rows in set (0.00 sec)

mysql>  describe products_vendors_subcategories;
++--+--+-+-++
| Field  | Type | Null | Key | Default |
Extra  |
++--+--+-+-++
| productVendorSubcategoryID | int(10) unsigned | NO   | PRI | NULL|
auto_increment |
| productVendorCategoryID| int(10) unsigned | NO   | | NULL|
|
| labelEnglish   | varchar(320) | NO   | | NULL|
|
| labelFrench| varchar(320) | NO   | | NULL|
|
++--+--+-+-++
4 rows in set (0.00 sec)

mysql>  describe products_vendors_categories;
+-+--+--+-+-++
| Field   | Type | Null | Key | Default |
Extra  |
+-+--+--+-+-++
| productVendorCategoryID | int(10) unsigned | NO   | PRI | NULL|
auto_increment |
| labelEnglish| varchar(320) | NO   | | NULL|
|
| labelFrench | varchar(320) | NO   | | NULL|
|
+-+--+--+-+-++
3 rows in set (0.00 sec)

mysql>  describe product_vendor_in_subcategory;
+---+--+--+-+-+---+
| Field | Type | Null | Key | Default | Extra |
+---+--+--+-+-+---+
| productVendorID   | int(10) unsigned | NO   | MUL | NULL|   |
| productCategoryID | int(10) unsigned | NO   | MUL | NULL|   |
+---+--+--+-+-+---+
2 rows in set (0.00 sec)

mysql>  describe products_vendors_countries;
++--+--+-+-++
| Field   

Re: Deploying Solr 1.3 in JBoss 5

2010-02-02 Thread Sascha Szott

Hi,

I'm not sure if that's a Solr issue. However, what happens if you set 
env-entry-value to C:/mypath/solr instead of ${solr.home.myhome}?


-Sascha

Am 02.02.2010 15:20, schrieb Luca Molteni:

Hello list,

I'm having some problem deploying solr to JBoss 5.

The problem is with environment variables:

Following this page of the wiki:  http://wiki.apache.org/solr/SolrJBoss

I've added to the web.xml of WEB-INF of solr

   
solr/home
java.lang.String
${solr.home.myhome}
  

Since I'm using lots of instances of solr in the same container.

This variable should be expanded by jboss itself in a path using
properties-services.xml:

 
solr.home.myhome=C:/mypath/solr
 

Unfortunately, during deployment of the solr application, it gives me
this error:

Caused by: org.jboss.xb.binding.JBossXBException: Failed to parse
source: The content of element type "env-entry" must match
"(description?,env-entry-name,env-entry-value?,env-entry-type)". @
vfsfile:/C:/pathtojboss/server/solrrepo/deploy/Solrrepo/solr-mysolr.war/WEB-INF/web.xml[146,14]
at 
org.jboss.xb.binding.parser.sax.SaxJBossXBParser.parse(SaxJBossXBParser.java:203)

... 33 more
Caused by: org.xml.sax.SAXException: The content of element type
"env-entry" must match
"(description?,env-entry-name,env-entry-value?,env-entry-type)". @
vfsfile:/C:/pathtojboss/server/solrrepo/deploy/Solrrepo/solr-mysolr.war/WEB-INF/web.xml[146,14]
at 
org.jboss.xb.binding.parser.sax.SaxJBossXBParser$MetaDataErrorHandler.error(SaxJBossXBParser.java:426)


Notice that the same .war and properties-services.xml works flawlessly
in JBoss 4.2.3

Any ideas?

Thank you very much.

L.M.


--
Sascha Szott
Kooperativer Bibliotheksverbund Berlin-Brandenburg (KOBV)
c/o Konrad-Zuse-Zentrum fuer Informationstechnik Berlin (ZIB)
Takustr. 7, D-14195 Berlin
Zimmer 4357
Telefon: (030) 841 85 - 457
Telefax: (030) 841 85 - 269
E-Mail: sz...@zib.de
WWW: http://www.kobv.de



Re: java.lang.NullPointerException with MySQL DataImportHandler

2010-02-02 Thread Sascha Szott

Hi,

can you post

* the output of MySQL's describe command for all tables/views referenced 
in your DIH configuration

* the DIH configuration file (i.e., data-config.xml)
* the schema definition (i.e., schema.xml)

-Sascha

Jean-Michel Philippon-Nadeau wrote:

Hi,

It is my first install of Solr. The setup has been pretty
straightforward and yet, the performance is very impressive.

I am running into an issue with my MySQL DataImportHandler. I've
followed the quick-start in order to write the necessary config and so
far everything seemed to work.

However, I am missing some fields in my index. I've switched all fields
to stored="true" temporarily in my schema to troubleshoot the issue. I
only have 3 fields listed in search results while I should have 8.

Could this be caused by ampersands or illegal entities in my database?
How can I see if DIH is importing correctly all my rows into the index?

Follows is the warning I have in my catalina.log.

Thank you very much,

Jean-Michel

===

Feb 2, 2010 12:21:07 AM org.apache.solr.handler.dataimport.SolrWriter
upload
WARNING: Error creating document :
SolrInputDocument[{keywords=keywords(1.0)={Dolce}, name=name(1.0)={Dolce
& Gabbana D&G Neckties designer Tie for men 543},
productID=productID(1.0)={220213}}]
java.lang.NullPointerException
 at
org.apache.lucene.util.StringHelper.intern(StringHelper.java:36)
 at org.apache.lucene.document.Field.(Field.java:341)
 at org.apache.lucene.document.Field.(Field.java:305)
 at
org.apache.solr.schema.FieldType.createField(FieldType.java:210)
 at
org.apache.solr.schema.SchemaField.createField(SchemaField.java:94)
 at
org.apache.solr.update.DocumentBuilder.toDocument(DocumentBuilder.java:246)
 at
org.apache.solr.update.processor.RunUpdateProcessor.processAdd(RunUpdateProcessorFactory.java:60)
 at
org.apache.solr.handler.dataimport.SolrWriter.upload(SolrWriter.java:75)
 at org.apache.solr.handler.dataimport.DataImportHandler
$1.upload(DataImportHandler.java:292)
 at
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:392)
 at
org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:242)
 at
org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:180)
 at
org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:331)
 at
org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:389)
 at org.apache.solr.handler.dataimport.DataImporter
$1.run(DataImporter.java:370)





Re: How to display Highlight with VelocityResponseWriter?

2010-01-13 Thread Sascha Szott
Hi Qiuyan,

> Thanks a lot. It works now. When i added the line
> #set($hl = $response.highlighting)
> i got the highlighting. But i wonder if there's any document that
> describes the usage of that. I mean i didn't know the name of those
> methods. Actually i just managed to guess it.
Solritas (aka VelocityResponseWriter) binds a number of objects into a so
called VelocityContext (consult [1] for a complete list). You can think of
a map that allows you to access objects by symbolic names, e.g., an
instance of QueryResponse is stored under response (that's why you write
$response in your template).

Since $response is an instance of QueryResponse you can call all methods
on it the API [2] provides. Furthermore, Velocity incorporates a
JavaBean-like introspection mechanism that lets you write
$response.highlighting instead of $response.getHighlighting() (only a bit
of syntactic sugar).

-Sascha

[1] http://wiki.apache.org/solr/VelocityResponseWriter#line-93
[2]
http://lucene.apache.org/solr/api/org/apache/solr/client/solrj/response/QueryResponse.html

> Quoting Sascha Szott :
>
>> Qiuyan,
>>
>>> with highlight can also be displayed in the web gui. I've added >> name="hl">true into the standard responseHandler and it already
>>> works, i.e without velocity. But the same line doesn't take effect in
>>> itas. Should i configure anything else? Thanks in advance.
>> First of all, just a few notes on the /itas request handler in your
>> solrconfig.xml:
>>
>> 1. The entry
>>
>> 
>>   highlight
>> 
>>
>> is obsolete, since the highlighting component is a default search
>> component [1].
>>
>> 2. Note that since you didn't specify a value for hl.fl highlighting
>> will only affect the fields listed inside of qf.
>>
>> 3. Why did you override the default value of hl.fragmenter? In most
>> cases the default fragmenting algorithm (gap) works fine - and maybe
>> in yours as well?
>>
>>
>> To make sure all your hl related settings are correct, can you post
>> an xml output (change the wt parameter to xml) for a search with
>> highlighted results.
>>
>> And finally, can you post the vtl code snippet that should produce
>> the highlighted output.
>>
>> -Sascha
>>
>> [1] http://wiki.apache.org/solr/SearchComponent
>>
>>
>>
>>
>>
>>
>>
>>
>
>
>



Re: How to display Highlight with VelocityResponseWriter?

2010-01-11 Thread Sascha Szott

Qiuyan,


with highlight can also be displayed in the web gui. I've added true into the standard responseHandler and it already
works, i.e without velocity. But the same line doesn't take effect in
itas. Should i configure anything else? Thanks in advance.
First of all, just a few notes on the /itas request handler in your 
solrconfig.xml:


1. The entry


  highlight


is obsolete, since the highlighting component is a default search 
component [1].


2. Note that since you didn't specify a value for hl.fl highlighting 
will only affect the fields listed inside of qf.


3. Why did you override the default value of hl.fragmenter? In most 
cases the default fragmenting algorithm (gap) works fine - and maybe in 
yours as well?



To make sure all your hl related settings are correct, can you post an 
xml output (change the wt parameter to xml) for a search with 
highlighted results.


And finally, can you post the vtl code snippet that should produce the 
highlighted output.


-Sascha

[1] http://wiki.apache.org/solr/SearchComponent








Re: solrJ and spell check queries

2010-01-03 Thread Sascha Szott

Hi,

Jay Fisher wrote:

I'm trying to find a way to formulate the following query in solrJ. This is
the only way I can get the desired result but I can't figure out how to get
solrJ to generate the same query string. It always generates a url that
starts with select and I need it to start with spell. If there is an
alternative url string that will work please let me know.

http://solr-server/spell/?indent=on&q=shert&wt=json&spellcheck=true&spellcheck.collate=true

In case you hook SpellCheckComponent directly into the standard request 
handler, i.e., /select,


http://solr-server/select?indent=on&q=shert&wt=json&spellcheck=true&spellcheck.collate=true

should work.

-Sascha




Re: how to do a Parent/Child Mapping using entities

2009-12-30 Thread Sascha Szott

Hi,


Thanks Sascha for your post, but i find it interresting, but in my case i
don't want to use an additionnal field, i want to be able with the same
schema to do a simple query like : "q=res_url:some url", and a query like
the other one;
You could easily write your own query parser (QParserPlugin, in Solr's 
terminology) that internally translates queries like


 q = res_url:url AND res_rank:rank

into
q = res_ranked_url:"rank url"

thus hiding the res_ranked_url field from the user/client.

I'm not sure, but maybe it's possible to utilize the order of values 
within the multi-valued field res_url directly in the newly created 
parser. This seems like the cleanest solution to me.


-Sascha


in other word; is there any solution to make two or more multivalued fields
in the same document linked with each other, e.g:
in this result:

-
-
   1
   Key1
-
   url1
   url2
   url3
   url4
   
-
   1
   2
   3
   4
   
   
   

i would like to make solr understand that for this document, value:url1 of
"res_url" field is linked to value:1 of "res_rank" field, and all of them
are linked to the commen field "keyword".
I think that i should use a custom field analyser or some thing like that;
but i don't know what to do.

but thanks for all; and any supplied help will be lovable.


Sascha Szott wrote:


Hi,

you could create an additional index field res_ranked_url that contains
the concatenated value of an url and its corresponding rank, e.g.,

res_rank + " " + res_url

Then, q=res_ranked_url:"1 url1" retrieves all documents with url1 as the
first url.

A drawback of this workaround is that you have to use a phrase query
thus preventing wildcard searches for urls.

-Sascha



Hello everybody, i would like to know how to create index supporting a
parent/child mapping and then querying the child to get the results.
in other words; imagine that we have a database containing 2
tables:Keyword[id(int), value(string)] and Result[id(int), res_url(text),
res_text(tex), res_date(date), res_rank(int)]
For indexing, i used the DataImportHandler to import data and it works
well,
and my query response seems good:(q=*:*) (imagine that we have only this
to
keywords and their results)


-
-
0
0
-
*:*


-
-
1
Key1
-
url1
url2
url3
url4

-
1
2
3
4


-
2
Key2
-
url1
url5
url8
url7

-
1
2
3
4





but the problem is when i tape a query kind of this:"q=res_url:url2 AND
res_rank:1" and this to say that i want to search for the keywords in
which
the url (url2) is ranked at the first position, i have a result like
this:


-
-
0
0
-
res_url:url2 AND res_rank:1


-
-
1
Key1
-
url1
url2
url3
url4

-
1
2
3
4





But this is not true; because the url present in the 1st position in the
results of the keyword "key1" is url1 and not url2.
So what i want to say is : is there any solution to make the values of
the
"multivalued" fields linked;
so in our case we can see that the previous result say that:
   - url1 is present in 1st position of "key1" results
   - url2 is present in 2nd position of "key1" results
   - url3 is present in 3rd position of "key1" results
   - url4 is present in 4th position of "key1" results

and i would like that solr consider this when executing queries.

Any helps please; and thanks for all :)







Re: how to do a Parent/Child Mapping using entities

2009-12-29 Thread Sascha Szott

Hi,

you could create an additional index field res_ranked_url that contains 
the concatenated value of an url and its corresponding rank, e.g.,


res_rank + " " + res_url

Then, q=res_ranked_url:"1 url1" retrieves all documents with url1 as the 
first url.


A drawback of this workaround is that you have to use a phrase query 
thus preventing wildcard searches for urls.


-Sascha



Hello everybody, i would like to know how to create index supporting a
parent/child mapping and then querying the child to get the results.
in other words; imagine that we have a database containing 2
tables:Keyword[id(int), value(string)] and Result[id(int), res_url(text),
res_text(tex), res_date(date), res_rank(int)]
For indexing, i used the DataImportHandler to import data and it works well,
and my query response seems good:(q=*:*) (imagine that we have only this to
keywords and their results)

   
-
-
   0
   0
-
   *:*
   
   
-
-
   1
   Key1
-
   url1
   url2
   url3
   url4
   
-
   1
   2
   3
   4
   
   
-
   2
   Key2
-
   url1
   url5
   url8
   url7
   
-
   1
   2
   3
   4
   
   
   
   

but the problem is when i tape a query kind of this:"q=res_url:url2 AND
res_rank:1" and this to say that i want to search for the keywords in which
the url (url2) is ranked at the first position, i have a result like this:


-
-
   0
   0
-
   res_url:url2 AND res_rank:1
   
   
-
-
   1
   Key1
-
   url1
   url2
   url3
   url4
   
-
   1
   2
   3
   4
   
   
   
   

But this is not true; because the url present in the 1st position in the
results of the keyword "key1" is url1 and not url2.
So what i want to say is : is there any solution to make the values of the
"multivalued" fields linked;
so in our case we can see that the previous result say that:
  - url1 is present in 1st position of "key1" results
  - url2 is present in 2nd position of "key1" results
  - url3 is present in 3rd position of "key1" results
  - url4 is present in 4th position of "key1" results

and i would like that solr consider this when executing queries.

Any helps please; and thanks for all :)




Re: Optimize not having any effect on my index

2009-12-18 Thread Sascha Szott

Hi Aleksander,

Aleksander Stensby wrote:

So i tried with curl:
curl http://server:8983/solr/update --data-binary '' -H
'Content-type:text/xml; charset=utf-8'

No difference here either... Am I doing anything wrong? Do i need to issue a
commit after the optimize?
Did you restart the Solr server instance after the optimize operation 
was completed?


BTW: You could initiate the optimization operation by POSTing 
optimize=true directly, i.e.,


curl http://server:8983/solr/update/update --form-string optimize=true


-Sascha



Re: Exception from Spellchecker

2009-12-15 Thread Sascha Szott

Hi Rafael,

Rafael Pappert wrote:

I try to enable the spellchecker in my 1.4.0 solr (running with tomcat 6 on 
debian).
But I always get the following exception, when I try to open 
http://localhost:8080/spell?:


The spellcheck=true pair is missing in your request. Try

http://localhost:8080/spell?q=&spellcheck=true

-Sascha



RE: search on tomcat server

2009-12-07 Thread Sascha Szott
Hi Jill,

just to make sure your index contains at least one document, what is the
output of



Best,
Sascha

Jill Han wrote:
> In fact, I just followed the instructions titled as Tomcat On Windows.
> Here are the updates on my computer
> 1. -Dsolr.solr.home=C:\solr\example
> 2. change dataDir to C:\solr\example\data in
> solrconfig.xml at C:\solr\example\conf
> 3. created solr.xml at C:\Tomcat 5.5\conf\Catalina\localhost
> 
>  crossContext="true">
>value="c:/solr/example" override="true"/>
> 
>
> I restarted Tomcat, went to http://localhost:8080/solr/admin/
> Entered video in Query String field, and got
> /**
> 
> - 
> - 
>   0
>   0
> - 
>   10
>   0
>   on
>   video
>   2.2
>   
>   
>   
>   
> /
> My questions are
> 1. is the setting correct?
> 2. where does solr start to search words entered in Query String field
> 3. how can I make result page like general searching result page, such as,
> not found, if found, a url, instead of xml will be returned.
>
>
> Thanks a lot for your helps,
>
> Jill
>
> -Original Message-
> From: William Pierce [mailto:evalsi...@hotmail.com]
> Sent: Friday, December 04, 2009 12:56 PM
> To: solr-user@lucene.apache.org
> Subject: Re: search on tomcat server
>
> Have you gone through the solr tomcat wiki?
>
> http://wiki.apache.org/solr/SolrTomcat
>
> I found this very helpful when I did our solr installation on tomcat.
>
> - Bill
>
> --
> From: "Jill Han" 
> Sent: Friday, December 04, 2009 8:54 AM
> To: 
> Subject: RE: search on tomcat server
> X-HOSTLOC: hermes.apache.org/140.211.11.3
>
>> I went through all the links on
>> http://wiki.apache.org/solr/#Search_and_Indexing
>> And still have no clue as how to proceed.
>> 1. do I have to do some implementation in order to get solr to search
>> doc.
>> on tomcat server?
>> 2. if I have files, such as .doc, docx, .pdf, .jsp, .html, etc under
>> window xp, c:/tomcat/webapps/test1, /webapps/test2,
>>   What should I do to make solr search those directories
>> 3. since I am using tomcat, instead of jetty, is there any demo that
>> shows
>> the solr searching features, and real searching result?
>>
>> Thanks,
>> Jill
>>
>>
>> -Original Message-
>> From: Shalin Shekhar Mangar [mailto:shalinman...@gmail.com]
>> Sent: Monday, November 30, 2009 10:40 AM
>> To: solr-user@lucene.apache.org
>> Subject: Re: search on tomcat server
>>
>> On Mon, Nov 30, 2009 at 9:55 PM, Jill Han  wrote:
>>
>>> I got solr running on the tomcat server,
>>> http://localhost:8080/solr/admin/
>>>
>>> After I enter a search word, such as, solr, then hit Search button, it
>>> will go to
>>>
>>> http://localhost:8080/solr/select/?q=solr&version=2.2&start=0&rows=10&in
>>> dent=on
>>>
>>>  and display
>>>
>>>   
>>>
>>> -
>>> >> ndent=on>
>>>  <
>>>
>>> -
>>> >> ndent=on>
>>>  <  
>>>
>>>  <0
>>>
>>>  <0
>>>
>>> -
>>> >> ndent=on>
>>>  <
>>>
>>>  <  10
>>>
>>>  <  0
>>>
>>>  <  on
>>>
>>>  <  solr
>>>
>>>  <  2.2
>>>
>>> 
>>>
>>>   
>>>
>>>  <  
>>>
>>>  
>>>
>>>  My question is what is the next step to search files on tomcat
>>> server?
>>>
>>>
>>>
>> Looks like you have not added any documents to Solr. See the "Indexing
>> Documents" section at http://wiki.apache.org/solr/#Search_and_Indexing
>>
>> --
>> Regards,
>> Shalin Shekhar Mangar.
>>
>



How to instruct MoreLikeThisHandler to sort results

2009-12-03 Thread Sascha Szott

Hi Folks,

is there any way to instruct MoreLikeThisHandler to sort results? I was 
wondering that MLTHandler recognizes faceting parameters among others, 
but it ignores the sort parameter.


Best,
Sascha



Re: Indexing file content with custom field

2009-12-02 Thread Sascha Szott

Piero,

it sounds you're looking for an integration of Solr Cell and Solr's DIH 
facility -- a feature that isn't implemented yet (but the issue is 
already addressed in Solr-1358).


As a workaround, you could store the extracted contents in plain text 
files (either by using Solr Cell or Apache Tika directly, which is under 
the hood of Solr Cell). Afterwards, you could use DIH's 
XPathEntityProcessor (to read the metadata in your XML files) in 
conjunction with DIH's PlainTextEntityProcessor (to read the previously 
created text files).


Another workaround would be to pass the metadata content as literal 
parameters along with the /update/extract request, as described in [1]. 
This would require you to write a small program that constructs and 
sends appropriate POST requests by parsing your XML metadata files.


Best,
Sascha

[1] http://wiki.apache.org/solr/ExtractingRequestHandler#Literals

Rodolico Piero wrote:

Hi,

I need to index the contents of a file (doc, pdf, ecc) and a set of
custom metadata specified in the XML like a standard request to Solr.
From the documentation I can extract the contents of a file with the
request "/update/extract" (tika) and index metadata with a second
request "/update" by passing the XML. How do I do it all in a single
request? (without using curl but using http java lib or solrj). For
example (although I know that is not correct):


  
 
 content of the extracted file (text) 

  

So I search it or by using metadata or full text on the content.
Sorry for my English ...

Thanks a lot.

 


Piero

 








Re: Hierarchical xml

2009-12-02 Thread Sascha Szott

Pooja,

have a look at Solr's DataImportHandler. XPathEntityProcessor [1] should 
suit your needs.


Best,
Sascha

[1] http://wiki.apache.org/solr/DataImportHandler#XPathEntityProcessor

Pooja Verlani schrieb:

Hi,
I want to index an xml like following:


John
1979-29-17T28:14:48Z


   ABC College
   1998
 
 
   PQRS College
   2001
 
  
   XYZ College
   2003
 



I am not able to judge how should be the schema like?
Also, if I flatten such an xml and make collegename & year as multivalued
like this:
ABC College, PQRS College, XYZ College
1998,2001,2003

In such a scenario I can't make a coorespondence between ABC college & year
1998.

In case someone has an efficient way out, do share.
Thanks in anticipation.

Regards,
Pooja





[Solved] Re: VelocityResponseWriter/Solritas character encoding issue

2009-11-27 Thread Sascha Szott

Hi Erik,

I've finally solved the problem. Unfortunately, the parameter 
v.contentType was not described in the Solr wiki (I've fixed that now). 
The point is, you must specify (in your solrconfig.xml)


   text/xml;charset=UTF-8

in order to receive correctly UTF-8 encoded HTML. That's it!

Best,
Sascha

Erik Hatcher schrieb:

Sascha,

Can you give me a test document that causes an issue?  (maybe send me a 
Solr XML document in private e-mail).   I'll see what I can do once I 
can see the issue first hand.


Erik


On Nov 18, 2009, at 2:48 PM, Sascha Szott wrote:


Hi,

I've played around with Solr's VelocityResponseWriter (which is indeed 
a very useful feature for rapid prototyping). I've realized that 
Velocity uses ISO-8859-1 as default character encoding. I've changed 
this setting to UTF-8 in my velocity.properties file (inside the conf 
directory), i.e.,


  input.encoding=UTF-8
  output.encoding=UTF-8

and checked that the settings were successfully loaded.

Within the main Velocity template, browse.vm, the character encoding 
is set to UTF-8 as well, i.e.,


  

After starting Solr (which is deployed in a Tomcat 6 server on a 
Ubuntu machine), I ran into some character encoding problems.


Due to the change of input.encoding to UTF-8, no problems occur when 
non-ASCII characters are presend in the query string, e.g. german 
umlauts. But unfortunately, something is wrong with the encoding of 
characters in the html page that is generated by 
VelocityResponseWriter. The non-ASCII characters aren't displayed 
properly (for example, FF prints a black diamond with a white question 
mark). If I manually set the encoding to ISO-8859-1, the non-ASCII 
characters are displayed correctly. Does anybody have a clue?


Thanks in advance,
Sascha






Re: VelocityResponseWriter/Solritas character encoding issue

2009-11-18 Thread Sascha Szott
Hi Lance,

Lance Norskog wrote:
> What platform are you using? Windows does not use UTF-8 by default,
> and this can cause subtle problems. If you can do the same thing on
> other platforms (Linux, Mac) that would help narrow down the problem.
My Solr server runs in a Tomcat server on a Ubuntu Linux machine.

-Sascha
>
> On Wed, Nov 18, 2009 at 8:15 AM, Sascha Szott  wrote:
>> Hi Erik,
>>
>> Erik Hatcher wrote:
>>>
>>> Can you give me a test document that causes an issue?  (maybe send me
>>> a
>>> Solr XML document in private e-mail).   I'll see what I can do once I
>>> can
>>> see the issue first hand.
>>
>> Thank you! Just try the utf8-example.xml file in the exampledoc
>> directory.
>> After having indexed the document, the output of the script test_utf8.sh
>> suggests to me that everything works correctly:
>>
>>  Solr server is up.
>>  HTTP GET is accepting UTF-8
>>  HTTP POST is accepting UTF-8
>>  HTTP POST does not default to UTF-8
>>  HTTP GET is accepting UTF-8 beyond the basic multilingual plane
>>  HTTP POST is accepting UTF-8 beyond the basic multilingual plane
>>  HTTP POST + URL params is accepting UTF-8 beyond the basic
>> multilingual
>>
>> If I'm using the standard QueryResponseWriter and the query q=umlauts,
>> the
>> responding xml page contains properly printed non-ASCII characters. The
>> same
>> query against the VelocityResponseWriter returns a lot of Unicode
>> replacement characters (u+FFFD) instead.
>>
>> -Sascha
>>
>>>
>>> On Nov 18, 2009, at 2:48 PM, Sascha Szott wrote:
>>>
>>>> Hi,
>>>>
>>>> I've played around with Solr's VelocityResponseWriter (which is indeed
>>>> a
>>>> very useful feature for rapid prototyping). I've realized that
>>>> Velocity uses
>>>> ISO-8859-1 as default character encoding. I've changed this setting to
>>>> UTF-8
>>>> in my velocity.properties file (inside the conf directory), i.e.,
>>>>
>>>>  input.encoding=UTF-8
>>>>  output.encoding=UTF-8
>>>>
>>>> and checked that the settings were successfully loaded.
>>>>
>>>> Within the main Velocity template, browse.vm, the character encoding
>>>> is
>>>> set to UTF-8 as well, i.e.,
>>>>
>>>>  
>>>>
>>>> After starting Solr (which is deployed in a Tomcat 6 server on a
>>>> Ubuntu
>>>> machine), I ran into some character encoding problems.
>>>>
>>>> Due to the change of input.encoding to UTF-8, no problems occur when
>>>> non-ASCII characters are presend in the query string, e.g. german
>>>> umlauts.
>>>> But unfortunately, something is wrong with the encoding of characters
>>>> in the
>>>> html page that is generated by VelocityResponseWriter. The non-ASCII
>>>> characters aren't displayed properly (for example, FF prints a black
>>>> diamond
>>>> with a white question mark). If I manually set the encoding to
>>>> ISO-8859-1,
>>>> the non-ASCII characters are displayed correctly. Does anybody have a
>>>> clue?
>>>>
>>>> Thanks in advance,
>>>> Sascha
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>
>>
>
>
>
> --
> Lance Norskog
> goks...@gmail.com
>



Re: VelocityResponseWriter/Solritas character encoding issue

2009-11-18 Thread Sascha Szott

Hi Erik,

Erik Hatcher wrote:
Can you give me a test document that causes an issue?  (maybe send me a 
Solr XML document in private e-mail).   I'll see what I can do once I 
can see the issue first hand.
Thank you! Just try the utf8-example.xml file in the exampledoc 
directory. After having indexed the document, the output of the script 
test_utf8.sh suggests to me that everything works correctly:


 Solr server is up.
 HTTP GET is accepting UTF-8
 HTTP POST is accepting UTF-8
 HTTP POST does not default to UTF-8
 HTTP GET is accepting UTF-8 beyond the basic multilingual plane
 HTTP POST is accepting UTF-8 beyond the basic multilingual plane
 HTTP POST + URL params is accepting UTF-8 beyond the basic multilingual

If I'm using the standard QueryResponseWriter and the query q=umlauts, 
the responding xml page contains properly printed non-ASCII characters. 
The same query against the VelocityResponseWriter returns a lot of 
Unicode replacement characters (u+FFFD) instead.


-Sascha



On Nov 18, 2009, at 2:48 PM, Sascha Szott wrote:


Hi,

I've played around with Solr's VelocityResponseWriter (which is indeed 
a very useful feature for rapid prototyping). I've realized that 
Velocity uses ISO-8859-1 as default character encoding. I've changed 
this setting to UTF-8 in my velocity.properties file (inside the conf 
directory), i.e.,


  input.encoding=UTF-8
  output.encoding=UTF-8

and checked that the settings were successfully loaded.

Within the main Velocity template, browse.vm, the character encoding 
is set to UTF-8 as well, i.e.,


  

After starting Solr (which is deployed in a Tomcat 6 server on a 
Ubuntu machine), I ran into some character encoding problems.


Due to the change of input.encoding to UTF-8, no problems occur when 
non-ASCII characters are presend in the query string, e.g. german 
umlauts. But unfortunately, something is wrong with the encoding of 
characters in the html page that is generated by 
VelocityResponseWriter. The non-ASCII characters aren't displayed 
properly (for example, FF prints a black diamond with a white question 
mark). If I manually set the encoding to ISO-8859-1, the non-ASCII 
characters are displayed correctly. Does anybody have a clue?


Thanks in advance,
Sascha











VelocityResponseWriter/Solritas character encoding issue

2009-11-18 Thread Sascha Szott

Hi,

I've played around with Solr's VelocityResponseWriter (which is indeed a 
very useful feature for rapid prototyping). I've realized that Velocity 
uses ISO-8859-1 as default character encoding. I've changed this setting 
to UTF-8 in my velocity.properties file (inside the conf directory), i.e.,


   input.encoding=UTF-8
   output.encoding=UTF-8

and checked that the settings were successfully loaded.

Within the main Velocity template, browse.vm, the character encoding is 
set to UTF-8 as well, i.e.,


   

After starting Solr (which is deployed in a Tomcat 6 server on a Ubuntu 
machine), I ran into some character encoding problems.


Due to the change of input.encoding to UTF-8, no problems occur when 
non-ASCII characters are presend in the query string, e.g. german 
umlauts. But unfortunately, something is wrong with the encoding of 
characters in the html page that is generated by VelocityResponseWriter. 
The non-ASCII characters aren't displayed properly (for example, FF 
prints a black diamond with a white question mark). If I manually set 
the encoding to ISO-8859-1, the non-ASCII characters are displayed 
correctly. Does anybody have a clue?


Thanks in advance,
Sascha









Re: Indexing multiple documents in Solr/SolrCell

2009-11-17 Thread Sascha Szott

Kewin,

Kerwin wrote:

Our approach is similar to what you have mentioned in the jira issue except
that we have all metadata in the xml and not in the database. I am therefore
using a custom XmlUpdateRequestHandler to parse the XML and then calling
Tika from within the XML Loader to parse the content. Until now this seems
to work.
When and in which Solr version do you expect the jira issue to be
addressed?
That's a good question. Since I'm not a Solr committer, I cannot give 
any estimate on when it will be released (hopefully in Solr 1.5).


-Sascha


On Mon, Nov 16, 2009 at 5:02 PM, Sascha Szott  wrote:


Hi,

the problem you've described -- an integration of DataImportHandler (to
traverse the XML file and get the document urls) and Solr Cell (to extract
content afterwards) -- is already addressed in issue SOLR-1358 (
https://issues.apache.org/jira/browse/SOLR-1358).

Best,
Sascha


Kerwin wrote:


Hi,

I am new to this forum and would like to know if the function described
below has been developed or exists in Solr. If it does not exist, is it a
good Idea and can I contribute.

We need to index multiple documents with different formats. So we use Solr
with Tika (Solr Cell).

Question:
Can you index both metadata and content for multiple documents iteratively
in Solr?
For example I have an XML with metadata and a links to the documents
content. There are many documents in this XML and I would like to index
them
all without firing multiple URLs.

Example of XML


34122
Michael
3MB
URL of the document




I need to index all these documents by sending this XML in a single
URL.The
collection of documents to be indexed could be on a file system.

I have altered the Solr code to be able to do this but is there an already
existing feature?






Re: Indexing multiple documents in Solr/SolrCell

2009-11-16 Thread Sascha Szott

Hi,

the problem you've described -- an integration of DataImportHandler (to 
traverse the XML file and get the document urls) and Solr Cell (to 
extract content afterwards) -- is already addressed in issue SOLR-1358 
(https://issues.apache.org/jira/browse/SOLR-1358).


Best,
Sascha

Kerwin wrote:

Hi,

I am new to this forum and would like to know if the function described
below has been developed or exists in Solr. If it does not exist, is it a
good Idea and can I contribute.

We need to index multiple documents with different formats. So we use Solr
with Tika (Solr Cell).

Question:
Can you index both metadata and content for multiple documents iteratively
in Solr?
For example I have an XML with metadata and a links to the documents
content. There are many documents in this XML and I would like to index them
all without firing multiple URLs.

Example of XML


34122
Michael
3MB
URL of the document




I need to index all these documents by sending this XML in a single URL.The
collection of documents to be indexed could be on a file system.

I have altered the Solr code to be able to do this but is there an already
existing feature?





Re: [DIH] concurrent requests to DIH

2009-11-12 Thread Sascha Szott
Hi Avlesh,

Avlesh Singh wrote:
>>
>> 1. Is it considered as good practice to set up several DIH request
>> handlers, one for each possible parameter value?
>>
> Nothing wrong with this. My assumption is that you want to do this to
> speed
> up indexing. Each DIH instance would block all others, once a Lucene
> commit
> for the former is performed.
Thanks for this clarification.

> 2. In case the range of parameter values is broad, it's not convenient to
>> define separate request handlers for each value. But this entails a
>> limitation (as far as I see): It is not possible to fire several request
>> to the same DIH handler (with different parameter values) at the same
>> time.
>>
> Nope.
>
> I had done a similar exercise in my quest to write a
> ParallelDataImportHandler. This thread might be of interest to you -
> http://www.lucidimagination.com/search/document/a9b26ade46466ee/queries_regarding_a_paralleldataimporthandler.
> Though there is a ticket in JIRA, I haven't been able to contribute this
> back. If you think this is what you need, lemme know.
Actually, I've already read this thread. In my opinion, both support for
batch processing and multi-threading are important extensions of DIH's
current capabilities, though issue SOLR-1352 mainly targets the latter. Is
your PDIH implementation able to deal with batch processing right now?

Best,
Sascha

> On Thu, Nov 12, 2009 at 6:35 AM, Sascha Szott  wrote:
>
>> Hi all,
>>
>> I'm using the DIH in a parameterized way by passing request parameters
>> that are used inside of my data-config. All imports end up in the same
>> index.
>>
>> 1. Is it considered as good practice to set up several DIH request
>> handlers, one for each possible parameter value?
>>
>> 2. In case the range of parameter values is broad, it's not convenient
>> to
>> define separate request handlers for each value. But this entails a
>> limitation (as far as I see): It is not possible to fire several request
>> to the same DIH handler (with different parameter values) at the same
>> time. However, in case several request handlers would be used (as in
>> 1.),
>> concurrent requests (to the different handlers) are possible. So, how to
>> overcome this limitation?
>>
>> Best,
>> Sascha
>>
>



Re: [DIH] blocking import operation

2009-11-12 Thread Sascha Szott
Noble Paul wrote:
> Yes , open an issue . This is a trivial change
I've opened JIRA issue SOLR-1554.

-Sascha

>
> On Thu, Nov 12, 2009 at 5:08 AM, Sascha Szott  wrote:
>> Noble,
>>
>> Noble Paul wrote:
>>> DIH imports are really long running. There is a good chance that the
>>> connection times out or breaks in between.
>> Yes, you're right, I missed that point (in my case imports take no
>> longer
>> than a minute).
>>
>>> how about a callback?
>> Thanks for the hint. There was a discussion on adding a callback url to
>> DIH a month ago, but it seems that no issue was raised. So, up to now
>> its
>> only possible to implement an appropriate Solr EventListener. Should we
>> open an issue for supporting callback urls?
>>
>> Best,
>> Sascha
>>
>>>
>>> On Tue, Nov 10, 2009 at 12:12 AM, Sascha Szott  wrote:
>>>> Hi all,
>>>>
>>>> currently, DIH's import operation(s) only works asynchronously.
>>>> Therefore,
>>>> after submitting an import request, DIH returns immediately, while the
>>>> import process (in case a large amount of data needs to be indexed)
>>>> continues asynchronously behind the scenes.
>>>>
>>>> So, what is the recommended way to check if the import process has
>>>> already
>>>> finished? Or still better, is there any method / workaround that will
>>>> block
>>>> the import operation's caller until the operation has finished?
>>>>
>>>> In my application, the DIH receives some URL parameters which are used
>>>> for
>>>> determining the database name that is used within data-config.xml,
>>>> e.g.
>>>>
>>>> http://localhost:8983/solr/dataimport?command=full-import&dbname=foo
>>>>
>>>> Since only one DIH, /dataimport, is defined, but several database
>>>> needs
>>>> to
>>>> be indexed, it is required to issue this command several times, e.g.
>>>>
>>>> http://localhost:8983/solr/dataimport?command=full-import&dbname=foo
>>>>
>>>> ... wait until /dataimport?command=status says "Indexing completed"
>>>> (but
>>>> without using a loop that checks it again and again) ...
>>>>
>>>> http://localhost:8983/solr/dataimport?command=full-import&dbname=bar&clean=false
>>>>
>>>>
>>>> A suitable solution, at least IMHO, would be to have an additional DIH
>>>> parameter which determines whether the import call is blocking on
>>>> non-blocking, the default. As far as I see, this could be accomplished
>>>> since
>>>> Solr can execute more than one import operation at a time (it starts a
>>>> new
>>>> thread for each). Perhaps, my question is somehow related to the
>>>> discussion
>>>> [1] on ParallelDataImportHandler.
>>>>
>>>> Best,
>>>> Sascha
>>>>
>>>> [1] http://www.lucidimagination.com/search/document/a9b26ade46466ee
>>>>
>>
>
>
>
> --
> -
> Noble Paul | Principal Engineer| AOL | http://aol.com
>



[DIH] concurrent requests to DIH

2009-11-11 Thread Sascha Szott
Hi all,

I'm using the DIH in a parameterized way by passing request parameters
that are used inside of my data-config. All imports end up in the same
index.

1. Is it considered as good practice to set up several DIH request
handlers, one for each possible parameter value?

2. In case the range of parameter values is broad, it's not convenient to
define separate request handlers for each value. But this entails a
limitation (as far as I see): It is not possible to fire several request
to the same DIH handler (with different parameter values) at the same
time. However, in case several request handlers would be used (as in 1.),
concurrent requests (to the different handlers) are possible. So, how to
overcome this limitation?

Best,
Sascha


Re: [DIH] blocking import operation

2009-11-11 Thread Sascha Szott
Noble,

Noble Paul wrote:
> DIH imports are really long running. There is a good chance that the
> connection times out or breaks in between.
Yes, you're right, I missed that point (in my case imports take no longer
than a minute).

> how about a callback?
Thanks for the hint. There was a discussion on adding a callback url to
DIH a month ago, but it seems that no issue was raised. So, up to now its
only possible to implement an appropriate Solr EventListener. Should we
open an issue for supporting callback urls?

Best,
Sascha

>
> On Tue, Nov 10, 2009 at 12:12 AM, Sascha Szott  wrote:
>> Hi all,
>>
>> currently, DIH's import operation(s) only works asynchronously.
>> Therefore,
>> after submitting an import request, DIH returns immediately, while the
>> import process (in case a large amount of data needs to be indexed)
>> continues asynchronously behind the scenes.
>>
>> So, what is the recommended way to check if the import process has
>> already
>> finished? Or still better, is there any method / workaround that will
>> block
>> the import operation's caller until the operation has finished?
>>
>> In my application, the DIH receives some URL parameters which are used
>> for
>> determining the database name that is used within data-config.xml, e.g.
>>
>> http://localhost:8983/solr/dataimport?command=full-import&dbname=foo
>>
>> Since only one DIH, /dataimport, is defined, but several database needs
>> to
>> be indexed, it is required to issue this command several times, e.g.
>>
>> http://localhost:8983/solr/dataimport?command=full-import&dbname=foo
>>
>> ... wait until /dataimport?command=status says "Indexing completed" (but
>> without using a loop that checks it again and again) ...
>>
>> http://localhost:8983/solr/dataimport?command=full-import&dbname=bar&clean=false
>>
>>
>> A suitable solution, at least IMHO, would be to have an additional DIH
>> parameter which determines whether the import call is blocking on
>> non-blocking, the default. As far as I see, this could be accomplished
>> since
>> Solr can execute more than one import operation at a time (it starts a
>> new
>> thread for each). Perhaps, my question is somehow related to the
>> discussion
>> [1] on ParallelDataImportHandler.
>>
>> Best,
>> Sascha
>>
>> [1] http://www.lucidimagination.com/search/document/a9b26ade46466ee
>>


[DIH] blocking import operation

2009-11-09 Thread Sascha Szott

Hi all,

currently, DIH's import operation(s) only works asynchronously. 
Therefore, after submitting an import request, DIH returns immediately, 
while the import process (in case a large amount of data needs to be 
indexed) continues asynchronously behind the scenes.


So, what is the recommended way to check if the import process has 
already finished? Or still better, is there any method / workaround that 
will block the import operation's caller until the operation has finished?


In my application, the DIH receives some URL parameters which are used 
for determining the database name that is used within data-config.xml, e.g.


http://localhost:8983/solr/dataimport?command=full-import&dbname=foo

Since only one DIH, /dataimport, is defined, but several database needs 
to be indexed, it is required to issue this command several times, e.g.


http://localhost:8983/solr/dataimport?command=full-import&dbname=foo

... wait until /dataimport?command=status says "Indexing completed" (but 
without using a loop that checks it again and again) ...


http://localhost:8983/solr/dataimport?command=full-import&dbname=bar&clean=false


A suitable solution, at least IMHO, would be to have an additional DIH 
parameter which determines whether the import call is blocking on 
non-blocking, the default. As far as I see, this could be accomplished 
since Solr can execute more than one import operation at a time (it 
starts a new thread for each). Perhaps, my question is somehow related 
to the discussion [1] on ParallelDataImportHandler.


Best,
Sascha

[1] http://www.lucidimagination.com/search/document/a9b26ade46466ee



Re: [DIH] SqlEntityProcessor does not recognize onError attribute

2009-11-09 Thread Sascha Szott

Hi,

Noble Paul നോബിള്‍ नोब्ळ् wrote:

On Mon, Nov 9, 2009 at 4:24 PM, Sascha Szott  wrote:

Hi all,

as stated in the Solr-WIKI, Solr 1.4 allows it to specify an onError
attribute for *each* entity listed in the data config file (it is considered
as one of the default attributes).

Unfortunately, the SqlEntityProcessor does not recognize the attribute's
value -- i.e., in case an SQL exception is thrown somewhere inside the
constructor of ResultSetIterators (which is an inner class of
JdbcDataSource), Solr's import exits immediately, even though onError is set
to continue or skip.

Why are database related exceptions (e.g., table does not exists, or an
error in query syntax occurs) not being covered by the onError attribute? In
my opinion, use cases exist that will profit from such an exception handling
inside of Solr (for example, in cases where the existence of certain
database tables or views is not predictable).

We thought DB errors are not to be ignored because errors such as
table does not exist can be really serious.
In principle, I agree with you, though I would consider it as a 
programmer's responsibility to be aware of it (in case he/she sets 
onError to skip or continue).



Should I raise an JIRA-issue about this?

Raise an issue it can be fixed

I've created issue SOLR-1549.

Best,
Sascha



[DIH] SqlEntityProcessor does not recognize onError attribute

2009-11-09 Thread Sascha Szott

Hi all,

as stated in the Solr-WIKI, Solr 1.4 allows it to specify an onError 
attribute for *each* entity listed in the data config file (it is 
considered as one of the default attributes).


Unfortunately, the SqlEntityProcessor does not recognize the attribute's 
value -- i.e., in case an SQL exception is thrown somewhere inside the 
constructor of ResultSetIterators (which is an inner class of 
JdbcDataSource), Solr's import exits immediately, even though onError is 
set to continue or skip.


Why are database related exceptions (e.g., table does not exists, or an 
error in query syntax occurs) not being covered by the onError 
attribute? In my opinion, use cases exist that will profit from such an 
exception handling inside of Solr (for example, in cases where the 
existence of certain database tables or views is not predictable).


Should I raise an JIRA-issue about this?

-Sascha




Re: How to use DataImportHandler with ExtractingRequestHandler?

2009-09-03 Thread Sascha Szott

Hi Khai,

a few weeks ago, I was facing the same problem.

In my case, this workaround helped (assuming, you're using Solr 1.3): 
For each row, extract the content from the corresponding pdf file using 
a parser library of your choice (I suggest Apache PDFBox or Apache Tika 
in case you need to process other file types as well), put it between




and store it in a text file. To keep the relationship between a file and 
its corresponding database row, use the primary key as the file name.


Within data-config.xml use the XPathEntityProcessor as follows (replace 
dbRow and primaryKey respectively):



  



And, by the way, in Solr 1.4 you do not have to put your content between 
xml tags: use the PlainTextEntityProcessor instead of XPathEntityProcessor.


Best,
Sascha

Khai Doan schrieb:

Hi all,

My name is Khai.  I have a table in a relational database.  I have
successfully use DataImportHandler to import this data into Apache Solr.
However, one of the column store the location of PDF file.  How can I
configure DataImportHandler to use ExtractingRequestHandler to extract the
content of the PDF?

Thanks!

Khai Doan





Re: Building documents using content residing both in database tables and text files

2009-08-11 Thread Sascha Szott

Hi Noble,

Noble Paul wrote:

isn't it possible to do this by having two datasources (one Js=dbc and
another File) and two entities . The outer entity can read from a DB
and the inner entity can read from a file.

Yes, it is. Here's my db-data-config.xml file:








  



  

  

  



Only one additional adjustment has to be made: Since I'm using Solr 1.3 
and it comes without PlainTextEntityProcessor, I have to transform my 
plain text files in xml files by surrounding the content with a root 
element. That's all!



On Tue, Aug 11, 2009 at 8:05 PM, Sascha Szott wrote:

Hello,

is it possible (and if it is, how can I accomplish it) to configure DIH to
build up index documents by using content that resides in different data
sources?

Here is an example scenario:
Let's assume we have a table T with two columns, ID (which is the primary
key of T) and TITLE. Furthermore, each record in T is assigned a directory
containing text files that were generated out of pdf documents by using
Tika. A directory name is build by using the ID of the record in T
associated to that directory, e.g. all text files associated to a record
with id = 101 are stored in direcory 101.

Is there a way to configure DIH such that it uses ID, TITLE and the content
of all related text files when building a document (the documents should
have three fields: id, title, and text)?

Furthermore, as you may have noticed, a second question arises naturally:
Will there be any integration of Solr Cell and DIH in an upcoming release,
so that it would be possible to directly use the pdf documents instead of
the extracted text files that were generated outside of Solr?


This is something I wish to see. But there has been no user request
yet. You can raise an issue and it can be looked upon

I've raised issue SOLR-1358.

Best,
Sascha



Building documents using content residing both in database tables and text files

2009-08-11 Thread Sascha Szott

Hello,

is it possible (and if it is, how can I accomplish it) to configure DIH 
to build up index documents by using content that resides in different 
data sources?


Here is an example scenario:
Let's assume we have a table T with two columns, ID (which is the 
primary key of T) and TITLE. Furthermore, each record in T is assigned a 
directory containing text files that were generated out of pdf documents 
by using Tika. A directory name is build by using the ID of the record 
in T associated to that directory, e.g. all text files associated to a 
record with id = 101 are stored in direcory 101.


Is there a way to configure DIH such that it uses ID, TITLE and the 
content of all related text files when building a document (the 
documents should have three fields: id, title, and text)?


Furthermore, as you may have noticed, a second question arises 
naturally: Will there be any integration of Solr Cell and DIH in an 
upcoming release, so that it would be possible to directly use the pdf 
documents instead of the extracted text files that were generated 
outside of Solr?


Best,
Sascha