Re: [ANNOUNCE] Solr wiki editing change

2013-03-25 Thread Andrzej Bialecki

On 3/25/13 4:18 AM, Steve Rowe wrote:

The wiki at http://wiki.apache.org/solr/ has come under attack by spammers more 
frequently of late, so the PMC has decided to lock it down in an attempt to 
reduce the work involved in tracking and removing spam.

 From now on, only people who appear on 
http://wiki.apache.org/solr/ContributorsGroup will be able to 
create/modify/delete wiki pages.

Please request either on the solr-user@lucene.apache.org or on 
d...@lucene.apache.org to have your wiki username added to the 
ContributorsGroup page - this is a one-time step.


Please add AndrzejBialecki to this group. Thank you!

--
Best regards,
Andrzej Bialecki
http://www.sigram.com, blog http://www.sigram.com/blog
 ___.,___,___,___,_._. __<><
[___||.__|__/|__||\/|: Information Retrieval, System Integration
___|||__||..\|..||..|: Contact: info at sigram dot com



Re: What is the "docs" number in Solr explain query results for fieldnorm?

2012-05-25 Thread Andrzej Bialecki

On 25/05/2012 20:13, Tom Burton-West wrote:

Hello all,

I am trying to understand the output of Solr explain for a one word query.
I am querying on the "ocr" field with no stemming/synonyms or stopwords.
And no query or index time boosting.

The query is "ocr:the"

The document (result below)  which contains two words "The Aeroplane" gets
more hits than documents with 50 or more occurances of the word "the"
Since the idf is the same I am assuming this is a result of length norms.

The explain (debugQuery) shows the following for fieldnorm:
  0.625 = fieldNorm(field=ocr, doc=16624)
What does the "doc=16624" mean?  It certainly can not represent either the
length of the field (as an integer) since there are only two terms in the
field.
It can't represent the number of docs with the query term (the idf output
shows the word "the" occurs in 16,219 docs.


Hi Tom,

This is an internal document number within a Lucene index. This number 
is useless from the level of Solr APIs because you can't use it to 
actually do anything. At the Lucene level (e.g. in Luke) you could 
navigate to this number and for example retrieve stored fields of this 
document.


As it's shown in the Explanation-s, it can be only used to co-ordinate 
parts of the query that matched the same document number.


--
Best regards,
Andrzej Bialecki <><
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: is there any practice to load index into RAM to accelerate solr performance?

2012-02-08 Thread Andrzej Bialecki

On 08/02/2012 09:17, Ted Dunning wrote:

This is true with Lucene as it stands.  It would be much faster if there
were a specialized in-memory index such as is typically used with high
performance search engines.


This could be implemented in Lucene trunk as a Codec. The challenge 
though is to come up with the right data structures.


There has been some interesting research on optimizations for in-memory 
inverted indexes, but it usually involves changing the query evaluation 
algos as well - for reference:


http://digbib.ubka.uni-karlsruhe.de/volltexte/documents/1202502
http://www.siam.org/proceedings/alenex/2008/alx08_01transierf.pdf
http://research.google.com/pubs/archive/37365.pdf

--
Best regards,
Andrzej Bialecki <><
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: Solr Lucene Index Version

2011-12-08 Thread Andrzej Bialecki

On 08/12/2011 14:50, Jamie Johnson wrote:

Mark,

Agreed that Replication wouldn't help, I was dreaming that there was
some intermediate format used in replication.

Ideally you are right, I could just reindex the data and go on with
life, but my case is not so simple.  Currently we have some set of
processes which is run against the raw artifact to index things of
interest within the text document.  I don't believe (and I need to
check with the folks who wrote this) that I have an easy way to do
this currently but this would be my preference.

Andrzej,

Isn't the codec stuff merged with trunk now?  Admittedly I know very
little about Lucene's index format but I'd be willing to be a guinea
pig if you needed a tester.


Bulk of the work described in LUCENE-2621 has been done by Robert Muir 
(big thanks!!) and merged with trunk, but I think there may be still 
some parts missing - see LUCENE-3622.



--
Best regards,
Andrzej Bialecki <><
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: Solr Lucene Index Version

2011-12-08 Thread Andrzej Bialecki

On 08/12/2011 05:00, Mark Miller wrote:

Replication just copies the index, so I'm not sure how this would help offhand?

With SolrCloud this is a breeze - just fire up another replica for a shard and 
the current index will replicate to it.

If you where willing to export the data to some portable format and then pull 
it back in, why not just store the original data and reindex?


This was actually one of the situations that motivated that jira issue - 
there are scenarios where reindexing, or keeping the original data, is 
very costly, in terms of space, time, I/O, pre-processing costs, 
curating, merging, etc, etc...


The good news is that once the recent work on the codecs is merged with 
the trunk then we can revisit this issue and implement it with much less 
effort than before - we could even start by modifying SimpleTextCodec to 
be more lenient, and proceed from there.


--
Best regards,
Andrzej Bialecki <><
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: Backup with lukeall XMLExporter.

2011-10-05 Thread Andrzej Bialecki

On 05/10/2011 19:21, Luis Cappa Banda wrote:

Hello.

I´ve been looking for information trying to find an easy way to do index
backups with Solr and I´ve readed that lukeall has an application called
XMLExporter that creates a XML dump from a lucene index with it´s complete
information. I´ve got some questions about this alternative:

*1. *Do it also contains the information from fields configured as
stored=false?
*2. *Can I load with curl this XML file generated to reindex? If not, any
other solution?

Thank you very much.



It does not provide a complete copy of the index information, it only 
dumps general information about the index plus the stored fields of 
documents. Non-stored fields are not available. There is no counterpart 
tool to take this XML dump and turn it into an index.


I'm working on a tool like what you had in mind, and I will be 
presenting results of this work at the Eurocon in Barcelona. However, 
it's still very much incomplete, and it depends on cutting edge features 
(LUCENE-2621).


In any case, if you're using Lucene then you can safely take a backup of 
the index if it's open readonly. With Solr you can use the replication 
mechanism to pull in a copy of the index from a running Solr instance.


--
Best regards,
Andrzej Bialecki <><
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: Can I delete the stored value?

2011-07-10 Thread Andrzej Bialecki

On 7/10/11 2:33 PM, Simon Willnauer wrote:

Currently there is no easy way to do this. I would need to think how
you can force the index to drop those so the answer here is no you
can't!

simon

On Sat, Jul 9, 2011 at 11:11 AM, Gabriele Kahlout
  wrote:

I've stored the contents of some pages I no longer need. How can I now
delete the stored content without re-crawling the pages (i.e. using
updateDocument ). I cannot just remove the field, since I still want the
field to be indexed, I just don't want to store something with it.
My understanding is that field.setValue("") won't do since that should
affect the indexed value as well.


You could pump the content of your index through a FilterIndexReader - 
i.e. implement a subclass of FilterIndexReader that removes stored 
fields under some conditions, and then use IndexWriter.addIndexes with 
this reader.


See LUCENE-1812 for another practical application of this concept.

--
Best regards,
Andrzej Bialecki <><
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: Feed index with analyzer output

2011-07-05 Thread Andrzej Bialecki

On 7/5/11 1:37 PM, Lox wrote:

Ok,

the very short question is:
Is there a way to submit the analyzer response so that solr already knows
what to do with that response? (that is, which field are to be treated as
payloads, which are tokens, etc...)


Check this issue: http://issues.apache.org/jira/browse/SOLR-1535


--
Best regards,
Andrzej Bialecki <><
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: Performance loss - querying more than 64 cores (randomly)

2011-06-16 Thread Andrzej Bialecki

On 6/16/11 5:31 PM, Mark Schoy wrote:

Thanks for your answers.

Andrzej was right with his assumption. Solr only needs about 9GB memory but
the system needs the rest of it for disc IO:

64 Cores:  64*100MB index size = 6,4GB + 9 GB Solr Cache + about 600 MB OS =
16GB

Conclusion: My system can exactly buffer the data of 64 Cores. Every
additional core cant be buffered and the performance is decreasing.


Glad to be of help... You could formulate this conclusion in a different 
way, too: if you specify too large a heap size then you stifle the OS 
disk buffers - Solr won't be able to use that excess of memory, but it 
won't be available for OS-level disk IO. Therefore reducing the heap 
size may actually increase your performance.



--
Best regards,
Andrzej Bialecki <><
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: Performance loss - querying more than 64 cores (randomly)

2011-06-16 Thread Andrzej Bialecki

On 6/16/11 3:22 PM, Mark Schoy wrote:

Hi,

I set up a Solr instance with 512 cores. Each core has 100k documents and 15
fields. Solr is running on a CPU with 4 cores (2.7Ghz) and 16GB RAM.

Now I've done some benchmarks with JMeter. On each thread iteration JMeter
queriing another Core by random. Here are the results (Duration:  each with
180 second):

Randomly queried cores | queries per second
1| 2016
2 | 2001
4 | 1978
8 | 1958
16 | 2047
32 | 1959
64 | 1879
128 | 1446
256 | 1009
512 | 428

Why are the queries per second until 64 constant and then the performance is
degreasing rapidly?

Solr only uses 10GB of the 16GB memory so I think it is not a memory issue.



This may be an OS-level disk buffer issue. With a limited disk buffer 
space the more random IO occurs from different files, the higher is the 
churn rate, and if the buffers are full then the churn rate may increase 
dramatically (and the performance will drop then). Modern OS-es try to 
keep as much data in memory as possible, so the memory usage itself is 
not that informative - but check what are the pagein/pageout rates when 
you start hitting the 32 vs 64 cores.


--
Best regards,
Andrzej Bialecki <><
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: Lucid Works

2011-04-08 Thread Andrzej Bialecki

On 4/8/11 9:55 PM, Andy wrote:


--- On Fri, 4/8/11, Andrzej Bialecki  wrote:



:) If you don't need the new functionality in 4.x, you don't
need the performance improvements,


What performance improvements does 4.x have over 3.1?


Ah... well, many - take a look at the CHANGES.txt.




reindexing cycles are long (indexes tend to stay around)
then 3.1 is a safer bet. If you need a dozen or so new
exciting features (e.g. results grouping) or top
performance, or if you need LucidWorks with Click and other
goodies, then use 4.x and be prepared for an occasional full
reindex.


So using 4.x would require occasional full reindex but using 3.1 would not? 
Could you explain? I thought 4.x comes with NRT indexing. So why is full 
reindex necessary?


Well, so long as you don't want to upgrade then of course, index format 
is stable and you can manage it incrementally. But in case of an 
upgrade, because in 4.x index format is not stable - if you upgrade to a 
newer Lucene / LucidWorks of 4.x vintage then it may be the case that 
even though indexes before and after upgrade are of 4.x vintage they are 
still incompatible.


At some point there may be tools to transparently convert indexes from 
one 4.x to another 4.x format, but they are not there yet.


--
Best regards,
Andrzej Bialecki <><
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: Lucid Works

2011-04-08 Thread Andrzej Bialecki

On 4/8/11 4:58 PM, Mark wrote:

Doesn't look like you allow new members to post questions in that forum.


There's a "Create new account" link there, you simply need to register 
and log in.




I have just one last question ;)

We are deciding whether to upgrade our 1.4 production environment to 4.x
or 3.1. What were you decisions when deciding to release 4.x over 3.1?


Based on the details that you provided I'd say "it depends" :) If you 
don't need the new functionality in 4.x, you don't need the performance 
improvements, and if your full reindexing cycles are long (indexes tend 
to stay around) then 3.1 is a safer bet. If you need a dozen or so new 
exciting features (e.g. results grouping) or top performance, or if you 
need LucidWorks with Click and other goodies, then use 4.x and be 
prepared for an occasional full reindex.


--
Best regards,
Andrzej Bialecki <><
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: Lucid Works

2011-04-08 Thread Andrzej Bialecki

On 4/7/11 10:16 PM, Mark wrote:

Andrezej,

Thanks for the info. I have a question regarding stability though. How
are you able to guarantee the stability of this release when 4.0 is
still a work in progress? I believe the last version Lucid released was
1.4 so why did you choose to release a 4.x version as opposed to 3.1?


To include all the goodies from 4.0, of course ;) LucidWorks uses a 
version from the trunk that behaves well in tests and with necessary 
patches applied - see also below.




Is the source code including with your distribution so that we may be
able to do some further patching upon it?


Yes, after installing it's in solr-src/ . So if any issue pops up you 
can apply a patch, recompile the libs and replace them.




Thanks again and hopefully I'll be joining you at that conference.


Great :)

PS. Questions like this are best asked on the Lucid forum 
http://www.lucidimagination.com/forum/ .


--
Best regards,
Andrzej Bialecki <><
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: Lucid Works

2011-04-07 Thread Andrzej Bialecki

On 4/7/11 9:43 PM, Mark wrote:

I noticed that Lucid Works distribution now says is upt to date with 4.X
versions. Does this mean 1.4 or 4.0/trunk?

If its truly 4.0 does that mean it includes the collapse component?


Yes it does.


Also, is the click scoring tools proprietary or was this just a
contrib/patch that was applied?


At the moment it's proprietary. I will have a talk at the Lucene 
Revolution conference that describes the Click tools in detail.


--
Best regards,
Andrzej Bialecki <><
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: Detecting an empty index during start-up

2011-03-25 Thread Andrzej Bialecki

On 3/25/11 11:25 AM, David McLaughlin wrote:

Thanks Chris. I dug into the SolrCore code and after reading some of the
code I ended up going with core.getNewestSearcher(true) and this fixed the
problem.


FYI, openNew=true is not implemented and can result in an 
UnsupportedOperationException. For now it's better to pass openNew=false 
and be prepared to get a null.


--
Best regards,
Andrzej Bialecki <><
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: a bug of solr distributed search

2010-10-25 Thread Andrzej Bialecki
On 2010-10-25 13:37, Toke Eskildsen wrote:
> On Mon, 2010-10-25 at 11:50 +0200, Andrzej Bialecki wrote:
>> * there is an exact solution to this problem, namely to make two
>> distributed calls instead of one (first call to collect per-shard IDFs
>> for given query terms, second call to submit a query rewritten with the
>> global IDF-s). This solution is implemented in SOLR-1632, with some
>> caching to reduce the cost for common queries.
> 
> I must admit that I have not tried the patch myself. Looking at
> https://issues.apache.org/jira/browse/SOLR-1632
> i see that the last comment is from LiLi with a failed patch, but as
> there are no further comments it is unclear if the problem is general or
> just with LiLi's setup. I might be a bit harsh here, but the other
> comments for the JIRA issue also indicate that one would have to be
> somewhat adventurous to run this in production. 

Oh, definitely this is not production quality yet - there are known
bugs, for example, that I need to fix, and then it needs to be
forward-ported to trunk. It shouldn't be too much work to bring it back
into usable state.

>> * another reason is that in many many cases the difference between using
>> exact global IDF and per-shard IDFs is not that significant. If shards
>> are more or less homogenous (e.g. you assign documents to shards by
>> hash(docId)) then term distributions will be also similar.
> 
> While I agree on the validity of the solution, it does put some serious
> constraints on the shard-setup.

True. But this is the simplest setup that just may be enough.

> 
>> To summarize, I would qualify your statement with: "...if the
>> composition of your shards is drastically different". Otherwise the cost
>> of using global IDF is not worth it, IMHO.
> 
> Do you know of any studies of the differences in ranking with regard to
> indexing-distribution by hashing, logical grouping and distributed IDF?

Unfortunately, this information is surprisingly scarce - research
predating year 2000 is often not applicable, and most current research
concentrates on P2P systems, which are really a different ball of wax.
Here's a few papers that I found that are related to this issue:

* Global Term Weights in Distributed Environments, H. Witschel, 2007
(Elsevier)

* KLEE: A Framework for Distributed Top-k Query Algorithms, S. Michel,
P. Triantafillou, G. Weikum, VLDB'05 (ACM)

* Exploring the Stability of IDF Term Weighting, Xin Fu and  Miao Chen,
2008 (Springer Verlag)

* A Comparison of Techniques for Estimating IDF Values to Generate
Lexical Signatures for the Web, M. Klein, M. Nelson, WIDM'08 (ACM)

* Comparison of dierent Collection Fusion Models in Distributed
Information Retrieval, Alexander Steidinger - this paper gives a nice
comparison framework for different strategies for joining partial
results; apparently we use the most primitive strategy explained there,
based on raw scores...

These papers likely don't fully answer your question, but at least they
provide a broader picture of the issue...

-- 
Best regards,
Andrzej Bialecki <><
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: a bug of solr distributed search

2010-10-25 Thread Andrzej Bialecki
On 2010-10-25 11:22, Toke Eskildsen wrote:
> On Thu, 2010-07-22 at 04:21 +0200, Li Li wrote: 
>> But itshows a problem of distrubted search without common idf.
>> A doc will get different score in different shard.
> 
> Bingo.
> 
> I really don't understand why this fundamental problem with sharding
> isn't mentioned more often. Every time the advice "use sharding" is
> given, it should be followed with a "but be aware that it will make
> relevance ranking unreliable".

The reason is twofold, I think:

* there is an exact solution to this problem, namely to make two
distributed calls instead of one (first call to collect per-shard IDFs
for given query terms, second call to submit a query rewritten with the
global IDF-s). This solution is implemented in SOLR-1632, with some
caching to reduce the cost for common queries. However, this means that
now for every query you need to make two calls instead of one, which
potentially doubles the time to return results (for simple common
queries - for rare complex queries the time will be still dominated by
the query runtime on shard servers).

* another reason is that in many many cases the difference between using
exact global IDF and per-shard IDFs is not that significant. If shards
are more or less homogenous (e.g. you assign documents to shards by
hash(docId)) then term distributions will be also similar. So then the
question is whether you can accept an N% variance in scores across
shards, or whether you want to bear the cost of an additional
distributed RPC for every query...

To summarize, I would qualify your statement with: "...if the
composition of your shards is drastically different". Otherwise the cost
of using global IDF is not worth it, IMHO.

-- 
Best regards,
Andrzej Bialecki <><
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: Different analyzers for dfferent documents in different languages?

2010-09-22 Thread Andrzej Bialecki

On 2010-09-22 15:30, Bernd Fehling wrote:

Actually, this is one of the biggest disadvantage of Solr for multilingual 
content.
Solr is field based which means you have to know the language _before_ you feed
the content to a specific field and process the content for that field.
This results in having separate fields for each language.
E.g. for Europe this will be 24 to 26 languages for each title, keyword, 
description, ...

I guess when they started with Lucene/Solr they never had multilingual on their 
mind.

The alternative is to have a separate index for each language.
Therefore you also have to know the language of the content _before_ feeding to 
the core.
E.g. again for Europe you end up with 24 to 26 cores.

Onother option is to "see" the multilingual fields (title, keywords, 
description,...) as
a "subdocument". Write a filter class as subpipeline, use language and encoding 
detection
as first step in that pipeline and then go on with all other linguistic 
processing within
that pipeline and return the processed content back to the field for further 
filtering
and storing.

Many solutions, but nothing out off the box :-)


Take a look at SOLR-1536, it contains an example of a tokenizing chain 
that could use a language detector to create different fields (or 
tokenize differently) based on this decision.



--
Best regards,
Andrzej Bialecki <><
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: SolrCloud distributed indexing (Re: anyone use hadoop+solr?)

2010-09-06 Thread Andrzej Bialecki

On 2010-09-06 22:03, Dennis Gearon wrote:

What is a 'simple MOD'?


md5(docId) % numShards

--
Best regards,
Andrzej Bialecki <><
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: SolrCloud distributed indexing (Re: anyone use hadoop+solr?)

2010-09-06 Thread Andrzej Bialecki

On 2010-09-06 16:41, Yonik Seeley wrote:

On Mon, Sep 6, 2010 at 10:18 AM, MitchK  wrote:
[...consistent hashing...]

But it doesn't solve the problem at all, correct me if I am wrong, but: If
you add a new server, let's call him IP3-1, and IP3-1 is nearer to the
current ressource X, than doc x will be indexed at IP3-1 - even if IP2-1
holds the older version.
Am I right?


Right.  You still need code to handle migration.

Consistent hashing is a way for everyone to be able to agree on the
mapping, and for the mapping to change incrementally.  i.e. you add a
node and it only changes the docid->node mapping of a limited percent
of the mappings, rather than changing the mappings of potentially
everything, as a simple MOD would do.


Another strategy to avoid excessive reindexing is to keep splitting the 
largest shards, and then your mapping becomes a regular MOD plus a list 
of these additional splits. Really, there's an infinite number of ways 
you could implement this...




For SolrCloud, I don't think we'll end up using consistent hashing -
we don't need it (although some of the concepts may still be useful).


I imagine there could be situations where a simple MOD won't do ;) so I 
think it would be good to hide this strategy behind an 
interface/abstract class. It costs nothing, and gives you flexibility in 
how you implement this mapping.


--
Best regards,
Andrzej Bialecki <><
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: How to retrieve the full corpus

2010-09-06 Thread Andrzej Bialecki

On 2010-09-06 17:15, Yonik Seeley wrote:

On Mon, Sep 6, 2010 at 10:52 AM, Roland Villemoes  
wrote:

How can I retrieve all words from a Solr core?
I need a list of all the words and how often they occur in the index.


http://wiki.apache.org/solr/TermsComponent

It doesn't currently stream though, so requesting *all* at once might
take too much memory.  One workaround is to page via terms.lower and
terms.limit.
Perhaps we should consider adding streaming to the terms component
though.  Would you mind opening a JIRA issue?


This would be nice also for building a spellchecker in another core 
(instead of using the current sub-index hack).



--
Best regards,
Andrzej Bialecki <><
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



SolrCloud distributed indexing (Re: anyone use hadoop+solr?)

2010-09-06 Thread Andrzej Bialecki

(I adjusted the subject to better reflect the content of this discussion).

On 2010-09-06 14:37, MitchK wrote:


Thanks for your detailed feedback Andzej!


From what I understood, SOLR-1301 becomes obsolete ones Solr becomes

cloud-ready, right?


Who knows... I certainly didn't expect this code to become so popular ;) 
so even after SolrCloud becomes available it's likely that some people 
will continue to use it. But SolrCloud should solve the original problem 
that I tried to solve with this patch.



Looking into the future: eventually, when SolrCloud arrives we will be
able to index straight to a SolrCloud cluster, assigning documents to
shards through a hashing schema (e.g. 'md5(docId) % numShards')


Hm, let's say the md5(docId) would produce a value of 10 (it won't, but
let's assume it).
If I got a constant number of shards, the doc will be published to the same
shard again and again.

i.e.: 10 % numShards(5) = 2 ->  doc 10 will be indexed at shard 2.

A few days later the rest of the cluster is available, now it looks like

10 % numShards(10) ->   1 ->  doc 10 will be indexed at shard 1... and what
about the older version at shard 2? I am no expert when it comes to
cloudComputing and the other stuff.


There are several possible solutions to this, and they all boil down to
the way how you assign documents to shards... Keep in mind that nodes 
(physical machines) can manage several shards, and the aggregate 
collection of all unique shards across all nodes forms your whole index 
- so there's also a related, but different issue, of how to assign 
shards to nodes.


Here are some scenarios how you can solve the doc-to-shard mapping 
problem (note: I removed the issue of replication from the picture to 
make this clearer):


a) keep the number of shards constant no matter how large is the 
cluster. The mapping schema is then as simple as the one above. In this 
scenario you create relatively small shards, so that a single physical 
node can manage dozens of shards (each shard using one core, or perhaps 
a more lightweight structure like MultiReader). This is also known as 
micro-sharding. As the number of documents grows the size of each shard 
will grow until you have to reduce the number of shards per node, 
ultimately ending up with a single shard per node. After that, if your 
collection continues to grow, you have to modify your hashing schema to 
split some shards (and reindex some shards, or use an index splitter tool).


b) use consistent hashing as the mapping schema to assign documents to a 
changing number of shards. There are many explanations of this schema on 
the net, here's one that is very simple:


http://www.tomkleinpeter.com/2008/03/17/programmers-toolbox-part-3-consistent-hashing/

In this case, you can grow/shrink the number of shards (and their size) 
as you see fit, incurring only a small reindexing cost.



If you can point me to one or another reference where I can read about it,
it would help me a lot, since I only want to understand how it works at the
moment.


http://wiki.apache.org/solr/SolrCloud ...



The problem with Solr is its lack of documentation in some classes and the
lack of capsulating some very complex things into different methods or
extra-classes. Of course, this is because it costs some extra time to do so,
but it makes understanding and modifying things very complicated if you do
not understand whats going on from a theoretical point of view.


In this case the lack of good docs and user-level API can be blamed on 
the fact that this functionality is still under heavy development.




Since the cloud-feature will be complex, a lack of documentation and no
understanding of the theory behind the code will make contributing back
very, very complicated.


For now, yes, it's an issue - though as soon as SolrCloud gets committed 
I'm sure people will follow up with user-level convenience components 
that will make it easier.



--
Best regards,
Andrzej Bialecki <><
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: anyone use hadoop+solr?

2010-09-06 Thread Andrzej Bialecki

On 2010-09-04 19:53, MitchK wrote:


Hi,

this topic started a few months ago, however there are some questions from
my side, that I couldn't answer by looking at the SOLR-1301-issue nor the
wiki-pages.

Let me try to explain my thoughts:
Given: a Hadoop-cluster, a solr-search-cluster and nutch as a
crawling-engine which also performs LinkRank and webgraph-related tasks.

Once a list of documents is created by nutch, you put the list + the
LinkRank-values etc. into a Solr+Hadoop-job like it is described in
Solr-1301 to index or reindex the given documents.


There is no out of the box integration between Nutch and SOLR-1301, so 
there is some step that you omitted from this chain... e.g. "export from 
Nutch segments to CSV".




When the shards are built, they will be sent over the network to the
solr-search-cluster.
Is this description correct?


Not really. SOLR-1301 doesn't deal with how you deploy the results of 
indexing. It simply creates the shards on HDFS. SOLR-1301 just creates 
the index data - it doesn't deal with serving the data...




What makes me thinking is:
Assumed I got a Document X on machine Y in shard Y...
When I reindex that document X together with lots of other documents that
are present or not present in Shard Y... and I put the resulting shard on a
machine Z, how does machine Y notice that it has got an older version of
document X than machine Z?

Furthermore: Go on and assume that the shard Y was replicated to three other
machines, how do they all notice, that their version of document X is not
the newest available one?
In such an environment, we do not have a master (right?), so far: How to
keep the index as consistent as possible?


It's not possible to do it like this, at least for now...

Looking into the future: eventually, when SolrCloud arrives we will be 
able to index straight to a SolrCloud cluster, assigning documents to 
shards through a hashing schema (e.g. 'md5(docId) % numShards'). Since 
shards would be created in a consistent way, then newer versions of 
documents would end up in the same shards and they would replace the 
older versions of the same documents - thus the problem would be solved. 
Additional benefit from this model is that it's not a disruptive and 
copy-intensive operation like SOLR-1301 (where you have to do "create 
new indexes, deploy them and switch") but rather a regular online update 
that is already supported in Solr.


Once this is in place, we can modify Nutch to send documents directly to 
a SolrCloud cluster. Until then, you need to build and deploy indexes 
more or less manually (or using Katta, but again Katta is not integrated 
with Nutch).


SolrCloud is not far away from hitting the trunk (right, Mark? ;) ), so 
medium-term I think this is your best bet.


--
Best regards,
Andrzej Bialecki <><
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: Analyser depending on field's value

2010-08-16 Thread Andrzej Bialecki

On 2010-08-16 10:06, Damien Dudognon wrote:

Hi all,

I want to use a specific stopword list depending on a field's value. For example, if type 
== 1 then I use stopwords1.txt to index "text" field, else I use stopwords2.txt.

I thought of several solutions but no one really satisfied me:
1) use one Solr instance by type, and therefore a distinct index by type;
2) use as many fields as types with specific rules for each field (e.g. a field "text_1" for the type "1" 
which uses "stopwords1.txt", "text_2" for other types which uses "stopwords2.txt", ...)

I am sure that there is a better solution to my problem.

If anyone have a suitable solution to suggest to me ... :-)


Perhaps the solution described here:

https://issues.apache.org/jira/browse/SOLR-1536

Take a look at the example that uses token types to put text into 
different fields, which can then be analyzed differently.


--
Best regards,
Andrzej Bialecki <><
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: Auto-suggest internal terms

2010-06-03 Thread Andrzej Bialecki
On 2010-06-03 13:38, Michael Kuhlmann wrote:
> Am 03.06.2010 13:02, schrieb Andrzej Bialecki:
>> ..., and deploy this
>> index in a separate JVM (to benefit from other CPUs than the one that
>> runs your Solr core)
> 
> Every known webserver ist multithreaded by default, so putting different
> Solr instances into different JVMs will be of no use.


You are right to a certain degree. Still, there are some contention
points in Lucene/Solr, how threads are allocated on available CPU-s, and
how the heap is used, which can make a two-JVM setup perform much better
than a single-JVM setup given the same number of threads...


-- 
Best regards,
Andrzej Bialecki <><
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: Auto-suggest internal terms

2010-06-03 Thread Andrzej Bialecki
On 2010-06-03 09:56, Michael Kuhlmann wrote:
> The only solution without "doing any custom work" would be to perform a
> normal query for each suggestion. But you might get into performance
> troubles with that, because suggestions are typically performed much
> more often than complete searches.

Actually, that's not a bad idea - if you can trim the size of the index
(either by using shingles instead of docs, or trimming the main index -
LUCENE-1812) so that the index fits completely in RAM, and deploy this
index in a separate JVM (to benefit from other CPUs than the one that
runs your Solr core) or another machine, then I think performance would
not be a big concern, and the functionality would be just what you wanted.

> 
> The much faster solution that needs own work would be to build up a
> large TreeMap with each word as the keys, and the matching terms as the
> values.

That would consume an awful lot of RAM... see SOLR-1316 for some
measurements.


-- 
Best regards,
Andrzej Bialecki <><
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: Importing large datasets

2010-06-02 Thread Andrzej Bialecki
On 2010-06-02 13:12, Grant Ingersoll wrote:
> 
> On Jun 2, 2010, at 6:53 AM, Andrzej Bialecki wrote:
> 
>> On 2010-06-02 12:42, Grant Ingersoll wrote:
>>>
>>> On Jun 1, 2010, at 9:54 PM, Blargy wrote:
>>>
>>>>
>>>> We have around 5 million items in our index and each item has a description
>>>> located on a separate physical database. These item descriptions vary in
>>>> size and for the most part are quite large. Currently we are only indexing
>>>> items and not their corresponding description and a full import takes 
>>>> around
>>>> 4 hours. Ideally we want to index both our items and their descriptions but
>>>> after some quick profiling I determined that a full import would take in
>>>> excess of 24 hours. 
>>>>
>>>> - How would I profile the indexing process to determine if the bottleneck 
>>>> is
>>>> Solr or our Database.
>>>
>>> As a data point, I routinely see clients index 5M items on normal
>>> hardware in approx. 1 hour (give or take 30 minutes).  
>>>
>>> When you say "quite large", what do you mean?  Are we talking books here or 
>>> maybe a couple pages of text or just a couple KB of data?
>>>
>>> How long does it take you to get that data out (and, from the sounds of it, 
>>> merge it with your item) w/o going to Solr?
>>>
>>>> - In either case, how would one speed up this process? Is there a way to 
>>>> run
>>>> parallel import processes and then merge them together at the end? Possibly
>>>> use some sort of distributed computing?
>>>
>>> DataImportHandler now supports multiple threads.  The absolute fastest way 
>>> that I know of to index is via multiple threads sending batches of 
>>> documents at a time (at least 100).  Often, from DBs one can split up the 
>>> table via SQL statements that can then be fetched separately.  You may want 
>>> to write your own multithreaded client to index.
>>
>> SOLR-1301 is also an option if you are familiar with Hadoop ...
>>
> 
> If the bottleneck is the DB, will that do much?
> 

Nope. But the workflow could be set up so that during night hours a DB
export takes place that results in a CSV or SolrXML file (there you
could measure the time it takes to do this export), and then indexing
can work from this file.


-- 
Best regards,
Andrzej Bialecki <><
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: Importing large datasets

2010-06-02 Thread Andrzej Bialecki
On 2010-06-02 12:42, Grant Ingersoll wrote:
> 
> On Jun 1, 2010, at 9:54 PM, Blargy wrote:
> 
>>
>> We have around 5 million items in our index and each item has a description
>> located on a separate physical database. These item descriptions vary in
>> size and for the most part are quite large. Currently we are only indexing
>> items and not their corresponding description and a full import takes around
>> 4 hours. Ideally we want to index both our items and their descriptions but
>> after some quick profiling I determined that a full import would take in
>> excess of 24 hours. 
>>
>> - How would I profile the indexing process to determine if the bottleneck is
>> Solr or our Database.
> 
> As a data point, I routinely see clients index 5M items on normal
> hardware in approx. 1 hour (give or take 30 minutes).  
> 
> When you say "quite large", what do you mean?  Are we talking books here or 
> maybe a couple pages of text or just a couple KB of data?
> 
> How long does it take you to get that data out (and, from the sounds of it, 
> merge it with your item) w/o going to Solr?
> 
>> - In either case, how would one speed up this process? Is there a way to run
>> parallel import processes and then merge them together at the end? Possibly
>> use some sort of distributed computing?
> 
> DataImportHandler now supports multiple threads.  The absolute fastest way 
> that I know of to index is via multiple threads sending batches of documents 
> at a time (at least 100).  Often, from DBs one can split up the table via SQL 
> statements that can then be fetched separately.  You may want to write your 
> own multithreaded client to index.

SOLR-1301 is also an option if you are familiar with Hadoop ...



-- 
Best regards,
Andrzej Bialecki <><
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: Autosuggest

2010-05-15 Thread Andrzej Bialecki
On 2010-05-15 02:46, Blargy wrote:
> 
> Thanks for your help and especially your analyzer.. probably saved me a
> full-import or two  :)
> 

Also, take a look at this issue:

https://issues.apache.org/jira/browse/SOLR-1316


-- 
Best regards,
Andrzej Bialecki <><
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: SOLR-1316 How To Implement this autosuggest component ???

2010-03-31 Thread Andrzej Bialecki

On 2010-03-31 06:14, Andy wrote:



--- On Tue, 3/30/10, Andrzej Bialecki  wrote:


From: Andrzej Bialecki
Subject: Re: SOLR-1316 How To Implement this autosuggest component ???
To: solr-user@lucene.apache.org
Date: Tuesday, March 30, 2010, 9:59 AM
On 2010-03-30 15:42, Robert Muir
wrote:

On Mon, Mar 29, 2010 at 11:34 PM, Andy

wrote:



Reading through this thread and SOLR-1316, there

seems to be a lot of

different ways to implement auto-complete in Solr.

I've seen the mentions

of:

EdgeNGrams
TermsComponent
Faceting
TST
Patricia Tries
RadixTree
DAWG




Another idea is you can use the Automaton support in

the lucene flexible

indexing branch: to query the index directly with a

DFA that represents

whatever terms you want back.
The idea is that there really isn't much gain in

building a separate Pat,

Radix Tree, or DFA to do this when you can efficiently

intersect a DFA with

the existing terms dictionary.

I don't really understand what autosuggest needs to

do, but if you are doing

things like looking for mispellings you can easily

build a DFA that

recognizes terms within some short edit distance with

the support thats

there (the LevenshteinAutomata class), to quickly get

back candidates.


You can intersect/concatenate/union these DFAs with

prefix or suffix DFAs if

you want too, don't really understand what the

algorithm should do, but I'm

happy to try to help.



The problem is a bit more complicated. There are two
issues:

* simple term-level completion often produces wrong results
for
multi-term queries (which are usually rewritten as "weak"
phrase queries),

* the weights of suggestions should not correspond directly
to IDF in
the index - much better results can be obtained when they
correspond to
the frequency of terms/phrases in the query logs ...

TermsComponent and EdgeNGrams, while simple to use, suffer
from both issues.



Thanks.

I actually have 2 use cases for autosuggest:

1) The "normal" one - I want to suggest search terms to users after they've 
typed a few letters. Just like Google suggest. Looks like for this use case SOLR-1316 is 
the best option. Right?


Hopefully, yes - it depends on how you intend to populate the TST. If 
you populate it from the main index, then (unless you have indexed 
phrases) there won't be any benefit over the TermsComponent. It may be 
faster, but it will take more RAM. If you populate it from a list of 
top-N queries, then SOLR-1316 is the way to go.



2) I have a field "city" with values that are entered by users. When a user is 
entering his city, I want to make suggestion based on what cities have already been 
entered so far by other users -- in order to reduce chances of duplication. What method 
would you recommend for this use case?


If the "city" field is not analyzed then TermsComponent is easiest to 
use. If it is analyzed, but vast majority of cities are single terms, 
then TermsComponent is ok too. If you want to assign different 
priorities to suggestions (other than a simple IDF based priority), or 
have many city names consisting of multiple tokens, then use SOLR-1316.


--
Best regards,
Andrzej Bialecki <><
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: SOLR-1316 How To Implement this autosuggest component ???

2010-03-30 Thread Andrzej Bialecki

On 2010-03-30 15:42, Robert Muir wrote:

On Mon, Mar 29, 2010 at 11:34 PM, Andy  wrote:


Reading through this thread and SOLR-1316, there seems to be a lot of
different ways to implement auto-complete in Solr. I've seen the mentions
of:

EdgeNGrams
TermsComponent
Faceting
TST
Patricia Tries
RadixTree
DAWG




Another idea is you can use the Automaton support in the lucene flexible
indexing branch: to query the index directly with a DFA that represents
whatever terms you want back.
The idea is that there really isn't much gain in building a separate Pat,
Radix Tree, or DFA to do this when you can efficiently intersect a DFA with
the existing terms dictionary.

I don't really understand what autosuggest needs to do, but if you are doing
things like looking for mispellings you can easily build a DFA that
recognizes terms within some short edit distance with the support thats
there (the LevenshteinAutomata class), to quickly get back candidates.

You can intersect/concatenate/union these DFAs with prefix or suffix DFAs if
you want too, don't really understand what the algorithm should do, but I'm
happy to try to help.



The problem is a bit more complicated. There are two issues:

* simple term-level completion often produces wrong results for 
multi-term queries (which are usually rewritten as "weak" phrase queries),


* the weights of suggestions should not correspond directly to IDF in 
the index - much better results can be obtained when they correspond to 
the frequency of terms/phrases in the query logs ...


TermsComponent and EdgeNGrams, while simple to use, suffer from both issues.

--
Best regards,
Andrzej Bialecki <><
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: SOLR-1316 How To Implement this autosuggest component ???

2010-03-29 Thread Andrzej Bialecki

On 2010-03-30 05:34, Andy wrote:

Reading through this thread and SOLR-1316, there seems to be a lot of different 
ways to implement auto-complete in Solr. I've seen the mentions of:

EdgeNGrams
TermsComponent
Faceting
TST
Patricia Tries
RadixTree
DAWG

Which algorthm does SOLR-1316 implement? TST is one. There are others mentioned 
in the comments on SOLR-1316, such as Patricia Tries, RadixTree, DAWG. Are 
those implemented too?

Among all those methods is there a "recommended" one? What are the pros&  cons?


Only TST is implemented in SOLR-1316. The main advantage of this 
approach is that it can complete arbitrary strings - e.g. frequent 
queries. This reduces the chance of suggesting queries that yield no 
results, which is a danger in other methods.


The disadvantage is the increased RAM consumption, and the need to 
populate it (either from IndexReader - but then it's nearly equivalent 
to the TermsComponent; or from a list of frequent queries - but you need 
to build that list yourself).



--
Best regards,
Andrzej Bialecki <><
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: multiple binary documents into a single solr document - Vignette/OpenText integration

2010-03-24 Thread Andrzej Bialecki

On 2010-03-24 15:58, Fábio Aragão da Silva wrote:

hello there,
I'm working on the development of a piece of code that integrates Solr
with Vignette/OpenText Content Management, meaning Vignette content
instances will be indexed in solr when published and deleted from solr
when unpublished. I'm using solr 1.4, solrj and solr cell.

I've implemented most of the code and I've ran into only a single
issue so far: vignette content management supports the attachment of
multiple binary documents (such as .doc, .pdf or .xls files) to a
single content instance. I am mapping each content instance in
Vignette to a solr document, but now I have a content instance in
vignette with multiple binary files attached to it.

So my question is: is it possible to have more than one binary file
indexed into a single document in solr?

I'm a beginner in solr, but from what I understood I have two options
to index content using solrj: either to use UpdateRequest() and the
add() method to add a SolrInputDocument to the request (in case the
document doesn´t represent a binary file), or to use
ContentStreamUpdateRequest() and the addFile() method to add a binary
file to the content stream request.

I don't see a way, though, to say "this document is comprised of two
files, a word and a pdf, so index them as one document in solr using
content1 and content2 fields - or merge their content into a single
'content' field)".

I tried calling the addFile() twice (one call for each file) and no
error but nothing getting indexed as well.

ContentStreamUpdateRequest req = new
ContentStreamUpdateRequest("/update/extract");
req.addFile(new File("file1.doc"));
req.addFile(new File("file2.pdf"));
req.setParam("literal.id", "multiple_files_test");
req.setParam("uprefix", "attr_");
req.setAction(AbstractUpdateRequest.ACTION.COMMIT, true, true);
server.request(req);

Any thoughts on this would be greatly appreciated.


Write your own RequestHandler that uses the existing 
ExtractingRequestHandler to actually parse the streams, and then you 
combine the results arbitrarily in your handler, eventually sending an 
AddUpdateCommand to the update processor. You can obtain both the update 
processor and SolrCell instance from req.getCore().



--
Best regards,
Andrzej Bialecki <><
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: wikipedia and teaching kids search engines

2010-03-24 Thread Andrzej Bialecki

On 2010-03-24 16:15, Markus Jelsma wrote:

A bit off-topic but how about Nutch grabbing some conent and have it indexed
in Solr?


The problem is not with collecting and submitting the documents, the 
problem is with parsing the Wikimedia markup embedded in XML. 
WikipediaTokenizer from Lucene contrib/ is a quick and perhaps 
acceptable solution ...


--
Best regards,
Andrzej Bialecki <><
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: Features not present in Solr

2010-03-23 Thread Andrzej Bialecki

On 2010-03-23 06:25, David Smiley @MITRE.org wrote:


I use Endeca and Solr.

A few notable things in Endeca but not in Solr:
1. Real-time search.




2. "related record navigation" (RRN) is what they call it.  This is the
ability to join in other records, something Lucene/Solr definitely can't do.


Could you perhaps elaborate a bit on this functionality? Your 
description sounds intriguing - it reminds me of ParallelReader, but I'm 
probably completely wrong ...



--
Best regards,
Andrzej Bialecki <><
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: SOLR-1316 How To Implement this autosuggest component ???

2010-03-19 Thread Andrzej Bialecki

On 2010-03-19 13:03, stocki wrote:


hello..

i try to implement autosuggest component from these link:
http://issues.apache.org/jira/browse/SOLR-1316

but i have no idea how to do this !?? can anyone get me some tipps ?


Please follow the instructions outlined in the JIRA issue, in the 
comment that shows fragments of XML config files.



--
Best regards,
Andrzej Bialecki <><
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: Update Index : Updating Specific Fields

2010-03-04 Thread Andrzej Bialecki

On 2010-03-04 07:41, Walter Underwood wrote:

No. --wunder


Or perhaps "not yet" ...

http://portal.acm.org/ft_gateway.cfm?id=1458171



On Mar 3, 2010, at 10:40 PM, Kranti™ K K Parisa wrote:


Hi,

Is there any way to update the index for only the specific fields?

Eg:
Index has ONE document consists of 4 fields,  F1, F2, F3, F4
Now I want to update the value of field F2, so if I send the update xml to
SOLR, can it keep the old field values for F1,F3,F4 and update the new value
specified for F2?

Best Regards,
Kranti K K Parisa








--
Best regards,
Andrzej Bialecki <><
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: If you could have one feature in Solr...

2010-02-28 Thread Andrzej Bialecki

On 2010-02-28 17:26, Ian Holsman wrote:

On 2/24/10 8:42 AM, Grant Ingersoll wrote:

What would it be?


most of this will be coming in 1.5,
but for me it's

- sharding.. it still seems a bit clunky

secondly.. this one isn't in 1.5.
I'd like to be able to find "interesting" terms that appear in my result
set that don't appear in the global corpus.

it's kind of like doing a facet count on *:* and then on the search term
and discount the terms that appear heavily on the global one.
(sorry.. there is a textbook definition of this.. XX distance.. but I
haven't got the books in front of me).


Kullback-Leibler divergence?


--
Best regards,
Andrzej Bialecki <><
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: term frequency vector access?

2010-02-11 Thread Andrzej Bialecki

On 2010-02-11 17:04, Mike Perham wrote:

In an UpdateRequestProcessor (processing an AddUpdateCommand), I have
a SolrInputDocument with a field 'content' that has termVectors="true"
in schema.xml.  Is it possible to get access to that field's term
vector in the URP?


No, term vectors are created much later, during the process of adding 
the document to a Lucene index (deep inside Lucene IndexWriter & co). 
That's the whole point of SOLR-1536 - certain features become available 
only when the tokenization actually occurs.


Another reason to use SOLR-1536 is when tokenization and analysis is 
costly, e.g. when doing named entity recognition, POS tagging or 
lemmatization. Theoretically you could play the TokenizerChain twice - 
once in URP, so that you can discover and capture features and modify 
the input document accordingly, and then again inside Lucene - but in 
practice this may be too costly.


--
Best regards,
Andrzej Bialecki <><
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: Can Solr be forced to return all field tags for a document even if the field is empty?l

2010-01-27 Thread Andrzej Bialecki

On 2010-01-28 03:21, Erick Erickson wrote:

This is kind of an unusual request, what higher-level
problem are you trying to solve here? Because the
field just *isn't there* in the underlying Lucene index
for that document.

I suppose you could index a "not there" token and just
throw those values out from the response...


You can also implement a SearchComponent that post-processes results and 
based on the schema if a field is missing then it adds an empty node to 
the result.



--
Best regards,
Andrzej Bialecki <><
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: How to Split Index file.

2010-01-10 Thread Andrzej Bialecki

On 2010-01-10 01:55, Lance Norskog wrote:

Make two copies of the index. In each copy, delete the records you do
not want. Optimize.


... which is essentially what the MultiPassIndexSplitter does, only it 
avoids the initial copy (by deleting in the source index).



--
Best regards,
Andrzej Bialecki <><
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: restore space between words by spell checker

2009-11-28 Thread Andrzej Bialecki

Otis Gospodnetic wrote:

I'm not sure if that can be easily done (other than going char by char and 
testing), because nothing indicates where the space might be, not even an upper 
case there.  I'd be curious to know if you find a better solution.

Otis
--
Sematext is hiring -- http://sematext.com/about/jobs.html?mls
Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR



- Original Message 

From: Andrey Klochkov 
To: solr-user 
Sent: Fri, November 27, 2009 6:09:08 AM
Subject: restore space between words by spell checker

Hi

If a user issued a misspelled query, forgetting to place space between
words, is it possible to fix it with a spell checker or by some other
mechanism?

For example, if we get query "tommyhitfiger" and have terms "tommy" and
"hitfiger" in the index, how to fix the query?


The usual approach to solving this is to index compound words, i.e. when 
producing a spellchecker dictionary add a record "tommyhitfiger" with a 
field that points to "tommy hitfiger". Details vary depending on what 
spellchecking impl. you use.




--
Best regards,
Andrzej Bialecki <><
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: Index Splitter

2009-11-25 Thread Andrzej Bialecki

Koji Sekiguchi wrote:

Giovanni Fernandez-Kincade wrote:

You can't really use this if you have an optimized index, right?

  

For optimized index, I think you can use MultiPassIndexSplitter.


Correct - MultiPassIndexSplitter can handle any index - optimized or 
not, with or without deletions, etc. The cost for this flexibility is 
that it needs to read index files multiple times (hence "multi-pass").




--
Best regards,
Andrzej Bialecki <><
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: how to get the autocomplete feature in solr 1.4?

2009-11-23 Thread Andrzej Bialecki

Chris Hostetter wrote:

: how to get the autocomplete/autosuggest feature in the solr1.4.plz give me
: the code also...

there is no magical "one size fits all" solution for autocomplete in solr.  
if you look at the archives there have been lots of discussions about 
differnet ways ot get auto complete functionality, using things like the 
TermsComponent, or the LukeRequest handler, and there are lots of examples 
of using the SolrJS javascript functionality to populate an autocomplete 
box -- but you'll have to figure out what solution works best for your 
goals.


Also, take a look at SOLR-1316, there are patches there that implement 
such component using prefix trees.



--
Best regards,
Andrzej Bialecki <><
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: leading and trailing wildcard query

2009-11-05 Thread Andrzej Bialecki

A. Steven Anderson wrote:

No thoughts on this? Really!?

I would hate to admit to my Oracle DBE that Solr can't be customized to do a
common query that a relational database can do. :-(


On Wed, Nov 4, 2009 at 6:01 PM, A. Steven Anderson <
a.steven.ander...@gmail.com> wrote:


I've scoured the archives and JIRA , but the answer to my question is just
not clear to me.

With all the new Solr 1.4 features, is there any way  to do a leading and
trailing wildcard query on an *untokenized* field?

e.g. q=myfield:*abc* would return a doc with myfield=xxxabcxxx

Yes, I know how expensive such a query would be, but we have the user
requirement, nonetheless.

If not, any suggestions on how to implement a custom solution using Solr?
Using an external data structure?


You can use ReversedWildcardFilterFactory that creates additional tokens 
(in your case, a single additional token :) ) that is reversed, _and_ 
also triggers the setAllowLeadingWildcards in the QueryParser - won't 
help much with the performance though, due to the trailing wildcard in 
your original query. Please see the discussion in SOLR-1321 (this will 
be available in 1.4 but it should be easy to patch 1.3 to use it).


If you really need to support such queries efficiently you should 
implement a full permu-term indexing, i.e. a token filter that rotates 
tokens and adds all rotations (with a special marker to mark the 
beginning of the word), and a query plugin that detects such query terms 
and rotates the query term appropriately.



--
Best regards,
Andrzej Bialecki <><
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: Solr Cell on web-based files?

2009-10-27 Thread Andrzej Bialecki

Grant Ingersoll wrote:
You might try remote streaming with Solr (see 
http://wiki.apache.org/solr/SolrConfigXml).  Otherwise, look into a 
crawler such as Nutch or Droids or Heretrix.


Additionally, Nutch can be configured to send the crawled/parsed 
documents to Solr for indexing.



--
Best regards,
Andrzej Bialecki <><
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: QTime always a multiple of 50ms ?

2009-10-23 Thread Andrzej Bialecki

Jérôme Etévé wrote:

Hi all,

 I'm using Solr trunk from 2009-10-12 and I noticed that the QTime
result is always a multiple of roughly 50ms, regardless of the used
handler.

For instance, for the update handler, I get :

INFO: [idx1] webapp=/solr path=/update/ params={} status=0 QTime=0
INFO: [idx1] webapp=/solr path=/update/ params={} status=0 QTime=104
INFO: [idx1] webapp=/solr path=/update/ params={} status=0 QTime=52
...

Is this a known issue ?


It may be an issue with System.currentTimeMillis() resolution on some 
platforms (e.g. Windows)?



--
Best regards,
Andrzej Bialecki <><
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: Is negative boost possible?

2009-10-13 Thread Andrzej Bialecki

Yonik Seeley wrote:

On Mon, Oct 12, 2009 at 12:03 PM, Andrzej Bialecki  wrote:

Solr never discarded non-positive hits, and now Lucene 2.9 no longer
does either.

Hmm ... The code that I pasted in my previous email uses
Searcher.search(Query, int), which in turn uses search(Query, Filter, int),
and it doesn't return any results if only the first clause is present (the
one with negative boost) even though it's a matching clause.

I think this is related to the fact that in TopScoreDocCollector:48 the
pqTop.score is initialized to 0, and then all results that have lower score
that this are discarded. Perhaps this should be initialized to
Float.MIN_VALUE?


Hmmm, You're actually seeing this with Lucene 2.9?
The HitQueue (subclass of PriorityQueue) is pre-populated with
sentinel objects with scores of -Inf, not zero.


Uhh, sorry, you are right - an early 2.9-dev version of the jar sneaked 
in on my classpath .. I verified now that 2.9.0 returns both positive 
and negative scores with the default TopScoreDocCollector.


--
Best regards,
Andrzej Bialecki <><
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: Passing request to another handler

2009-10-13 Thread Andrzej Bialecki

Chris Hostetter wrote:

: What's the canonical way to pass an update request to another handler? I'm
: implementing a handler that has to dispatch its result to different update
: handlers based on its internal processing.

I've always written my delegating RequestHandlers so that they take in the 
names (or paths) of the handlers they are going to delegate to as init 
params.


Yeah, this is where I started ...



the other approach i've seen is to make the delegating handler instantiate 
the sub-handlers directly so that it can have the exact instnaces it wants 
configured the way it wants them.


... and this is where I ended now :)



It really comes down to what your goal is: if you wnat your code to be 
totlaly in conrol instantiate new instances.  if you wnat the person 
creatining the solrconfig.xml to be in control let them tell you the name 
of a handler (with it's defaults/invariants configured i na way you can't 
control) to delegate to.


Indeed - thanks.

--
Best regards,
Andrzej Bialecki <><
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: Is negative boost possible?

2009-10-12 Thread Andrzej Bialecki

Yonik Seeley wrote:

On Mon, Oct 12, 2009 at 5:58 AM, Andrzej Bialecki  wrote:

BTW, standard Collectors collect only results
with positive scores, so if you want to collect results with negative scores
as well then you need to use a custom Collector.


Solr never discarded non-positive hits, and now Lucene 2.9 no longer
does either.


Hmm ... The code that I pasted in my previous email uses 
Searcher.search(Query, int), which in turn uses search(Query, Filter, 
int), and it doesn't return any results if only the first clause is 
present (the one with negative boost) even though it's a matching clause.


I think this is related to the fact that in TopScoreDocCollector:48 the 
pqTop.score is initialized to 0, and then all results that have lower 
score that this are discarded. Perhaps this should be initialized to 
Float.MIN_VALUE?



--
Best regards,
Andrzej Bialecki <><
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: Is negative boost possible?

2009-10-12 Thread Andrzej Bialecki

Yonik Seeley wrote:

On Sun, Oct 11, 2009 at 6:04 PM, Lance Norskog  wrote:

And the other important
thing to know about boost values is that the dynamic range is about
6-8 bits


That's an index-time boost - an 8 bit float with 5 bits of mantissa
and 3 bits of exponent.
Query time boosts are normal 32 bit floats.


To be more specific: index-time float encoding does not permit negative 
numbers (see SmallFloat), but query-time boosts can be negative, and 
they DO affect the score - see below. BTW, standard Collectors collect 
only results with positive scores, so if you want to collect results 
with negative scores as well then you need to use a custom Collector.


---
BeanShell 2.0b4 - by Pat Niemeyer (p...@pat.net)
bsh % import org.apache.lucene.search.*;
bsh % import org.apache.lucene.index.*;
bsh % import org.apache.lucene.store.*;
bsh % import org.apache.lucene.document.*;
bsh % import org.apache.lucene.analysis.*;
bsh % tq = new TermQuery(new Term("a", "b"));
bsh % print(tq);
a:b
bsh % tq.setBoost(-1);
bsh % print(tq);
a:b^-1.0
bsh % q = new BooleanQuery();
bsh % tq1 = new TermQuery(new Term("a", "c"));
bsh % tq1.setBoost(10);
bsh % q.add(tq1, BooleanClause.Occur.SHOULD);
bsh % q.add(tq, BooleanClause.Occur.SHOULD);
bsh % print(q);
a:c^10.0 a:b^-1.0
bsh % dir = new RAMDirectory();
bsh % w = new IndexWriter(dir, new WhitespaceAnalyzer());
bsh % doc = new Document();
bsh % doc.add(new Field("a", "b c d", Field.Store.YES, 
Field.Index.ANALYZED));

bsh % w.addDocument(doc);
bsh % w.close();
bsh % r = IndexReader.open(dir);
bsh % is = new IndexSearcher(r);
bsh % td = is.search(q, 10);
bsh % sd = td.scoreDocs;
bsh % print(sd.length);
1
bsh % print(is.explain(q, 0));
0.1373985 = (MATCH) sum of:
  0.15266499 = (MATCH) weight(a:c^10.0 in 0), product of:
0.99503726 = queryWeight(a:c^10.0), product of:
  10.0 = boost
  0.30685282 = idf(docFreq=1, numDocs=1)
  0.32427183 = queryNorm
0.15342641 = (MATCH) fieldWeight(a:c in 0), product of:
  1.0 = tf(termFreq(a:c)=1)
  0.30685282 = idf(docFreq=1, numDocs=1)
  0.5 = fieldNorm(field=a, doc=0)
  -0.0152664995 = (MATCH) weight(a:b^-1.0 in 0), product of:
-0.099503726 = queryWeight(a:b^-1.0), product of:
  -1.0 = boost
  0.30685282 = idf(docFreq=1, numDocs=1)
  0.32427183 = queryNorm
0.15342641 = (MATCH) fieldWeight(a:b in 0), product of:
  1.0 = tf(termFreq(a:b)=1)
  0.30685282 = idf(docFreq=1, numDocs=1)
  0.5 = fieldNorm(field=a, doc=0)

bsh %


--
Best regards,
Andrzej Bialecki <><
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: Passing request to another handler

2009-10-11 Thread Andrzej Bialecki

Shalin Shekhar Mangar wrote:

On Fri, Oct 9, 2009 at 10:53 PM, Andrzej Bialecki  wrote:


Hi,

What's the canonical way to pass an update request to another handler? I'm
implementing a handler that has to dispatch its result to different update
handlers based on its internal processing.



An update request? There's always only one UpdateHandler registered in Solr.


Hm, yes - to be more specific, what I meant is that I need to 
pre-process an update request and then pass it on either to my own 
handler (which performs an update) or to the ExtractingRequestHandler 
(which also performs an update).






Getting a handler from SolrCore.getRequestHandler(handlerName) makes the
implementation dependent on deployment paths defined in solrconfig.xml.
Using SolrCore.getRequestHandlers(handler.class) often returns the
LazyRequestHandlerWrapper, from which it's not possible to retrieve the
wrapped instance of the handler ..



You must know the name of the handler you are going to invoke. Or if you are
sure that there is only one instance, knowing the class name will let you
know the handler name. Then the easiest way to invoke it would be to use a


I do know the class name - ExtractingRequestHandler. But when I invoke 
SolrCore.getRequestHandler(Class) I get an empty map, because this 
handler is registered as lazy, and this means that it's represented as 
LazyRequestHandlerWrapper.


--
Best regards,
Andrzej Bialecki <><
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Passing request to another handler

2009-10-09 Thread Andrzej Bialecki

Hi,

What's the canonical way to pass an update request to another handler? 
I'm implementing a handler that has to dispatch its result to different 
update handlers based on its internal processing.


Getting a handler from SolrCore.getRequestHandler(handlerName) makes the 
implementation dependent on deployment paths defined in solrconfig.xml. 
Using SolrCore.getRequestHandlers(handler.class) often returns the 
LazyRequestHandlerWrapper, from which it's not possible to retrieve the 
wrapped instance of the handler ..


--
Best regards,
Andrzej Bialecki <><
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: Where to place ReversedWildcardFilterFactory in Chain

2009-10-01 Thread Andrzej Bialecki

Chantal Ackermann wrote:

Thanks, Mark!
But I suppose it does matter where in the index chain it goes? I would 
guess it is applied to the tokens, so I suppose I should put it at the 
very end - after WordDelimiter and Lowercase have been applied.



Is that correct?

 >>   
 >> >splitOnCaseChange="1" splitOnNumerics="1"
 >>stemEnglishPossessive="1" generateWordParts="1"
 >>generateNumberParts="1" catenateAll="1"
 >>preserveOriginal="1" />
 >> 
   
 >>   


Yes. Care should be taken that the query analyzer chain produces the 
same forward tokens, because the code in QueryParser that optionally 
reverses tokens acts on tokens that it receives _after_ all other query 
analyzers have run on the query.



--
Best regards,
Andrzej Bialecki <><
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: Adding data from nutch to a Solr index

2009-09-30 Thread Andrzej Bialecki

Sönke Goldbeck wrote:

Alright, first post to this list and I hope the question
is not too stupid or misplaced ...

what I currently have:
- a nicely working Solr 1.3 index with information about some
entities e.g. organisations, indexed from an RDBMS. Many of these
entities have an URL pointing at further information, e.g. the
website of an institute or company.

- an installation of nutch 0.9 with which I can crawl for the
URLs that I can extract from the RDBMS mentioned above and put
into a seed file

- tutorials about how to put crawled and indexed data from
nutch 1.0 (which I could install w/o problems) into a separate
Solr index


what I want:
- combine the indexed information from the RDBMS and the website
in one Solr index so that I can search both in one and with the
capability of using all the Solr features. E.g. having the following
(example) fields in one document:


  
  
  
  
  <...>



I believe that this kind of document merging is not possible (at least 
not easily) - you have to assemble the whole document before you index 
it in Solr.


If these documents use the same primary key (I guess they do, otherwise 
how would you merge them...) then you can do the merging in your 
front-end application, which would have to submit the main query to 
Solr, and then for each Solr document on the list of results it would 
retrieve a Nutch document (using NutchBean API).


(The not so easy way involves writing a SearchComponent that does the 
latter part of that process on the Solr side.)


--
Best regards,
Andrzej Bialecki <><
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: Number of terms in a SOLR field

2009-09-30 Thread Andrzej Bialecki

Fergus McMenemie wrote:

Fergus McMenemie wrote:

Hi all,

I am attempting to test some changes I made to my DIH based
indexing process. The changes only affect the way I 
describe my fields in data-config.xml, there should be no

changes to the way the data is indexed or stored.

As a QA check I was wanting to compare the results from
indexing the same data before/after the change. I was looking
for a way of getting counts of terms in each field. I 
guess Luke etc most allow this but how?
Luke uses brute force approach - it traverses all terms, and counts 
terms per field. This is easy to implement yourself - just get 
IndexReader.terms() enumeration and traverse it.


Thanks Andrzej 


This is just a one off QA check. How do I get Luke to display
terms and counts?


1. get Luke 0.9.9
2. open index with Luke
3. Look at the Overview panel, you will see the list titled "Available 
fields and term counts per field".



--
Best regards,
Andrzej Bialecki <><
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: Number of terms in a SOLR field

2009-09-29 Thread Andrzej Bialecki

Fergus McMenemie wrote:

Hi all,

I am attempting to test some changes I made to my DIH based
indexing process. The changes only affect the way I 
describe my fields in data-config.xml, there should be no

changes to the way the data is indexed or stored.

As a QA check I was wanting to compare the results from
indexing the same data before/after the change. I was looking
for a way of getting counts of terms in each field. I 
guess Luke etc most allow this but how?


Luke uses brute force approach - it traverses all terms, and counts 
terms per field. This is easy to implement yourself - just get 
IndexReader.terms() enumeration and traverse it.



--
Best regards,
Andrzej Bialecki <><
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: How to get a stack trace

2009-08-08 Thread Andrzej Bialecki

Chris Hostetter wrote:
: I'm a new user of solr but I have worked a bit with Lucene before. I get 
: some out of memory exception when optimizing the index through Solr and 
: I would like to find out why. However, the only message I get on 
: standard output is: Jul 30, 2009 9:20:22 PM 
: org.apache.solr.common.SolrException log SEVERE: 
: java.lang.OutOfMemoryError: Java heap space
: 
: Is there a way to get a stack trace for this exception? I had a look 
: into the java.util.logging options and didn't find anything.


FWIW #1: OutOfMemoryError is a java "Error" not an "Exception" ... 
Exceptions and Errors are both Throwable, but an Error is not an 
Exception. this is a really importatn distinction (see below)


FWIW #2: when dealing with an OOM, a stack trace is almost never useful.  
as mentioned in other threads, a heapdump is the most useful diagnostic 
tool


FWIW #3: the formatting of Throwables in log files is 100% dependent 
on the configuration of the log manager -- the client code doing 
the logging just specifies the Throwable object -- it's up to the 
Formatter to decide how to output it.


Ok .. on to the meat of hte issue...

OOM Errors are a particularly devious class of errors: they don't 
neccessarily have stack traces (depending on your VM impl, and the state 
of the VM when it tries to log the OOM) 

   http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=4753347
   http://blogs.sun.com/alanb/entry/outofmemoryerror_looks_a_bit_better

...on any *Exception* you should get a detailed stacktrace in the logs 
(unless you have a really screwed up LogManger configs), but when dealing 
with *Errors* like OutOfMemoryError, all bets are off as to what hte VM 
can give you.


I had some success in debugging this type of problems when I would 
generate a heap dump on OOM (it's a JVM flag) and then use a tool like 
HAT to find largest objects and references to them.



--
Best regards,
Andrzej Bialecki <><
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: Language Detection for Analysis?

2009-08-07 Thread Andrzej Bialecki

Otis Gospodnetic wrote:

Bradford,

If I may:

Have a look at http://www.sematext.com/products/language-identifier/index.html
And/or http://www.sematext.com/products/multilingual-indexer/index.html


.. and a Nutch plugin with similar functionality:

http://lucene.apache.org/nutch/apidocs-1.0/org/apache/nutch/analysis/lang/LanguageIdentifier.html

--
Best regards,
Andrzej Bialecki <><
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: Solr Search probem w/ phrase searches, text type, w/ escaped characters

2009-08-03 Thread Andrzej Bialecki

Peter Keane wrote:

I've used Luke to figure out what is going on, and I see in the fields that
fail to match, a "null_1".  Could someone tell me what that is?  I see some
null_100s there as well, which see to separate field values.  Clearly the
null_1s are causing the search to fail.


You used the "Reconstruct" function to obtain the field values for 
unstored fields, right? null_NNN is Luke's way of telling you that the 
tokens that should be on these positions are absent, because they were 
removed by analyzer during indexing, and there is no stored value of 
this field from which you could recover the original text. In other 
words, they are holes in the token stream, of length NNN.


Such holes may be also produced by artificially increasing the token 
positions, hence the null_100 that serves to separate multiple field 
values so that e.g. phrase queries don't match unrelated text.


Phrase queries that you can construct using QueryParser can't match two 
tokens separated by a hole, unless you set a slop value > 0.


--
Best regards,
Andrzej Bialecki <><
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com




Re: what crawler do you use for Solr indexing?

2009-03-10 Thread Andrzej Bialecki

Sean Timm wrote:

We too use Heritrix. We tried Nutch first but Nutch was not finding all
of the documents that it was supposed to. When Nutch and Heritrix were
both set to crawl our own site to a depth of three, Nutch missed some
pages that were linked directly from the seed. We ended up with 10%-20%
fewer pages in the Nutch crawl.


FWIW, from a private conversation with Sean it seems that this was 
likely related to the default configuration in Nutch, which collects 
only the first 1000 outlinks from a page. This is an arbitrary and 
configurable limit, introduced as a way to limit the impact of spam 
pages and to limit the size of LinkDb. If a page hits this limit then 
indeed the symptoms that you observe are missing (dropped) links.




--
Best regards,
Andrzej Bialecki <><
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: what crawler do you use for Solr indexing?

2009-03-06 Thread Andrzej Bialecki

Tony Wang wrote:

Hi Hoss,

But I cannot find documents about the integration of Nutch and Solr in
anywhere. Could you give me some clue? thanks


Tony, I suggest that you follow Hoss's advice and ask these questions on 
nutch-user. This integration is built into Nutch, and not Solr, so it's 
less likely that people on this list know what you are talking about.


This integration is quite fresh, too, so there are almost no docs except 
on the mailing list. Eventually someone is going to create some docs, 
and if you keep asking questions on nutch-user you will contribute to 
the creation of such docs ;)



--
Best regards,
Andrzej Bialecki <><
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: Integrating Solr and Nutch

2009-02-27 Thread Andrzej Bialecki

Tony Wang wrote:

I heard Nutch 1.0 will have an easy way to integrate with Solr, but I
haven't found any documentation on that yet. anyone?


Indeed, this integration is already supported in Nutch trunk (soon to be 
released). Please download a nightly package and test it.


You will need to reindex your segments using the solrindex command, and 
change the searcher configuration. See nutch-default.xml for details.


--
Best regards,
Andrzej Bialecki <><
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: Redhat vs FreeBSD vs other unix flavors

2009-02-27 Thread Andrzej Bialecki

Otis Gospodnetic wrote:

You should be fine on either Linux or FreeBSD (or any other UNIX
flavour).  Running on Solaris would probably give you access to
goodness like dtrace, but you can live without it.


There's dtrace on FreeBSD, too.


--
Best regards,
Andrzej Bialecki <><
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: Please help me integrate Nutch with Solr

2008-12-29 Thread Andrzej Bialecki

Tony Wang wrote:

Thanks Otis.

I've just downloaded
NUTCH-442_v8.patch<https://issues.apache.org/jira/secure/attachment/12391810/NUTCH-442_v8.patch>from
https://issues.apache.org/jira/browse/NUTCH-442, but the patching process
gave me lots errors, see below:


This patch will be integrated within a couple days - please monitor this 
issue, and when it's done just download the patched code.



--
Best regards,
Andrzej Bialecki <><
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: [VOTE] Community Logo Preferences

2008-11-27 Thread Andrzej Bialecki

https://issues.apache.org/jira/secure/attachment/12394268/apache_solr_c_red.jpg
https://issues.apache.org/jira/secure/attachment/12394350/solr.s4.jpg
https://issues.apache.org/jira/secure/attachment/12394376/solr_sp.png
https://issues.apache.org/jira/secure/attachment/12394267/apache_solr_c_blue.jpg


--
Best regards,
Andrzej Bialecki <><
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: TextProfileSigature using deduplication

2008-11-20 Thread Andrzej Bialecki

Mark Miller wrote:
Thanks for sharing Marc, thats very nice to know. I'll take your 
experience as a starting point for some wiki recommendations.


Sounds like we should add a switch to order alpha as well.


On the general note of near-duplicate detection ... I found this paper 
in the proceedings of SIGIR-08, which presents an interesting and 
relatively simple algorithm that yields excellent results. Who has some 
spare CPU cycles to implement this? ;)


http://ilpubs.stanford.edu:8090/860/

--
Best regards,
Andrzej Bialecki <><
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: TextProfileSigature using deduplication

2008-11-18 Thread Andrzej Bialecki

Marc Sturlese wrote:

Hey there, I've been testing and checking the source of the
TextProfileSignature.java to avoid similar entries at indexing time.
What I understood is that it is useful for huge text where the frequency of
the tokens (the words in lowercase just with number and leters in taht case)
is important. If you want to detect duplicates in not huge text and not
giving a lot of importance to the frequencies it doesn't work...
The hash will be made just with the terms wich frequency is higher than a
QUANTUM (which value is given in function of the max freq between all the
terms). So it will say that:

aaa sss ddd fff ggg hhh aaa kkk lll ooo
aaa xxx iii www qqq aaa jjj eee zzz nnn

are duplicates because quantum here wolud be 2 and the frequency of aaa
would be 2 aswell. So, to make the hash just the term aaa would be used.

In this case:
aaa sss ddd fff ggg hhh kkk lll ooo
apa sss ddd fff ggg hhh kkk lll ooo

Here quantum would be 1 and the frequencies of all terms would be 1 so all
terms would be use for the hash. It will consider this two strings not
similar.

As I understood the algorithm there's no way to make it understand that in
my second case both strings are similar. I wish i were wrong...

I have my own duplication system to detect that but I use String comparison
so it works really slow... Would like to know if there is any tuning
possibility to do that with TextProfileSignature 


Don't know if I should pot this here or in the developers forum...


Hi Marc,

TextProfileSignature is a rather crude implementation of approximate 
similarity, and as you pointed out it's best suited for large texts. The 
original purpose of this Signature was to deduplicate web pages in large 
amounts of crawled pages (in Nutch), where it worked reasonably well. 
Its advantage is also that it's easy to compute and doesn't require 
multiple passes over the corpus.


As it is implemented now, it breaks badly in the case you describe. You 
could modify this implementation to include also word-level ngrams, i.e. 
sequences of more than 1 word, up to N (e.g. 5) - this should work in 
your case.


Ultimately, what you are probably looking for is a shingle-based 
algorithm, but it's relatively costly and requires multiple passes.


--
Best regards,
Andrzej Bialecki <><
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: maxFieldLength

2008-11-07 Thread Andrzej Bialecki

Dan A. Dickey wrote:

I just came across the maxFieldLength setting for the mainIndex
in solrconfig.xml and have a question or two about it.
The default value is 1.

I'm extracting text from pdf documents and
storing them into a text field.  Is the length of this text field limited
to 1 characters?  Many pdf documents are megabytes in size.
Do this mean that only the first 1 characters are getting indexed?

Is there a good way to index the whole document, or do I just simply
need to increase the size of maxFieldLength?  What performance
ramifications would something like this have?


maxFieldLength is counted in tokens, not chars, so you should be pretty 
safe unless your documents contain a lot of text.


You can of course set this value to whatever you want, including 
Integer.MAX_VALUE. This has performance consequences - terms found at 
large positions will increase the length of posting lists, which leads 
to increased memory/CPU consumption during decoding and traversing of 
the lists. Also, the overall increased number of positions will have an 
impact on the index size.



--
Best regards,
Andrzej Bialecki <><
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: Advice on analysis/filtering?

2008-10-16 Thread Andrzej Bialecki

Jarek Zgoda wrote:

Wiadomość napisana w dniu 2008-10-16, o godz. 16:21, przez Grant Ingersoll:

I'm trying to create a search facility for documents in "broken" 
Polish (by broken I mean "not language rules compliant"),


Can you explain what you mean here a bit more?  I don't know Polish, 


Hi guys,

I do speak Polish :) maybe I can help here a bit.


Some documents (around 15% of all pile) contain the texts entered by 
children from primary school's and that implies many syntactic and 
ortographic errors.


document text: "włatcy móch" (in proper Polish this would be "władcy 
much")
example terms that should match: "włatcy much", "wlatcy moch", 
"wladcy much"


These examples can be classified as "sounds like", and typically 
soundexing algorithms are used to address this problem, in order to 
generate initial suggestions. After that you can use other heuristic 
rules to select the most probable correct forms.


AFAIK, there are no (public) soundex implementations for Polish, in 
particular in Java, although there was some research work done on the 
construction of a specifically Polish soundex. You could also use the 
Daitch-Mokotoff soundex, which comes close enough.



Taking word "włatcy" from my example, I'd like to find documents 
containing words


"wlatcy" (latin-2 accentuations stripped from original), 


This step is trivial.

"władcy" (proper form of this noun) and "wladcy" (latin-2 
accents stripped from proper form).


And this one is not. It requires using something like soundexing in 
order to look up possible similar terms. However ... in this process you 
inevitably collect false positives, and you don't have any way in the 
input text to determine that they should be rejected. You can only make 
this decision based on some external knowledge of Polish, such as:


* a morpho-syntactic analyzer that will determine which combinations of 
suggestions are more correct and more probable,


* a language model that for any given soundexed phrase can generate the 
most probable original phrases.


Also, knowing the context in which a query is asked may help, but 
usually you don't have this information (queries are short).


--
Best regards,
Andrzej Bialecki <><
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: Adding bias to Distributed search feature?

2008-09-15 Thread Andrzej Bialecki

Lance Norskog wrote:

Thanks!  We made variants of this and a couple of other files.

As to why we have the same document in different shards with different
contents: once you hit a certain index size and ingest rate, it is easiest
to create a series of indexes and leave the older ones alone. In the future,
please consider this as a legitimate use case instead of simply a mistake.


You may be interested in implementing something like this:

"Compact Features for Detection of Near-Duplicates in Distributed 
Retrieval", Yaniv Bernstein, Milad Shokouhi, and Justin Zobel


It sounds straightforward, and relieves your from the need to 
de-duplicate your collection.


--
Best regards,
Andrzej Bialecki <><
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: Extending Solr with custom filter

2008-09-12 Thread Andrzej Bialecki

Jarek Zgoda wrote:

Exactly like that.

Wiadomość napisana w dniu 2008-09-12, o godz. 17:27, przez sunnyfr:



ok .. that?

  
 
   




I recommend using Stempelator (or Morfologik) for Polish stemming and 
lemmatization. It provides a superset of Stempel features, namely in 
addition to the algorithmic stemming it provides a dictionary-based 
stemming, and these two methods nicely complement each other.


--
Best regards,
Andrzej Bialecki <><
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: Solr Logo thought

2008-08-06 Thread Andrzej Bialecki

Stephen Weiss wrote:
My issue with the logos presented was they made solr look like a school 
project instead of the powerful tool that it is.  The tricked out font 
or whatever just usually doesn't play well with the business types... 
they want serious-looking software.  First impressions are everything.  
While the fiery colors are appropriate for something named Solr, you can 
play with that without getting silly - take a look at:


http://www.ascsolar.com/images/asc_solar_splash_logo.gif
http://www.logostick.com/images/EOS_InvestmentingLogo_lg.gif

(Luckily there are many businesses that do solar energy!)

They have the same elements but with a certain simplicity and elegance.

I know probably some people don't care if it makes the boss or client 
happy, but, these are the kinds of seemingly insignificant things that 
make people choose a bad, proprietary piece of junk over something solid 
and open-source... it's all about appearances!  The people making the 
decision often have little else to go on, unfortunately.


I concur. IMHO you should at least consider how the logo looks like when:

* it's reduced to black & white (e.g. when sending faxes or making copies)

* resized to favicon.ico size,

* resized to an A0 poster size

Many OSS projects for some obscure reason love to use color gradients, 
often with broad hue spans - but such gradients rarely look well in 
print, exhibiting the banding problem, and they are very easy to corrupt 
when transferring images from one media to another. If we absolutely 
must use gradients, then at least we should create some logo variants 
without gradients - see the SVG file I created in SOLR-84 .


For these reasons I suggest:

* not using gradients

* not using small intricate elements that get lost in logos of small 
size - or come up with logos of reduced complexity for smaller size versions


* avoid large splashes of uniform strong color - these look bad on large 
logos, like poster-sized.



--
Best regards,
Andrzej Bialecki <><
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: Lucene-based Distributed Index Leveraging Hadoop

2008-02-07 Thread Andrzej Bialecki

Doug Cutting wrote:

Ning,

I am also interested in starting a new project in this area.  The 
approach I have in mind is slightly different, but hopefully we can come 
to some agreement and collaborate.


I'm interested in this too.

My current thinking is that the Solr search API is the appropriate 
model.  Solr's facets are an important feature that require low-level 
support to be practical.  Thus a useful distributed search system should 
support facets from the outset, rather than attempt to graft them on 
later.  In particular, I believe this requirement mandates disjoint shards.


I agree - shards should be disjoint also because if we eventually want 
to manage multiple replicas of each shard across the cluster (for 
reliability and performance) then overlapping documents would complicate 
both the query dispatching process and the merging of partial result sets.



My primary difference with your proposal is that I would like to support 
online indexing.  Documents could be inserted and removed directly, and 
shards would synchronize changes amongst replicas, with an "eventual 
consistency" model.  Indexes would not be stored in HDFS, but directly 
on the local disk of each node.  Hadoop would perhaps not play a role. 
In many ways this would resemble CouchDB, but with explicit support for 
sharding and failover from the outset.


It's true that searching over HDFS is slow - but I'd hate to lose all 
other HDFS benefits and have to start from scratch ... I wonder what 
would be the performance of FsDirectory over an HDFS index that is 
"pinned" to a local disk, i.e. a full local replica is available, with 
block size of each index file equal to the file size.



A particular client should be able to provide a consistent read/write 
view by bonding to particular replicas of a shard.  Thus a user who 
makes a modification should be able to generally see that modification 
in results immediately, while other users, talking to different 
replicas, may not see it until synchronization is complete.


This requires that we use versioning, and that we have a "shard manager" 
that knows the latest versions of each shard among the whole active set 
- or that clients discover this dynamically by querying the shard 
servers every now and then.


--
Best regards,
Andrzej Bialecki <><
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: big perf-difference between solr-server vs. SOlrJ req.process(solrserver)

2008-01-02 Thread Andrzej Bialecki

Otis Gospodnetic wrote:

Maybe I'm not following your situation 100%, but it sounded like
pulling the values of purely stored fields is the slow part.
*Perhaps* using a non-Lucene data store just for the saved fields
would be faster.


For this purpose Nutch uses external files in Hadoop MapFile format. 
MapFile-s offer quick search & get by key (using binary search over an 
in-memory index of keys).


The benefit of this solution is that the bulky content is decoupled from 
Lucene indexes, and it can be put in a physically different location 
(e.g. a dedicated page content server).


--
Best regards,
Andrzej Bialecki <><
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: multilingual list of stopwords

2007-10-18 Thread Andrzej Bialecki

Lukas Vlcek wrote:

Hi,

I haven't heard of multilingual stop words list before. What should be the
purpose of it? This seems to odd to me :-)


That's because multilingual stopword list doesn't make sense ;)

One example that I'm familiar with: words "is" and "by" in English and 
in Swedish. Both words are stopwords in English, but they are content 
words in Swedish (ice and village, respectively). Similarly, "till" in 
Swedish is a stopword (to, towards), but it's a content word in English.


So, as Lukas correctly suggested, you should first perform language 
identification, and then apply the correct stopword list.



--
Best regards,
Andrzej Bialecki <><
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: solr, snippets and stored field in nutch...

2007-10-15 Thread Andrzej Bialecki

Mike Klaas wrote:

On 11-Oct-07, at 4:34 PM, Ravish Bhagdev wrote:


Hi Mike,

Thanks for your reply :)

I am not an expert of either! But, I understand that Nutch stores
contents albeit in a separate data structure (they call segment as
discussed in the thread), but what I meant was that this seems like
much more efficient way of presenting summaries or snippets (of course
for apps that need these only) than using a stored field which is only
option in solr -  not only resulting in a huge index size but reducing
speed of retrieval because of this increase in size (this is
admittedly a guess, would like to know if not the case).  Also for
queries only requesting ids/urls, the segments would never be touched
even for first n results...


Let me add a few comments, as someone who is pretty familiar with Nutch.

Indeed, there is a strong separation of data stores in Nutch - in order 
to get the maximum possible performance Lucene indexes are not used for 
data storage - they contain only bare essentials needed to compute the 
score, plus an "id" of a data record stored elsewhere. Confusingly, this 
location is called "segment", and it consists of a bunch of Hadoop 
MapFile-s and SequenceFile-s - there are data files with "content", 
"parse_data" and "parse_text" among others.


When results are returned to the client (in this case - Nutch front-end 
machine) they contain only the score and this id (plus optionally some 
other data needed for online de-duplication). In other words, Nutch 
doesn't transmit the whole "document" to the client, only the parts that 
are needed to prepare the presentation of the requested portion of hits.


Nutch stores plain text versions of documents in segments, in the 
"parse_text" file, and retrieves this data on demand, i.e. when a client 
requests a summary to be presented. Nutch front-end uses Hadoop RPC to 
communicate with back-end servers, and can retrieve either one or 
several summaries in one call, which reduces the network traffic.


In a similar way the original binary content of a document can be 
requested if needed, and it will be retrieved from the "content" MapFile 
in a "segment".


The advantage of this approach is that you can keep the index size to a 
minimum (it contains mostly unstored fields), and that you can associate 
arbitrary binary data with a Lucene document. The downside is the 
increased cost to manage many data files - but this cost is largely 
hidden in Nutch behind specialized *Reader facades.




It doesn't slow down querying, but it does slow down document retrieval 
(*if you are never going to request the summaries for those documents).  
That is the case I was referring to below.


This is the case for which Nutch architecture is optimized.


--
Best regards,
Andrzej Bialecki <><
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com