Re: [ANNOUNCE] Solr wiki editing change
On 3/25/13 4:18 AM, Steve Rowe wrote: The wiki at http://wiki.apache.org/solr/ has come under attack by spammers more frequently of late, so the PMC has decided to lock it down in an attempt to reduce the work involved in tracking and removing spam. From now on, only people who appear on http://wiki.apache.org/solr/ContributorsGroup will be able to create/modify/delete wiki pages. Please request either on the solr-user@lucene.apache.org or on d...@lucene.apache.org to have your wiki username added to the ContributorsGroup page - this is a one-time step. Please add AndrzejBialecki to this group. Thank you! -- Best regards, Andrzej Bialecki http://www.sigram.com, blog http://www.sigram.com/blog ___.,___,___,___,_._. __<>< [___||.__|__/|__||\/|: Information Retrieval, System Integration ___|||__||..\|..||..|: Contact: info at sigram dot com
Re: What is the "docs" number in Solr explain query results for fieldnorm?
On 25/05/2012 20:13, Tom Burton-West wrote: Hello all, I am trying to understand the output of Solr explain for a one word query. I am querying on the "ocr" field with no stemming/synonyms or stopwords. And no query or index time boosting. The query is "ocr:the" The document (result below) which contains two words "The Aeroplane" gets more hits than documents with 50 or more occurances of the word "the" Since the idf is the same I am assuming this is a result of length norms. The explain (debugQuery) shows the following for fieldnorm: 0.625 = fieldNorm(field=ocr, doc=16624) What does the "doc=16624" mean? It certainly can not represent either the length of the field (as an integer) since there are only two terms in the field. It can't represent the number of docs with the query term (the idf output shows the word "the" occurs in 16,219 docs. Hi Tom, This is an internal document number within a Lucene index. This number is useless from the level of Solr APIs because you can't use it to actually do anything. At the Lucene level (e.g. in Luke) you could navigate to this number and for example retrieve stored fields of this document. As it's shown in the Explanation-s, it can be only used to co-ordinate parts of the query that matched the same document number. -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: is there any practice to load index into RAM to accelerate solr performance?
On 08/02/2012 09:17, Ted Dunning wrote: This is true with Lucene as it stands. It would be much faster if there were a specialized in-memory index such as is typically used with high performance search engines. This could be implemented in Lucene trunk as a Codec. The challenge though is to come up with the right data structures. There has been some interesting research on optimizations for in-memory inverted indexes, but it usually involves changing the query evaluation algos as well - for reference: http://digbib.ubka.uni-karlsruhe.de/volltexte/documents/1202502 http://www.siam.org/proceedings/alenex/2008/alx08_01transierf.pdf http://research.google.com/pubs/archive/37365.pdf -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: Solr Lucene Index Version
On 08/12/2011 14:50, Jamie Johnson wrote: Mark, Agreed that Replication wouldn't help, I was dreaming that there was some intermediate format used in replication. Ideally you are right, I could just reindex the data and go on with life, but my case is not so simple. Currently we have some set of processes which is run against the raw artifact to index things of interest within the text document. I don't believe (and I need to check with the folks who wrote this) that I have an easy way to do this currently but this would be my preference. Andrzej, Isn't the codec stuff merged with trunk now? Admittedly I know very little about Lucene's index format but I'd be willing to be a guinea pig if you needed a tester. Bulk of the work described in LUCENE-2621 has been done by Robert Muir (big thanks!!) and merged with trunk, but I think there may be still some parts missing - see LUCENE-3622. -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: Solr Lucene Index Version
On 08/12/2011 05:00, Mark Miller wrote: Replication just copies the index, so I'm not sure how this would help offhand? With SolrCloud this is a breeze - just fire up another replica for a shard and the current index will replicate to it. If you where willing to export the data to some portable format and then pull it back in, why not just store the original data and reindex? This was actually one of the situations that motivated that jira issue - there are scenarios where reindexing, or keeping the original data, is very costly, in terms of space, time, I/O, pre-processing costs, curating, merging, etc, etc... The good news is that once the recent work on the codecs is merged with the trunk then we can revisit this issue and implement it with much less effort than before - we could even start by modifying SimpleTextCodec to be more lenient, and proceed from there. -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: Backup with lukeall XMLExporter.
On 05/10/2011 19:21, Luis Cappa Banda wrote: Hello. I´ve been looking for information trying to find an easy way to do index backups with Solr and I´ve readed that lukeall has an application called XMLExporter that creates a XML dump from a lucene index with it´s complete information. I´ve got some questions about this alternative: *1. *Do it also contains the information from fields configured as stored=false? *2. *Can I load with curl this XML file generated to reindex? If not, any other solution? Thank you very much. It does not provide a complete copy of the index information, it only dumps general information about the index plus the stored fields of documents. Non-stored fields are not available. There is no counterpart tool to take this XML dump and turn it into an index. I'm working on a tool like what you had in mind, and I will be presenting results of this work at the Eurocon in Barcelona. However, it's still very much incomplete, and it depends on cutting edge features (LUCENE-2621). In any case, if you're using Lucene then you can safely take a backup of the index if it's open readonly. With Solr you can use the replication mechanism to pull in a copy of the index from a running Solr instance. -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: Can I delete the stored value?
On 7/10/11 2:33 PM, Simon Willnauer wrote: Currently there is no easy way to do this. I would need to think how you can force the index to drop those so the answer here is no you can't! simon On Sat, Jul 9, 2011 at 11:11 AM, Gabriele Kahlout wrote: I've stored the contents of some pages I no longer need. How can I now delete the stored content without re-crawling the pages (i.e. using updateDocument ). I cannot just remove the field, since I still want the field to be indexed, I just don't want to store something with it. My understanding is that field.setValue("") won't do since that should affect the indexed value as well. You could pump the content of your index through a FilterIndexReader - i.e. implement a subclass of FilterIndexReader that removes stored fields under some conditions, and then use IndexWriter.addIndexes with this reader. See LUCENE-1812 for another practical application of this concept. -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: Feed index with analyzer output
On 7/5/11 1:37 PM, Lox wrote: Ok, the very short question is: Is there a way to submit the analyzer response so that solr already knows what to do with that response? (that is, which field are to be treated as payloads, which are tokens, etc...) Check this issue: http://issues.apache.org/jira/browse/SOLR-1535 -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: Performance loss - querying more than 64 cores (randomly)
On 6/16/11 5:31 PM, Mark Schoy wrote: Thanks for your answers. Andrzej was right with his assumption. Solr only needs about 9GB memory but the system needs the rest of it for disc IO: 64 Cores: 64*100MB index size = 6,4GB + 9 GB Solr Cache + about 600 MB OS = 16GB Conclusion: My system can exactly buffer the data of 64 Cores. Every additional core cant be buffered and the performance is decreasing. Glad to be of help... You could formulate this conclusion in a different way, too: if you specify too large a heap size then you stifle the OS disk buffers - Solr won't be able to use that excess of memory, but it won't be available for OS-level disk IO. Therefore reducing the heap size may actually increase your performance. -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: Performance loss - querying more than 64 cores (randomly)
On 6/16/11 3:22 PM, Mark Schoy wrote: Hi, I set up a Solr instance with 512 cores. Each core has 100k documents and 15 fields. Solr is running on a CPU with 4 cores (2.7Ghz) and 16GB RAM. Now I've done some benchmarks with JMeter. On each thread iteration JMeter queriing another Core by random. Here are the results (Duration: each with 180 second): Randomly queried cores | queries per second 1| 2016 2 | 2001 4 | 1978 8 | 1958 16 | 2047 32 | 1959 64 | 1879 128 | 1446 256 | 1009 512 | 428 Why are the queries per second until 64 constant and then the performance is degreasing rapidly? Solr only uses 10GB of the 16GB memory so I think it is not a memory issue. This may be an OS-level disk buffer issue. With a limited disk buffer space the more random IO occurs from different files, the higher is the churn rate, and if the buffers are full then the churn rate may increase dramatically (and the performance will drop then). Modern OS-es try to keep as much data in memory as possible, so the memory usage itself is not that informative - but check what are the pagein/pageout rates when you start hitting the 32 vs 64 cores. -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: Lucid Works
On 4/8/11 9:55 PM, Andy wrote: --- On Fri, 4/8/11, Andrzej Bialecki wrote: :) If you don't need the new functionality in 4.x, you don't need the performance improvements, What performance improvements does 4.x have over 3.1? Ah... well, many - take a look at the CHANGES.txt. reindexing cycles are long (indexes tend to stay around) then 3.1 is a safer bet. If you need a dozen or so new exciting features (e.g. results grouping) or top performance, or if you need LucidWorks with Click and other goodies, then use 4.x and be prepared for an occasional full reindex. So using 4.x would require occasional full reindex but using 3.1 would not? Could you explain? I thought 4.x comes with NRT indexing. So why is full reindex necessary? Well, so long as you don't want to upgrade then of course, index format is stable and you can manage it incrementally. But in case of an upgrade, because in 4.x index format is not stable - if you upgrade to a newer Lucene / LucidWorks of 4.x vintage then it may be the case that even though indexes before and after upgrade are of 4.x vintage they are still incompatible. At some point there may be tools to transparently convert indexes from one 4.x to another 4.x format, but they are not there yet. -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: Lucid Works
On 4/8/11 4:58 PM, Mark wrote: Doesn't look like you allow new members to post questions in that forum. There's a "Create new account" link there, you simply need to register and log in. I have just one last question ;) We are deciding whether to upgrade our 1.4 production environment to 4.x or 3.1. What were you decisions when deciding to release 4.x over 3.1? Based on the details that you provided I'd say "it depends" :) If you don't need the new functionality in 4.x, you don't need the performance improvements, and if your full reindexing cycles are long (indexes tend to stay around) then 3.1 is a safer bet. If you need a dozen or so new exciting features (e.g. results grouping) or top performance, or if you need LucidWorks with Click and other goodies, then use 4.x and be prepared for an occasional full reindex. -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: Lucid Works
On 4/7/11 10:16 PM, Mark wrote: Andrezej, Thanks for the info. I have a question regarding stability though. How are you able to guarantee the stability of this release when 4.0 is still a work in progress? I believe the last version Lucid released was 1.4 so why did you choose to release a 4.x version as opposed to 3.1? To include all the goodies from 4.0, of course ;) LucidWorks uses a version from the trunk that behaves well in tests and with necessary patches applied - see also below. Is the source code including with your distribution so that we may be able to do some further patching upon it? Yes, after installing it's in solr-src/ . So if any issue pops up you can apply a patch, recompile the libs and replace them. Thanks again and hopefully I'll be joining you at that conference. Great :) PS. Questions like this are best asked on the Lucid forum http://www.lucidimagination.com/forum/ . -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: Lucid Works
On 4/7/11 9:43 PM, Mark wrote: I noticed that Lucid Works distribution now says is upt to date with 4.X versions. Does this mean 1.4 or 4.0/trunk? If its truly 4.0 does that mean it includes the collapse component? Yes it does. Also, is the click scoring tools proprietary or was this just a contrib/patch that was applied? At the moment it's proprietary. I will have a talk at the Lucene Revolution conference that describes the Click tools in detail. -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: Detecting an empty index during start-up
On 3/25/11 11:25 AM, David McLaughlin wrote: Thanks Chris. I dug into the SolrCore code and after reading some of the code I ended up going with core.getNewestSearcher(true) and this fixed the problem. FYI, openNew=true is not implemented and can result in an UnsupportedOperationException. For now it's better to pass openNew=false and be prepared to get a null. -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: a bug of solr distributed search
On 2010-10-25 13:37, Toke Eskildsen wrote: > On Mon, 2010-10-25 at 11:50 +0200, Andrzej Bialecki wrote: >> * there is an exact solution to this problem, namely to make two >> distributed calls instead of one (first call to collect per-shard IDFs >> for given query terms, second call to submit a query rewritten with the >> global IDF-s). This solution is implemented in SOLR-1632, with some >> caching to reduce the cost for common queries. > > I must admit that I have not tried the patch myself. Looking at > https://issues.apache.org/jira/browse/SOLR-1632 > i see that the last comment is from LiLi with a failed patch, but as > there are no further comments it is unclear if the problem is general or > just with LiLi's setup. I might be a bit harsh here, but the other > comments for the JIRA issue also indicate that one would have to be > somewhat adventurous to run this in production. Oh, definitely this is not production quality yet - there are known bugs, for example, that I need to fix, and then it needs to be forward-ported to trunk. It shouldn't be too much work to bring it back into usable state. >> * another reason is that in many many cases the difference between using >> exact global IDF and per-shard IDFs is not that significant. If shards >> are more or less homogenous (e.g. you assign documents to shards by >> hash(docId)) then term distributions will be also similar. > > While I agree on the validity of the solution, it does put some serious > constraints on the shard-setup. True. But this is the simplest setup that just may be enough. > >> To summarize, I would qualify your statement with: "...if the >> composition of your shards is drastically different". Otherwise the cost >> of using global IDF is not worth it, IMHO. > > Do you know of any studies of the differences in ranking with regard to > indexing-distribution by hashing, logical grouping and distributed IDF? Unfortunately, this information is surprisingly scarce - research predating year 2000 is often not applicable, and most current research concentrates on P2P systems, which are really a different ball of wax. Here's a few papers that I found that are related to this issue: * Global Term Weights in Distributed Environments, H. Witschel, 2007 (Elsevier) * KLEE: A Framework for Distributed Top-k Query Algorithms, S. Michel, P. Triantafillou, G. Weikum, VLDB'05 (ACM) * Exploring the Stability of IDF Term Weighting, Xin Fu and Miao Chen, 2008 (Springer Verlag) * A Comparison of Techniques for Estimating IDF Values to Generate Lexical Signatures for the Web, M. Klein, M. Nelson, WIDM'08 (ACM) * Comparison of dierent Collection Fusion Models in Distributed Information Retrieval, Alexander Steidinger - this paper gives a nice comparison framework for different strategies for joining partial results; apparently we use the most primitive strategy explained there, based on raw scores... These papers likely don't fully answer your question, but at least they provide a broader picture of the issue... -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: a bug of solr distributed search
On 2010-10-25 11:22, Toke Eskildsen wrote: > On Thu, 2010-07-22 at 04:21 +0200, Li Li wrote: >> But itshows a problem of distrubted search without common idf. >> A doc will get different score in different shard. > > Bingo. > > I really don't understand why this fundamental problem with sharding > isn't mentioned more often. Every time the advice "use sharding" is > given, it should be followed with a "but be aware that it will make > relevance ranking unreliable". The reason is twofold, I think: * there is an exact solution to this problem, namely to make two distributed calls instead of one (first call to collect per-shard IDFs for given query terms, second call to submit a query rewritten with the global IDF-s). This solution is implemented in SOLR-1632, with some caching to reduce the cost for common queries. However, this means that now for every query you need to make two calls instead of one, which potentially doubles the time to return results (for simple common queries - for rare complex queries the time will be still dominated by the query runtime on shard servers). * another reason is that in many many cases the difference between using exact global IDF and per-shard IDFs is not that significant. If shards are more or less homogenous (e.g. you assign documents to shards by hash(docId)) then term distributions will be also similar. So then the question is whether you can accept an N% variance in scores across shards, or whether you want to bear the cost of an additional distributed RPC for every query... To summarize, I would qualify your statement with: "...if the composition of your shards is drastically different". Otherwise the cost of using global IDF is not worth it, IMHO. -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: Different analyzers for dfferent documents in different languages?
On 2010-09-22 15:30, Bernd Fehling wrote: Actually, this is one of the biggest disadvantage of Solr for multilingual content. Solr is field based which means you have to know the language _before_ you feed the content to a specific field and process the content for that field. This results in having separate fields for each language. E.g. for Europe this will be 24 to 26 languages for each title, keyword, description, ... I guess when they started with Lucene/Solr they never had multilingual on their mind. The alternative is to have a separate index for each language. Therefore you also have to know the language of the content _before_ feeding to the core. E.g. again for Europe you end up with 24 to 26 cores. Onother option is to "see" the multilingual fields (title, keywords, description,...) as a "subdocument". Write a filter class as subpipeline, use language and encoding detection as first step in that pipeline and then go on with all other linguistic processing within that pipeline and return the processed content back to the field for further filtering and storing. Many solutions, but nothing out off the box :-) Take a look at SOLR-1536, it contains an example of a tokenizing chain that could use a language detector to create different fields (or tokenize differently) based on this decision. -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: SolrCloud distributed indexing (Re: anyone use hadoop+solr?)
On 2010-09-06 22:03, Dennis Gearon wrote: What is a 'simple MOD'? md5(docId) % numShards -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: SolrCloud distributed indexing (Re: anyone use hadoop+solr?)
On 2010-09-06 16:41, Yonik Seeley wrote: On Mon, Sep 6, 2010 at 10:18 AM, MitchK wrote: [...consistent hashing...] But it doesn't solve the problem at all, correct me if I am wrong, but: If you add a new server, let's call him IP3-1, and IP3-1 is nearer to the current ressource X, than doc x will be indexed at IP3-1 - even if IP2-1 holds the older version. Am I right? Right. You still need code to handle migration. Consistent hashing is a way for everyone to be able to agree on the mapping, and for the mapping to change incrementally. i.e. you add a node and it only changes the docid->node mapping of a limited percent of the mappings, rather than changing the mappings of potentially everything, as a simple MOD would do. Another strategy to avoid excessive reindexing is to keep splitting the largest shards, and then your mapping becomes a regular MOD plus a list of these additional splits. Really, there's an infinite number of ways you could implement this... For SolrCloud, I don't think we'll end up using consistent hashing - we don't need it (although some of the concepts may still be useful). I imagine there could be situations where a simple MOD won't do ;) so I think it would be good to hide this strategy behind an interface/abstract class. It costs nothing, and gives you flexibility in how you implement this mapping. -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: How to retrieve the full corpus
On 2010-09-06 17:15, Yonik Seeley wrote: On Mon, Sep 6, 2010 at 10:52 AM, Roland Villemoes wrote: How can I retrieve all words from a Solr core? I need a list of all the words and how often they occur in the index. http://wiki.apache.org/solr/TermsComponent It doesn't currently stream though, so requesting *all* at once might take too much memory. One workaround is to page via terms.lower and terms.limit. Perhaps we should consider adding streaming to the terms component though. Would you mind opening a JIRA issue? This would be nice also for building a spellchecker in another core (instead of using the current sub-index hack). -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
SolrCloud distributed indexing (Re: anyone use hadoop+solr?)
(I adjusted the subject to better reflect the content of this discussion). On 2010-09-06 14:37, MitchK wrote: Thanks for your detailed feedback Andzej! From what I understood, SOLR-1301 becomes obsolete ones Solr becomes cloud-ready, right? Who knows... I certainly didn't expect this code to become so popular ;) so even after SolrCloud becomes available it's likely that some people will continue to use it. But SolrCloud should solve the original problem that I tried to solve with this patch. Looking into the future: eventually, when SolrCloud arrives we will be able to index straight to a SolrCloud cluster, assigning documents to shards through a hashing schema (e.g. 'md5(docId) % numShards') Hm, let's say the md5(docId) would produce a value of 10 (it won't, but let's assume it). If I got a constant number of shards, the doc will be published to the same shard again and again. i.e.: 10 % numShards(5) = 2 -> doc 10 will be indexed at shard 2. A few days later the rest of the cluster is available, now it looks like 10 % numShards(10) -> 1 -> doc 10 will be indexed at shard 1... and what about the older version at shard 2? I am no expert when it comes to cloudComputing and the other stuff. There are several possible solutions to this, and they all boil down to the way how you assign documents to shards... Keep in mind that nodes (physical machines) can manage several shards, and the aggregate collection of all unique shards across all nodes forms your whole index - so there's also a related, but different issue, of how to assign shards to nodes. Here are some scenarios how you can solve the doc-to-shard mapping problem (note: I removed the issue of replication from the picture to make this clearer): a) keep the number of shards constant no matter how large is the cluster. The mapping schema is then as simple as the one above. In this scenario you create relatively small shards, so that a single physical node can manage dozens of shards (each shard using one core, or perhaps a more lightweight structure like MultiReader). This is also known as micro-sharding. As the number of documents grows the size of each shard will grow until you have to reduce the number of shards per node, ultimately ending up with a single shard per node. After that, if your collection continues to grow, you have to modify your hashing schema to split some shards (and reindex some shards, or use an index splitter tool). b) use consistent hashing as the mapping schema to assign documents to a changing number of shards. There are many explanations of this schema on the net, here's one that is very simple: http://www.tomkleinpeter.com/2008/03/17/programmers-toolbox-part-3-consistent-hashing/ In this case, you can grow/shrink the number of shards (and their size) as you see fit, incurring only a small reindexing cost. If you can point me to one or another reference where I can read about it, it would help me a lot, since I only want to understand how it works at the moment. http://wiki.apache.org/solr/SolrCloud ... The problem with Solr is its lack of documentation in some classes and the lack of capsulating some very complex things into different methods or extra-classes. Of course, this is because it costs some extra time to do so, but it makes understanding and modifying things very complicated if you do not understand whats going on from a theoretical point of view. In this case the lack of good docs and user-level API can be blamed on the fact that this functionality is still under heavy development. Since the cloud-feature will be complex, a lack of documentation and no understanding of the theory behind the code will make contributing back very, very complicated. For now, yes, it's an issue - though as soon as SolrCloud gets committed I'm sure people will follow up with user-level convenience components that will make it easier. -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: anyone use hadoop+solr?
On 2010-09-04 19:53, MitchK wrote: Hi, this topic started a few months ago, however there are some questions from my side, that I couldn't answer by looking at the SOLR-1301-issue nor the wiki-pages. Let me try to explain my thoughts: Given: a Hadoop-cluster, a solr-search-cluster and nutch as a crawling-engine which also performs LinkRank and webgraph-related tasks. Once a list of documents is created by nutch, you put the list + the LinkRank-values etc. into a Solr+Hadoop-job like it is described in Solr-1301 to index or reindex the given documents. There is no out of the box integration between Nutch and SOLR-1301, so there is some step that you omitted from this chain... e.g. "export from Nutch segments to CSV". When the shards are built, they will be sent over the network to the solr-search-cluster. Is this description correct? Not really. SOLR-1301 doesn't deal with how you deploy the results of indexing. It simply creates the shards on HDFS. SOLR-1301 just creates the index data - it doesn't deal with serving the data... What makes me thinking is: Assumed I got a Document X on machine Y in shard Y... When I reindex that document X together with lots of other documents that are present or not present in Shard Y... and I put the resulting shard on a machine Z, how does machine Y notice that it has got an older version of document X than machine Z? Furthermore: Go on and assume that the shard Y was replicated to three other machines, how do they all notice, that their version of document X is not the newest available one? In such an environment, we do not have a master (right?), so far: How to keep the index as consistent as possible? It's not possible to do it like this, at least for now... Looking into the future: eventually, when SolrCloud arrives we will be able to index straight to a SolrCloud cluster, assigning documents to shards through a hashing schema (e.g. 'md5(docId) % numShards'). Since shards would be created in a consistent way, then newer versions of documents would end up in the same shards and they would replace the older versions of the same documents - thus the problem would be solved. Additional benefit from this model is that it's not a disruptive and copy-intensive operation like SOLR-1301 (where you have to do "create new indexes, deploy them and switch") but rather a regular online update that is already supported in Solr. Once this is in place, we can modify Nutch to send documents directly to a SolrCloud cluster. Until then, you need to build and deploy indexes more or less manually (or using Katta, but again Katta is not integrated with Nutch). SolrCloud is not far away from hitting the trunk (right, Mark? ;) ), so medium-term I think this is your best bet. -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: Analyser depending on field's value
On 2010-08-16 10:06, Damien Dudognon wrote: Hi all, I want to use a specific stopword list depending on a field's value. For example, if type == 1 then I use stopwords1.txt to index "text" field, else I use stopwords2.txt. I thought of several solutions but no one really satisfied me: 1) use one Solr instance by type, and therefore a distinct index by type; 2) use as many fields as types with specific rules for each field (e.g. a field "text_1" for the type "1" which uses "stopwords1.txt", "text_2" for other types which uses "stopwords2.txt", ...) I am sure that there is a better solution to my problem. If anyone have a suitable solution to suggest to me ... :-) Perhaps the solution described here: https://issues.apache.org/jira/browse/SOLR-1536 Take a look at the example that uses token types to put text into different fields, which can then be analyzed differently. -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: Auto-suggest internal terms
On 2010-06-03 13:38, Michael Kuhlmann wrote: > Am 03.06.2010 13:02, schrieb Andrzej Bialecki: >> ..., and deploy this >> index in a separate JVM (to benefit from other CPUs than the one that >> runs your Solr core) > > Every known webserver ist multithreaded by default, so putting different > Solr instances into different JVMs will be of no use. You are right to a certain degree. Still, there are some contention points in Lucene/Solr, how threads are allocated on available CPU-s, and how the heap is used, which can make a two-JVM setup perform much better than a single-JVM setup given the same number of threads... -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: Auto-suggest internal terms
On 2010-06-03 09:56, Michael Kuhlmann wrote: > The only solution without "doing any custom work" would be to perform a > normal query for each suggestion. But you might get into performance > troubles with that, because suggestions are typically performed much > more often than complete searches. Actually, that's not a bad idea - if you can trim the size of the index (either by using shingles instead of docs, or trimming the main index - LUCENE-1812) so that the index fits completely in RAM, and deploy this index in a separate JVM (to benefit from other CPUs than the one that runs your Solr core) or another machine, then I think performance would not be a big concern, and the functionality would be just what you wanted. > > The much faster solution that needs own work would be to build up a > large TreeMap with each word as the keys, and the matching terms as the > values. That would consume an awful lot of RAM... see SOLR-1316 for some measurements. -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: Importing large datasets
On 2010-06-02 13:12, Grant Ingersoll wrote: > > On Jun 2, 2010, at 6:53 AM, Andrzej Bialecki wrote: > >> On 2010-06-02 12:42, Grant Ingersoll wrote: >>> >>> On Jun 1, 2010, at 9:54 PM, Blargy wrote: >>> >>>> >>>> We have around 5 million items in our index and each item has a description >>>> located on a separate physical database. These item descriptions vary in >>>> size and for the most part are quite large. Currently we are only indexing >>>> items and not their corresponding description and a full import takes >>>> around >>>> 4 hours. Ideally we want to index both our items and their descriptions but >>>> after some quick profiling I determined that a full import would take in >>>> excess of 24 hours. >>>> >>>> - How would I profile the indexing process to determine if the bottleneck >>>> is >>>> Solr or our Database. >>> >>> As a data point, I routinely see clients index 5M items on normal >>> hardware in approx. 1 hour (give or take 30 minutes). >>> >>> When you say "quite large", what do you mean? Are we talking books here or >>> maybe a couple pages of text or just a couple KB of data? >>> >>> How long does it take you to get that data out (and, from the sounds of it, >>> merge it with your item) w/o going to Solr? >>> >>>> - In either case, how would one speed up this process? Is there a way to >>>> run >>>> parallel import processes and then merge them together at the end? Possibly >>>> use some sort of distributed computing? >>> >>> DataImportHandler now supports multiple threads. The absolute fastest way >>> that I know of to index is via multiple threads sending batches of >>> documents at a time (at least 100). Often, from DBs one can split up the >>> table via SQL statements that can then be fetched separately. You may want >>> to write your own multithreaded client to index. >> >> SOLR-1301 is also an option if you are familiar with Hadoop ... >> > > If the bottleneck is the DB, will that do much? > Nope. But the workflow could be set up so that during night hours a DB export takes place that results in a CSV or SolrXML file (there you could measure the time it takes to do this export), and then indexing can work from this file. -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: Importing large datasets
On 2010-06-02 12:42, Grant Ingersoll wrote: > > On Jun 1, 2010, at 9:54 PM, Blargy wrote: > >> >> We have around 5 million items in our index and each item has a description >> located on a separate physical database. These item descriptions vary in >> size and for the most part are quite large. Currently we are only indexing >> items and not their corresponding description and a full import takes around >> 4 hours. Ideally we want to index both our items and their descriptions but >> after some quick profiling I determined that a full import would take in >> excess of 24 hours. >> >> - How would I profile the indexing process to determine if the bottleneck is >> Solr or our Database. > > As a data point, I routinely see clients index 5M items on normal > hardware in approx. 1 hour (give or take 30 minutes). > > When you say "quite large", what do you mean? Are we talking books here or > maybe a couple pages of text or just a couple KB of data? > > How long does it take you to get that data out (and, from the sounds of it, > merge it with your item) w/o going to Solr? > >> - In either case, how would one speed up this process? Is there a way to run >> parallel import processes and then merge them together at the end? Possibly >> use some sort of distributed computing? > > DataImportHandler now supports multiple threads. The absolute fastest way > that I know of to index is via multiple threads sending batches of documents > at a time (at least 100). Often, from DBs one can split up the table via SQL > statements that can then be fetched separately. You may want to write your > own multithreaded client to index. SOLR-1301 is also an option if you are familiar with Hadoop ... -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: Autosuggest
On 2010-05-15 02:46, Blargy wrote: > > Thanks for your help and especially your analyzer.. probably saved me a > full-import or two :) > Also, take a look at this issue: https://issues.apache.org/jira/browse/SOLR-1316 -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: SOLR-1316 How To Implement this autosuggest component ???
On 2010-03-31 06:14, Andy wrote: --- On Tue, 3/30/10, Andrzej Bialecki wrote: From: Andrzej Bialecki Subject: Re: SOLR-1316 How To Implement this autosuggest component ??? To: solr-user@lucene.apache.org Date: Tuesday, March 30, 2010, 9:59 AM On 2010-03-30 15:42, Robert Muir wrote: On Mon, Mar 29, 2010 at 11:34 PM, Andy wrote: Reading through this thread and SOLR-1316, there seems to be a lot of different ways to implement auto-complete in Solr. I've seen the mentions of: EdgeNGrams TermsComponent Faceting TST Patricia Tries RadixTree DAWG Another idea is you can use the Automaton support in the lucene flexible indexing branch: to query the index directly with a DFA that represents whatever terms you want back. The idea is that there really isn't much gain in building a separate Pat, Radix Tree, or DFA to do this when you can efficiently intersect a DFA with the existing terms dictionary. I don't really understand what autosuggest needs to do, but if you are doing things like looking for mispellings you can easily build a DFA that recognizes terms within some short edit distance with the support thats there (the LevenshteinAutomata class), to quickly get back candidates. You can intersect/concatenate/union these DFAs with prefix or suffix DFAs if you want too, don't really understand what the algorithm should do, but I'm happy to try to help. The problem is a bit more complicated. There are two issues: * simple term-level completion often produces wrong results for multi-term queries (which are usually rewritten as "weak" phrase queries), * the weights of suggestions should not correspond directly to IDF in the index - much better results can be obtained when they correspond to the frequency of terms/phrases in the query logs ... TermsComponent and EdgeNGrams, while simple to use, suffer from both issues. Thanks. I actually have 2 use cases for autosuggest: 1) The "normal" one - I want to suggest search terms to users after they've typed a few letters. Just like Google suggest. Looks like for this use case SOLR-1316 is the best option. Right? Hopefully, yes - it depends on how you intend to populate the TST. If you populate it from the main index, then (unless you have indexed phrases) there won't be any benefit over the TermsComponent. It may be faster, but it will take more RAM. If you populate it from a list of top-N queries, then SOLR-1316 is the way to go. 2) I have a field "city" with values that are entered by users. When a user is entering his city, I want to make suggestion based on what cities have already been entered so far by other users -- in order to reduce chances of duplication. What method would you recommend for this use case? If the "city" field is not analyzed then TermsComponent is easiest to use. If it is analyzed, but vast majority of cities are single terms, then TermsComponent is ok too. If you want to assign different priorities to suggestions (other than a simple IDF based priority), or have many city names consisting of multiple tokens, then use SOLR-1316. -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: SOLR-1316 How To Implement this autosuggest component ???
On 2010-03-30 15:42, Robert Muir wrote: On Mon, Mar 29, 2010 at 11:34 PM, Andy wrote: Reading through this thread and SOLR-1316, there seems to be a lot of different ways to implement auto-complete in Solr. I've seen the mentions of: EdgeNGrams TermsComponent Faceting TST Patricia Tries RadixTree DAWG Another idea is you can use the Automaton support in the lucene flexible indexing branch: to query the index directly with a DFA that represents whatever terms you want back. The idea is that there really isn't much gain in building a separate Pat, Radix Tree, or DFA to do this when you can efficiently intersect a DFA with the existing terms dictionary. I don't really understand what autosuggest needs to do, but if you are doing things like looking for mispellings you can easily build a DFA that recognizes terms within some short edit distance with the support thats there (the LevenshteinAutomata class), to quickly get back candidates. You can intersect/concatenate/union these DFAs with prefix or suffix DFAs if you want too, don't really understand what the algorithm should do, but I'm happy to try to help. The problem is a bit more complicated. There are two issues: * simple term-level completion often produces wrong results for multi-term queries (which are usually rewritten as "weak" phrase queries), * the weights of suggestions should not correspond directly to IDF in the index - much better results can be obtained when they correspond to the frequency of terms/phrases in the query logs ... TermsComponent and EdgeNGrams, while simple to use, suffer from both issues. -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: SOLR-1316 How To Implement this autosuggest component ???
On 2010-03-30 05:34, Andy wrote: Reading through this thread and SOLR-1316, there seems to be a lot of different ways to implement auto-complete in Solr. I've seen the mentions of: EdgeNGrams TermsComponent Faceting TST Patricia Tries RadixTree DAWG Which algorthm does SOLR-1316 implement? TST is one. There are others mentioned in the comments on SOLR-1316, such as Patricia Tries, RadixTree, DAWG. Are those implemented too? Among all those methods is there a "recommended" one? What are the pros& cons? Only TST is implemented in SOLR-1316. The main advantage of this approach is that it can complete arbitrary strings - e.g. frequent queries. This reduces the chance of suggesting queries that yield no results, which is a danger in other methods. The disadvantage is the increased RAM consumption, and the need to populate it (either from IndexReader - but then it's nearly equivalent to the TermsComponent; or from a list of frequent queries - but you need to build that list yourself). -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: multiple binary documents into a single solr document - Vignette/OpenText integration
On 2010-03-24 15:58, Fábio Aragão da Silva wrote: hello there, I'm working on the development of a piece of code that integrates Solr with Vignette/OpenText Content Management, meaning Vignette content instances will be indexed in solr when published and deleted from solr when unpublished. I'm using solr 1.4, solrj and solr cell. I've implemented most of the code and I've ran into only a single issue so far: vignette content management supports the attachment of multiple binary documents (such as .doc, .pdf or .xls files) to a single content instance. I am mapping each content instance in Vignette to a solr document, but now I have a content instance in vignette with multiple binary files attached to it. So my question is: is it possible to have more than one binary file indexed into a single document in solr? I'm a beginner in solr, but from what I understood I have two options to index content using solrj: either to use UpdateRequest() and the add() method to add a SolrInputDocument to the request (in case the document doesn´t represent a binary file), or to use ContentStreamUpdateRequest() and the addFile() method to add a binary file to the content stream request. I don't see a way, though, to say "this document is comprised of two files, a word and a pdf, so index them as one document in solr using content1 and content2 fields - or merge their content into a single 'content' field)". I tried calling the addFile() twice (one call for each file) and no error but nothing getting indexed as well. ContentStreamUpdateRequest req = new ContentStreamUpdateRequest("/update/extract"); req.addFile(new File("file1.doc")); req.addFile(new File("file2.pdf")); req.setParam("literal.id", "multiple_files_test"); req.setParam("uprefix", "attr_"); req.setAction(AbstractUpdateRequest.ACTION.COMMIT, true, true); server.request(req); Any thoughts on this would be greatly appreciated. Write your own RequestHandler that uses the existing ExtractingRequestHandler to actually parse the streams, and then you combine the results arbitrarily in your handler, eventually sending an AddUpdateCommand to the update processor. You can obtain both the update processor and SolrCell instance from req.getCore(). -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: wikipedia and teaching kids search engines
On 2010-03-24 16:15, Markus Jelsma wrote: A bit off-topic but how about Nutch grabbing some conent and have it indexed in Solr? The problem is not with collecting and submitting the documents, the problem is with parsing the Wikimedia markup embedded in XML. WikipediaTokenizer from Lucene contrib/ is a quick and perhaps acceptable solution ... -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: Features not present in Solr
On 2010-03-23 06:25, David Smiley @MITRE.org wrote: I use Endeca and Solr. A few notable things in Endeca but not in Solr: 1. Real-time search. 2. "related record navigation" (RRN) is what they call it. This is the ability to join in other records, something Lucene/Solr definitely can't do. Could you perhaps elaborate a bit on this functionality? Your description sounds intriguing - it reminds me of ParallelReader, but I'm probably completely wrong ... -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: SOLR-1316 How To Implement this autosuggest component ???
On 2010-03-19 13:03, stocki wrote: hello.. i try to implement autosuggest component from these link: http://issues.apache.org/jira/browse/SOLR-1316 but i have no idea how to do this !?? can anyone get me some tipps ? Please follow the instructions outlined in the JIRA issue, in the comment that shows fragments of XML config files. -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: Update Index : Updating Specific Fields
On 2010-03-04 07:41, Walter Underwood wrote: No. --wunder Or perhaps "not yet" ... http://portal.acm.org/ft_gateway.cfm?id=1458171 On Mar 3, 2010, at 10:40 PM, Kranti™ K K Parisa wrote: Hi, Is there any way to update the index for only the specific fields? Eg: Index has ONE document consists of 4 fields, F1, F2, F3, F4 Now I want to update the value of field F2, so if I send the update xml to SOLR, can it keep the old field values for F1,F3,F4 and update the new value specified for F2? Best Regards, Kranti K K Parisa -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: If you could have one feature in Solr...
On 2010-02-28 17:26, Ian Holsman wrote: On 2/24/10 8:42 AM, Grant Ingersoll wrote: What would it be? most of this will be coming in 1.5, but for me it's - sharding.. it still seems a bit clunky secondly.. this one isn't in 1.5. I'd like to be able to find "interesting" terms that appear in my result set that don't appear in the global corpus. it's kind of like doing a facet count on *:* and then on the search term and discount the terms that appear heavily on the global one. (sorry.. there is a textbook definition of this.. XX distance.. but I haven't got the books in front of me). Kullback-Leibler divergence? -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: term frequency vector access?
On 2010-02-11 17:04, Mike Perham wrote: In an UpdateRequestProcessor (processing an AddUpdateCommand), I have a SolrInputDocument with a field 'content' that has termVectors="true" in schema.xml. Is it possible to get access to that field's term vector in the URP? No, term vectors are created much later, during the process of adding the document to a Lucene index (deep inside Lucene IndexWriter & co). That's the whole point of SOLR-1536 - certain features become available only when the tokenization actually occurs. Another reason to use SOLR-1536 is when tokenization and analysis is costly, e.g. when doing named entity recognition, POS tagging or lemmatization. Theoretically you could play the TokenizerChain twice - once in URP, so that you can discover and capture features and modify the input document accordingly, and then again inside Lucene - but in practice this may be too costly. -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: Can Solr be forced to return all field tags for a document even if the field is empty?l
On 2010-01-28 03:21, Erick Erickson wrote: This is kind of an unusual request, what higher-level problem are you trying to solve here? Because the field just *isn't there* in the underlying Lucene index for that document. I suppose you could index a "not there" token and just throw those values out from the response... You can also implement a SearchComponent that post-processes results and based on the schema if a field is missing then it adds an empty node to the result. -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: How to Split Index file.
On 2010-01-10 01:55, Lance Norskog wrote: Make two copies of the index. In each copy, delete the records you do not want. Optimize. ... which is essentially what the MultiPassIndexSplitter does, only it avoids the initial copy (by deleting in the source index). -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: restore space between words by spell checker
Otis Gospodnetic wrote: I'm not sure if that can be easily done (other than going char by char and testing), because nothing indicates where the space might be, not even an upper case there. I'd be curious to know if you find a better solution. Otis -- Sematext is hiring -- http://sematext.com/about/jobs.html?mls Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR - Original Message From: Andrey Klochkov To: solr-user Sent: Fri, November 27, 2009 6:09:08 AM Subject: restore space between words by spell checker Hi If a user issued a misspelled query, forgetting to place space between words, is it possible to fix it with a spell checker or by some other mechanism? For example, if we get query "tommyhitfiger" and have terms "tommy" and "hitfiger" in the index, how to fix the query? The usual approach to solving this is to index compound words, i.e. when producing a spellchecker dictionary add a record "tommyhitfiger" with a field that points to "tommy hitfiger". Details vary depending on what spellchecking impl. you use. -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: Index Splitter
Koji Sekiguchi wrote: Giovanni Fernandez-Kincade wrote: You can't really use this if you have an optimized index, right? For optimized index, I think you can use MultiPassIndexSplitter. Correct - MultiPassIndexSplitter can handle any index - optimized or not, with or without deletions, etc. The cost for this flexibility is that it needs to read index files multiple times (hence "multi-pass"). -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: how to get the autocomplete feature in solr 1.4?
Chris Hostetter wrote: : how to get the autocomplete/autosuggest feature in the solr1.4.plz give me : the code also... there is no magical "one size fits all" solution for autocomplete in solr. if you look at the archives there have been lots of discussions about differnet ways ot get auto complete functionality, using things like the TermsComponent, or the LukeRequest handler, and there are lots of examples of using the SolrJS javascript functionality to populate an autocomplete box -- but you'll have to figure out what solution works best for your goals. Also, take a look at SOLR-1316, there are patches there that implement such component using prefix trees. -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: leading and trailing wildcard query
A. Steven Anderson wrote: No thoughts on this? Really!? I would hate to admit to my Oracle DBE that Solr can't be customized to do a common query that a relational database can do. :-( On Wed, Nov 4, 2009 at 6:01 PM, A. Steven Anderson < a.steven.ander...@gmail.com> wrote: I've scoured the archives and JIRA , but the answer to my question is just not clear to me. With all the new Solr 1.4 features, is there any way to do a leading and trailing wildcard query on an *untokenized* field? e.g. q=myfield:*abc* would return a doc with myfield=xxxabcxxx Yes, I know how expensive such a query would be, but we have the user requirement, nonetheless. If not, any suggestions on how to implement a custom solution using Solr? Using an external data structure? You can use ReversedWildcardFilterFactory that creates additional tokens (in your case, a single additional token :) ) that is reversed, _and_ also triggers the setAllowLeadingWildcards in the QueryParser - won't help much with the performance though, due to the trailing wildcard in your original query. Please see the discussion in SOLR-1321 (this will be available in 1.4 but it should be easy to patch 1.3 to use it). If you really need to support such queries efficiently you should implement a full permu-term indexing, i.e. a token filter that rotates tokens and adds all rotations (with a special marker to mark the beginning of the word), and a query plugin that detects such query terms and rotates the query term appropriately. -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: Solr Cell on web-based files?
Grant Ingersoll wrote: You might try remote streaming with Solr (see http://wiki.apache.org/solr/SolrConfigXml). Otherwise, look into a crawler such as Nutch or Droids or Heretrix. Additionally, Nutch can be configured to send the crawled/parsed documents to Solr for indexing. -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: QTime always a multiple of 50ms ?
Jérôme Etévé wrote: Hi all, I'm using Solr trunk from 2009-10-12 and I noticed that the QTime result is always a multiple of roughly 50ms, regardless of the used handler. For instance, for the update handler, I get : INFO: [idx1] webapp=/solr path=/update/ params={} status=0 QTime=0 INFO: [idx1] webapp=/solr path=/update/ params={} status=0 QTime=104 INFO: [idx1] webapp=/solr path=/update/ params={} status=0 QTime=52 ... Is this a known issue ? It may be an issue with System.currentTimeMillis() resolution on some platforms (e.g. Windows)? -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: Is negative boost possible?
Yonik Seeley wrote: On Mon, Oct 12, 2009 at 12:03 PM, Andrzej Bialecki wrote: Solr never discarded non-positive hits, and now Lucene 2.9 no longer does either. Hmm ... The code that I pasted in my previous email uses Searcher.search(Query, int), which in turn uses search(Query, Filter, int), and it doesn't return any results if only the first clause is present (the one with negative boost) even though it's a matching clause. I think this is related to the fact that in TopScoreDocCollector:48 the pqTop.score is initialized to 0, and then all results that have lower score that this are discarded. Perhaps this should be initialized to Float.MIN_VALUE? Hmmm, You're actually seeing this with Lucene 2.9? The HitQueue (subclass of PriorityQueue) is pre-populated with sentinel objects with scores of -Inf, not zero. Uhh, sorry, you are right - an early 2.9-dev version of the jar sneaked in on my classpath .. I verified now that 2.9.0 returns both positive and negative scores with the default TopScoreDocCollector. -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: Passing request to another handler
Chris Hostetter wrote: : What's the canonical way to pass an update request to another handler? I'm : implementing a handler that has to dispatch its result to different update : handlers based on its internal processing. I've always written my delegating RequestHandlers so that they take in the names (or paths) of the handlers they are going to delegate to as init params. Yeah, this is where I started ... the other approach i've seen is to make the delegating handler instantiate the sub-handlers directly so that it can have the exact instnaces it wants configured the way it wants them. ... and this is where I ended now :) It really comes down to what your goal is: if you wnat your code to be totlaly in conrol instantiate new instances. if you wnat the person creatining the solrconfig.xml to be in control let them tell you the name of a handler (with it's defaults/invariants configured i na way you can't control) to delegate to. Indeed - thanks. -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: Is negative boost possible?
Yonik Seeley wrote: On Mon, Oct 12, 2009 at 5:58 AM, Andrzej Bialecki wrote: BTW, standard Collectors collect only results with positive scores, so if you want to collect results with negative scores as well then you need to use a custom Collector. Solr never discarded non-positive hits, and now Lucene 2.9 no longer does either. Hmm ... The code that I pasted in my previous email uses Searcher.search(Query, int), which in turn uses search(Query, Filter, int), and it doesn't return any results if only the first clause is present (the one with negative boost) even though it's a matching clause. I think this is related to the fact that in TopScoreDocCollector:48 the pqTop.score is initialized to 0, and then all results that have lower score that this are discarded. Perhaps this should be initialized to Float.MIN_VALUE? -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: Is negative boost possible?
Yonik Seeley wrote: On Sun, Oct 11, 2009 at 6:04 PM, Lance Norskog wrote: And the other important thing to know about boost values is that the dynamic range is about 6-8 bits That's an index-time boost - an 8 bit float with 5 bits of mantissa and 3 bits of exponent. Query time boosts are normal 32 bit floats. To be more specific: index-time float encoding does not permit negative numbers (see SmallFloat), but query-time boosts can be negative, and they DO affect the score - see below. BTW, standard Collectors collect only results with positive scores, so if you want to collect results with negative scores as well then you need to use a custom Collector. --- BeanShell 2.0b4 - by Pat Niemeyer (p...@pat.net) bsh % import org.apache.lucene.search.*; bsh % import org.apache.lucene.index.*; bsh % import org.apache.lucene.store.*; bsh % import org.apache.lucene.document.*; bsh % import org.apache.lucene.analysis.*; bsh % tq = new TermQuery(new Term("a", "b")); bsh % print(tq); a:b bsh % tq.setBoost(-1); bsh % print(tq); a:b^-1.0 bsh % q = new BooleanQuery(); bsh % tq1 = new TermQuery(new Term("a", "c")); bsh % tq1.setBoost(10); bsh % q.add(tq1, BooleanClause.Occur.SHOULD); bsh % q.add(tq, BooleanClause.Occur.SHOULD); bsh % print(q); a:c^10.0 a:b^-1.0 bsh % dir = new RAMDirectory(); bsh % w = new IndexWriter(dir, new WhitespaceAnalyzer()); bsh % doc = new Document(); bsh % doc.add(new Field("a", "b c d", Field.Store.YES, Field.Index.ANALYZED)); bsh % w.addDocument(doc); bsh % w.close(); bsh % r = IndexReader.open(dir); bsh % is = new IndexSearcher(r); bsh % td = is.search(q, 10); bsh % sd = td.scoreDocs; bsh % print(sd.length); 1 bsh % print(is.explain(q, 0)); 0.1373985 = (MATCH) sum of: 0.15266499 = (MATCH) weight(a:c^10.0 in 0), product of: 0.99503726 = queryWeight(a:c^10.0), product of: 10.0 = boost 0.30685282 = idf(docFreq=1, numDocs=1) 0.32427183 = queryNorm 0.15342641 = (MATCH) fieldWeight(a:c in 0), product of: 1.0 = tf(termFreq(a:c)=1) 0.30685282 = idf(docFreq=1, numDocs=1) 0.5 = fieldNorm(field=a, doc=0) -0.0152664995 = (MATCH) weight(a:b^-1.0 in 0), product of: -0.099503726 = queryWeight(a:b^-1.0), product of: -1.0 = boost 0.30685282 = idf(docFreq=1, numDocs=1) 0.32427183 = queryNorm 0.15342641 = (MATCH) fieldWeight(a:b in 0), product of: 1.0 = tf(termFreq(a:b)=1) 0.30685282 = idf(docFreq=1, numDocs=1) 0.5 = fieldNorm(field=a, doc=0) bsh % -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: Passing request to another handler
Shalin Shekhar Mangar wrote: On Fri, Oct 9, 2009 at 10:53 PM, Andrzej Bialecki wrote: Hi, What's the canonical way to pass an update request to another handler? I'm implementing a handler that has to dispatch its result to different update handlers based on its internal processing. An update request? There's always only one UpdateHandler registered in Solr. Hm, yes - to be more specific, what I meant is that I need to pre-process an update request and then pass it on either to my own handler (which performs an update) or to the ExtractingRequestHandler (which also performs an update). Getting a handler from SolrCore.getRequestHandler(handlerName) makes the implementation dependent on deployment paths defined in solrconfig.xml. Using SolrCore.getRequestHandlers(handler.class) often returns the LazyRequestHandlerWrapper, from which it's not possible to retrieve the wrapped instance of the handler .. You must know the name of the handler you are going to invoke. Or if you are sure that there is only one instance, knowing the class name will let you know the handler name. Then the easiest way to invoke it would be to use a I do know the class name - ExtractingRequestHandler. But when I invoke SolrCore.getRequestHandler(Class) I get an empty map, because this handler is registered as lazy, and this means that it's represented as LazyRequestHandlerWrapper. -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Passing request to another handler
Hi, What's the canonical way to pass an update request to another handler? I'm implementing a handler that has to dispatch its result to different update handlers based on its internal processing. Getting a handler from SolrCore.getRequestHandler(handlerName) makes the implementation dependent on deployment paths defined in solrconfig.xml. Using SolrCore.getRequestHandlers(handler.class) often returns the LazyRequestHandlerWrapper, from which it's not possible to retrieve the wrapped instance of the handler .. -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: Where to place ReversedWildcardFilterFactory in Chain
Chantal Ackermann wrote: Thanks, Mark! But I suppose it does matter where in the index chain it goes? I would guess it is applied to the tokens, so I suppose I should put it at the very end - after WordDelimiter and Lowercase have been applied. Is that correct? >> >> >splitOnCaseChange="1" splitOnNumerics="1" >>stemEnglishPossessive="1" generateWordParts="1" >>generateNumberParts="1" catenateAll="1" >>preserveOriginal="1" /> >> >> Yes. Care should be taken that the query analyzer chain produces the same forward tokens, because the code in QueryParser that optionally reverses tokens acts on tokens that it receives _after_ all other query analyzers have run on the query. -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: Adding data from nutch to a Solr index
Sönke Goldbeck wrote: Alright, first post to this list and I hope the question is not too stupid or misplaced ... what I currently have: - a nicely working Solr 1.3 index with information about some entities e.g. organisations, indexed from an RDBMS. Many of these entities have an URL pointing at further information, e.g. the website of an institute or company. - an installation of nutch 0.9 with which I can crawl for the URLs that I can extract from the RDBMS mentioned above and put into a seed file - tutorials about how to put crawled and indexed data from nutch 1.0 (which I could install w/o problems) into a separate Solr index what I want: - combine the indexed information from the RDBMS and the website in one Solr index so that I can search both in one and with the capability of using all the Solr features. E.g. having the following (example) fields in one document: <...> I believe that this kind of document merging is not possible (at least not easily) - you have to assemble the whole document before you index it in Solr. If these documents use the same primary key (I guess they do, otherwise how would you merge them...) then you can do the merging in your front-end application, which would have to submit the main query to Solr, and then for each Solr document on the list of results it would retrieve a Nutch document (using NutchBean API). (The not so easy way involves writing a SearchComponent that does the latter part of that process on the Solr side.) -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: Number of terms in a SOLR field
Fergus McMenemie wrote: Fergus McMenemie wrote: Hi all, I am attempting to test some changes I made to my DIH based indexing process. The changes only affect the way I describe my fields in data-config.xml, there should be no changes to the way the data is indexed or stored. As a QA check I was wanting to compare the results from indexing the same data before/after the change. I was looking for a way of getting counts of terms in each field. I guess Luke etc most allow this but how? Luke uses brute force approach - it traverses all terms, and counts terms per field. This is easy to implement yourself - just get IndexReader.terms() enumeration and traverse it. Thanks Andrzej This is just a one off QA check. How do I get Luke to display terms and counts? 1. get Luke 0.9.9 2. open index with Luke 3. Look at the Overview panel, you will see the list titled "Available fields and term counts per field". -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: Number of terms in a SOLR field
Fergus McMenemie wrote: Hi all, I am attempting to test some changes I made to my DIH based indexing process. The changes only affect the way I describe my fields in data-config.xml, there should be no changes to the way the data is indexed or stored. As a QA check I was wanting to compare the results from indexing the same data before/after the change. I was looking for a way of getting counts of terms in each field. I guess Luke etc most allow this but how? Luke uses brute force approach - it traverses all terms, and counts terms per field. This is easy to implement yourself - just get IndexReader.terms() enumeration and traverse it. -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: How to get a stack trace
Chris Hostetter wrote: : I'm a new user of solr but I have worked a bit with Lucene before. I get : some out of memory exception when optimizing the index through Solr and : I would like to find out why. However, the only message I get on : standard output is: Jul 30, 2009 9:20:22 PM : org.apache.solr.common.SolrException log SEVERE: : java.lang.OutOfMemoryError: Java heap space : : Is there a way to get a stack trace for this exception? I had a look : into the java.util.logging options and didn't find anything. FWIW #1: OutOfMemoryError is a java "Error" not an "Exception" ... Exceptions and Errors are both Throwable, but an Error is not an Exception. this is a really importatn distinction (see below) FWIW #2: when dealing with an OOM, a stack trace is almost never useful. as mentioned in other threads, a heapdump is the most useful diagnostic tool FWIW #3: the formatting of Throwables in log files is 100% dependent on the configuration of the log manager -- the client code doing the logging just specifies the Throwable object -- it's up to the Formatter to decide how to output it. Ok .. on to the meat of hte issue... OOM Errors are a particularly devious class of errors: they don't neccessarily have stack traces (depending on your VM impl, and the state of the VM when it tries to log the OOM) http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=4753347 http://blogs.sun.com/alanb/entry/outofmemoryerror_looks_a_bit_better ...on any *Exception* you should get a detailed stacktrace in the logs (unless you have a really screwed up LogManger configs), but when dealing with *Errors* like OutOfMemoryError, all bets are off as to what hte VM can give you. I had some success in debugging this type of problems when I would generate a heap dump on OOM (it's a JVM flag) and then use a tool like HAT to find largest objects and references to them. -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: Language Detection for Analysis?
Otis Gospodnetic wrote: Bradford, If I may: Have a look at http://www.sematext.com/products/language-identifier/index.html And/or http://www.sematext.com/products/multilingual-indexer/index.html .. and a Nutch plugin with similar functionality: http://lucene.apache.org/nutch/apidocs-1.0/org/apache/nutch/analysis/lang/LanguageIdentifier.html -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: Solr Search probem w/ phrase searches, text type, w/ escaped characters
Peter Keane wrote: I've used Luke to figure out what is going on, and I see in the fields that fail to match, a "null_1". Could someone tell me what that is? I see some null_100s there as well, which see to separate field values. Clearly the null_1s are causing the search to fail. You used the "Reconstruct" function to obtain the field values for unstored fields, right? null_NNN is Luke's way of telling you that the tokens that should be on these positions are absent, because they were removed by analyzer during indexing, and there is no stored value of this field from which you could recover the original text. In other words, they are holes in the token stream, of length NNN. Such holes may be also produced by artificially increasing the token positions, hence the null_100 that serves to separate multiple field values so that e.g. phrase queries don't match unrelated text. Phrase queries that you can construct using QueryParser can't match two tokens separated by a hole, unless you set a slop value > 0. -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: what crawler do you use for Solr indexing?
Sean Timm wrote: We too use Heritrix. We tried Nutch first but Nutch was not finding all of the documents that it was supposed to. When Nutch and Heritrix were both set to crawl our own site to a depth of three, Nutch missed some pages that were linked directly from the seed. We ended up with 10%-20% fewer pages in the Nutch crawl. FWIW, from a private conversation with Sean it seems that this was likely related to the default configuration in Nutch, which collects only the first 1000 outlinks from a page. This is an arbitrary and configurable limit, introduced as a way to limit the impact of spam pages and to limit the size of LinkDb. If a page hits this limit then indeed the symptoms that you observe are missing (dropped) links. -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: what crawler do you use for Solr indexing?
Tony Wang wrote: Hi Hoss, But I cannot find documents about the integration of Nutch and Solr in anywhere. Could you give me some clue? thanks Tony, I suggest that you follow Hoss's advice and ask these questions on nutch-user. This integration is built into Nutch, and not Solr, so it's less likely that people on this list know what you are talking about. This integration is quite fresh, too, so there are almost no docs except on the mailing list. Eventually someone is going to create some docs, and if you keep asking questions on nutch-user you will contribute to the creation of such docs ;) -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: Integrating Solr and Nutch
Tony Wang wrote: I heard Nutch 1.0 will have an easy way to integrate with Solr, but I haven't found any documentation on that yet. anyone? Indeed, this integration is already supported in Nutch trunk (soon to be released). Please download a nightly package and test it. You will need to reindex your segments using the solrindex command, and change the searcher configuration. See nutch-default.xml for details. -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: Redhat vs FreeBSD vs other unix flavors
Otis Gospodnetic wrote: You should be fine on either Linux or FreeBSD (or any other UNIX flavour). Running on Solaris would probably give you access to goodness like dtrace, but you can live without it. There's dtrace on FreeBSD, too. -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: Please help me integrate Nutch with Solr
Tony Wang wrote: Thanks Otis. I've just downloaded NUTCH-442_v8.patch<https://issues.apache.org/jira/secure/attachment/12391810/NUTCH-442_v8.patch>from https://issues.apache.org/jira/browse/NUTCH-442, but the patching process gave me lots errors, see below: This patch will be integrated within a couple days - please monitor this issue, and when it's done just download the patched code. -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: [VOTE] Community Logo Preferences
https://issues.apache.org/jira/secure/attachment/12394268/apache_solr_c_red.jpg https://issues.apache.org/jira/secure/attachment/12394350/solr.s4.jpg https://issues.apache.org/jira/secure/attachment/12394376/solr_sp.png https://issues.apache.org/jira/secure/attachment/12394267/apache_solr_c_blue.jpg -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: TextProfileSigature using deduplication
Mark Miller wrote: Thanks for sharing Marc, thats very nice to know. I'll take your experience as a starting point for some wiki recommendations. Sounds like we should add a switch to order alpha as well. On the general note of near-duplicate detection ... I found this paper in the proceedings of SIGIR-08, which presents an interesting and relatively simple algorithm that yields excellent results. Who has some spare CPU cycles to implement this? ;) http://ilpubs.stanford.edu:8090/860/ -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: TextProfileSigature using deduplication
Marc Sturlese wrote: Hey there, I've been testing and checking the source of the TextProfileSignature.java to avoid similar entries at indexing time. What I understood is that it is useful for huge text where the frequency of the tokens (the words in lowercase just with number and leters in taht case) is important. If you want to detect duplicates in not huge text and not giving a lot of importance to the frequencies it doesn't work... The hash will be made just with the terms wich frequency is higher than a QUANTUM (which value is given in function of the max freq between all the terms). So it will say that: aaa sss ddd fff ggg hhh aaa kkk lll ooo aaa xxx iii www qqq aaa jjj eee zzz nnn are duplicates because quantum here wolud be 2 and the frequency of aaa would be 2 aswell. So, to make the hash just the term aaa would be used. In this case: aaa sss ddd fff ggg hhh kkk lll ooo apa sss ddd fff ggg hhh kkk lll ooo Here quantum would be 1 and the frequencies of all terms would be 1 so all terms would be use for the hash. It will consider this two strings not similar. As I understood the algorithm there's no way to make it understand that in my second case both strings are similar. I wish i were wrong... I have my own duplication system to detect that but I use String comparison so it works really slow... Would like to know if there is any tuning possibility to do that with TextProfileSignature Don't know if I should pot this here or in the developers forum... Hi Marc, TextProfileSignature is a rather crude implementation of approximate similarity, and as you pointed out it's best suited for large texts. The original purpose of this Signature was to deduplicate web pages in large amounts of crawled pages (in Nutch), where it worked reasonably well. Its advantage is also that it's easy to compute and doesn't require multiple passes over the corpus. As it is implemented now, it breaks badly in the case you describe. You could modify this implementation to include also word-level ngrams, i.e. sequences of more than 1 word, up to N (e.g. 5) - this should work in your case. Ultimately, what you are probably looking for is a shingle-based algorithm, but it's relatively costly and requires multiple passes. -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: maxFieldLength
Dan A. Dickey wrote: I just came across the maxFieldLength setting for the mainIndex in solrconfig.xml and have a question or two about it. The default value is 1. I'm extracting text from pdf documents and storing them into a text field. Is the length of this text field limited to 1 characters? Many pdf documents are megabytes in size. Do this mean that only the first 1 characters are getting indexed? Is there a good way to index the whole document, or do I just simply need to increase the size of maxFieldLength? What performance ramifications would something like this have? maxFieldLength is counted in tokens, not chars, so you should be pretty safe unless your documents contain a lot of text. You can of course set this value to whatever you want, including Integer.MAX_VALUE. This has performance consequences - terms found at large positions will increase the length of posting lists, which leads to increased memory/CPU consumption during decoding and traversing of the lists. Also, the overall increased number of positions will have an impact on the index size. -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: Advice on analysis/filtering?
Jarek Zgoda wrote: Wiadomość napisana w dniu 2008-10-16, o godz. 16:21, przez Grant Ingersoll: I'm trying to create a search facility for documents in "broken" Polish (by broken I mean "not language rules compliant"), Can you explain what you mean here a bit more? I don't know Polish, Hi guys, I do speak Polish :) maybe I can help here a bit. Some documents (around 15% of all pile) contain the texts entered by children from primary school's and that implies many syntactic and ortographic errors. document text: "włatcy móch" (in proper Polish this would be "władcy much") example terms that should match: "włatcy much", "wlatcy moch", "wladcy much" These examples can be classified as "sounds like", and typically soundexing algorithms are used to address this problem, in order to generate initial suggestions. After that you can use other heuristic rules to select the most probable correct forms. AFAIK, there are no (public) soundex implementations for Polish, in particular in Java, although there was some research work done on the construction of a specifically Polish soundex. You could also use the Daitch-Mokotoff soundex, which comes close enough. Taking word "włatcy" from my example, I'd like to find documents containing words "wlatcy" (latin-2 accentuations stripped from original), This step is trivial. "władcy" (proper form of this noun) and "wladcy" (latin-2 accents stripped from proper form). And this one is not. It requires using something like soundexing in order to look up possible similar terms. However ... in this process you inevitably collect false positives, and you don't have any way in the input text to determine that they should be rejected. You can only make this decision based on some external knowledge of Polish, such as: * a morpho-syntactic analyzer that will determine which combinations of suggestions are more correct and more probable, * a language model that for any given soundexed phrase can generate the most probable original phrases. Also, knowing the context in which a query is asked may help, but usually you don't have this information (queries are short). -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: Adding bias to Distributed search feature?
Lance Norskog wrote: Thanks! We made variants of this and a couple of other files. As to why we have the same document in different shards with different contents: once you hit a certain index size and ingest rate, it is easiest to create a series of indexes and leave the older ones alone. In the future, please consider this as a legitimate use case instead of simply a mistake. You may be interested in implementing something like this: "Compact Features for Detection of Near-Duplicates in Distributed Retrieval", Yaniv Bernstein, Milad Shokouhi, and Justin Zobel It sounds straightforward, and relieves your from the need to de-duplicate your collection. -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: Extending Solr with custom filter
Jarek Zgoda wrote: Exactly like that. Wiadomość napisana w dniu 2008-09-12, o godz. 17:27, przez sunnyfr: ok .. that? I recommend using Stempelator (or Morfologik) for Polish stemming and lemmatization. It provides a superset of Stempel features, namely in addition to the algorithmic stemming it provides a dictionary-based stemming, and these two methods nicely complement each other. -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: Solr Logo thought
Stephen Weiss wrote: My issue with the logos presented was they made solr look like a school project instead of the powerful tool that it is. The tricked out font or whatever just usually doesn't play well with the business types... they want serious-looking software. First impressions are everything. While the fiery colors are appropriate for something named Solr, you can play with that without getting silly - take a look at: http://www.ascsolar.com/images/asc_solar_splash_logo.gif http://www.logostick.com/images/EOS_InvestmentingLogo_lg.gif (Luckily there are many businesses that do solar energy!) They have the same elements but with a certain simplicity and elegance. I know probably some people don't care if it makes the boss or client happy, but, these are the kinds of seemingly insignificant things that make people choose a bad, proprietary piece of junk over something solid and open-source... it's all about appearances! The people making the decision often have little else to go on, unfortunately. I concur. IMHO you should at least consider how the logo looks like when: * it's reduced to black & white (e.g. when sending faxes or making copies) * resized to favicon.ico size, * resized to an A0 poster size Many OSS projects for some obscure reason love to use color gradients, often with broad hue spans - but such gradients rarely look well in print, exhibiting the banding problem, and they are very easy to corrupt when transferring images from one media to another. If we absolutely must use gradients, then at least we should create some logo variants without gradients - see the SVG file I created in SOLR-84 . For these reasons I suggest: * not using gradients * not using small intricate elements that get lost in logos of small size - or come up with logos of reduced complexity for smaller size versions * avoid large splashes of uniform strong color - these look bad on large logos, like poster-sized. -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: Lucene-based Distributed Index Leveraging Hadoop
Doug Cutting wrote: Ning, I am also interested in starting a new project in this area. The approach I have in mind is slightly different, but hopefully we can come to some agreement and collaborate. I'm interested in this too. My current thinking is that the Solr search API is the appropriate model. Solr's facets are an important feature that require low-level support to be practical. Thus a useful distributed search system should support facets from the outset, rather than attempt to graft them on later. In particular, I believe this requirement mandates disjoint shards. I agree - shards should be disjoint also because if we eventually want to manage multiple replicas of each shard across the cluster (for reliability and performance) then overlapping documents would complicate both the query dispatching process and the merging of partial result sets. My primary difference with your proposal is that I would like to support online indexing. Documents could be inserted and removed directly, and shards would synchronize changes amongst replicas, with an "eventual consistency" model. Indexes would not be stored in HDFS, but directly on the local disk of each node. Hadoop would perhaps not play a role. In many ways this would resemble CouchDB, but with explicit support for sharding and failover from the outset. It's true that searching over HDFS is slow - but I'd hate to lose all other HDFS benefits and have to start from scratch ... I wonder what would be the performance of FsDirectory over an HDFS index that is "pinned" to a local disk, i.e. a full local replica is available, with block size of each index file equal to the file size. A particular client should be able to provide a consistent read/write view by bonding to particular replicas of a shard. Thus a user who makes a modification should be able to generally see that modification in results immediately, while other users, talking to different replicas, may not see it until synchronization is complete. This requires that we use versioning, and that we have a "shard manager" that knows the latest versions of each shard among the whole active set - or that clients discover this dynamically by querying the shard servers every now and then. -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: big perf-difference between solr-server vs. SOlrJ req.process(solrserver)
Otis Gospodnetic wrote: Maybe I'm not following your situation 100%, but it sounded like pulling the values of purely stored fields is the slow part. *Perhaps* using a non-Lucene data store just for the saved fields would be faster. For this purpose Nutch uses external files in Hadoop MapFile format. MapFile-s offer quick search & get by key (using binary search over an in-memory index of keys). The benefit of this solution is that the bulky content is decoupled from Lucene indexes, and it can be put in a physically different location (e.g. a dedicated page content server). -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: multilingual list of stopwords
Lukas Vlcek wrote: Hi, I haven't heard of multilingual stop words list before. What should be the purpose of it? This seems to odd to me :-) That's because multilingual stopword list doesn't make sense ;) One example that I'm familiar with: words "is" and "by" in English and in Swedish. Both words are stopwords in English, but they are content words in Swedish (ice and village, respectively). Similarly, "till" in Swedish is a stopword (to, towards), but it's a content word in English. So, as Lukas correctly suggested, you should first perform language identification, and then apply the correct stopword list. -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: solr, snippets and stored field in nutch...
Mike Klaas wrote: On 11-Oct-07, at 4:34 PM, Ravish Bhagdev wrote: Hi Mike, Thanks for your reply :) I am not an expert of either! But, I understand that Nutch stores contents albeit in a separate data structure (they call segment as discussed in the thread), but what I meant was that this seems like much more efficient way of presenting summaries or snippets (of course for apps that need these only) than using a stored field which is only option in solr - not only resulting in a huge index size but reducing speed of retrieval because of this increase in size (this is admittedly a guess, would like to know if not the case). Also for queries only requesting ids/urls, the segments would never be touched even for first n results... Let me add a few comments, as someone who is pretty familiar with Nutch. Indeed, there is a strong separation of data stores in Nutch - in order to get the maximum possible performance Lucene indexes are not used for data storage - they contain only bare essentials needed to compute the score, plus an "id" of a data record stored elsewhere. Confusingly, this location is called "segment", and it consists of a bunch of Hadoop MapFile-s and SequenceFile-s - there are data files with "content", "parse_data" and "parse_text" among others. When results are returned to the client (in this case - Nutch front-end machine) they contain only the score and this id (plus optionally some other data needed for online de-duplication). In other words, Nutch doesn't transmit the whole "document" to the client, only the parts that are needed to prepare the presentation of the requested portion of hits. Nutch stores plain text versions of documents in segments, in the "parse_text" file, and retrieves this data on demand, i.e. when a client requests a summary to be presented. Nutch front-end uses Hadoop RPC to communicate with back-end servers, and can retrieve either one or several summaries in one call, which reduces the network traffic. In a similar way the original binary content of a document can be requested if needed, and it will be retrieved from the "content" MapFile in a "segment". The advantage of this approach is that you can keep the index size to a minimum (it contains mostly unstored fields), and that you can associate arbitrary binary data with a Lucene document. The downside is the increased cost to manage many data files - but this cost is largely hidden in Nutch behind specialized *Reader facades. It doesn't slow down querying, but it does slow down document retrieval (*if you are never going to request the summaries for those documents). That is the case I was referring to below. This is the case for which Nutch architecture is optimized. -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com