Re: questions about autocommit committing documents
Hi Andy, Andy-152 wrote: autoCommit maxDocs1/maxDocs maxTime1000/maxTime /autoCommit has been commented out. - With autoCommit commented out, does it mean that every new document indexed to Solr is being auto-committed individually? Or that they are not being auto-committed at all? I am not sure, whether there is a default value, but if not, commenting out would mean that you have to send a commit explicitly. - If I enable autoCommit and set maxDocs at 1, does it mean that my new documents won't be avalable for searching until 10,000 new documents have been added? Yes, that's correct. However, you can do a commit explicitly, if you want to do so. - When I add a new document to Solr, do I need to call commit explicitly? If so, how do I do that? I look at the Solr tutorial ( http://lucene.apache.org/solr/tutorial.html), the command used to index documents (java -jar post.jar solr.xml monitor.xml) doesn't include any explicit call to commit the documents. So I'm not sure if it's necessary. Thanks Committing is necessary, since every added document is not visible at query-time, if there was no commit to it. Kind regards, Mitch -- View this message in context: http://lucene.472066.n3.nabble.com/questions-about-autocommit-committing-documents-tp1582487p1582676.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: questions about autocommit committing documents
First: Usually you do not use post.jar for updating your index. It's a simple tool, but normally you use features like the csv- or xml-update-RequestHandler. Have a look at UpdateCSV and UpdateXMLMessages in the wiki. There you can find examples on how to commit explicitly. With the post.jar you need to set either dcommit=yes or to append commit/, I think. Hope this helps. - Mitch -- View this message in context: http://lucene.472066.n3.nabble.com/questions-about-autocommit-committing-documents-tp1582487p1582846.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Swapping cores with SolrJ
Hi Shaun, I think it is more easy to fix this problem, if we got more information about what is going on in your application. Please, could you provide the CoreAdminResponse returned by car.process() for us? Kind regards, - Mitch -- View this message in context: http://lucene.472066.n3.nabble.com/Swapping-cores-with-SolrJ-tp1472154p1473435.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Solr CoreAdmin create ignores dataDir Parameter
Frank, have a look at SOLR-646. Do you think a workaround for the data-dir-tag in the solrconfig.xml can help? I think about something like dataDir${solr./data/corename}/dataDir for illustration. Unfortunately I am not very skilled in working with solr's variables and therefore I do not know what variables are available. If we find a solution, we should provide it as a suggestion at the wiki's CoreAdmin-page. Kind regards, - Mitch -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-CoreAdmin-create-ignores-dataDir-Parameter-tp1451665p1454705.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: SolrCloud distributed indexing (Re: anyone use hadoop+solr?)
What if we do not care about the version of a document at index-time? When it comes to distributed search, we currently decide aggregating documents based on their uniqueKey. But what would be, if we decide additionally decide on uniqueKey plus indexingDate, so that we only aggregate the last indexed version of a document? The concept could look like this: When Solr aggregated the documents for a response, it could store what shard responsed an older version of document x. Now a crawler can crawl through our SolrCloud and asking each shard whether it noticed something like shard y got an older version of doc x-case. The crawler aggregates those information. After he finished crawling, he sends delete-by-query-requests to those shards which have older versions of documents than they should have. I will call these stores document versions that are older than the newest version ODV (Old Document Versions) for better understanding. So, what can happen: Before the crawler can visit shard A - who noticed that shard y stores an ODV of doc x - shard A can go down. That's okay, because either another shard noticed the same, or shard A will be available later on. If those information will we stored at HD, it will also be available. If it was stored in RAM the information is lost... however, you could replicate those information over more than one shard, right? :-) Another case: Shard y can go down - so someone has to care for storing the noticed ODV-information, so that one can delete the document when Shard Y comes back. Pros: - You can do something like consistent hashing in connection with a concept where each node has to care for its neighbour-nodes. This is because only the neighbour nodes can store ODVs. - using the described concept, you can do nightly batches, looking for ODVs in the neigbour-nodes. - ODVs will be found at requesting time, so we can avoid to response ODVs over newer versions. Cons: - We are wasting disc space. - This works only for smaller clusters, not for large ones where the number of machines changes very frequently ... this is just another idea - and it is very very lazy. I must emphasize, that I assume that neighbour-machines do not go down very frequently. Of course, it is not a question whether a machine crashes, but when it crashes - but I assume that the same server does not crash every hour. :-) Thoughts? Kind regards Andrzej Bialecki wrote: On 2010-09-06 16:41, Yonik Seeley wrote: On Mon, Sep 6, 2010 at 10:18 AM, MitchKmitc...@web.de wrote: [...consistent hashing...] But it doesn't solve the problem at all, correct me if I am wrong, but: If you add a new server, let's call him IP3-1, and IP3-1 is nearer to the current ressource X, than doc x will be indexed at IP3-1 - even if IP2-1 holds the older version. Am I right? Right. You still need code to handle migration. Consistent hashing is a way for everyone to be able to agree on the mapping, and for the mapping to change incrementally. i.e. you add a node and it only changes the docid-node mapping of a limited percent of the mappings, rather than changing the mappings of potentially everything, as a simple MOD would do. Another strategy to avoid excessive reindexing is to keep splitting the largest shards, and then your mapping becomes a regular MOD plus a list of these additional splits. Really, there's an infinite number of ways you could implement this... For SolrCloud, I don't think we'll end up using consistent hashing - we don't need it (although some of the concepts may still be useful). I imagine there could be situations where a simple MOD won't do ;) so I think it would be good to hide this strategy behind an interface/abstract class. It costs nothing, and gives you flexibility in how you implement this mapping. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com -- View this message in context: http://lucene.472066.n3.nabble.com/anyone-use-hadoop-solr-tp485333p1434329.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: SolrCloud distributed indexing (Re: anyone use hadoop+solr?)
I must add something to my last post: When saying it could be used together with techniques like consistent hashing, I mean it could be used at indexing time for indexing documents, since I assumed that the number of shards does not change frequently and therefore an ODV-case becomes relatively infrequent. Furthermore the overhead of searching for and removing those ODV-documents is relatively low. -- View this message in context: http://lucene.472066.n3.nabble.com/anyone-use-hadoop-solr-tp485333p1434364.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: anyone use hadoop+solr?
Thanks for your detailed feedback Andzej! From what I understood, SOLR-1301 becomes obsolete ones Solr becomes cloud-ready, right? Looking into the future: eventually, when SolrCloud arrives we will be able to index straight to a SolrCloud cluster, assigning documents to shards through a hashing schema (e.g. 'md5(docId) % numShards') Hm, let's say the md5(docId) would produce a value of 10 (it won't, but let's assume it). If I got a constant number of shards, the doc will be published to the same shard again and again. i.e.: 10 % numShards(5) = 2 - doc 10 will be indexed at shard 2. A few days later the rest of the cluster is available, now it looks like 10 % numShards(10) - 1 - doc 10 will be indexed at shard 1... and what about the older version at shard 2? I am no expert when it comes to cloudComputing and the other stuff. If you can point me to one or another reference where I can read about it, it would help me a lot, since I only want to understand how it works at the moment. The problem with Solr is its lack of documentation in some classes and the lack of capsulating some very complex things into different methods or extra-classes. Of course, this is because it costs some extra time to do so, but it makes understanding and modifying things very complicated if you do not understand whats going on from a theoretical point of view. Since the cloud-feature will be complex, a lack of documentation and no understanding of the theory behind the code will make contributing back very, very complicated. Thank you :-) - Mitch -- View this message in context: http://lucene.472066.n3.nabble.com/anyone-use-hadoop-solr-tp485333p1425986.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: anyone use hadoop+solr?
Yonik, are there any discussions about SolrCloud-indexing? I would be glad to join them, if I find some interesting papers about that topic. - Mitch -- View this message in context: http://lucene.472066.n3.nabble.com/anyone-use-hadoop-solr-tp485333p1426469.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: SolrCloud distributed indexing (Re: anyone use hadoop+solr?)
Andrzej, thank you for sharing your experiences. b) use consistent hashing as the mapping schema to assign documents to a changing number of shards. There are many explanations of this schema on the net, here's one that is very simple: Boom. With the given explanation, I understand it as the following: You can use hadoop and do some map-reduce-jobs per csv-file. At the reducer-side, the reducer has to look for the id of the current doc and needs to create a hash of it. Now it looks inside a SortedSet, picks the next-best server and looks in a map, whether this server has got free capacity or not. That's cool. But it doesn't solve the problem at all, correct me if I am wrong, but: If you add a new server, let's call him IP3-1, and IP3-1 is nearer to the current ressource X, than doc x will be indexed at IP3-1 - even if IP2-1 holds the older version. Am I right? Thank you for sharing the paper. I will have a look for more like this. In this case the lack of good docs and user-level API can be blamed on the fact that this functionality is still under heavy development. I do not only mean documentation at the user-level but also inside a class, if there is going on some complicated stuff. - Mitch -- View this message in context: http://lucene.472066.n3.nabble.com/anyone-use-hadoop-solr-tp485333p1426728.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Show a facet filter All
Peter, take a close look at tagging and and excluding filters: http://wiki.apache.org/solr/SimpleFacetParameters#LocalParams_for_faceting Another way would be to index your services_raw as services_raw/Exclusive rental services_raw/Fotoreport services_raw/Live music In this case, you can use the facet-prefix param to get all the services_raw/*-values. I am not sure, but maybe even * is a valid prefix - than you do not need such extra-work. If all your documents include a services_raw-field, than this facet wouldn't make much sense, since it is applicable to all the documents, isn't it? Kind regards, - Mitch -- View this message in context: http://lucene.472066.n3.nabble.com/Show-a-facet-filter-All-tp1421248p1421539.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: anyone use hadoop+solr?
Hi, this topic started a few months ago, however there are some questions from my side, that I couldn't answer by looking at the SOLR-1301-issue nor the wiki-pages. Let me try to explain my thoughts: Given: a Hadoop-cluster, a solr-search-cluster and nutch as a crawling-engine which also performs LinkRank and webgraph-related tasks. Once a list of documents is created by nutch, you put the list + the LinkRank-values etc. into a Solr+Hadoop-job like it is described in Solr-1301 to index or reindex the given documents. When the shards are built, they will be sent over the network to the solr-search-cluster. Is this description correct? What makes me thinking is: Assumed I got a Document X on machine Y in shard Y... When I reindex that document X together with lots of other documents that are present or not present in Shard Y... and I put the resulting shard on a machine Z, how does machine Y notice that it has got an older version of document X than machine Z? Furthermore: Go on and assume that the shard Y was replicated to three other machines, how do they all notice, that their version of document X is not the newest available one? In such an environment, we do not have a master (right?), so far: How to keep the index as consistent as possible? Thank you for clearifying. Kind regards -- View this message in context: http://lucene.472066.n3.nabble.com/anyone-use-hadoop-solr-tp485333p1418140.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: full control over norm values?
Hi Micheal, have a look at SweetSpotSimilarity (Lucene). Kind regards, - Mitch -- View this message in context: http://lucene.472066.n3.nabble.com/full-control-over-norm-values-tp1366910p1367462.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Why it's boosted up?
Hi Scott, (so shorter fields are automatically boosted up). The theory behind that is the following (in easy words): Let's say you got two documents, each doc contains on 1 field (like it was in my example). Additionally we got a query that contains two words. Let's say doc1 contains on 10 words and doc2 contains on 20 words. The query matches both docs with both words. The idea of boosting shorter fields stronger than longer fields is the following: In doc1, 2/10 = 0.2 = 20% of the words are matching your query. In doc2 2/20 = 0.1 = 10% of the words are matching your query. So doc1 should get a better score, because the rate of matching words vs the total number of occuring words is greater than in doc2 This is the idea of using norms as an index-time-boosting-factor. NOTE: This does not mean that doc1 get's boosted by 20% and doc1 by 10%! It only illustrates what the idea behind such norms is. From the similarity-class's documentation of lengthNorm(): Matches in longer fields are less precise, so implementations of this method usually return smaller values when numTokens is large, and larger values when numTokens is small. However, you, as a search-application-developer got the task, that you have to decide whether this theory applies to your application or not. In some cases using norms makes no sense, in others it does. If you think that norms are applying to your project, ommitting them is no good approach to save disk-space. Furthermore: If you think the theory does apply to the business-needs of your application but its impact is currently to heavy, you can have a look at the sweetSpotSimilarity in Lucene. The request is from our business team, they wish user of our product can type in partial string of a word that exists in title or body field. You mean something like typing note and also getting results like notebook? The correct approach for something like that is not using shingleFilter but NGrams or edged NGrams. Shingles are doing something like that: This is my shingle sentence - This is, is my, my shingle, shingle sentence - it breaks up the sentence into smaller pieces. The benefit of doins so is, that, if a query matches one of these shingles, you have found a short phrase without using the performance-consuming phraseQuery-feature. Kind regards, - Mitch scott chu wrote: In Lucene's web page, there's a paragraph: Indexing time boosts are preprocessed for storage efficiency and written to the directory (when writing the document) in a single byte (!) as follows: For each field of a document, all boosts of that field (i.e. all boosts under the same field name in that doc) are multiplied. The result is multiplied by the boost of the document, and also multiplied by a field length norm value that represents the length of that field in that doc (so shorter fields are automatically boosted up). I though the greater the value, the boosting is upper. Then why short fields are boost up? Isn't Norm value for short fields smaller? -- View this message in context: http://lucene.472066.n3.nabble.com/Doing-Shingle-but-also-keep-special-single-word-tp1241204p1306419.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Solr creates whitespace in dismax query
Johann, try to remove the wordDelimiterFilter from the query-analyzer of your fieldType. If your index-analyzer-wordDelimiterFilter is well configured, it will find everything you want. Does this solve the problem? Kind regards, - Mitch -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-creates-whitespace-in-dismax-query-tp1317196p1318759.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Doing Shingle but also keep special single word
No, I mean that you use an additional field (indexed) for searching (i.e. whitespace-tokenized, so every word - seperated by a whitespace - becomes to a token . So you have got two fields (shingle-token-field and single-token-field). So you can search accross both fields. This provides several benefits: i.e. you can boost the shingle-field at query-time, since a match in a shingle-field would mean, that there matches an exact phrase. Additionally: You can search with single-word-queries as well as multi-word-queries. Furthermore you can apply synonyms to your single-token-field. If you want to keep your index as small as possible but as large as needed, try to understand Lucene's similarity implementation to consider, whether you can set the field option omitNorms=true or omitTermFreqAndPositions=true. http://lucene.apache.org/java/3_0_1/api/all/org/apache/lucene/search/Similarity.html Keep in mind what happens, if you omit one of those options. A small example of the consequences of setting omitNorms = true;. doc1: this is a short example doc doc2: this is a longer example doc for presenting the effect of omitNorms If you are searching for doc while omitNorms=false your response will look like this: doc1, doc2 This is because the norm-value for doc1 is smaller as the norm-value for doc2, because doc1 is shorter than doc2 (have a look at the provided link). If omitNorms=true, the scores for both docs will be equal. Kind regards, - Mitch scott chu wrote: I don't quite understand additional-field-way? Do you mean making another field that stores special words particularly but no indexing for that field? Scott - Original Message - From: MitchK mitc...@web.de To: solr-user@lucene.apache.org Sent: Sunday, August 22, 2010 11:48 PM Subject: Re: Doing Shingle but also keep special single word Hi, keepword-filter is no solution for this problem, since this would lead to the problematic that one has to manage a word-dictionary. As explained, this would lead to too much effort. You can easily add outputUnigrams=true and check out the analysis.jsp for this field. So you can see how much bigger a single field will become with this option. However, I am quite sure that the difference between using outputUnigrams=true and indexing in a seperate field is not noteworthy. I would suggest you to do it the additionally-field-way, since this would lead to more flexibility in boosting the different fields. Unfortunately, I haven't understood your explanation about the use-case. But it sounds a little bit like tagging? Kind regards, - Mitch iorixxx wrote: Isn't set outputUnigrams=true will make index size about twice than when it's set to false? Sure index will be bigger. I didn't know that this is problem for you. But if you have a list of special single words that you want to keep, keepwordfilter can eliminate other tokens. So index size will be okey. Scott - Original Message - From: Ahmet Arslan iori...@yahoo.com To: solr-user@lucene.apache.org Sent: Saturday, August 21, 2010 1:15 AM Subject: Re: Doing Shingle but also keep special single word I am building index with Shingle filter. We know it's minimum 2-gram but I also want keep some special single word, e.g. IBM, Microsoft, etc. i.e. I want to do a minimum 2-gram but also want to have these single word in my index, Is it possible? outputUnigrams=true parameter does not work for you? After that you can cast filter class=solr.KeepWordFilterFactory words=keepwords.txt ignoreCase=true/ with keepwords.txt=IBM, Microsoft. -- View this message in context: http://lucene.472066.n3.nabble.com/Doing-Shingle-but-also-keep-special-single-word-tp1241204p1276506.html Sent from the Solr - User mailing list archive at Nabble.com. ¥¼¦b¶Ç¤J°T®§¤¤§ä¨ì¯f¬r¡C Checked by AVG - www.avg.com Version: 9.0.851 / Virus Database: 271.1.1/3083 - Release Date: 08/20/10 14:35:00 -- View this message in context: http://lucene.472066.n3.nabble.com/Doing-Shingle-but-also-keep-special-single-word-tp1241204p1300497.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: How to limit rows to which highlighting applies
Alex, it sounds like it would make sense. Use cases could be i.e. clustering or similar techniques. However, in my opinion the point of view for such a modification is not the right. I.e. one wants to have got several resultsets. I could imagine that one does a primary-query (the query for the displayed results) and a query to compute clustering-results. Now, you want to do different things with the result-sets. The primary-query needs faceting, highlighting, spellcheck and much more, wheareas the additional query only needs clustering or something like that. In your case, you do not want to apply highlighting for the whole set, since you do not need such information for every row. This is a general problem and I think a solution that makes it possible to create more than one resultset for a single solr-request would be applicable for more general use cases. What do you think? Kind regards, - Mitch Alex Baranau wrote: Hello Solr users and devs! Is there a way to limit number of rows to which highlighting applies? I don't see any hl.rows or similar parameter description, so it looks like I need to enhance HighlightComponent to enable that. If it is not possible currently, do you think it's worth adding such possibility? JFI my case, when I need this: I display on results page 20, 10 or 5 rows only, but I need much more rows (100-500) to display additional data on the same page. Queries could be very complex and their execution time (QueryComponent) is quite big. So I do want to fetch things via single request. However, I noticed that with increasing number of rows, time spent in HighlightComponent increases dramatically. For those additional hundreds of rows I don't need highlighting at all. Actually, *ideally* it would be great to have the ability to specify fields returned for those extra rows as well. So I tend to think that adding this features should not be based on changing HighlightComponent behaviour, but changing QueryComponent or even bigger part somehow so that Solr query accepts specifying extra group(s) of rows for fetching along with params for them (which not influence the searching process, like formatting/highlighting, fields to return, etc.). Thus, we could execute *one* search query and fetch different data for different purposes. Does this all make sense to you guys? Thank you, Alex Baranau Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch - Hadoop - HBase Lucene ecosystem search :: http://search-lucene.com/http://search-hadoop.com/ -- View this message in context: http://lucene.472066.n3.nabble.com/How-to-limit-rows-to-which-highlighting-applies-tp1274042p1275962.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Doing Shingle but also keep special single word
Hi, keepword-filter is no solution for this problem, since this would lead to the problematic that one has to manage a word-dictionary. As explained, this would lead to too much effort. You can easily add outputUnigrams=true and check out the analysis.jsp for this field. So you can see how much bigger a single field will become with this option. However, I am quite sure that the difference between using outputUnigrams=true and indexing in a seperate field is not noteworthy. I would suggest you to do it the additionally-field-way, since this would lead to more flexibility in boosting the different fields. Unfortunately, I haven't understood your explanation about the use-case. But it sounds a little bit like tagging? Kind regards, - Mitch iorixxx wrote: Isn't set outputUnigrams=true will make index size about twice than when it's set to false? Sure index will be bigger. I didn't know that this is problem for you. But if you have a list of special single words that you want to keep, keepwordfilter can eliminate other tokens. So index size will be okey. Scott - Original Message - From: Ahmet Arslan iori...@yahoo.com To: solr-user@lucene.apache.org Sent: Saturday, August 21, 2010 1:15 AM Subject: Re: Doing Shingle but also keep special single word I am building index with Shingle filter. We know it's minimum 2-gram but I also want keep some special single word, e.g. IBM, Microsoft, etc. i.e. I want to do a minimum 2-gram but also want to have these single word in my index, Is it possible? outputUnigrams=true parameter does not work for you? After that you can cast filter class=solr.KeepWordFilterFactory words=keepwords.txt ignoreCase=true/ with keepwords.txt=IBM, Microsoft. -- View this message in context: http://lucene.472066.n3.nabble.com/Doing-Shingle-but-also-keep-special-single-word-tp1241204p1276506.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: SolrJ Response + JSON
Hi, as I promised, I want to give a feedback for transforming SolrJ's output into JSON with the package from json.org (the package was the json.org's one): I need to make a small modification to the package, since they store the JSON-key-value-pairs in a HashMap, I changed this to a LinkedHashMap to make sure that the order of the retrived values is the same order as they were inserted in the map. The result looks very, very pretty. It was very easy to transform the SolrJ's output into the desired JSON-format and I can add now whatever I want to the response. Kind regards, - Mitch
RE: Boosting DisMax queries with !boost component
Jonathan Rochkind wrote: qf needs to have spaces in it, unfortunately the local query parser can not deal with that, as Erik Hatcher mentioned some months ago. By local query parser, you mean what I call the LocalParams stuff (for lack of being sure of the proper term)? Yes, that was what I meant. Kind regards, - Mitch -- View this message in context: http://lucene.472066.n3.nabble.com/Boosting-DisMax-queries-with-boost-component-tp1011294p1015619.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Boosting DisMax queries with !boost component
Hi, qf needs to have spaces in it, unfortunately the local query parser can not deal with that, as Erik Hatcher mentioned some months ago. A solution would be to do something like that: {!dismax%20qf=$yourqf}yourQueryyourgf=title^1.0 tags^2.0 Since you are using the dismax-query-parser, you can add the boosting query via bq-param. Hope this helps, - Mitch -- View this message in context: http://lucene.472066.n3.nabble.com/Boosting-DisMax-queries-with-boost-component-tp1011294p1014242.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Nabble problems?
I got some problems with Nabble, too. Nabble sends some warnings that my posts are still pending to the mailing-list, while people were already answering to my initial questions. Did you send a message to the nabble-support? Kind regards, - Mitch kenf_nc wrote: The Nabble.com page for Solr - User seems to be broken. I haven't seen an update on it since early this morning. However I'm still getting email notifications so people are seeing and responding to posts. I'm just curious, are you just using email and responding to solr-u...@lucene.apache.org? Or is there a mirror site that *is* working for the Solr User forum? -- View this message in context: http://lucene.472066.n3.nabble.com/Nabble-problems-tp1004870p1004992.html Sent from the Solr - User mailing list archive at Nabble.com.
SolrJ Response + JSON
Hello community, I need to transform SolrJ - responses into JSON, after some computing on those results by another application has finished. I can not do those computations on the Solr - side. So, I really have to translate SolrJ's output into JSON. Any experiences how to do so without writing your own JSON-writer? Thank you. - Mitch -- View this message in context: http://lucene.472066.n3.nabble.com/SolrJ-Response-JSON-tp1002024p1002024.html Sent from the Solr - User mailing list archive at Nabble.com.
SolrJ Response + JSON
Hello , Second try to send a mail to the mailing list... I need to translate SolrJ's response into JSON-response. I can not query Solr directly, because I need to do some math with the responsed data, before I show the results to the client. Any experiences how to translate SolrJ's response into JSON without writing your own JSON Writer? Thank you. - Mitch -- View this message in context: http://lucene.472066.n3.nabble.com/SolrJ-Response-JSON-tp1002115p1002115.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: SolrJ Response + JSON
Thank you Markus, Mark. Seems to be a problem with Nabble, not with the mailing list. Sorry. I can create a JSON-response, when I query Solr directly. But I mean, that I query Solr through a SolrJ-client (CommonsHttpSolrServer). That means my queries look a litte bit like that: http://wiki.apache.org/solr/Solrj#Reading_Data_from_Solr So the response is shown as an QueryResponse-object, not as a JSON-string. Or do I miss something here? Am 28.07.2010 15:15, schrieb Markus Jelsma: Hi, I got a response to your e-mail in my box 30 minutes ago. Anyway, enable the JSONResponseWriter, if you haven't already, and query with wt=json. Can't get mucht easier. Cheers, On Wednesday 28 July 2010 15:08:26 MitchK wrote: Hello , Second try to send a mail to the mailing list... I need to translate SolrJ's response into JSON-response. I can not query Solr directly, because I need to do some math with the responsed data, before I show the results to the client. Any experiences how to translate SolrJ's response into JSON without writing your own JSON Writer? Thank you. - Mitch Markus Jelsma - Technisch Architect - Buyways BV http://www.linkedin.com/in/markus17 050-8536620 / 06-50258350
Re: SolrJ Response + JSON
Thank you, Chantal. I have looked at this one: http://www.json.org/java/index.html This seems to be an easy-to-understand-implementation. However, I am wondering how to determine whether a SolrDocument's field is multiValued or not. The JSONResponseWriter of Solr looks at the schema-configuration. However, the client shouldn't do that. How did you solved that problem? Thanks for sharing ideas. - Mitch Am 28.07.2010 15:35, schrieb Chantal Ackermann: You could use org.apache.solr.handler.JsonLoader. That one uses org.apache.noggit.JSONParser internally. I've used the JacksonParser with Spring. http://json.org/ lists parsers for different programming languages. Cheers, Chantal On Wed, 2010-07-28 at 15:08 +0200, MitchK wrote: Hello , Second try to send a mail to the mailing list... I need to translate SolrJ's response into JSON-response. I can not query Solr directly, because I need to do some math with the responsed data, before I show the results to the client. Any experiences how to translate SolrJ's response into JSON without writing your own JSON Writer? Thank you. - Mitch
Re: SolrJ Response + JSON
Hi Chantal, thank you for the feedback. I did not see the wood for the trees! The SolrDocument's javadoc says the following: http://lucene.apache.org/solr/api/org/apache/solr/common/SolrDocument.html |*getFieldValue ../../../../org/apache/solr/common/SolrDocument.html#getFieldValue%28java.lang.String%29*(String http://java.sun.com/javase/6/docs/api/java/lang/String.html?is-external=true name)| Get the value or collection of values for a given field. The magical word here is that little or :-). I will try that tomorrow and give you a feedback! Are you sure that you cannot change the SOLR results at query time according to your needs? Unfortunately, it is not possible in this case. Kind regards, Mitch Am 28.07.2010 16:49, schrieb Chantal Ackermann: Hi Mitch On Wed, 2010-07-28 at 16:38 +0200, MitchK wrote: Thank you, Chantal. I have looked at this one: http://www.json.org/java/index.html This seems to be an easy-to-understand-implementation. However, I am wondering how to determine whether a SolrDocument's field is multiValued or not. The JSONResponseWriter of Solr looks at the schema-configuration. However, the client shouldn't do that. How did you solved that problem? I didn't. I'm not recreating JSON from the SolrJ results. I would try to use the same classes that SolrJ uses, actually. (Writing that without having a further look at the code.) I would avoid recreating existing code as much as possible. About multivalued fields: you need instanceof checks, I guess. The field only contains a list if there really are multiple values. (That's what works for my ScriptTransformer.) Are you sure that you cannot change the SOLR results at query time according to your needs? Maybe you should ask for that, first (ask for X instead of Y...). Cheers, Chantal Thanks for sharing ideas. - Mitch Am 28.07.2010 15:35, schrieb Chantal Ackermann: You could use org.apache.solr.handler.JsonLoader. That one uses org.apache.noggit.JSONParser internally. I've used the JacksonParser with Spring. http://json.org/ lists parsers for different programming languages. Cheers, Chantal On Wed, 2010-07-28 at 15:08 +0200, MitchK wrote: Hello , Second try to send a mail to the mailing list... I need to translate SolrJ's response into JSON-response. I can not query Solr directly, because I need to do some math with the responsed data, before I show the results to the client. Any experiences how to translate SolrJ's response into JSON without writing your own JSON Writer? Thank you. - Mitch
Re: DIH : SQL query (sub-entity) is executed although variable is not set (null or empty list)
Hi Chantal, However, with this approach indexing time went up from 20min to more than 5 hours. This is 15x slower than the initial solution... wow. From MySQL I know that IN ()-clauses are the embodiment of endlessness - they perform very, very badly. New idea: Create a method which returns the query-string: returnString(theVIP) { if ( theVIP != null || theVIP != ) { return a query-string to find the vip } else { return SELECT 1 // you need to modify this, so that it matches your field-definition } } The main-idea is to perform a blazing fast query, instead of a complex IN-clause-query. Does this sounds like a solution??? The new approach is to query the solr index for that other database that I've already setup. This is only a bit slower than the original query (20min). (I'm using URLDataSource to be 1.4.1 conform.) Unfortunately I can not follow you. You are querying a solr-index for a database? Kind regards, - Mitch -- View this message in context: http://lucene.472066.n3.nabble.com/DIH-SQL-query-sub-entity-is-executed-although-variable-is-not-set-null-or-empty-list-tp995983p998859.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: DIH : SQL query (sub-entity) is executed although variable is not set (null or empty list)
Hi Chantal, instead of: entity name=prog ... field name=vip ... /* multivalued, not required */ entity name=ssc_entry dataSource=ssc onError=continue query=select SSC_VALUE from SSC_VALUE where SSC_ATTRIBUTE_ID=1 and SSC_VALUE in (${prog.vip}) field column=SSC_VALUE name=vip_ssc / /entity /entity you do: entity name=prog ... field name=vip ... /* multivalued, not required */ entity name=ssc_entry dataSource=ssc onError=continue query=${yourCustomFunctionToReturnAQueryString(prog.vip, ..., ...)} field column=SSC_VALUE name=vip_ssc / /entity /entity The yourCustomFunctionToReturnAQueryString(vip, querystring1, querystring2) { if(vip != null !vip.equals()) { StringBuilder sb = new StringBuilder(50); sb.append(querystring1); // SELECT SSC_VALUE from SSC_VALUE where SSC_ATTRIBUTE_ID=1 and SSC_VALUE in ( sb.append(vip);//VIP-value sb.append(querystring2);//just the closing ) return sb.toString(); } else { return SELECT \\ AS yourFieldName; } } I expect that this method is called for every vip-value, if there is one. Solr DIH uses the returned querystring to query the database. So, if vip-value is empty or null, you can use a different query that is blazing fast (i.e. SELECT AS yourFieldName - just an example to show the logic). This query should return a row with an empty string. So Solr fills the current field with an empty string. I don't know how to prevent Solr from calling your ssc_entry-entity, when vip is null or empty. But this would be a solution to handle empty vip-strings as efficient as possible. If realized that I have to throw an exception and add the onError attribute to the entity to make that work. I am curious: Can you show how to make a method throwing an exception that is accepted by the onError-attribute? I hope we do not talk past eachother here. :-) Kind regards, - Mitch -- View this message in context: http://lucene.472066.n3.nabble.com/DIH-SQL-query-sub-entity-is-executed-although-variable-is-not-set-null-or-empty-list-tp995983p998950.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: a bug of solr distributed search
Good morning, https://issues.apache.org/jira/browse/SOLR-1632 - Mitch Li Li wrote: where is the link of this patch? 2010/7/24 Yonik Seeley yo...@lucidimagination.com: On Fri, Jul 23, 2010 at 2:23 PM, MitchK mitc...@web.de wrote: why do we do not send the output of TermsComponent of every node in the cluster to a Hadoop instance? Since TermsComponent does the map-part of the map-reduce concept, Hadoop only needs to reduce the stuff. Maybe we even do not need Hadoop for this. After reducing, every node in the cluster gets the current values to compute the idf. We can store this information in a HashMap-based SolrCache (or something like that) to provide constant-time access. To keep the values up to date, we can repeat that after every x minutes. There's already a patch in JIRA that does distributed IDF. Hadoop wouldn't be the right tool for that anyway... it's for batch oriented systems, not low-latency queries. If we got that, it does not care whereas we use doc_X from shard_A or shard_B, since they will all have got the same scores. That only works if the docs are exactly the same - they may not be. -Yonik http://www.lucidimagination.com -- View this message in context: http://lucene.472066.n3.nabble.com/a-bug-of-solr-distributed-search-tp983533p995407.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Solr Doc Lucene Doc !?
Stockii, Solr's index is a Lucene Index. Therefore, Solr documents are Lucene documents. Kind regards, - Mitch -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-Doc-Lucene-Doc-tp995922p995968.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: DIH : SQL query (sub-entity) is executed although variable is not set (null or empty list)
Hi Chantal, did you tried to write a http://wiki.apache.org/solr/DIHCustomFunctions custom DIH Function ? If not, I think this will be a solution. Just check, whether ${prog.vip} is an empty string or null. If so, you need to replace it with a value that never can response anything. So the vip-field will always be empty for such queries. Maybe that helps? Hopefully, the variable resolver is able to resolve something like ${dih.functions.getReplacementIfNeeded(prog.vip). Kind regards, - Mitch Chantal Ackermann wrote: Hi, my use case is the following: In a sub-entity I request rows from a database for an input list of strings: entity name=prog ... field name=vip ... /* multivalued, not required */ entity name=ssc_entry dataSource=ssc onError=continue query=select SSC_VALUE from SSC_VALUE where SSC_ATTRIBUTE_ID=1 and SSC_VALUE in (${prog.vip}) field column=SSC_VALUE name=vip_ssc / /entity /entity The root entity is prog and it has an optional multivalued field called vip. When the list of vip values is empty, the SQL for the sub-entity above throws an SQLException. (Working with Oracle which does not allow an empty expression in the in-clause.) Two things: (A) best would be not to run the query whenever ${prog.vip} is null or empty. (B) From the documentation, it is not clear that onError is only checked in the transformer runs but not checked when the SQL for the entity throws an exception. (Trunk version JdbcDataSource lines 250pp). IMHO, (A) is the better fix, and if so, (B) is the right decision. (If (A) is not easily fixable, making (B) work would be helpful.) Looking through the code, I've realized that the replacement of the variables is done in a very generic way. I've not yet seen an appropriate way to check on those variables in order to stop the processing of the entity if the variable is empty. Is there a way to do this? Or maybe there is a completely different way to get my use case working. Any help most appreciated! Thanks, Chantal -- View this message in context: http://lucene.472066.n3.nabble.com/DIH-SQL-query-sub-entity-is-executed-although-variable-is-not-set-null-or-empty-list-tp995983p996446.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: a bug of solr distributed search
Okay, but than LiLi did something wrong, right? I mean, if the document exists only at one shard, it should get the same score whenever one requests it, no? Of course, this only applies if nothing gets changed between the requests. The only remaining problem here would be, that you need distributed IDF (like at the mentioned JIRA-issue) to normalize your results's scoring. But the mentioned problem at this mailing-list-posting has nothing to do with that... Regards - Mitch -- View this message in context: http://lucene.472066.n3.nabble.com/a-bug-of-solr-distributed-search-tp983533p991907.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: a bug of solr distributed search
Yonik, why do we do not send the output of TermsComponent of every node in the cluster to a Hadoop instance? Since TermsComponent does the map-part of the map-reduce concept, Hadoop only needs to reduce the stuff. Maybe we even do not need Hadoop for this. After reducing, every node in the cluster gets the current values to compute the idf. We can store this information in a HashMap-based SolrCache (or something like that) to provide constant-time access. To keep the values up to date, we can repeat that after every x minutes. If we got that, it does not care whereas we use doc_X from shard_A or shard_B, since they will all have got the same scores. Even if we got large indices with 10 million or more unique terms, this will only need some megabyte network-traffic. Kind regards, - Mitch Yonik Seeley-2-2 wrote: As the comments suggest, it's not a bug, but just the best we can do for now since our priority queues don't support removal of arbitrary elements. I guess we could rebuild the current priority queue if we detect a duplicate, but that will have an obvious performance impact. Any other suggestions? -Yonik http://www.lucidimagination.com -- View this message in context: http://lucene.472066.n3.nabble.com/a-bug-of-solr-distributed-search-tp983533p990506.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: a bug of solr distributed search
... Additionally to my previous posting: To keep this sync we could do two things: Waiting for every server to make sure that everyone uses the same values to compute the score and than apply them. Or: Let's say that we collect the new values every 15 minutes. To merge and send them over the network, we declare that this will need 3 additionally minutes (We want to keep the network traffic for such actions very low, so we do not send everything instantly). Okay, and now we say 2 additionally minutes, if 3 were not enough or something needs a little bit more time than we tought.. After those 2 minutes, every node has to apply the new values. Pro: If one node gets broken, we do not delay the Application of the new values. Con: We need two HashMaps and both will have roughly the same sice. That means we will waste some RAM for this operation, if we do not write the values to disk (Which I do not suggest). Thoughts? - Mitch MitchK wrote: Yonik, why do we do not send the output of TermsComponent of every node in the cluster to a Hadoop instance? Since TermsComponent does the map-part of the map-reduce concept, Hadoop only needs to reduce the stuff. Maybe we even do not need Hadoop for this. After reducing, every node in the cluster gets the current values to compute the idf. We can store this information in a HashMap-based SolrCache (or something like that) to provide constant-time access. To keep the values up to date, we can repeat that after every x minutes. If we got that, it does not care whereas we use doc_X from shard_A or shard_B, since they will all have got the same scores. Even if we got large indices with 10 million or more unique terms, this will only need some megabyte network-traffic. Kind regards, - Mitch Yonik Seeley-2-2 wrote: As the comments suggest, it's not a bug, but just the best we can do for now since our priority queues don't support removal of arbitrary elements. I guess we could rebuild the current priority queue if we detect a duplicate, but that will have an obvious performance impact. Any other suggestions? -Yonik http://www.lucidimagination.com -- View this message in context: http://lucene.472066.n3.nabble.com/a-bug-of-solr-distributed-search-tp983533p990551.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: a bug of solr distributed search
That only works if the docs are exactly the same - they may not be. Ahm, what? Why? If the uniqueID is the same, the docs *should* be the same, don't they? -- View this message in context: http://lucene.472066.n3.nabble.com/a-bug-of-solr-distributed-search-tp983533p990563.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: a bug of solr distributed search
Li Li, this is the intended behaviour, not a bug. Otherwise you could get back the same record in a response for several times, which may not be intended by the user. Kind regards, - Mitch -- View this message in context: http://lucene.472066.n3.nabble.com/a-bug-of-solr-distributed-search-tp983533p983675.html Sent from the Solr - User mailing list archive at Nabble.com.
nested query and number of matched records
Hello community, I got a situation, where I know that some types of documents contain very extensive information and other types are giving more general information. Since I don't know whether a user searches for general or extensive information (and I don't want to ask him when he uses the default search), I want to give him a response back like this: 10 documents are type: short 1 document, if there is one, is type: extensive An example query would look like this: q={!dismax fq=type:short}my cool query OR {!dismax fq=type:extensive}my cool query The problem with this one will be, that I can not specify to retrive up to 10 short-documents and at most one extensive. I think this will not work and if I want to create such a search, I need to do two different queries. But before I waste performance, I wanted to ask. Thank you! Kind regards, - Mitch -- View this message in context: http://lucene.472066.n3.nabble.com/nested-query-and-number-of-matched-records-tp983756p983756.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: a bug of solr distributed search
Ah, okay. I understand your problem. Why should doc x be at position 1 when searching for the first time, and when I search for the 2nd time it occurs at position 8 - right? I am not sure, but I think you can't prevent this without custom coding or making a document's occurence unique. Kind regards, - Mitch -- View this message in context: http://lucene.472066.n3.nabble.com/a-bug-of-solr-distributed-search-tp983533p983771.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: nested query and number of matched records
Oh,... I just see, there is no direct question ;-). How can I specify the number of returned documents in the desired way *within* one request? - Mitch -- View this message in context: http://lucene.472066.n3.nabble.com/nested-query-and-number-of-matched-records-tp983756p983773.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: a bug of solr distributed search
I don't know much about the code. Maybe you can tell me to what file you are referring? However, from the comments one can see, that the problem is known but one decided to let it happen, because of System requirements in the Java version. - Mitch -- View this message in context: http://lucene.472066.n3.nabble.com/a-bug-of-solr-distributed-search-tp983533p983880.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: a bug of solr distributed search
It already was sorted by score. The problem here is the following: Shard_A and shard_B contain doc_X and doc_X. If you are querying for something, doc_X could have a score of 1.0 at shard_A and a score of 12.0 at shard_B. You can never be sure which doc Solr sees first. In the bad case, Solr sees the doc_X firstly at shard_A and ignores it at shard_B. That means, that the doc maybe would occur at page 10 in pagination, although it *should* occur at page 1 or 2. Kind regards, - Mitch -- View this message in context: http://lucene.472066.n3.nabble.com/a-bug-of-solr-distributed-search-tp983533p984743.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: nested query and number of matched records
Thank you three for your feedback! Chantal, unfortuntately kenf is right. Facetting won't work in this special case. parallel calls. Yes, this will be the solution. However, this would lead to a second HTTP-request and I hoped to be able to avoid it. Chantal Ackermann wrote: Sure SOLR supports this: use facets on the field type: add to your regular query: facet.query=truefacet.field=type see http://wiki.apache.org/solr/SimpleFacetParameters On Wed, 2010-07-21 at 15:48 +0200, kenf_nc wrote: parallel calls. simultaneously query for type:short rows=10 and type:extensive rows=1 and merge your results. This would also let you separate your short docs from your extensive docs into different solr instances if you wished...depending on your document architecture this could speed up one or the other. -- View this message in context: http://lucene.472066.n3.nabble.com/nested-query-and-number-of-matched-records-tp983756p984750.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Autocomplete with NGrams
It sounds like the best solution here, right. However, I do not want to exclude the possibility of doing things one *should* do in different cores with different configurations and schema.xml in one core. I haven't completly read the lucidimagination article, but I would suggest you to do your work in different cores, since it would make managing and configuring the different tasks easier. Furthermore the optimization in configurations for task A (a normal index where you search) may work worse or wasteful with task B. To prevent such situation you must use multicore-setups. - Mitch -- View this message in context: http://lucene.472066.n3.nabble.com/Autocomplete-with-NGrams-tp979312p980680.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Beginner question
Here you can find params and their meanings for the dismax-handler. You may not find anything in the wiki by searching for a parser ;). Link: http://wiki.apache.org/solr/DisMaxRequestHandler Wiki: DisMaxRequestHandler Kind regards - Mitch Erik Hatcher-4 wrote: Consider using the dismax query parser instead. It has more sophisticated capability to spread user queries across multiple fields with different weightings. Erik On Jul 20, 2010, at 4:34 AM, Bilgin Ibryam wrote: Hi all, I have two simple questions: I have an Item entity with id, name, category and description fields. The main requirements is to be able to search in all the fields with the same string and different priority per field, so matches in name appear before category matches, and they appear before description field matches in the result list. 1. I think to create an index having the same fields, because each field needs different priority during searching. 2. And then do the search with a query like this: name:search_string^1.3 OR categpry:search_string^1.2 OR description:search_string^1.1 Is this the right approach to model the index and search query? Thanks in advance. Bilgin -- View this message in context: http://lucene.472066.n3.nabble.com/Beginner-question-tp980695p980819.html Sent from the Solr - User mailing list archive at Nabble.com.
Problem with Solr-Mailinglist
Hello, I try to post http://lucene.472066.n3.nabble.com/Solr-in-an-extra-project-what-about-replication-scaling-etc-td977961.html#a977961 this message for the fourth time to the Solr Mailinglist and everytime I get the following response from the Mailing-list's server: solr-user@lucene.apache.org SMTP error from remote mail server after end of data: host mx1.eu.apache.org [192.87.106.230]: 552 spam score (7.8) exceeded threshold Why is my posting declared as Spam?! Did anyone else has got such problems? Thank you! - Mitch -- View this message in context: http://lucene.472066.n3.nabble.com/Problem-with-Solr-Mailinglist-tp978247p978247.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Problem with Solr-Mailinglist
Thank you both. I will do what Hoss suggested, tomorrow. The mail was sent over the nabble-board and another time over my thunderbird-client. Both with the same result. So there was not more HTML-code than it was in every of my other postings. Kind regards - Mitch -- View this message in context: http://lucene.472066.n3.nabble.com/Problem-with-Solr-Mailinglist-tp978247p979602.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Autocomplete with NGrams
Frank, have a look at Solr's example-directory's and look for 'multicore'. There you can see an example-configuration for a multicore-environment. Kind regards, - Mitch -- View this message in context: http://lucene.472066.n3.nabble.com/Autocomplete-with-NGrams-tp979312p979610.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: solr with hadoop
I need to revive this discussion... If you do distributed indexing correctly, what about updating the documents and what about replicating them correctly? Does this work? Or wasn't this an issue? Kind regards - Mitch -- View this message in context: http://lucene.472066.n3.nabble.com/solr-with-hadoop-tp482688p944413.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Wither field compresed=true ?
David, well, I am no committer, but I noticed that Lucene will no longer care of compressing (I think this was because of the trouble when doing this) and maybe this is the reason why Solr keeps this option no longer available. Unfortunately, I do not have got any link for it, but I think this was said in some changes.txt (at Nutch, I think). Kind regards - Mitch -- View this message in context: http://lucene.472066.n3.nabble.com/Wither-field-compresed-true-tp926288p929985.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: How I can use score value for my function
Britske good workaround! I did not thought about the possibility of using subqueries. Regards - Mitch -- View this message in context: http://lucene.472066.n3.nabble.com/How-I-can-use-score-value-for-my-function-tp899662p931448.html Sent from the Solr - User mailing list archive at Nabble.com.
Question about the mailinglist (junk on my behalf)
Hello community, since a few days I recieve daily some mails with suspicious content. It is said that some of my mails were rejected, because of the file-types of the mail's attachements and other things. This wonders me a lot, because I didn't send any mails with attachements and even the eMail-adresses which want to make me aware of my rejected mails are unknown to me. This is the first mailinglist I have joined and I know that there are a lot of bots out there, crawling for eMail-adresses to send junk. However, I can't recognize any suspicious behaviour except those mails. The number of mails that make me aware of the mentioned thing is 10 in a few days, maybe 15 but not more. And I do not get more junk than I normally get. Does anyone recieves suspicious eMails on my behalf? Thank you. - Mitch -- View this message in context: http://lucene.472066.n3.nabble.com/Question-about-the-mailinglist-junk-on-my-behalf-tp927461p927461.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: MoreLikeThis (mlt) : use the match's maxScore for result score normalization
Hi Chantal, Munich? Germany seems to be soo small :-). Chantal Ackermann wrote: I only want a way to show to the user a kind of relevancy or similarity indicator (for example using a range of 10 stars) that would give a hint on how similar the mlt hit is to the input (match) item. Okay, that's making more sense. Unfortunately, you can not do that with Lucene with results that might fit your needs (as far as I know). Kind regards - Mitch -- View this message in context: http://lucene.472066.n3.nabble.com/MoreLikeThis-mlt-use-the-match-s-maxScore-for-result-score-normalization-tp919598p921942.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Solr 1.4 - Image-Highlighting and Payloads
Sebastian, sounds like an exciting project. We've found the argument TokenGroup in method highlightTerm implemented in SimpleHtmlFormatter. TokenGroup provides the method getPayload(), but the returned value is always NULL. No, Token provides this method, not TokenGroup. But this might not be the mistake. Hm, since this approach is very special, I would suggest to do something easier. You already got tools to retrive the word and the word's position from the image, right? What would be, if you add a field to the schema.xml with a preprocessed input-string. I.e: You got two fields: page's text and page's text's word-positions. Page's text's word-positions needs preprocessing outside of Solr where you add the coordinates of the words . This preprocessing will be a little bit tricky. If the 10th word is Solr and the 30th word also, you do not want to have solr two times with different coordinates. In fact, you want to store both coordinates for the term solr. However, on the Solr-side you can add this preprocessed string to a field with TermVectors. If your query hits the page, you will get all the coordinates you want to get. Unfortunately, highlighting must be done on the client side. Hope this helps - Mitch -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-1-4-Image-Highlighting-and-Payloads-tp919266p919342.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: MoreLikeThis (mlt) : use the match's maxScore for result score normalization
Chantal, have a look at http://lucene.apache.org/java/3_0_1/api/all/org/apache/lucene/search/similar/MoreLikeThis.html More like this to have a guess what the MLT's score concerns. The problem is that you can't compare scores. The query for the normal result-response was maybe something like Bill Gates featuring Linus Torvald - The perfect OS song. The user picks now one of the responsed documents and says he wants More like this - maybe, because the concerned topic was okay, but the content was not enough or whatever... But the sent query is totaly different (as you can see in the link) - so that would be like comparing apples and oranges, since they do not use the same base. What would be the use case? Why is score-normalization needed? Kind regards from Germany, - Mitch -- View this message in context: http://lucene.472066.n3.nabble.com/MoreLikeThis-mlt-use-the-match-s-maxScore-for-result-score-normalization-tp919598p919716.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: solr with hadoop
I wanted to add a Jira-issue about exactly what Otis is asking here. Unfortunately, I haven't time for it because of my exams. However, I'd like to add a question to Otis' ones: If you destribute the indexing-progress this way, are you able to replicate the different documents correctly? Thank you. - Mitch Otis Gospodnetic-2 wrote: Stu, Interesting! Can you provide more details about your setup? By load balance the indexing stage you mean distribute the indexing process, right? Do you simply take your content to be indexed, split it into N chunks where N matches the number of TaskNodes in your Hadoop cluster and provide a map function that does the indexing? What does the reduce function do? Does that call IndexWriter.addAllIndexes or do you do that outside Hadoop? Thanks, Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: Stu Hood stuh...@webmail.us To: solr-user@lucene.apache.org Sent: Monday, January 7, 2008 7:14:20 PM Subject: Re: solr with hadoop As Mike suggested, we use Hadoop to organize our data en route to Solr. Hadoop allows us to load balance the indexing stage, and then we use the raw Lucene IndexWriter.addAllIndexes method to merge the data to be hosted on Solr instances. Thanks, Stu -Original Message- From: Mike Klaas mike.kl...@gmail.com Sent: Friday, January 4, 2008 3:04pm To: solr-user@lucene.apache.org Subject: Re: solr with hadoop On 4-Jan-08, at 11:37 AM, Evgeniy Strokin wrote: I have huge index base (about 110 millions documents, 100 fields each). But size of the index base is reasonable, it's about 70 Gb. All I need is increase performance, since some queries, which match big number of documents, are running slow. So I was thinking is any benefits to use hadoop for this? And if so, what direction should I go? Is anybody did something for integration Solr with Hadoop? Does it give any performance boost? Hadoop might be useful for organizing your data enroute to Solr, but I don't see how it could be used to boost performance over a huge Solr index. To accomplish that, you need to split it up over two machines (for which you might find hadoop useful). -Mike -- View this message in context: http://lucene.472066.n3.nabble.com/solr-with-hadoop-tp482688p914589.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Re: Re: Solr and Nutch/Droids - to use or not to use?
Solr doesn't know anything about OPIC, but I suppose you can feed the OPIC score computed by Nutch into a Solr field and use it during scoring, if you want, say with a function query. Oh! Yes, that makes more sense than using the OPIC as doc-boost-value. :-) Anywhere at the Lucene Mailing lists I read that in future it will be possible to change field's contents without reindexing the whole document. If one stores the OPIC-Score (which is independent from the page's content) in a field and uses functionQuery to influence the score of a document, one saves the effort of reindexing the whole doc, if the content did not change. Regards - Mitch -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-and-Nutch-Droids-to-use-or-not-to-use-tp900069p902158.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Re: Re: Solr and Nutch/Droids - to use or not to use?
Otis, you are right. I wasn't aware of this. At least not with such a large dataList (let's think of an index with 4mio docs, this would mean we got an ExternalFile with 4mio records). But from what I've read at search-lucene.com it seems to perform very well. Thanks for the idea! Btw: Otis, did you open a JIRA Issue for the distributed indexing ability of Solr? I would like to follow the issue, if it is open. Regards - Mitch Otis Gospodnetic-2 wrote: Mitch, Yes, one day. But it sounds like you are not aware of ExternalFieldFile, which you can use today: http://search-lucene.com/?q=ExternalFileFieldfc_project=Solr Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/ - Original Message From: MitchK mitc...@web.de To: solr-user@lucene.apache.org Sent: Thu, June 17, 2010 4:15:27 AM Subject: Re: Re: Re: Solr and Nutch/Droids - to use or not to use? Solr doesn't know anything about OPIC, but I suppose you can feed the OPIC score computed by Nutch into a Solr field and use it during scoring, if you want, say with a function query. Oh! Yes, that makes more sense than using the OPIC as doc-boost-value. :-) Anywhere at the Lucene Mailing lists I read that in future it will be possible to change field's contents without reindexing the whole document. If one stores the OPIC-Score (which is independent from the page's content) in a field and uses functionQuery to influence the score of a document, one saves the effort of reindexing the whole doc, if the content did not change. Regards - Mitch -- View this message in context: href=http://lucene.472066.n3.nabble.com/Solr-and-Nutch-Droids-to-use-or-not-to-use-tp900069p902158.html; target=_blank http://lucene.472066.n3.nabble.com/Solr-and-Nutch-Droids-to-use-or-not-to-use-tp900069p902158.html Sent from the Solr - User mailing list archive at Nabble.com. -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-and-Nutch-Droids-to-use-or-not-to-use-tp900069p903148.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Document boosting troubles
Hi, first of all, are you sure that row.put('$docBoost',docBoostVal) is correct? I think it should be row.put($docBoost,docBoostVal); - unfortunately I am not sure. Hm, I think, until you can solve the problem with the docBoosts itself, you should use a functionQuery. Use div(1, rank) as boost function (bf). The higher the rank value, the smaller the result. Hope this helps! - Mitch dbashford wrote: Brand new to this sort of thing so bear with me. For sake of simplicity, I've got a two field document, title and rank. Title gets searched on, rank has values from 1 to 10. 1 being highest. What I'd like to do is boost results of searches on title based on the documents rank. Because it's fairly cut and dry, I was hoping to do it during indexing. I have this in my DIH transformer.. var docBoostVal = 0; switch (rank) { case '1': docBoostVal = 3.0; break; case '2': docBoostVal = 2.6; break; case '3': docBoostVal = 2.2; break; case '4': docBoostVal = 1.8; break; case '5': docBoostVal = 1.5; break; case '6': docBoostVal = 1.2; break; case '7': docBoostVal = 0.9; break; case '8': docBoostVal = 0.7; break; case '9': docBoostVal = 0.5; break; } row.put('$docBoost',docBoostVal); It's my understanding that with this, I can simply do the same /select queries I've been doing and expect documents to be boosted, but that doesn't seem to be happening because I'm seeing things like this in the results... {title:Some title 1, rank:10, score:0.11726039}, {title:Some title 2, rank:7, score:0.11726039}, Pretty much everything with the same score. Whatever I'm doing isn't making its way through. (To cover my bases I did try the case statement with integers rather than strings, same result) With that not working I started looking at other options. Starting playing with dismax. I'm able to add this to a query string a get results I'm somewhat expecting... bq=rank:1^3.0 rank:2^2.6 rank:3^2.2 rank:4^1.8 rank:5^1.5 rank:6^1.2 rank:7^0.9 rank:8^0.7 rank:9^0.5 ...but I guess I wasn't expecting it to ONLY rank based on those factors. That essentially gives me a sort by rank. Trying to be super inclusive with the search, so while I'm fiddling my mm=11. As expected, a q= like q=red door is returning everything that contains Red and door. But I was hoping that items that matched red door exactly would sort closer to the top. And if that exact match was a rank 7 that it's score wouldn't be exactly the same as all the other rank 7s? Ditto if I searched for q=The Tales Of, anything possessing all 3 terms would sort closer to the top...and possessing two terms behind them...and possessing 1 term behind them, and within those groups weight heavily on by rank. I think I understand that the score is based entirely on the boosts I provide...so how do I get something more like what I'm looking for? Along those lines, I initially had put something like this in my defaults... str name=bf rank:1^10.0 rank:2^9.0 rank:3^8.0 rank:4^7.0 rank:5^6.0 rank:6^5.0 rank:7^4.0 rank:8^3.0 rank:9^2.0 /str ...but that was not working, queries fail with a syntax exception. Guessing this won't work? Thanks in advance for any help you can provide. -- View this message in context: http://lucene.472066.n3.nabble.com/Document-boosting-troubles-tp902982p903190.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Document boosting troubles
Sorry, I've overlooked your other question. str name=bf rank:1^10.0 rank:2^9.0 rank:3^8.0 rank:4^7.0 rank:5^6.0 rank:6^5.0 rank:7^4.0 rank:8^3.0 rank:9^2.0 /str This is wrong. You need to change bf to bq. Bf - boosting function Bq - boosting query. -- View this message in context: http://lucene.472066.n3.nabble.com/Document-boosting-troubles-tp902982p903208.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: solr multi-node
Antonello, here are a few links to the Solr Wiki: http://wiki.apache.org/solr/SolrReplication Solr Replication http://wiki.apache.org/solr/DistributedSearchDesign Distributed Search Design http://wiki.apache.org/solr/DistributedSearch Distributed Search http://wiki.apache.org/solr/SolrCloud Solr Cloud Hope this helps. - Mitch -- View this message in context: http://lucene.472066.n3.nabble.com/solr-multi-node-tp903159p903228.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Master master?
What is the usecase for such an architecture? Do you send requests to two different masters for indexing and that's why they need to be synchronized? Kind regards - Mitch -- View this message in context: http://lucene.472066.n3.nabble.com/Master-master-tp884253p903233.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Document boosting troubles
Hi, One problem down, two left! =) bf == bq did the trick, thanks. Now at least if I can't get the DIH solution working I don't have to tack that on every query string. I would really recommend to use a boost function. If your rank will change in future implementations, you do not need to redefine the bq. Besides that, I think this is not only more comfortable, but also scales better. The bq-param is more for things like boost this category or boost docs of an advertisement campaign or something like that. I am not sure, since I never worked with the DIH this way, but - from my logic - the problem could be, that you do not return the row, right? If you don't, try it again when return row was added to your sourcecode. Otherwise, I can't help you, since there are no more codeexamples available at the mailing list (from what I have seen). Maybe this mailing-list topic helps you: http://lucene.472066.n3.nabble.com/Using-DIH-s-special-commands-Help-needed-td475695.html#a475695 Using DIHs special commands Help needed . There are some suggestions,... however, it seems like he wasn't able to solve the problem. And still can't figure out what I need to do with my dismax querying to get scores for quality of match. I don't really understand what you mean. Can you explain it a little bit more? What, except the $docBoost, does not work as it should do? Kind regards - Mitch -- View this message in context: http://lucene.472066.n3.nabble.com/Document-boosting-troubles-tp902982p904129.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: DismaxRequestHandler
Joe, please, can you provide an example of what you are thinking of? Subqueries with Solr... I've never seen something like that before. Thank you! Kind regards - Mitch -- View this message in context: http://lucene.472066.n3.nabble.com/DismaxRequestHandler-tp903641p904142.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Re: Re: Solr and Nutch/Droids - to use or not to use?
Otis, And again I wished I were registred. I will check the JIRA and when I feel comfortable with it, I will open it. Kind regards - Mitch -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-and-Nutch-Droids-to-use-or-not-to-use-tp900069p904145.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Question on dynamic fields
Barani, without more background on dynamic fields, I would say that the most easiest way would be to define a suffix for each of the fields you want to index into the mentioned dynamic field and to redefine your dynamic field - condition. If suffix does not work, because of other dynamic-field declarations, use a prefix. Instead of *_bla to match myField_bla, you can use bla_* to match bla_myField. Hope this helps, - Mitch -- View this message in context: http://lucene.472066.n3.nabble.com/Question-on-dynamic-fields-tp904053p904159.html Sent from the Solr - User mailing list archive at Nabble.com.
Solr and Nutch/Droids - to use or not to use?
Hello community, from several discussions about Solr and Nutch, I got some questions for a virtual web-search-engine. I know I've posted this message to the mailing list a few days ago, but the thread got injected and at least I did not get any more postings about the topic and so I try to reopen it, hopefully no one gets upset here :-). Please, bear with me. Thank you. The requirements: I. I need a scalable solution for a growing index that becomes larger than one machine can handle. If I add more hardware, I want to linear improve the performance. II. I want to use technologies like the OPIC-algorithm (default algorithm in Nutch) or PageRank or... whatever is out there to improve the ranking of the webpages. III. I want to be able to easily add more fields to my documents. Imagine one retrives information from a webpage's content, than I want to make it searchable. IV. While fetching my data, I want to make special-searches possible. For example I want to retrive pictures from a webpage and want to index picture-related content into another search-index plus I want to save a small thumbnail of the picture itself. Btw: This is (as far as I know) not possible with solr, because solr was not intended to do such special indexing-logic. V. I want to use filter queries (i.e. main-query christopher lee returns 1.5mio results, subquery action - the main-query would be a filter-query and action would be the actual query. So a search within search-results would be easily made available). VI. I want to be able to use different logics for different pages. Maybe I got a pool of 100 domains that I know better than others and I got special scripts that retrive more special information from those 100 domains. Than I want to apply my special logic to those 100 domains, but every other domain should use the default logic. - The project is only virtual. So why I am asking? I want to learn more about websearch and I would like to make some new experiences. What do I know about Solr + Nutch: As it is said on lucidimagination.com, Solr + Nutch does not scale if the index is too large. The article was a little bit older and I don't know whether this problem gets fixed with the new distributed abilities of Solr. Furthermore I don't want to index the pages with nutch and reindex them with solr. The only exception would be: If the content of a webpage get's indexed by nutch, I want to use the already tokenized content of the body with some Solr copyfield operations to extend the search (i.e. making fuzzy search possible). At the moment: I don't think this is possible. I don't know much about the droids project and how well it is documented. But from what I can read by some posts of Otis, it seems to be usable as a crawler-framework. Pros for Nutch are: It is very scalable! Thanks to hadoop and MapReduce it is a scaling-monster (from what I've read). Cons: The search is not as rich as it is possible with Solr. Extend Nutch's search-abilities *seems* to be more complicated than with Solr. Furthermore, if I want to use Solr to search nutch's index, looking at my requirements I would need to reindex the whole thing - without the benefits of Hadoop. What I don't know at the moment is, how it is possible to use algorithms like in II. mentioned with Solr. I hope you understand the problem here - Solr *seems* to me as it would not be the best solution for a web-search-engine, because of scaling reasons in indexing. Where should I dive deeper? Solr + Droids? Solr + Nutch? Nutch + howToExtendNutchToMakeSearchBetter? Thanks for the discussion! - Mitch -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-and-Nutch-Droids-to-use-or-not-to-use-tp900069p900069.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Solr and Nutch/Droids - to use or not to use?
Thank you for the feedback, Otis. Yes, I thought that such an approach is usefull if the number of pages to crawl is relatively low. However, what about using solr + nutch? Exists the problem that this would not scale, if the index becomes too large, up to now? What about extending nutch with features such as the DisMaxRequestHandler, is the amount of work larger than it would be in Solr? The big pro of Solr is that I can enhance the whole thing in a few minutes, if I need more extra-information to improve the search. That makes it very easy to experiment with boostings, filters etc. As far as I know, Nutch does not offer such greatefull features. Do you know a little bit more about that? Probably I should ask such question at the Nutch-mailing list, but at the moment I hope that I can achieve as much as I can with Solr, because I have no experiences with Hadoop but Nutch seems to require it. Thank you! - Mitch -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-and-Nutch-Droids-to-use-or-not-to-use-tp900069p900480.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Solr and Nutch/Droids - to use or not to use?
Thanks, that really helps to find the right beginning for such a journey. :-) * Use Solr, not Nutch's search webapp As far as I have read, Solr can't scale, if the index gets too large for one Server The setup explained here has one significant caveat you also need to keep in mind: scale. You cannot use this kind of setup with vertical scale (collection size) that goes beyond one Solr box. The horizontal scaling (query throughput) is still possible with the standard Solr replication tools. ...from Lucidimagination.com Is this still the case? Furthermore, as far as I have understood this blogpost: http://www.lucidimagination.com/blog/2009/03/09/nutch-solr/ Lucidimagination.com : Nutch and Solr , they index the whole stuff with nutch and reindex it to Solr - sounds like a lot of redundant work. Lucid, Sematext and the Nutch-wiki are the only information-sources where I can find talks about Nutch and Solr, but no one seems to talk about these facts - except this one blogpost. If you say this is wrong or contingent on the shown setup, can you tell me how to avoid these problems? A lot of questions, but it's such an exciting topic... Hopefully you can answer some of them. Again, thank you for the feedback, Otis. - Mitch -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-and-Nutch-Droids-to-use-or-not-to-use-tp900069p900604.html Sent from the Solr - User mailing list archive at Nabble.com.
RE: Re: Re: Solr and Nutch/Droids - to use or not to use?
Good morning! Great feedback from you all. This really helped a lot to get an impression of what is possible and what is not. What is interesting to me are some detail questions. Let's assume Solr is possible to work on his own with distributed indexing, so that the client does not need to know anything about shards etc. What is interesting to me is: I. The scoring - Nutch uses special Scoring-implementations like the OPIC-algorithm. Can Solr use such improvements or do I need to reimplement it for Solr? II. The indexing. At the moment it really sounds like nutch would index the whole stuff and afterwards Solr does the job again. Regarding to indexing it would make sense, if Nutch computes things like the document boost (I am not sure, but I think the results of the OPIC-algorithm were added to each document as a boost) and sends an indexing-request to Solr afterwards. However, if Nutch indexes the page's content and Solr does it, too - I would waste some time, no? Is this the case or do I missunderstood something here? III. I am no Java-Expert. However, in a few month I will start to study computer-science at an university. Maybe I will find some literature to learn more about distributed software and how hashing needs to work, to do the job it should do, to make distributed indexing work. Maybe than I can help to implement this feature into Solr. On the other hand, not much is known about Solr's distributed search-concept and which classes are responsible for that - but such things one could ask on the mailing list, no? As far as I know Elastic Search already supports distributed indexing. Maybe one can reuse the responsible implementation for Solr. Btw: I think a great benefit of using Solr + Nutch would be to extend the search. I could create several Solr cores for different kinds of search - one for picture-search, one for video-search etc. *and* with the help of Nutch I can index some of the needed content in special directories. So Solr does not need to care about indexing a picture - Nutch already does the job. Kind regards, - Mitch -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-and-Nutch-Droids-to-use-or-not-to-use-tp900069p901943.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Solr and Nutch/Droids - to use or not to use?
Just wanted to push the topic a little bit, because those question come up quite often and it's very interesting for me. Thank you! - Mitch MitchK wrote: Hello community and a nice satureday, from several discussions about Solr and Nutch, I got some questions for a virtual web-search-engine. The requirements: I. I need a scalable solution for a growing index that becomes larger than one machine can handle. If I add more hardware, I want to linear improve the performance. II. I want to use technologies like the OPIC-algorithm (default algorithm in Nutch) or PageRank or... whatever is out there to improve the ranking of the webpages. III. I want to be able to easily add more fields to my documents. Imagine one retrives information from a webpage's content, than I want to make it searchable. IV. While fetching my data, I want to make special-searches possible. For example I want to retrive pictures from a webpage and want to index picture-related content into another search-index plus I want to save a small thumbnail of the picture itself. Btw: This is (as far as I know) not possible with solr, because solr was not intended to do such special indexing-logic. V. I want to use filter queries (i.e. main-query christopher lee returns 1.5mio results, subquery action - the main-query would be a filter-query and action would be the actual query. So a search within search-results would be easily made available). VI. I want to be able to use different logics for different pages. Maybe I got a pool of 100 domains that I know better than others and I got special scripts that retrive more special information from those 100 domains. Than I want to apply my special logic to those 100 domains, but every other domain should use the default logic. - The project is only virtual. So why I am asking? I want to learn more about websearch and I would like to make some new experiences. What do I know about Solr + Nutch: As it is said on lucidimagination.com, Solr + Nutch does not scale if the index is too large. The article was a little bit older and I don't know whether this problem gets fixed with the new distributed abilities of Solr. Furthermore I don't want to index the pages with nutch and reindex them with solr. The only exception would be: If the content of a webpage get's indexed by nutch, I want to use the already tokenized content of the body with some Solr copyfield operations to extend the search (i.e. making fuzzy search possible). At the moment: I don't think this is possible. I don't know much about the droids project and how well it is documented. But from what I can read by some posts of Otis, it seems to be usable as a crawler-framework. Pros for Nutch are: It is very scalable! Thanks to hadoop and MapReduce it is a scaling-monster (from what I've read). Cons: The search is not as rich as it is possible with Solr. Extend Nutch's search-abilities *seems* to be more complicated than with Solr. Furthermore, if I want to use Solr to search nutch's index, looking at my requirements I would need to reindex the whole thing - without the benefits of Hadoop. What I don't know at the moment is, how it is possible to use algorithms like in II. mentioned with Solr. I hope you understand the problem here - Solr *seems* to me as it would not be the best solution for a web-search-engine, because of scaling reasons in indexing. Where should I dive deeper? Solr + Droids? Solr + Nutch? Nutch + howToExtendNutchToMakeSearchBetter? Thanks for the discussion! - Mitch -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-and-Nutch-Droids-to-use-or-not-to-use-tp890640p894391.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Solr DataConfig / DIH Question
Guys??? You are in the wrong thread. Please, send a message to the mailing list, do not answer to existing posts. Thank you. -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-and-Nutch-Droids-to-use-or-not-to-use-tp890640p892041.html Sent from the Solr - User mailing list archive at Nabble.com.
Solr and Nutch/Droids - to use or not to use?
Hello community and a nice satureday, from several discussions about Solr and Nutch, I got some questions for a virtual web-search-engine. The requirements: I. I need a scalable solution for a growing index that becomes larger than one machine can handle. If I add more hardware, I want to linear improve the performance. II. I want to use technologies like the OPIC-algorithm (default algorithm in Nutch) or PageRank or... whatever is out there to improve the ranking of the webpages. III. I want to be able to easily add more fields to my documents. Imagine one retrives information from a webpage's content, than I want to make it searchable. IV. While fetching my data, I want to make special-searches possible. For example I want to retrive pictures from a webpage and want to index picture-related content into another search-index plus I want to save a small thumbnail of the picture itself. Btw: This is (as far as I know) not possible with solr, because solr was not intended to do such special indexing-logic. V. I want to use filter queries (i.e. main-query christopher lee returns 1.5mio results, subquery action - the main-query would be a filter-query and action would be the actual query. So a search within search-results would be easily made available). VI. I want to be able to use different logics for different pages. Maybe I got a pool of 100 domains that I know better than others and I got special scripts that retrive more special information from those 100 domains. Than I want to apply my special logic to those 100 domains, but every other domain should use the default logic. - The project is only virtual. So why I am asking? I want to learn more about websearch and I would like to make some new experiences. What do I know about Solr + Nutch: As it is said on lucidimagination.com, Solr + Nutch does not scale if the index is too large. The article was a little bit older and I don't know whether this problem gets fixed with the new distributed abilities of Solr. Furthermore I don't want to index the pages with nutch and reindex them with solr. The only exception would be: If the content of a webpage get's indexed by nutch, I want to use the already tokenized content of the body with some Solr copyfield operations to extend the search (i.e. making fuzzy search possible). At the moment: I don't think this is possible. I don't know much about the droids project and how well it is documented. But from what I can read by some posts of Otis, it seems to be usable as a crawler-framework. Pros for Nutch are: It is very scalable! Thanks to hadoop and MapReduce it is a scaling-monster (from what I've read). Cons: The search is not as rich as it is possible with Solr. Extend Nutch's search-abilities *seems* to be more complicated than with Solr. Furthermore, if I want to use Solr to search nutch's index, looking at my requirements I would need to reindex the whole thing - without the benefits of Hadoop. What I don't know at the moment is, how it is possible to use algorithms like in II. mentioned with Solr. I hope you understand the problem here - Solr *seems* to me as it would not be the best solution for a web-search-engine, because of scaling reasons in indexing. Where should I dive deeper? Solr + Droids? Solr + Nutch? Nutch + howToExtendNutchToMakeSearchBetter? Thanks for the discussion! - Mitch -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-and-Nutch-Droids-to-use-or-not-to-use-tp890640p890640.html Sent from the Solr - User mailing list archive at Nabble.com.
conditional Document Boost
Hello out there, I am searching for a solution for conditional Document Boosting. During analyzing the fields of a document, I want to create a document boost based on some metrics. There are two approaches: First: I preprocess the data. The main problem with this is, that I need to take care about the preprocessing-part and I can't do it out of the box (implementing an analyzer, compute the boosting value and afterwards store those values or send them to solr.). Second: Using the UpdateRequestProcessor (does it work with DIH?). However, the problem would also be custom work and taking care that the used params are up-to-date. Third: Setting the Document Boost while analyzing-process is running with the help of a TokenFilter (is this possible?). What would you do? I think what I want to do is quite the same as working with Mahout and Solr. I never worked with Mahout - but how can I use it to improve the user's search-experience? Where can I use Mahout in Solr, if I want to influence document's boosts? And where in general (i.e. for classification). References, ideas and whatever could be useful are welcome :-). Thank you. Kind regards - Mitch -- View this message in context: http://lucene.472066.n3.nabble.com/conditional-Document-Boost-tp871108p871108.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: sort by function
Where is your query? You don't search for anything. The q-param is empty. You got two options (untested): remove the q-param or search for something special. I think removing is not a good idea. Instead search for *:* would retrive ALL results that match your filter-query. Kind regards - Mitch -- View this message in context: http://lucene.472066.n3.nabble.com/sort-by-function-tp814380p839167.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: IndexSearcher and Caches
Ahh, now I understand. No, you need no second IndexSearcher as long as the Server is alive. You can reuse your searcher for every user. The only commands you are executing per user are those to create a search-query. Kind regards, - Mitch -- View this message in context: http://lucene.472066.n3.nabble.com/IndexSearcher-and-Caches-tp833567p840228.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: IndexSearcher and Caches
Good question. Well, I never worked productively with SolrJ. But two things: The first: As the documentation says, you *should* get your IndexSearcher from your SolrQueryRequest-object. The second: As a developer of the SolrJ I would do as much as I can automatically behind the curtain. That means that if you do a commit, the index searcher should be automatically renewed. But that's a guess. I can't answer you this question, sorry. Maybe this link helps? http://lucene.472066.n3.nabble.com/Solr-commit-issue-td770315.html#a770453 (searched with the following keywords: solrj commit searcher) I am new to Java and the concept of Java Enterprise Edition's Servlets is not yet fully clear to me. Please, let me ask a question. Let me give you an example: If I use inside my application (it's a Servlet) a SolrServer, I should create him when I start the Servlet. Should I cache the instantiated SolrServer-object with the help of the servlet's cache? And should my cache-implementation should provide a getSolrServer()-method? Maybe this is a question more related to the JavaEE-concept. Thank you. Kind regards - Mitch -- View this message in context: http://lucene.472066.n3.nabble.com/IndexSearcher-and-Caches-tp833567p840479.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: IndexSearcher and Caches
In my case, I have an index which will not be modified after creation. Does this mean that in a multi-user scenario, I can have a static IndexSearcher object that can be shared by multiple users ? I am not sure, what you mean with multi-user-scenario. Can you tell me what you got in mind? If your index never changes, your IndexSearcher won't change. If the IndexSearcher object is threadsafe, then only issues related to concurrency are addressed. What about the case where the IndexSearcher is static? User 1 logs in to the system, queries with the static IndexSearcher, logs out; and then User 2 logs in to the system, queries with the same static IndexSearcher, logs out. In this case, the users 1 and 2 are not querying concurrently but one after another. Will the query information (filters or any other data) of User 1 be retained when User 2 uses this ? I am not sure about the benefit of a static IndexSearcher. What do you hope??? If user 1 uses a filter like fq=name:Samuelq=somethingIWantToKnow and user 2 queries for fq=name:Samuelq=whatIReallyWantToKnow than they use the same cached filter-object, retrived from Solr's internal cache (of course you need to have a cache-size that allows cacheing). The solr wiki states that the caches are per IndexSearcher object i.e if I set my filterCache size to 1000 it means that 1000 entries can be assigned for every IndexSearcher object. Yes. If a new searcher is created than the new Cache is built on the old one. Is this true for queryResultsCache, filterCache and documentCache ? For FilterCache it's true. For queryResultsCache (if I understand the wiki right), too. Please note, that the documentCache's behaviour is different from the already mentioned ones. The wiki says: Note: This cache cannot be used as a source for autowarming because document IDs will change when anything in the index changes so they can't be used by a new searcher. The wiki says that the number of the document cache should not be bigger than the number of _results_ * number of _concurrent_ queries. I never worked with the document cache, so maybe someone else can throw some light into the dark. But from what I have understood it means the following: If you show 10 results per request and you think of up to 500 concurrent queries: 10 * 500 = 5000 But I want to emphasize, that this is only a gues. I actually don't exactly know more about this topic. Kind regards - Mitch -- View this message in context: http://lucene.472066.n3.nabble.com/IndexSearcher-and-Caches-tp833567p838367.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: sort by function
The score isn't computed when you try to access it. Furthermore your functionQuery needs to become part of the score. So what can you do??? The keyword is boosting. Do: {!func}product(0.88,rank)^x Where x is a boosting factor based on your experiences. Keep in mind that the result of your product-function-query will be added to the score. That means if the result is i.e. 12, and the normal score would be 5,6, than the final score for the document is 17,6. If your ranking-value or your x-value is too large, this would lead to unexpected results. Hope this helps. - Mitch -- View this message in context: http://lucene.472066.n3.nabble.com/sort-by-function-tp814380p836471.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: IndexSearcher and Caches
Rahul, the IndexSearcher of Solr gets shared with every request within two commits. That means one IndexSearcher + its caches got a lifetime of one commit. After every commit, there will be a new one created. The cache does not mean, that they are applied automatically. They mean, that a filter from a query will be cached and whenever an user-query requieres the same filtering-criteria, they will use the cached filter instead of creating a new one on the fly. I.e: fq=inStock:true The result of this filtering-criteria gets cached one time. If another user asks again for a query with fq=inStock:true, Solr reuses the already existing filter. Since such filters are cached as byteVectors, they are not large. In this case it does not care for what the user is querying in his q-param. BTW: The IndexSearcher is threadsafe. So there is no problem with concurrent usage. Hope this helps??? Kind regards - Mitch -- View this message in context: http://lucene.472066.n3.nabble.com/IndexSearcher-and-Caches-tp833567p833841.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Personalized Search
Hi dc, - at query time, specify boosts for 'my items' items Do you mean something like document-boost or do you want to include something like OR myItemId:100^100 ? Can you tell us how you would specify document-boostings at query-time? Or are you querying something like a boolean field (i.e. isFavorite:true^10) or a numeric field? Kind regards - Mitch -- View this message in context: http://lucene.472066.n3.nabble.com/Personalized-Search-tp831070p832062.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: sort by function
Can you please do some math to show the principle? Do you want to do something like this: finalScore = score * rank finalScore = rank ??? If the first is the case, than it is done by default (have a look at the wiki-example for making more recent documents more relevant). If the second is the case, than I would say you need a new sort-function (never realized something like that). Hope this helps - Mitch -- View this message in context: http://lucene.472066.n3.nabble.com/sort-by-function-tp814380p821239.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Short DismaxRequestHandler Question
Okay, I will do so in future, if another problem like this occurs. At the moment, everything is fine after I followed your suggestions. Kind regards - Mitch -- View this message in context: http://lucene.472066.n3.nabble.com/Short-DismaxRequestHandler-Question-tp775913p820355.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: sort by function
Can you provide us some more information on what you really want to do? Like the examples in the wiki said, the returned value of the function query is multiplied with the score - you can boost your returned value from the function query, if you like to do so. Kind regards - Mitch -- View this message in context: http://lucene.472066.n3.nabble.com/sort-by-function-tp814380p820359.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Short DismaxRequestHandler Question
Okay, let me be more specific: I got a custom StopWordFilter and a WordMarkingFilter. The WordMarkingFilter is an easy implementation to determine which type a word is. The StopWordFilter (my implementation) removes specific types of words *and* all markers from all words. This leads to a deletion of some parts of sentences. In my disMaxQuery I specified some fields with such filters and some without. a) what docs should *not* match the query you listed In this case: docs where only Solr OR development occur should not match. It is not important, if both words occur in different fields. b) what queries should *not* match the doc you listed Actually Solr Development Lucidworks should not match, for example (assuming that lucidworks does not occur in a field like content). In this case, the user searches for development-work with Solr in relation to LucidWorks. Solr does not know about the relation, however with the 100%mm-definition, I can tell Solr something like this in a more easier way. c) what types of URLs you've already tried Those I have shown here. No more. Let me be sure, that I have understood your part about how the DisMaxRequestHandler works. If I got 4 fields: name, colour, category, manufacturer And an example-doc like this: title: iPhone colour: black category: smartphone manufacturer: apple And I got a dismax-query like this: q=apple iPhone qf=title^5 manufacturer mm=100% Than the whole thing will match (assumed that iPhone and /or apple where no stopwords)? If yes, than the problem is my filter-definition. There were some threads with discussions about such problems with the standard-stopWordFilter. Another example: title: Solr in a production environment cat: tutorial At index-time, title is reduced to: Solr production environment. A query like this using Solr in a production environment will be reduced to Solr production environment. This will work, as I have understood, because both: the indexed terms and the query are the same. However, if I got a content field, that indexes the content of the text without my markerFilter, this won't work, because the parsed query-strings are different??? I don't understand the problem example: title: Solr in a production environment cat: tutorial content: here is some text about using Solr in production. This fieldType consists of a lowerCaseFilter and a standard-StopWordFilter to delete all words like 'the, and, in' etc. Please, note that environment does not occur in the content-field. So a parsed querystring would look like: using Solr in a production environment - using Solr production environment (stopwords are removed). This won't match, because the word environment does not occur in the content-field? And according to that, the whole doc does not match? If you are confused about my examples and questions - I was trying to understand the explanations that were described here: http://lucene.472066.n3.nabble.com/DisMax-request-handler-doesn-t-work-with-stopwords-td478128.html#a478128 Thank you for help. - Mitch -- View this message in context: http://lucene.472066.n3.nabble.com/Short-DismaxRequestHandler-Question-tp775913p783063.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Short DismaxRequestHandler Question
Btw: This thread helps a lot to understand the difference between qf and pf :-) http://lucene.472066.n3.nabble.com/Dismax-query-phrases-td489994.html#a489995 -- View this message in context: http://lucene.472066.n3.nabble.com/Short-DismaxRequestHandler-Question-tp775913p783379.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: increase(change) relevancy
Hi Ramzesua, take a look at the example of the function query that influences relvancy by the popular-field of the example-directory. http://wiki.apache.org/solr/FunctionQuery#Using_FunctionQuery Kind regards - Mitch -- View this message in context: http://lucene.472066.n3.nabble.com/increase-change-relevancy-tp783497p783750.html Sent from the Solr - User mailing list archive at Nabble.com.
Short DismaxRequestHandler Question
Hello community, I need a minimum should match only on some fields, not on all. Let me give you an example: title: Breaking News: New information about Solr 1.5 category: development tag: Solr News If I am searching for Solr development, I want to return this doc, although I defined a minimum should match of 100%, because 100% of the query match the *whole* document. At the moment, 100% applies only if 100% of the query match a field. Is this possible at the moment? If not, are there any suggestions or practices to make this working? Thank you. Kind regards - Mitch -- View this message in context: http://lucene.472066.n3.nabble.com/Short-DismaxRequestHandler-Question-tp775913p775913.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Custom SearchComponent to reset facet value counts after collapse
When is the returned facet-info the expected info for your multiValued fields? Before or after your collapse? It could be possible, that you need to facet only on your multiValued fields before you are collapsing to retrive the right values. If this is the case, you need to integrate the before-collapsing feature of the collapsing-patch in your own component, the rest is done by the patch itself. Hope this helps. Kind regards - Mitch -- View this message in context: http://lucene.472066.n3.nabble.com/Custom-SearchComponent-to-reset-facet-value-counts-after-collapse-tp770826p776067.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Short DismaxRequestHandler Question
Thank you for responsing. This would be possible. However, I wouldn't like to do so, because a match in title should boost higher than a match in category. -- View this message in context: http://lucene.472066.n3.nabble.com/Short-DismaxRequestHandler-Question-tp775913p776238.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Short DismaxRequestHandler Question
I got an idea: If I would catenate all relevant fields to one large multiValued field, I could query like this: {!dismax qf='myLargeField^5'}solr development //mm is 1 (100%) if not set Additionally to that, I add a phraseQuery {!dismax qf='myLargeField^5'}solr development AND title:(solr development)^10 OR category:(solr development)^2 Any other ideas are welcome. Thank you for the discussion. -- View this message in context: http://lucene.472066.n3.nabble.com/Short-DismaxRequestHandler-Question-tp775913p776446.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Custom SearchComponent to reset facet value counts after collapse
I would prefer extending the given CollapseComponent, because of performance-reasons. What you want to do sounds a bit like making things too complicate. There are two options I would prefer: 1. get the schema-information for every field you want to query against and define, whether you want to facet before or after collapsing. As far as I have understood: For multiValued fields you want to facet before collapsing, because if you facet after collapsing, the returned counts are wrong. 2. As a developer, you know which of the queried fields is a multiValued one. Knowing this, you create a new param that contains on those fields, you always want to facet on BEFORE collapsing. I want to emphasize that I never had a look at the sourcecode of the patch. However, I really think that you do not need to reimplement so much things. You only need to implement the logic when to facet which field. That's everything. And as far as the component seems to implement both things: facet before *and* after collapsing, you can use the provided methods to make your logic work. Just some thoughts. :) Kind regards - Mitch -- View this message in context: http://lucene.472066.n3.nabble.com/Custom-SearchComponent-to-reset-facet-value-counts-after-collapse-tp770826p776896.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: How do I return all the results in an index?
Did you clean up the Browser-Cache? Maybe you need to restart (I am currently not sure, whether Solr caches HTTP-requests, even when you did a commit???). Kind regards - Mitch -- View this message in context: http://lucene.472066.n3.nabble.com/How-do-I-return-all-the-results-in-an-index-tp777214p777353.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: synonym filter problem for string or phrase
Just for clear terminology: You mean field, not fieldType. FieldType is the definition of tokenizers, filters etc.. You apply a fieldType on a field. And you query against a field, not against a whole fieldType. :-) Kind regards - Mitch Marco Martinez-2 wrote: Hi Ranveer, If you don't specify a field type in the q parameter, the search will be done searching in your default search field defined in the solrconfig.xml, its your default field a text_sync field? Regards, Marco Martínez Bautista http://www.paradigmatecnologico.com Avenida de Europa, 26. Ática 5. 3ª Planta 28224 Pozuelo de Alarcón Tel.: 91 352 59 42 -- View this message in context: http://lucene.472066.n3.nabble.com/synonym-filter-problem-for-string-or-phrase-tp765242p773083.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Custom SearchComponent to reset facet value counts after collapse
Kelly, did you have a look at the facetComponent - class and simpleFacets-class? Why do you want to reset the counts? What is your usecase? What is the difference between the facetComponent's return-value and your component? Kind regards - Mitch -- View this message in context: http://lucene.472066.n3.nabble.com/Custom-SearchComponent-to-reset-facet-value-counts-after-collapse-tp770826p771260.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Custom SearchComponent to reset facet value counts after collapse
Unfortunately this patch does not support multiValued-fields (as this is said by the author and some others that worked with that patch). I had a look on others, but they seem to have the same problem. What would I suggest, hmm... Out-of-the-box and at this time (it's late here in Germany) I got only one simple idea: Send a second request with using the standard facetComponent and do the same query and facet at those fields that seems to have unexpected results. If I understand you right, this would be the fastest solution. However, I am not sure, whether you really got a problem, since the simpleFacet-implementation sends also several queries to get the count-value per facet-value. Does it really kills your performance? Or do you have got performance issues, even if you don't do so? What time does it take to compute a response? Maybe you can provide the full code of your own implementation, so that we can have a look together at your source code. Hope this helps. Kind regards - Mitch -- View this message in context: http://lucene.472066.n3.nabble.com/Custom-SearchComponent-to-reset-facet-value-counts-after-collapse-tp770826p772012.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Custom SearchComponent to reset facet value counts after collapse
Good morning, I do not have the time to read your full code very carefully at the moment. I will do so later on, however: Have a look at simpleFacets. Consider the method that creates the facetCounts. When I got it right in mind, the author uses the IndexSearcher's numDoc(arg1, arg2) method. That's what you need here, I *think* (I never created such a feature). There is one thing that may be tricky: Which field to quey against (in an universal way - at the moment you need to do so when we are talking about multiValued fields). If I use param (CollapseParams.COLLAPSE_FACET, after) I get accurate counts for some facet values, while other facet values (from multi-value fields) are completely missing. What you've said was shown in your example? Just want to know to verify, that we see the same problem. Kind regards - Mitch -- View this message in context: http://lucene.472066.n3.nabble.com/Custom-SearchComponent-to-reset-facet-value-counts-after-collapse-tp770826p772544.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: thresholding results by percentage drop from maxScore in lucene/solr
I am curious: What is your usecase or what type of data is this? Web-Pages? Blog-posts? Product-items? Can you provide some real examples so that we can discuss other ideas than doing it by the score? Because I think this is not possible or really difficult to achieve, since you don't know what the highest score will be, until every document that match the query is found. Kind regards - Mitch -- View this message in context: http://lucene.472066.n3.nabble.com/thresholding-results-by-percentage-drop-from-maxScore-in-lucene-solr-tp768872p770063.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Any way to get top 'n' queries searched from Solr?
The most simple way is to send the querystring to your Solr-client *and* to your custom query-fetcher, which could be any database you like. Doing so, you can count how often which query was send etc. *And* you can make them searchable by exporting those datasets to another Solr-core. Why an extra DB? Because if there occurs a crash, you got no guaranties given by Solr. Keep in mind that Solr is only an index-search-server, not a real database. This is the pretty easiest way to implement such a feature, I think. Good luck. - Mitch -- View this message in context: http://lucene.472066.n3.nabble.com/Any-way-to-get-top-n-queries-searched-from-Solr-tp767165p767489.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Elevation of of part match
Gert, could you provide the solrconfig- and schema-specifications you have made? If the wiki really means what it says, the behaviour you want should be possible. But that's only what I guess. Btw: The standard definition for the elevation-component is string in the example-directory. That means that there is no tokinization and according to this a partially match is not possible. Hope that helps - Mitch -- View this message in context: http://lucene.472066.n3.nabble.com/Elevation-of-of-part-match-tp767139p767877.html Sent from the Solr - User mailing list archive at Nabble.com.
RE: Elevation of of part match
The elevate.xml-example says: !-- If this file is found in the config directory, it will only be loaded once at startup. If it is found in Solr's data directory, it will be re-loaded every commit. -- Did you make a restart? -- View this message in context: http://lucene.472066.n3.nabble.com/Elevation-of-of-part-match-tp767139p768120.html Sent from the Solr - User mailing list archive at Nabble.com.