Re: Setting up MiniSolrCloudCluster to use pre-built index
Hi Mark, I’ll have a completely new, rebuilt index that’s (a) large, and (b) already sharded appropriately. In that case, using the merge API isn’t great, in that it would take significant time and temporarily use double (or more) disk space. E.g. I’ve got an index with 250M+ records, and about 200GB. There are other indexes, still big but not quite as large as this one. So I’m still wondering if there’s any robust way to swap in a fresh set of shards, especially without relying on legacy cloud mode. I think I can figure out where the data is being stored for an existing (empty) collection, shut that down, swap in the new files, and reload. But I’m wondering if that’s really the best (or even sane) approach. Thanks, — Ken > On May 19, 2018, at 6:24 PM, Mark Miller wrote: > > You create MiniSolrCloudCluster with a base directory and then each Jetty > instance created gets a SolrHome in a subfolder called node{i}. So if > legacyCloud=true you can just preconfigure a core and index under the right > node{i} subfolder. legacyCloud=true should not even exist anymore though, > so the long term way to do this would be to create a collection and then > use the merge API or something to merge your index into the empty > collection. > > - Mark > > On Sat, May 19, 2018 at 5:25 PM Ken Krugler > wrote: > >> Hi all, >> >> Wondering if anyone has experience (this is with Solr 6.6) in setting up >> MiniSolrCloudCluster for unit testing, where we want to use an existing >> index. >> >> Note that this index wasn’t built with SolrCloud, as it’s generated by a >> distributed (Hadoop) workflow. >> >> So there’s no “restore from backup” option, or swapping collection >> aliases, etc. >> >> We can push our configset to Zookeeper and create the collection as per >> other unit tests in Solr, but what’s the right way to set up data dirs for >> the cores such that Solr is running with this existing index (or indexes, >> for our sharded test case)? >> >> Thanks! >> >> — Ken >> >> PS - yes, we’re aware of the routing issue with generating our own shards…. >> >> -- >> Ken Krugler >> +1 530-210-6378 <(530)%20210-6378> >> http://www.scaleunlimited.com >> Custom big data solutions & training >> Flink, Solr, Hadoop, Cascading & Cassandra >> >> -- > - Mark > about.me/markrmiller -- Ken Krugler +1 530-210-6378 http://www.scaleunlimited.com Custom big data solutions & training Flink, Solr, Hadoop, Cascading & Cassandra
Re: Storing & using feature vectors
Hi Doug, Many thanks for the tons of useful information! Some comments/questions inline below. — Ken > On Oct 19, 2018, at 10:46 AM, Doug Turnbull > wrote: > > This is a pretty big hole in Lucene-based search right now that many > practitioners have struggled with > > I know a couple of people who have worked on solutions. And I've used a > couple of hacks: > > - You can hack together something that does cosine similarity using the > term frequency & query boosts DelimitedTermFreqFilterFactory. Basically the > term frequency becomes a feature weight on the document. Boosts become the > query weight. If you massage things correctly with the similarity, the > resulting boolean similarity is a dot product… I’ve done a quick test of that approach, though not as elegantly. I just constructed a string of “terms” (feature indices) that generated an approximation to the target vector. DelimitedTermFreqFilterFactory is much better :) The problem I ran into was that some features have negative weights, and it wasn’t obvious whether it would work to have a second field (with only the negative weights) that I used for (not really supported in Solr?) negative boosting. Is there some hack to work around that? > - Erik Hatcher has done some great work with payloads which you might want > to check out. See the delimited payload filter factory, and payload score > function queries Thanks, I’d poked at payloads a bit. From what I could tell, there isn't a way to use payloads with negative feature values, or to filter results, but maybe I didn’t dig deep enough. > - Simon Hughes Activate Talk (slides/video not yet posted) covers this > topic in some depth OK, that looks great - https://activate2018.sched.com/event/FkM3 and https://github.com/DiceTechJobs/VectorsInSearch Seems like the planets are aligning for this kind of thing. > - Rene Kriegler's Haystack Talk discusses encoding Inception model > vectorizations of images: > https://opensourceconnections.com/events/haystack-single/haystack-relevance-scoring/ Good stuff, thanks! I’d be curious what his querqy <https://github.com/renekrie/querqy> configuration looked like for the “summing up fieldweights only (ignore df; use cross-field tf)” row in his results table on slide 36. The use of LSHs (what he describes in this talk as “random projection forest") is something I’d suggested to the client, to mitigate the need for true feature vector support. Using an initial LSH-based query to get candidates, and then re-ranking based on the actual feature vector, is something I was expecting Rene to discuss but he didn’t seem to mention it in his talk. > If this is a huge importance to you, I might also suggest looking at vespa, > which makes tensors a first-class citizen and makes matrix-math pretty > seamless: http://vespa.ai Interesting, though my client is pretty much locked into using Solr. > On Fri, Oct 19, 2018 at 12:50 PM Ken Krugler > wrote: > >> Hi all, >> >> [I posted on the Lucene list two days ago, but didn’t see any response - >> checking here for completeness] >> >> I’ve been looking at directly storing feature vectors and providing >> scoring/filtering support. >> >> This is for vectors consisting of (typically 300 - 2048) floats or doubles. >> >> It’s following the same pattern as geospatial support - so a new field >> type and query/parser, plus plumbing to hook it into Solr. >> >> Before I go much further, is there anything like this already done, or in >> the works? >> >> Thanks, >> >> — Ken >> > CTO, OpenSource Connections > Author, Relevant Search > http://o19s.com/doug -- Ken Krugler +1 530-210-6378 http://www.scaleunlimited.com Custom big data solutions & training Flink, Solr, Hadoop, Cascading & Cassandra
Storing & using feature vectors
Hi all, [I posted on the Lucene list two days ago, but didn’t see any response - checking here for completeness] I’ve been looking at directly storing feature vectors and providing scoring/filtering support. This is for vectors consisting of (typically 300 - 2048) floats or doubles. It’s following the same pattern as geospatial support - so a new field type and query/parser, plus plumbing to hook it into Solr. Before I go much further, is there anything like this already done, or in the works? Thanks, — Ken -- Ken Krugler +1 530-210-6378 http://www.scaleunlimited.com Custom big data solutions & training Flink, Solr, Hadoop, Cascading & Cassandra
Is router.field an explicit shard name, or hashed?
Hi all, From https://lucene.apache.org/solr/guide/6_6/shards-and-indexing-data-in-solrcloud.html#ShardsandIndexingDatainSolrCloud-DocumentRouting <https://lucene.apache.org/solr/guide/6_6/shards-and-indexing-data-in-solrcloud.html#ShardsandIndexingDatainSolrCloud-DocumentRouting>, and various posts on the mailing list, the implication is that the content of the “router.field” field is used as the shard name. But on https://lucene.apache.org/solr/guide/6_6/collections-api.html#CollectionsAPI-create <https://lucene.apache.org/solr/guide/6_6/collections-api.html#CollectionsAPI-create>, the description of “router.field” says: > If this field is specified, the router will look at the value of the field in > an input document to compute the hash and identify a shard instead of looking > at the uniqueKey field I’m wondering which is correct. Thanks, — Ken ---------- Ken Krugler +1 530-210-6378 http://www.scaleunlimited.com Custom big data solutions & training Flink, Solr, Hadoop, Cascading & Cassandra
Setting up MiniSolrCloudCluster to use pre-built index
Hi all, Wondering if anyone has experience (this is with Solr 6.6) in setting up MiniSolrCloudCluster for unit testing, where we want to use an existing index. Note that this index wasn’t built with SolrCloud, as it’s generated by a distributed (Hadoop) workflow. So there’s no “restore from backup” option, or swapping collection aliases, etc. We can push our configset to Zookeeper and create the collection as per other unit tests in Solr, but what’s the right way to set up data dirs for the cores such that Solr is running with this existing index (or indexes, for our sharded test case)? Thanks! — Ken PS - yes, we’re aware of the routing issue with generating our own shards…. -- Ken Krugler +1 530-210-6378 http://www.scaleunlimited.com Custom big data solutions & training Flink, Solr, Hadoop, Cascading & Cassandra
Handling of local params in QParserPlugin.createParser
Hi all, As part of some interesting work creating a custom query parser, I was writing unit tests that exercised ExtendedDismaxQParser. So I first created the ExtendedDismaxQParserPlugin, and then used that to create the QParser via: QParser parser = plugin.createParser(query, localParams, params, req); If query is something like {!complexphrase}fieldname:”A * query”, I was expecting the complex phrase query parser to get used, but that’s not happening - the local param is being treated as regular text. Which makes me think my conceptual model of local params processing is wrong, and there’s higher level code that does a pre-processing step first. But I was hoping that I’d get out a DisjunctionMaxQueries where one of the queries was a ComplexPhraseQuery, which would mean the processing has to happen inside of the ExtendedDismaxQParser code. Any pointers for where to poke around? Thanks, — Ken -- Ken Krugler +1 530-210-6378 http://www.scaleunlimited.com custom big data solutions & training Hadoop, Cascading, Cassandra & Solr
Shingles from WDFF
Hi all, I’ve got some ancient Lucene tokenizer code from 2006 that I’m trying to avoid forward-porting, but I don’t think there’s an equivalent in Solr 5/6. Specifically it’s applying shingles to the output of something like the WordDelimiterFilter - e.g. MySuperSink gets split into “My” “Super” “Sink”, and then shingled (if we’re using shingle size of 2) to be “My”, “MySuper”, “Super”, “SuperSink”, “Sink”. I can’t just follow the WDF with a single filter because shingles aren’t created across terms coming into the WDF - it’s only for the pieces generated by the WDF. Or is there actually a way to make this work with Solr 5/6? Thanks, — Ken -- Ken Krugler +1 530-210-6378 http://www.scaleunlimited.com custom big data solutions & training Hadoop, Cascading, Cassandra & Solr
RE: how to update billions of docs
As others noted, currently updating a field means deleting and inserting the entire document. Depending on how you use the field, you might be able to create another core/container with that one field (plus the key field), and use join support. Note that https://issues.apache.org/jira/browse/LUCENE-6352 is an improvement, which looks like it's in the 5.x code line, though I don't see a fix version. -- Ken > From: Mohsin Beg Beg > Sent: March 16, 2016 3:52:47pm PDT > To: solr-user@lucene.apache.org > Subject: how to update billions of docs > > Hi, > > I have a requirement to replace a value of a field in 100B's of docs in 100's > of cores. > The field is multiValued=false docValues=true type=StrField stored=true > indexed=true. > > Atomic Updates performance is on the order of 5K docs per sec per core in > solr 5.3 (other fields are quite big). > > Any suggestions ? > > -Mohsin -- Ken Krugler +1 530-210-6378 http://www.scaleunlimited.com custom big data solutions & training Hadoop, Cascading, Cassandra & Solr
RE: Embedded Solr now deprecated?
Hi Shawn, We have a different use case than the ones you covered in your response to Robert (below), which I wanted to call out. We currently use the embedded server when building indexes as part of a Hadoop workflow. The results get copied to a production analytics server and swapped in on a daily basis. Writing to multiple embedded servers (one per reduce task) gives us maximum performance, and has proven to be a very reliable method for the daily rebuild of pre-aggregations we need for our analytics use case. Regards, -- Ken PS - I'm also currently looking at using embedded Solr as a state storage engine for Samza. > From: Shawn Heisey > Sent: August 5, 2015 7:54:07am PDT > To: solr-user@lucene.apache.org > Subject: Re: Embedded Solr now deprecated? > > On 8/5/2015 7:09 AM, Robert Krüger wrote: >> I tried to upgrade my application from solr 4 to 5 and just now realized >> that embedded use of solr seems to be on the way out. Is that correct or is >> there a just new API to use for that? > > Building on Erick's reply: > > I doubt that the embedded server is going away, and I do not recall > seeing *anything* marking the entire class deprecated. The class still > receives attention from devs -- this feature was released with 5.1.0: > > https://issues.apache.org/jira/browse/SOLR-7307 > > That said, we have discouraged users from deploying it in production for > quite some time, even though it continues to exist and receive developer > attention. Some of the reasons that I think users should avoid the > embedded server: It doesn't support SolrCloud, you cannot make it > fault-tolerant (redundant), and troubleshooting is harder because you > cannot connect to it from outside of the source code where it is embedded. > > Deploying Solr as a network service offers much more capability than you > can get when you embed it in your application. Chances are that you can > easily replace EmbeddedSolrServer with one of the SolrClient classes and > use a separate Solr deployment from your application. > > Thanks, > Shawn > -- Ken Krugler +1 530-210-6378 http://www.scaleunlimited.com custom big data solutions & training Hadoop, Cascading, Cassandra & Solr
Re: When not to use NRTCachingDirectory and what to use instead.
On Jul 10, 2013, at 9:16am, Shawn Heisey wrote: > On 7/10/2013 9:59 AM, Tom Burton-West wrote: >> The Javadoc for NRTCachingDirectoy ( >> http://lucene.apache.org/core/4_3_1/core/org/apache/lucene/store/NRTCachingDirectory.html?is-external=true) >> says: >> >> "This class is likely only useful in a near real-time context, where >> indexing rate is lowish but reopen rate is highish, resulting in many tiny >> files being written..." >> >> It seems like we have exactly the opposite use case, so we would like >> advice on what directory implementation to use instead. >> >> We are doing offline batch indexing, so no searches are being done. So we >> don't need NRT. We also have a high indexing rate as we are trying to >> index 3 billion pages as quickly as possible. >> >> I am not clear what determines the reopen rate. Is it only related to >> searching or is it involved in indexing as well? >> >> Does the NRTCachingDirectory have any benefit for indexing under the use >> case noted above? >> >> I'm guessing we should just use the solrStandardDirectoryFactory instead. >> Is this correct? > > The NRT directory object in Solr uses the MMap implementation as its default > delegate. The code I see seems to be using an FSDirectory, or is there another layer of wrapping going on here? return new NRTCachingDirectory(FSDirectory.open(new File(path)), maxMergeSizeMB, maxCachedMB); > I would use MMapDirectoryFactory (the default for most of the 3.x releases) > for testing whether you can get any improvement from moving away from the > default. The advantages of memory mapping are not something you'd want to > give up. Tom - did you ever get any useful results from testing here? I'm also curious about the impact of various xxxDirectoryFactory implementations for batch indexing. Thanks, -- Ken -- Ken Krugler +1 530-210-6378 http://www.scaleunlimited.com custom big data solutions & training Hadoop, Cascading, Cassandra & Solr
Re: Enabling other SimpleText formats besides postings
Hi Erik (& Shawn), On Mar 31, 2014, at 1:48pm, Shawn Heisey wrote: > On 3/31/2014 2:36 PM, Erik Hatcher wrote: >> Not currently possible. Solr’s SchemaCodecFactory only has a hook for >> postings format (and doc values format). OK, thanks for confirming. > Would it be a reasonable thing to develop a config structure (probably in > schema.xml) that starts with something like and has ways > to specify the class and related configuration for each of the components in > the codec? Then you could specify codec="foo" on an individual field > definition. The codec definition could allow one of them to have > default="true". > > I will admit that my understanding of these Lucene-level details is low, so I > could be thinking about this wrong. The absolute easiest approach would be to support a new init value for codecFactory, which SchemaCodecFactory would use to select a different base codec class to use (versus always using LuceneCodec). That would switch everything to a different codec. Or you could extend the SchemaCodecFactory to support additional per-field settings for stored fields format, etc beyond what's currently available. For my quick & dirty hack I've specified a different codecFactory in solrconfig.xml, and have my own factory that hard-codes the SimpleTextCodec. This works - all files are in the SimpleTextXXX format, other than the segments.gen and segments_XX files; what, those aren't pluggable?!?! :) -- Ken -- Ken Krugler +1 530-210-6378 http://www.scaleunlimited.com custom big data solutions & training Hadoop, Cascading, Cassandra & Solr
Re: Enabling other SimpleText formats besides postings
Hi all (and particularly Uwe and Robert), On Mar 28, 2014, at 7:24am, Michael McCandless wrote: > You told the fieldType to use SimpleText only for the postings, not > all other parts of the codec (doc values, live docs, stored fields, > etc...), and so it used the default codec for those components. > > If instead you used the SimpleTextCodec (not sure how to specify this > in Solr's schema.xml) then all components would be SimpleText. Yes, that's the gist of my question - how do you specify use of SimpleTextXXX (e.g. SimpleTextStoredFieldsFormat) in Solr? Or is this currently not possible? Thanks, -- Ken > On Fri, Mar 28, 2014 at 8:53 AM, Ken Krugler > wrote: >> Hi all, >> >> I've been using the SimpleTextCodec in the past, but I just noticed >> something odd... >> >> I'm running Solr 4.3, and enable the SimpleText posting format via something >> like: >> >>> /> >> >> The resulting index does have the expected _0_SimpleText_0.pst text output, >> but I just noticed that the other files are all the standard binary format >> (e.g. .fdt for field data) >> >> Based on SimpleTextCodec.java, I was assuming that I'd get the >> SimpleTextStoredFieldsFormat for stored data. >> >> This same holds true for most (all?) of the other files, e.g. >> https://issues.apache.org/jira/browse/LUCENE-3074 is about adding a simple >> text format for DocValues. >> >> I can walk the code to figure out what's up, but I'm hoping I just need to >> change some configuration setting. >> >> Thanks! >> >> -- Ken -- Ken Krugler +1 530-210-6378 http://www.scaleunlimited.com custom big data solutions & training Hadoop, Cascading, Cassandra & Solr
Enabling other SimpleText formats besides postings
Hi all, I've been using the SimpleTextCodec in the past, but I just noticed something odd… I'm running Solr 4.3, and enable the SimpleText posting format via something like: The resulting index does have the expected _0_SimpleText_0.pst text output, but I just noticed that the other files are all the standard binary format (e.g. .fdt for field data) Based on SimpleTextCodec.java, I was assuming that I'd get the SimpleTextStoredFieldsFormat for stored data. This same holds true for most (all?) of the other files, e.g. https://issues.apache.org/jira/browse/LUCENE-3074 is about adding a simple text format for DocValues. I can walk the code to figure out what's up, but I'm hoping I just need to change some configuration setting. Thanks! -- Ken ---------- Ken Krugler +1 530-210-6378 http://www.scaleunlimited.com custom big data solutions & training Hadoop, Cascading, Cassandra & Solr
RegexReplaceProcessorFactory replacement string support for match groups
Hi Hoss, In RegexReplaceProcessorFactory, this line means that you can't use match groups in the replacement string: replacement = Matcher.quoteReplacement(replacementParam.toString()); What's the reasoning behind this? Or am I missing something here, and groups can be used? It's making it hard for me to write up a simple solution to a training exercise, where students need to clean up incorrectly formatted dates :) Thanks, -- Ken ---------- Ken Krugler +1 530-210-6378 http://www.scaleunlimited.com custom big data solutions & training Hadoop, Cascading, Cassandra & Solr
WikipediaTokenizer documentation - never mind
Hi all, Sorry for the noise - I finally realized that the script I was running was using some Java code (EnwikiContentSource, from Lucene benchmark) to explicitly set up fields and then push the results to Solr. -- Ken == Where's the documentation on the WikipediaTokenizer? Specifically I'm wondering how pieces from the source XML get mapped to field names in the Solr schema. For example, seems to be going into the "date" field for an example schema I've got. And goes into "body". But is there any way to get , for example? Thanks, -- Ken ------ Ken Krugler +1 530-210-6378 http://www.scaleunlimited.com custom big data solutions & training Hadoop, Cascading, Cassandra & Solr -- Ken Krugler +1 530-210-6378 http://www.scaleunlimited.com custom big data solutions & training Hadoop, Cascading, Cassandra & Solr
WikipediaTokenizer documentation
Hi all, Where's the documentation on the WikipediaTokenizer? Specifically I'm wondering how pieces from the source XML get mapped to field names in the Solr schema. For example, seems to be going into the "date" field for an example schema I've got. And goes into "body". But is there any way to get , for example? Thanks, -- Ken ------ Ken Krugler +1 530-210-6378 http://www.scaleunlimited.com custom big data solutions & training Hadoop, Cascading, Cassandra & Solr
Re: Grouping by field substring?
Hi Jack, On Sep 11, 2013, at 5:34pm, Jack Krupansky wrote: > Do a copyField to another field, with a limit of 8 characters, and then use > that other field. Thanks - I should have included a few more details in my original question. The issue is that I've got an index with 200M records, of which about 50M have a unique value for this prefix (which is 32 characters long) So adding another indexed field would be significant, which is why I was hoping there was a way to do it via grouping/collapsing at query time. Or is that just not possible? Thanks, -- Ken > -Original Message----- From: Ken Krugler > Sent: Wednesday, September 11, 2013 8:24 PM > To: solr-user@lucene.apache.org > Subject: Grouping by field substring? > > Hi all, > > Assuming I want to use the first N characters of a specific field for > grouping results, is such a thing possible out-of-the-box? > > If not, then what would the next best option be? E.g. a custom function query? > > Thanks, > > -- Ken > > -- > Ken Krugler > +1 530-210-6378 > http://www.scaleunlimited.com > custom big data solutions & training > Hadoop, Cascading, Cassandra & Solr > > > > > -- Ken Krugler +1 530-210-6378 http://www.scaleunlimited.com custom big data solutions & training Hadoop, Cascading, Cassandra & Solr
Grouping by field substring?
Hi all, Assuming I want to use the first N characters of a specific field for grouping results, is such a thing possible out-of-the-box? If not, then what would the next best option be? E.g. a custom function query? Thanks, -- Ken -- Ken Krugler +1 530-210-6378 http://www.scaleunlimited.com custom big data solutions & training Hadoop, Cascading, Cassandra & Solr
Re: Filter cache pollution during sharded edismax queries
Hi Otis, Sorry I missed your reply, and thanks for trying to find a similar report. Wondering if I should file a Jira issue? That might get more attention :) -- Ken On Jul 5, 2013, at 1:05pm, Otis Gospodnetic wrote: > Hi Ken, > > Uh, I left this email until now hoping I could find you a reference to > similar reports, but I can't find them now. I am quite sure I saw > somebody with a similar report within the last month. Plus, several > people have reported issues with performance dropping when they went > from 3.x to 4.x and maybe this is why. > > Otis > -- > Solr & ElasticSearch Support -- http://sematext.com/ > Performance Monitoring -- http://sematext.com/spm > > > > On Tue, Jul 2, 2013 at 3:01 PM, Ken Krugler > wrote: >> Hi all, >> >> After upgrading from Solr 3.5 to 4.2.1, I noticed our filterCache hit ratio >> had dropped significantly. >> >> Previously it was at 95+%, but now it's < 50%. >> >> I enabled recording 100 entries for debugging, and in looking at them it >> seems that edismax (and faceting) is creating entries for me. >> >> This is in a sharded setup, so it's a distributed search. >> >> If I do a search for the string "bogus text" using edismax on two fields, I >> get an entry in each of the shard's filter caches that looks like: >> >> item_+(((field1:bogus | field2:bogu) (field1:text | field2:text))~2): >> >> Is this expected? >> >> I have a similar situation happening during faceted search, even though my >> fields are single-value/untokenized strings, and I'm not using the enum >> facet method. >> >> But I'll get many, many entries in the filterCache for facet values, and >> they all look like "item_::" >> >> The net result of the above is that even with a very big filterCache size of >> 2K, the hit ratio is still only 60%. >> >> Thanks for any insights, >> >> -- Ken -- Ken Krugler +1 530-210-6378 http://www.scaleunlimited.com custom big data solutions & training Hadoop, Cascading, Cassandra & Solr
Blog posts on extracting text features using Solr
Hi all, I recently posted parts 1 & 2 of a series on extracting text features for machine learning… http://www.scaleunlimited.com/2013/07/10/text-feature-selection-for-machine-learning-part-1/ http://www.scaleunlimited.com/2013/07/21/text-feature-selection-for-machine-learning-part-2/ It uses Solr to generate terms from mailing list text, and then does analysis to extract good features for things like classification, similarity and clustering. The last part will cover using Solr to implement a real-time similarity engine, and maybe a recommendation engine as well. It undoubtedly has some things that are unclear or even incorrect, so please comment :) Regards, -- Ken ------ Ken Krugler +1 530-210-6378 http://www.scaleunlimited.com custom big data solutions & training Hadoop, Cascading, Cassandra & Solr
Filter cache pollution during sharded edismax queries
Hi all, After upgrading from Solr 3.5 to 4.2.1, I noticed our filterCache hit ratio had dropped significantly. Previously it was at 95+%, but now it's < 50%. I enabled recording 100 entries for debugging, and in looking at them it seems that edismax (and faceting) is creating entries for me. This is in a sharded setup, so it's a distributed search. If I do a search for the string "bogus text" using edismax on two fields, I get an entry in each of the shard's filter caches that looks like: item_+(((field1:bogus | field2:bogu) (field1:text | field2:text))~2): Is this expected? I have a similar situation happening during faceted search, even though my fields are single-value/untokenized strings, and I'm not using the enum facet method. But I'll get many, many entries in the filterCache for facet values, and they all look like "item_::" The net result of the above is that even with a very big filterCache size of 2K, the hit ratio is still only 60%. Thanks for any insights, -- Ken -- Ken Krugler +1 530-210-6378 http://www.scaleunlimited.com custom big data solutions & training Hadoop, Cascading, Cassandra & Solr
Solr 4.2.1 behavior with field names that use "|" character
Hi all, We have a fieldname that uses the "|" character to separate elements (e.g. state|city) Up until Solr 4.x this has worked fine. Now, when doing a query that gets distributed across shards, we get a SolrException: SEVERE: org.apache.solr.common.SolrException: can not use FieldCache on a field which is neither indexed nor has doc values: state at org.apache.solr.schema.SchemaField.checkFieldCacheSource(SchemaField.java:186) at org.apache.solr.schema.StrField.getValueSource(StrField.java:72) at org.apache.solr.search.FunctionQParser.parseValueSource(FunctionQParser.java:362) at org.apache.solr.search.FunctionQParser.parse(FunctionQParser.java:68) at org.apache.solr.search.QParser.getQuery(QParser.java:142) at org.apache.solr.search.SolrReturnFields.add(SolrReturnFields.java:285) at org.apache.solr.search.SolrReturnFields.parseFieldList(SolrReturnFields.java:112) at org.apache.solr.search.SolrReturnFields.(SolrReturnFields.java:98) at org.apache.solr.search.SolrReturnFields.(SolrReturnFields.java:74) at org.apache.solr.handler.component.QueryComponent.prepare(QueryComponent.java:96) The problem appears to be that the fl= state|city parameter is getting split up by the FunctionQParser, and it tries to use "state" as a field name. This actually exists, but as an ignored field (since we can just do a q=state|city:ca|* to find all entries in California). Is this a known issue? Is there any way to disable the parsing of field names in a field list? Thanks, -- Ken ---------- Ken Krugler +1 530-210-6378 http://www.scaleunlimited.com custom big data solutions & training Hadoop, Cascading, Cassandra & Solr
Javadocs issue on Solr web site
Currently all Javadoc links seem to wind up pointing at the api-4_0_0-ALPHA versions - is that expected? E.g. do a Google search on StreamingUpdateSolrServer. First hit is for "StreamingUpdateSolrServer (Solr 3.6.0 API)" Follow that link, and you get a 404 for page http://lucene.apache.org/solr/api-4_0_0-ALPHA/org/apache/solr/client/solrj/impl/StreamingUpdateSolrServer.html -- Ken ------ Ken Krugler http://www.scaleunlimited.com custom big data solutions & training Hadoop, Cascading, Mahout & Solr
Re: Error with distributed search and Suggester component (Solr 3.4)
Hi Robert, On May 1, 2012, at 7:07pm, Robert Muir wrote: > On Tue, May 1, 2012 at 6:48 PM, Ken Krugler > wrote: >> Hi list, >> >> Does anybody know if the Suggester component is designed to work with shards? > > I'm not really sure it is? They would probably have to override the > default merge implementation specified by SpellChecker. What confuses me is that Suggester says it's based on SpellChecker, which supposedly does work with shards. > But, all of the current suggesters pump out over 100,000 QPS on my > machine, so I'm wondering what the usefulness of this is? > > And if it was useful, merging results from different machines is > pretty inefficient, for suggest you would shard by term instead so > that you need only contact a single host? The issue is that I've got a configuration with 8 shards already that I'm trying to leverage for auto-complete. My quick & dirty work-around would be to add a custom response handler that wraps the suggester, and returns results with the fields that the SearchHandler needs to do the merge. -- Ken -- Ken Krugler http://www.scaleunlimited.com custom big data solutions & training Hadoop, Cascading, Mahout & Solr
Re: Error with distributed search and Suggester component (Solr 3.4)
I should have also included one more bit of information. If I configure the top-level (sharding) request handler to use just the suggest component as such: explicit suggest-core localhost:8080/solr/core0/,localhost:8080/solr/core1/ suggest Then I don't get a NPE, but I also get a response with no results. 0 0 r For completeness, here are the other pieces to the solrconfig.xml puzzle: true suggest-one 10 suggest suggest-one org.apache.solr.spelling.suggest.Suggester org.apache.solr.spelling.suggest.fst.FSTLookup name 0.05 true suggest-two org.apache.solr.spelling.suggest.Suggester org.apache.solr.spelling.suggest.fst.FSTLookup content 0.0 true Thanks, -- Ken On May 1, 2012, at 3:48pm, Ken Krugler wrote: > Hi list, > > Does anybody know if the Suggester component is designed to work with shards? > > I'm asking because the documentation implies that it should (since > ...Suggester reuses much of the SpellCheckComponent infrastructure…, and the > SpellCheckComponent is documented as supporting a distributed setup). > > But when I make a request, I get an exception: > > java.lang.NullPointerException > at > org.apache.solr.handler.component.QueryComponent.mergeIds(QueryComponent.java:493) > at > org.apache.solr.handler.component.QueryComponent.handleResponses(QueryComponent.java:390) > at > org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:289) > at > org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129) > at org.apache.solr.core.SolrCore.execute(SolrCore.java:1368) > at > org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:356) > at > org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:252) > at > org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1157) > at org.mortbay.servlet.UserAgentFilter.doFilter(UserAgentFilter.java:81) > at org.mortbay.servlet.GzipFilter.doFilter(GzipFilter.java:132) > at > org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1157) > at > org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:388) > at > org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216) > at > org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182) > at > org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:765) > at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:418) > at > org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230) > at > org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114) > at > org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152) > at org.mortbay.jetty.Server.handle(Server.java:326) > at > org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:542) > at > org.mortbay.jetty.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:923) > at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:547) > at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:212) > at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:404) > at > org.mortbay.io.nio.SelectChannelEndPoint.run(SelectChannelEndPoint.java:409) > at > org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:582) > Looking at the QueryComponent.java:493 code, I see: > >SolrDocumentList docs = > (SolrDocumentList)srsp.getSolrResponse().getResponse().get("response"); > >// calculate global maxScore and numDocsFound >if (docs.getMaxScore() != null) { <<<< This is line 493 > > So I'm assuming the "docs" variable is null, which would happen if there is > no "response" element in the Solr response. > > If I make a direct request to the request handler in one core (e.g. > http://hostname:8080/solr/core0/select?qt=suggest-core&q=rad), the query > works. > > But I see that there's no element named "response", unlike a regular query. > > > >0 >1 > > > > >10 >0 >3 > > radair > radar > > > > > > > So I'm wondering if my co
Error with distributed search and Suggester component (Solr 3.4)
Hi list, Does anybody know if the Suggester component is designed to work with shards? I'm asking because the documentation implies that it should (since ...Suggester reuses much of the SpellCheckComponent infrastructure…, and the SpellCheckComponent is documented as supporting a distributed setup). But when I make a request, I get an exception: java.lang.NullPointerException at org.apache.solr.handler.component.QueryComponent.mergeIds(QueryComponent.java:493) at org.apache.solr.handler.component.QueryComponent.handleResponses(QueryComponent.java:390) at org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:289) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1368) at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:356) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:252) at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1157) at org.mortbay.servlet.UserAgentFilter.doFilter(UserAgentFilter.java:81) at org.mortbay.servlet.GzipFilter.doFilter(GzipFilter.java:132) at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1157) at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:388) at org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216) at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182) at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:765) at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:418) at org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230) at org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114) at org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152) at org.mortbay.jetty.Server.handle(Server.java:326) at org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:542) at org.mortbay.jetty.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:923) at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:547) at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:212) at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:404) at org.mortbay.io.nio.SelectChannelEndPoint.run(SelectChannelEndPoint.java:409) at org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:582) Looking at the QueryComponent.java:493 code, I see: SolrDocumentList docs = (SolrDocumentList)srsp.getSolrResponse().getResponse().get("response"); // calculate global maxScore and numDocsFound if (docs.getMaxScore() != null) { <<<< This is line 493 So I'm assuming the "docs" variable is null, which would happen if there is no "response" element in the Solr response. If I make a direct request to the request handler in one core (e.g. http://hostname:8080/solr/core0/select?qt=suggest-core&q=rad), the query works. But I see that there's no element named "response", unlike a regular query. 0 1 10 0 3 radair radar So I'm wondering if my configuration is just borked and this should work, or the fact that the Suggester doesn't return a response field means that it just doesn't work with shards. Thanks, -- Ken -------- http://about.me/kkrugler +1 530-210-6378 -- Ken Krugler http://www.scaleunlimited.com custom big data solutions & training Hadoop, Cascading, Mahout & Solr
Re: JSON & XML response writer issues with short & binary fields
On Jan 13, 2012, at 1:39pm, Yonik Seeley wrote: > -Yonik > http://www.lucidimagination.com > > > > On Fri, Jan 13, 2012 at 4:22 PM, Yonik Seeley > wrote: >> On Fri, Jan 13, 2012 at 4:04 PM, Ken Krugler >> wrote: >>> I finally got around to looking at why short field values are returned as >>> "java.lang.Short:". >>> >>> Both XMLWriter.writeVal() and TextResponseWriter.writeVal() are missing the >>> check for (val instanceof Short), and thus this bit of code is used: >>> >>> // default... for debugging only >>> writeStr(name, val.getClass().getName() + ':' + val.toString(), true); >>> >>> The same thing happens when you have a binary field, since val in that case >>> is byte[], so you get "[B:[B@" >>> >>> Has anybody else run into this? Seems odd that it's not a known issue, so >>> I'm wondering if there's something odd about my schema. >>> >>> This is especially true since BinaryField has write methods for both XML >>> and JSON (via TextResponseWriter) that handle Base64-encoding the data. So >>> I'm wondering how normally the BinaryField.write() methods would get used, >>> and whether the actual problem lies elsewhere. >> >> Hmmm, Ryan recently restructured some of the writer code to support >> the pseudo-field feature. A quick look at the code seems like >> FieldType.write() methods are not used anymore (the Document is >> transformed into a SolrDocument and writeVal is used for each value). > > Double hmmm... I see this in writeVal() > >} else if (val instanceof IndexableField) { > IndexableField f = (IndexableField)val; > SchemaField sf = schema.getFieldOrNull( f.name() ); > if( sf != null ) { >sf.getType().write(this, name, f); > } > > So my initial quick analysis of FieldType.write() not being used > anymore doesn't look correct. > Anyway, please do open an issue and we'll get to the bottom of it. Thanks for the fast response - I was beginning to worry that nobody read my posts :) See https://issues.apache.org/jira/browse/SOLR-3035 I've attached some test code to the issue, plus a simple fix for the JSON case. Regards, -- Ken -- Ken Krugler http://www.scaleunlimited.com custom big data solutions & training Hadoop, Cascading, Mahout & Solr
JSON & XML response writer issues with short & binary fields
I finally got around to looking at why short field values are returned as "java.lang.Short:". Both XMLWriter.writeVal() and TextResponseWriter.writeVal() are missing the check for (val instanceof Short), and thus this bit of code is used: // default... for debugging only writeStr(name, val.getClass().getName() + ':' + val.toString(), true); The same thing happens when you have a binary field, since val in that case is byte[], so you get "[B:[B@" Has anybody else run into this? Seems odd that it's not a known issue, so I'm wondering if there's something odd about my schema. This is especially true since BinaryField has write methods for both XML and JSON (via TextResponseWriter) that handle Base64-encoding the data. So I'm wondering how normally the BinaryField.write() methods would get used, and whether the actual problem lies elsewhere. -- Ken PS - any good reason why XMLWriter is a final class? I created my own fixed version of JSONResponseWriter w/o much effort because I could subclass it, but XMLWriter being final makes it hard (impossible?) to do the same, since there are numerous internal methods that take an explicit XMLWriter object as a parameter. -- Ken Krugler http://www.scaleunlimited.com custom big data solutions & training Hadoop, Cascading, Mahout & Solr
Re: Solr core as a dispatcher
Hi Hector, On Jan 9, 2012, at 4:15pm, Hector Castro wrote: > Hi, > > Has anyone had success with multicore single node Solr configurations that > have one core acting solely as a dispatcher for the other cores? For > example, say you had 4 populated Solr cores – configure a 5th to be the > definitive endpoint with `shards` containing cores 1-4. > > Is there any advantage to this setup over simply having requests distributed > randomly across the 4 populated cores (all with `shards` equal to cores 1-4)? > Is it even worth distributing requests across the cores over always hitting > the same one? If you have low query rates, then using a shards approach can improve performance on a multi-core (CPUs here, not Solr cores) setup. By distributing the requests, you effectively use all CPU cores in parallel on one request. And if you spread your shards across spindles, then you're also maximizing I/O throughput. But there are a few issues with this approach: - binary fields don't work. The results come back as "@B[]", versus the actual data. - short fields get "java.lang.Short" text prefixed on every value. - deep queries result in lots of extra load. E.g. if you want the 5000th hit then you'll get (5000 * # of shards) hits being collected/returned to the dispatcher. Though only the unique id & score is returned in this case, followed by the second request to get the actual top N hits from the shards. And there's something wonky with the way that distributed HTTP requests are queued up & processed - under load, I see IOExceptions where it's always N-1 shards that succeed, and one shard request fails. But I don't have a good reproducible case yet to debug. -- Ken -- Ken Krugler http://www.scaleunlimited.com custom big data solutions & training Hadoop, Cascading, Mahout & Solr
Re: strange performance issue with many shards on one server
>>>>> Vadim >>>>> >>>>> >>>>> 2011/9/28 Frederik Kraus >>>> (mailto:frederik.kr...@gmail.com) (mailto: >>>> frederik.kr...@gmail.com (mailto:frederik.kr...@gmail.com))> >>>>> >>>>>> Hi, >>>>>> >>>>>> >>>>>> I am experiencing a strange issue doing some load tests. Our setup: >>>>>> >>>>>> - 2 server with each 24 cpu cores, 130GB of RAM >>>>>> - 10 shards per server (needed for response times) running in a single >>>>>> tomcat instance >>>>>> - each query queries all 20 shards (distributed search) >>>>>> >>>>>> - each shard holds about 1.5 mio documents (small shards are needed due >>>> to >>>>>> rather complex queries) >>>>>> - all caches are warmed / high cache hit rates (99%) etc. >>>>>> >>>>>> >>>>>> Now for some reason we cannot seem to fully utilize all CPU power (no >>>> disk >>>>>> IO), ie. increasing concurrent users doesn't increase CPU-Load at a >>>> point, >>>>>> decreases throughput and increases the response times of the individual >>>>>> queries. >>>>>> >>>>>> Also 1-2% of the queries take significantly longer: avg somewhere at >>>> 100ms >>>>>> while 1-2% take 1.5s or longer. >>>>>> >>>>>> Any ideas are greatly appreciated :) >>>>>> >>>>>> Fred. > -- Ken Krugler http://www.scaleunlimited.com custom big data solutions & training Hadoop, Cascading, Mahout & Solr
SearchComponents and ShardResponse
Hi all, I feel like I must be missing something here... I'm working on a customized version of the SearchHandler, which supports distributed searching in multiple *local* cores. Assuming you want to support SearchComponents, then my handler needs to create/maintain a ResponseBuilder, which is passed to various SearchComponent methods. The ResponseBuilder has a "finished" list of ShardRequest objects, for requests that have received responses from shards. Inside the ShardRequest is a "responses" list of ShardResponse objects, which contain things like the SolrResponse. The SolrResponse field in ShardResponse is private, and the method to set it is package private. So it doesn't appear like there's any easy way to create the ShardResponse objects that the SearchComponents expect to receive inside of the ResponseBuilder. If I put my custom SearchHandler class into the same package as the ShardResponse class, then I can call setSolrResponse(). It builds, and I can run locally. But if I deploy a jar with this code, then at runtime I get an illegal access exception when running under Jetty. I can make it work by re-building the solr.war with my custom SearchHandler, but that's pretty painful. Any other ideas/input? Thanks, -- Ken -- Ken Krugler http://www.scaleunlimited.com custom big data solutions & training Hadoop, Cascading, Mahout & Solr
Distributed search and binary fields w/Solr 3.4
Hi there, I'm running into a problem, where queries that are distributed among multiple shards don't return binary field data properly. If I hit a single core, the XML response to my HTTP request contains the expected data. If I hit the request handler that's configured to distribute the request to my shards, the XML contains "B[B", It looks like I wind up getting the .toString() data, not the actual data itself. Has anybody else run into this? I've done a fair amount of searching, but no hits yet. Next step is to create a unit test in Solr, if nobody raises their hand, and then walk it. Thanks, -- Ken ------ Ken Krugler http://www.scaleunlimited.com custom big data solutions & training Hadoop, Cascading, Mahout & Solr
Re: overwrite=false support with SolrJ client
On Nov 7, 2011, at 12:06pm, Chris Hostetter wrote: > > : I see that https://issues.apache.org/jira/browse/SOLR-653 removed this > : support from SolrJ, because it was deemed too dangerous for mere > : mortals. > > I believe the concern was that the "novice level" API was very in your > face about asking if you wanted to "overwrite" and made it too easy to > hurt yourself. > > It should still be fairly trivial to specify overwrite=false in a SolrJ > request -- just not using hte convenience methods. something like... > > UpdateRequest req = new UpdateRequest(); > req.add(myBigCollectionOfDocuments); > req.setParam(UpdateParams.OVERWRITE, true); > req.process(mySolrServer); That seemed to work, thanks for the suggestion - though using (in case anybody else reads this) req.setParam(UpdateParams.OVERWRITE, Boolean.toString(false)); I'll need to run some tests to check performance improvements. > : For Hadoop-based workflows, it's straightforward to ensure that the > : unique key field is really unique, thus if the performance gain is > : significant, I might look into figuring out some way (with a trigger > : lock) of re-enabling this support in SolrJ. > > it's not just an issue of knowing that the key is unique -- it's an issue > of being certain that your index does not contain any documents with the > same key as a document you are about to add. If you are generating a > completley new solr index from data that you are certain is unique -- then > you will probably see some perf gains. but if you are adding to an > existing index, i would avoid it. For Hadoop workflows, the output is always fresh (unless you do some interesting helicopter stunts). So yes, by default the index is always being rebuilt from scratch. And thus as long as the primary key is being used as the reduce-phase key, it's easy to ensure uniqueness in the index. Thanks again, -- Ken -- Ken Krugler http://www.scaleunlimited.com custom big data solutions & training Hadoop, Cascading, Mahout & Solr
overwrite=false support with SolrJ client
Hi list, I'm working on improving the performance of the Solr scheme for Cascading. This supports generating a Solr index as the output of a Hadoop job. We use SolrJ to write the index locally (via EmbeddedSolrServer). There are mentions of using overwrite=false with the CSV request handler, as a way of improving performance. I see that https://issues.apache.org/jira/browse/SOLR-653 removed this support from SolrJ, because it was deemed too dangerous for mere mortals. My question is whether anyone knows just how much performance boost this really provides. For Hadoop-based workflows, it's straightforward to ensure that the unique key field is really unique, thus if the performance gain is significant, I might look into figuring out some way (with a trigger lock) of re-enabling this support in SolrJ. Thanks, -- Ken ------ Ken Krugler http://www.scaleunlimited.com custom big data solutions & training Hadoop, Cascading, Mahout & Solr
Re: indexing key value pair into lucene solr index
On Oct 24, 2011, at 1:41pm, jame vaalet wrote: > hi, > in my use case i have list of key value pairs in each document object, if i > index them as separate index fields then in the result doc object i will get > two arrays corresponding to my keys and values. The problem i face here is > that there wont be any mapping between those keys and values. > > do we have any easy to index these data in solr ? thanks in advance ... As Karsten said, providing more detail re what you're actually trying to do usually makes for better and more helpful/accurate answers. But I'm guessing you only want to search on the key, not the value, right? If so, then: 1. Create a multi-value field with a custom type, indexed, stored. 2. During indexing, add entries as 3. In the custom type, set the analyzer to strip off the so you only index the key. E.g. -- Ken ------ Ken Krugler +1 530-210-6378 http://bixolabs.com custom big data solutions & training Hadoop, Cascading, Mahout & Solr
Re: Want to support "did you mean xxx" but is Chinese
Hi Floyd, Typically you'd do this by creating a custom analyzer that - segments Chinese text into words - Converts from words to pinyin or zhuyin Your index would have both the actual Hanzi characters, plus (via copyfield) this phonetic version. During search, you'd use dismax to search against both fields, with a higher weighting to the Hanzi field. But segmentation can be error prone, and requires embedding specialized code that you typically license (for high quality results) from a commercial vendor. So my first cut approach would be to use the current synonym support to map each Hanzi to all possible pronunciations. There are numerous open source datasets that contain this information. Note that there might be performance issues with having such a huge set of synonyms. Then, by weighting phrase matches sufficiently high (again using dismax) I think you could get reasonable results. -- Ken On Oct 21, 2011, at 7:33am, Floyd Wu wrote: > Does anybody know how to implement this idea in SOLR. Please kindly > point me a direction. > > For example, when user enter a keyword in Chinese "貝多芬" (this is > Beethoven in Chinese) > but key in a wrong combination of characters "背多分" (this is > pronouncation the same with previous keyword "貝多芬"). > > There in solr index exist token "貝多芬" actually. How to hit documents > where "貝多芬" exist when "背多分" is enter. > > This is basic function of commercial search engine especially in > Chinese processing. I wonder how to implements in SOLR and where is > the start point. > > Floyd -- Ken Krugler +1 530-210-6378 http://bixolabs.com custom big data solutions & training Hadoop, Cascading, Mahout & Solr
Re: Does anybody has experience in Chinese soundex(sounds like) of SOLR?
> Wow, interesting question. Can soundex even be applied to a language like > Chinese, which is tonal and doesn't have individual letters, but whole > characters? I'm no expert, but intuitively speaking it sounds hard or maybe > even impossible... The only two cases I can think of are: - Cases where you have two (or more) characters that are variant forms. Unicode tried to unify all of these, but some still exist. And in GB 18030 there are tons. - If you wanted to support phonetic (pinyin or zhuyin) search, then you might want to collapse syllables that are commonly confused. But then of course you'd have to be storing the phonetic forms for all of the words. -- Ken >> From: Floyd Wu >> To: solr-user@lucene.apache.org >> Sent: Thursday, October 20, 2011 5:43 AM >> Subject: Does anybody has experience in Chinese soundex(sounds like) of SOLR? >> >> Hi there, >> >> There are many English soundex implementation can be referenced, but I >> wonder how to do Chinese soundex(sounds like) filter (maybe). >> >> any idea? >> >> Floyd >> >> >> -- Ken Krugler +1 530-210-6378 http://bixolabs.com custom big data solutions & training Hadoop, Cascading, Mahout & Solr
Re: Multi CPU Cores
On Oct 16, 2011, at 1:44pm, Rob Brown wrote: > Looks like I checked the load during a quiet period, ab -n 1 -c 1000 > saw a decent 40% load on each core. > > Still a little confused as to why 1 core stays at 100% constantly - even > during the quiet periods? Could be background GC, depending on what you've got your JVM configured to use. Though that shouldn't stay at 100% for very long. -- Ken > -Original Message- > From: Johannes Goll > Reply-to: solr-user@lucene.apache.org > To: solr-user@lucene.apache.org > Subject: Re: Multi CPU Cores > Date: Sat, 15 Oct 2011 21:30:11 -0400 > > Did you try to submit multiple search requests in parallel? The apache ab > tool is great tool to simulate simultaneous load using (-n and -c). > Johannes > > On Oct 15, 2011, at 7:32 PM, Rob Brown wrote: > >> Hi, >> >> I'm running Solr on a machine with 16 CPU cores, yet watching "top" >> shows that java is only apparently using 1 and maxing it out. >> >> Is there anything that can be done to take advantage of more CPU cores? >> >> Solr 3.4 under Tomcat >> >> [root@solr01 ~]# java -version >> java version "1.6.0_20" >> OpenJDK Runtime Environment (IcedTea6 1.9.8) >> (rhel-1.22.1.9.8.el5_6-x86_64) >> OpenJDK 64-Bit Server VM (build 19.0-b09, mixed mode) >> >> >> top - 14:36:18 up 22 days, 21:54, 4 users, load average: 1.89, 1.24, >> 1.08 >> Tasks: 317 total, 1 running, 315 sleeping, 0 stopped, 1 zombie >> Cpu0 : 0.0%us, 0.0%sy, 0.0%ni, 99.6%id, 0.4%wa, 0.0%hi, 0.0%si, >> 0.0%st >> Cpu1 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, >> 0.0%st >> Cpu2 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, >> 0.0%st >> Cpu3 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, >> 0.0%st >> Cpu4 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, >> 0.0%st >> Cpu5 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, >> 0.0%st >> Cpu6 : 99.6%us, 0.4%sy, 0.0%ni, 0.0%id, 0.0%wa, 0.0%hi, 0.0%si, >> 0.0%st >> Cpu7 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, >> 0.0%st >> Cpu8 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, >> 0.0%st >> Cpu9 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, >> 0.0%st >> Cpu10 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, >> 0.0%st >> Cpu11 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, >> 0.0%st >> Cpu12 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, >> 0.0%st >> Cpu13 : 0.7%us, 0.0%sy, 0.0%ni, 99.3%id, 0.0%wa, 0.0%hi, 0.0%si, >> 0.0%st >> Cpu14 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, >> 0.0%st >> Cpu15 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, >> 0.0%st >> Mem: 132088928k total, 23760584k used, 108328344k free, 318228k >> buffers >> Swap: 25920868k total,0k used, 25920868k free, 18371128k cached >> >> PID USER PR NI VIRT RES SHR S %CPU %MEMTIME+ >> COMMAND >> >> >> 4466 tomcat20 0 31.2g 4.0g 171m S 101.0 3.2 2909:38 >> java >> >> >> 6495 root 15 0 42416 3892 1740 S 0.4 0.0 9:34.71 >> openvpn >> >> >> 11456 root 16 0 12892 1312 836 R 0.4 0.0 0:00.08 >> top >> >> >> 1 root 15 0 10368 632 536 S 0.0 0.0 0:04.69 >> init >> >> >> > -- Ken Krugler +1 530-210-6378 http://bixolabs.com custom big data solutions & training Hadoop, Cascading, Mahout & Solr
Re: strange performance issue with many shards on one server
>>>>> special characters. >>>>> if you don't know it: >>>> http://www.hathitrust.org/blogs/large-scale-search/tuning-search-performance >>>>> Regards >>>>> Vadim >>>>> >>>>> >>>>> 2011/9/28 Frederik Kraus >>>> (mailto:frederik.kr...@gmail.com) (mailto: >>>> frederik.kr...@gmail.com (mailto:frederik.kr...@gmail.com))> >>>>> >>>>>> Hi, >>>>>> >>>>>> >>>>>> I am experiencing a strange issue doing some load tests. Our setup: >>>>>> >>>>>> - 2 server with each 24 cpu cores, 130GB of RAM >>>>>> - 10 shards per server (needed for response times) running in a single >>>>>> tomcat instance >>>>>> - each query queries all 20 shards (distributed search) >>>>>> >>>>>> - each shard holds about 1.5 mio documents (small shards are needed due >>>> to >>>>>> rather complex queries) >>>>>> - all caches are warmed / high cache hit rates (99%) etc. >>>>>> >>>>>> >>>>>> Now for some reason we cannot seem to fully utilize all CPU power (no >>>> disk >>>>>> IO), ie. increasing concurrent users doesn't increase CPU-Load at a >>>> point, >>>>>> decreases throughput and increases the response times of the individual >>>>>> queries. >>>>>> >>>>>> Also 1-2% of the queries take significantly longer: avg somewhere at >>>> 100ms >>>>>> while 1-2% take 1.5s or longer. >>>>>> >>>>>> Any ideas are greatly appreciated :) >>>>>> >>>>>> Fred. > -- Ken Krugler +1 530-210-6378 http://bixolabs.com custom big data solutions & training Hadoop, Cascading, Mahout & Solr
Re: two cores but have single result set in solr
On Sep 23, 2011, at 2:03pm, hadi wrote: > I have to cores with seprate schema and index but i want to have single > result set in solr/browse, If they have different schemas, how would you combine results from the two? If they have the same schemas, then you can define a third core with a different conf dir, and in that separate conf/solrschema.xml you can set up a request handler that just dispatches to the two real cores. -- Ken ------ Ken Krugler +1 530-210-6378 http://bixolabs.com custom big data solutions & training Hadoop, Cascading, Mahout & Solr
Re: Distinct elements in a field
On Sep 15, 2011, at 3:43am, swiss knife wrote: > Simple question: I want to know how many distinct elements I have in a field > and these verify a query. Do you know if there's a way to do it today in 3.4. > > I saw SOLR-1814 and SOLR-2242. > > SOLR-1814 seems fairly easy to use. What do you think ? Thank you If you turn on facets in your query (facet=true&facet.field=) then you'll get back all of the distinct values, though might have to play with other settings (e.g. facet.limit=-1) to get the results you need. -- Ken -- Ken Krugler +1 530-210-6378 http://bixolabs.com custom big data solutions & training Hadoop, Cascading, Mahout & Solr
Re: Will Solr/Lucene crawl multi websites (aka a mini google with faceted search)?
On Sep 11, 2011, at 7:04pm, dpt9876 wrote: > Hi thanks for the reply. > > How does nutch/solr handle the scenario where 1 website calls price, "price" > and another website calls it "cost". Same thing different name, yet I would > want the facet to handle that and not create a different facet. > > Is this combo of nutch and Solr that intelligent and or intuitive? What you're describing here is web mining, not web crawling. You want to extract price data from web pages, and put that into a specific field in Solr. To do that using Nutch, you'd need to write custom plug-ins that know how to extract the price from a page, and add that as a custom field to the crawl results. The above is a topic for the Nutch mailing list, since Solr is just a downstream consumer of whatever Nutch provides. -- Ken > On Sep 12, 2011 9:06 AM, "Erick Erickson [via Lucene]" < > ml-node+s472066n3328340...@n3.nabble.com> wrote: >> >> >> Nope, there's nothing in Solr that crawls anything, you have to feed >> documents in yourself from the websites. >> >> Or, look at the Nutch project, see: http://nutch.apache.org/about.html >> >> which is designed for this kind of problem. >> >> Best >> Erick >> >> On Sun, Sep 11, 2011 at 8:53 PM, dpt9876 > wrote: >>> Hi all, >>> I am wondering if Solr will do the following for a project I am working > on. >>> I want to create a search engine with facets for potentially hundreds of >>> websites. >>> Similar to say crawling amazon + buy.com + ebay and someone can search > these >>> 3 sites from my 1 website. >>> (I realise there are better ways of doing the above example, its for >>> illustrative purposes). >>> Eventually I would build that search crawl to index say 200 or 1000 >>> merchants. >>> Someone would come to my site and search for "digital camera". >>> >>> They would get results from all 3 indexes and hopefully dynamic facets eg >>> Price $100-200 >>> Price 200-300 >>> Resolution 1mp-2mp >>> >>> etc etc >>> >>> Can this be done on the fly? >>> >>> I ask this because I am currently developing webscrapers to crawl these >>> websites, dump that data into a db, then was thinking of tacking on a > solr >>> server to crawl my db. >>> >>> Problem with that approach is that crawling the worlds ecommerce sites > will >>> take forever, when it seems solr might do that for me? (I have read about >>> multiple indexes etc). >>> >>> Many thanks >>> >>> -- >>> View this message in context: > http://lucene.472066.n3.nabble.com/Will-Solr-Lucene-crawl-multi-websites-aka-a-mini-google-with-faceted-search-tp3328314p3328314.html >>> Sent from the Solr - User mailing list archive at Nabble.com. >>> >> >> >> ___ >> If you reply to this email, your message will be added to the discussion > below: >> > http://lucene.472066.n3.nabble.com/Will-Solr-Lucene-crawl-multi-websites-aka-a-mini-google-with-faceted-search-tp3328314p3328340.html >> >> To unsubscribe from Will Solr/Lucene crawl multi websites (aka a mini > google with faceted search)?, visit > http://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code&node=3328314&code=ZGFuaW50aGV0cm9waWNzQGdtYWlsLmNvbXwzMzI4MzE0fC04MDk0NTc1ODg= > > > -- > View this message in context: > http://lucene.472066.n3.nabble.com/Will-Solr-Lucene-crawl-multi-websites-aka-a-mini-google-with-faceted-search-tp3328314p3328449.html > Sent from the Solr - User mailing list archive at Nabble.com. -- Ken Krugler +1 530-210-6378 http://bixolabs.com custom big data solutions & training Hadoop, Cascading, Mahout & Solr
Re: performance crossover between single index and sharding
With low qps and multi-core servers, I believe one reason to have multiple shards on one server is to provide better parallelism for a request, and thus reduce your response time. -- Ken On Aug 2, 2011, at 11:06am, Jonathan Rochkind wrote: > What's the reasoning behind having three shards on one machine, instead of > just combining those into one shard? Just curious. I had been thinking the > point of shards was to get them on different machines, and there'd be no > reason to have multiple shards on one machine. > > On 8/2/2011 1:59 PM, Burton-West, Tom wrote: >> Hi Markus, >> >> Just as a data point for a very large sharded index, we have the full text >> of 9.3 million books with an index size of about 6+ TB spread over 12 shards >> on 4 machines. Each machine has 3 shards. The size of each shard ranges >> between 475GB and 550GB. We are definitely I/O bound. Our machines have >> 144GB of memory with about 16GB dedicated to the tomcat instance running the >> 3 Solr instances, which leaves about 120 GB (or 40GB per shard) for the OS >> disk cache. We release a new index every morning and then warm the caches >> with several thousand queries. I probably should add that our disk storage >> is a very high performance Isilon appliance that has over 500 drives and >> every block of every file is striped over no less than 14 different drives. >> (See blog for details *) >> >> We have a very low number of queries per second (0.3-2 qps) and our modest >> response time goal is to keep 99th percentile response time for our >> application (i.e. Solr + application) under 10 seconds. >> >> Our current performance statistics are: >> >> average response time 300 ms >> median response time 113 ms >> 90th percentile663 ms >> 95th percentile1,691 ms >> >> We had plans to do some performance testing to determine the optimum shard >> size and optimum number of shards per machine, but that has remained on the >> back burner for a long time as other higher priority items keep pushing it >> down on the todo list. >> >> We would be really interested to hear about the experiences of people who >> have so many shards that the overhead of distributing the queries, and >> consolidating/merging the responses becomes a serious issue. >> >> >> Tom Burton-West >> >> http://www.hathitrust.org/blogs/large-scale-search >> >> * >> http://www.hathitrust.org/blogs/large-scale-search/scaling-large-scale-search-50-volumes-5-million-volumes-and-beyond >> >> -Original Message- >> From: Markus Jelsma [mailto:markus.jel...@openindex.io] >> Sent: Tuesday, August 02, 2011 12:33 PM >> To: solr-user@lucene.apache.org >> Subject: Re: performance crossover between single index and sharding >> >> Actually, i do worry about it. Would be marvelous if someone could provide >> some metrics for an index of many terabytes. >> >>> [..] At some extreme point there will be diminishing >>> returns and a performance decrease, but I wouldn't worry about that at all >>> until you've got many terabytes -- I don't know how many but don't worry >>> about it. >>> >>> ~ David >>> >>> - >>> Author: https://www.packtpub.com/solr-1-4-enterprise-search-server/book >>> -- >>> View this message in context: >>> http://lucene.472066.n3.nabble.com/performance-crossover-between-single-in >>> dex-and-sharding-tp3218561p3219397.html Sent from the Solr - User mailing >>> list archive at Nabble.com. -- Ken Krugler +1 530-210-6378 http://bixolabs.com custom data mining solutions
Re: Processing/Indexing CSV
On Jun 9, 2011, at 2:21pm, Helmut Hoffer von Ankershoffen wrote: > Hi, > > btw: there seems to somewhat of a non-match regarding efforts to Enhance DIH > regarding the CSV format (James Dyer) and the effort to maintain the > CSVLoader (Ken Krugler). How about merging your efforts and migrating the > CSVLoader to a CSVEntityProcessor (cp. my initial email)? :-) While I'm a CSVLoader user (and I've found/fixed one bug in it), I'm not involved in any active development/maintenance of that piece of code. If James or you can make progress on merging support for CSV into DIH, that's great. -- Ken > On Thu, Jun 9, 2011 at 11:17 PM, Helmut Hoffer von Ankershoffen < > helmut...@googlemail.com> wrote: > >> >> >> On Thu, Jun 9, 2011 at 11:05 PM, Ken Krugler >> wrote: >> >>> >>> On Jun 9, 2011, at 1:27pm, Helmut Hoffer von Ankershoffen wrote: >>> >>>> Hi, >>>> >>>> ... that would be an option if there is a defined set of field names and >>> a >>>> single column/CSV layout. The scenario however is different csv files >>> (from >>>> different shops) with individual column layouts (separators, encodings >>>> etc.). The idea is to map known field names to defined field names in >>> the >>>> solr schema. If I understand the capabilities of the CSVLoader correctly >>>> (sorry, I am completely new to Solr, started work on it today) this is >>> not >>>> possible - is it? >>> >>> As per the documentation on >>> http://wiki.apache.org/solr/UpdateCSV#fieldnames, you can specify the >>> names/positions of fields in the CSV file, and ignore fieldnames. >>> >>> So this seems like it would solve your requirement, as each different >>> layout could specify its own such mapping during import. >>> >>> Sure, but the requirement (to keep the process of integrating new shops >> efficient) is not to have one mapping per import (cp. the Email regarding >> "more or less schema free") but to enhance one mapping that maps common >> field names to defined fields disregarding order of known fields/columns. As >> far as I understand that is not a problem at all with DIH, however DIH and >> CSV are not a perfect match ,-) >> >> >>> It could be handy to provide a fieldname map (versus the value map that >>> UpdateCSV supports). >> >> Definitely. Either a fieldname map in CSVLoader or a robust CSVLoader in >> DIH ... >> >> >>> Then you could use the header, and just provide a mapping from header >>> fieldnames to schema fieldnames. >>> >> That's the idea -) >> >> => what's the best way to progress. Either someone enhances the CSVLoader >> by a field mapper (with multipel input field names mapping to one field name >> in the Solr schema) or someone enhances the DIH with a robust CSV loader >> ,-). As I am completely new to this Community, please give me the direction >> to go (or wait :-). >> >> best regards >> >> >>> -- Ken >>> >>>> On Thu, Jun 9, 2011 at 10:12 PM, Yonik Seeley < >>> yo...@lucidimagination.com>wrote: >>>> >>>>> On Thu, Jun 9, 2011 at 4:07 PM, Helmut Hoffer von Ankershoffen >>>>> wrote: >>>>>> Hi, >>>>>> yes, it's about CSV files loaded via HTTP from shops to be fed into a >>>>>> shopping search engine. >>>>>> The CSV Loader cannot map fields (only field values) etc. >>>>> >>>>> You can provide your own list of fieldnames and optionally ignore the >>>>> first line of the CSV file (assuming it contains the field names). >>>>> http://wiki.apache.org/solr/UpdateCSV#fieldnames >>>>> >>>>> -Yonik >>>>> http://www.lucidimagination.com >>>>> >>> >>> -- >>> Ken Krugler >>> +1 530-210-6378 >>> http://bixolabs.com >>> custom data mining solutions >>> >>> >>> >>> >>> >>> >>> >> -- Ken Krugler +1 530-210-6378 http://bixolabs.com custom data mining solutions
Re: Processing/Indexing CSV
On Jun 9, 2011, at 1:27pm, Helmut Hoffer von Ankershoffen wrote: > Hi, > > ... that would be an option if there is a defined set of field names and a > single column/CSV layout. The scenario however is different csv files (from > different shops) with individual column layouts (separators, encodings > etc.). The idea is to map known field names to defined field names in the > solr schema. If I understand the capabilities of the CSVLoader correctly > (sorry, I am completely new to Solr, started work on it today) this is not > possible - is it? As per the documentation on http://wiki.apache.org/solr/UpdateCSV#fieldnames, you can specify the names/positions of fields in the CSV file, and ignore fieldnames. So this seems like it would solve your requirement, as each different layout could specify its own such mapping during import. It could be handy to provide a fieldname map (versus the value map that UpdateCSV supports). Then you could use the header, and just provide a mapping from header fieldnames to schema fieldnames. -- Ken > On Thu, Jun 9, 2011 at 10:12 PM, Yonik Seeley > wrote: > >> On Thu, Jun 9, 2011 at 4:07 PM, Helmut Hoffer von Ankershoffen >> wrote: >>> Hi, >>> yes, it's about CSV files loaded via HTTP from shops to be fed into a >>> shopping search engine. >>> The CSV Loader cannot map fields (only field values) etc. >> >> You can provide your own list of fieldnames and optionally ignore the >> first line of the CSV file (assuming it contains the field names). >> http://wiki.apache.org/solr/UpdateCSV#fieldnames >> >> -Yonik >> http://www.lucidimagination.com >> -- Ken Krugler +1 530-210-6378 http://bixolabs.com custom data mining solutions
Re: Solr monitoring: Newrelic
It sounds like "roySolr" is running embedded Jetty, launching solr using the start.jar If so, then there's no app container where Newrelic can be installed. -- Ken On Jun 9, 2011, at 2:28am, Sujatha Arun wrote: > Try the RPM support accessed from the accout support page ,Giving all > details ,they are very helpful. > > Regards > Sujatha > > On Thu, Jun 9, 2011 at 2:33 PM, roySolr wrote: > >> Yes, that's the problem. There is no jetty folder. >> I have try the example/lib directory, it's not working. There is no jetty >> war file, only >> jetty-***.jar files >> >> Same error, could not locate a jetty instance. >> >> -- >> View this message in context: >> http://lucene.472066.n3.nabble.com/Solr-monitoring-Newrelic-tp3042889p3043080.html >> Sent from the Solr - User mailing list archive at Nabble.com. >> -- Ken Krugler +1 530-210-6378 http://bixolabs.com custom data mining solutions
Re: Hitting the URI limit, how to get around this?
It sounds like you're hitting the max URL length (8K is a common default) for the HTTP web server that you're using to run Solr. All of the web servers I know about let you bump this limit up via configuration settings. -- Ken On Jun 3, 2011, at 9:27am, JohnRodey wrote: > So here's what I'm seeing: I'm running Solr 3.1 > I'm running a java client that executes a Httpget (I tried HttpPost) with a > large shard list. If I remove a few shards from my current list it returns > fine, when I use my full shard list I get a "HTTP/1.1 400 Bad Request". If > I execute it in firefox with a few shards removed it returns fine, with the > full shard list I get a blank screen returned immediately. > > My URI works at around 7800 characters but adding one more shard to it blows > up. > > Any ideas? > > I've tried using SolrJ rather than httpget before but ran into similar > issues but with even less shards. > See > http://lucene.472066.n3.nabble.com/Long-list-of-shards-breaks-solrj-query-td2748556.html > http://lucene.472066.n3.nabble.com/Long-list-of-shards-breaks-solrj-query-td2748556.html > > > My shards are added dynamically, every few hours I am adding new shards or > cores into the cluster. so I cannot have a shard list in the config files > unless I can somehow update them while the system is running. > > -- > View this message in context: > http://lucene.472066.n3.nabble.com/Hitting-the-URI-limit-how-to-get-around-this-tp3017837p3020185.html > Sent from the Solr - User mailing list archive at Nabble.com. -- Ken Krugler +1 530-210-6378 http://bixolabs.com custom data mining solutions
Re: Difference between Solr and Lucidworks distribution
On Apr 3, 2011, at 6:56am, yehosef wrote: > How can they require payment for something that was developed under the > apache license? It's the difference between free speech and free beer :) See http://en.wikipedia.org/wiki/Gratis_versus_libre -- Ken ------ Ken Krugler +1 530-210-6378 http://bixolabs.com e l a s t i c w e b m i n i n g
Re: boilerpipe solr tika howto please
Hi Arno, On Jan 14, 2011, at 3:57am, arnaud gaudinat wrote: Hello, I would like to use BoilerPipe (a very good program which cleans the html content from surplus "clutter"). I saw that BoilerPipe is inside Tika 0.8 and so should be accessible from solr, am I right? How I can Activate BoilerPipe in Solr? Do I need to change solrconfig.xml ( with org.apache.solr.handler.extraction.ExtractingRequestHandler)? Or do I need to modify some code inside Solr? I so something like TikaCLI -F in the tika forum (http://www.lucidimagination.com/search/document/242ce3a17f30f466/boilerpipe_integration ) is it the right way? You need to add the BoilerpipeContentHandler into Tika's content handler chain. Which I'm pretty sure means you'd need to modify Solr, e.g. (in trunk) the TikaEntityProcessor.getHtmlHandler() method. I'd try something like: return new BoilerpipeContentHandler(new ContentHandlerDecorator( Though from a quick look at that code, I'm curious why it doesn't use BodyContentHandler, versus the current ContentHandlerDecorator. -- Ken -- Ken Krugler +1 530-210-6378 http://bixolabs.com e l a s t i c w e b m i n i n g
Re: How to let crawlers in, but prevent their damage?
On Jan 10, 2011, at 7:02am, Otis Gospodnetic wrote: Hi Ken, thanks Ken. :) The problem with this approach is that it exposes very limited content to bots/web search engines. Take http://search-lucene.com/ for example. People enter all kinds of queries in web search engines and end up on that site. People who visit the site directly don't necessarily search for those same things. Plus, new terms are entered to get to search-lucene.com every day, so keeping up with that would mean constantly generating more and more of those static pages. Basically, the tail is super long. To clarify - the issue of using actual user search traffic is one of SEO, not what content you expose. If, for example, people commonly do a search for "java " then that's a hint that the URL to the static content, and the page title, should have the language as part of it. So you shouldn't be generating static pages based on search traffic. Though you might want to decide what content to "favor" (see below) based on popularity. On top of that, new content is constantly being generated, so one would have to also constantly both add and update those static pages. Yes, but that's why you need to automate that content generation, and do it on a regular (e.g. weekly) basis. The big challenges we ran into were: 1. Dealing with badly behaved bots that would hammer the site. We wound up putting this content on a separate system, so it wouldn't impact users on the main system. And generating a regular report by user agent & IP address, so that we could block by robots.txt and IP when necessary. 2. Figuring out how to structure the static content so that it didn't look like spam to Google/Yahoo/Bing You don't want to have too many links per page, or too much depth, but that constrains how many pages you can reasonably expose. We had project scores based on code, activity, usage - so we used that to rank the content and focus on exposing early (low depth) the "good stuff". You could do the same based on popularity, from search logs. Anyway, there's a lot to this topic, but it doesn't feel very Solr specific. So apologies for reducing the signal-to-noise ratio with talk about SEO :) -- Ken I have a feeling there is not a good solution for this because on one hand people don't like the negative bot side effect, on the other hand people want as much of their sites indexed by the big guys. The only half-solution that comes to mind involves looking at who's actually crawling you and who's bringing you visitors, then blocking those with a bad ratio of those two - bots that crawl a lot but don't bring a lot of value. Any other ideas? Thanks, Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/ - Original Message From: Ken Krugler To: solr-user@lucene.apache.org Sent: Mon, January 10, 2011 9:43:49 AM Subject: Re: How to let crawlers in, but prevent their damage? Hi Otis, From what I learned at Krugle, the approach that worked for us was: 1. Block all bots on the search page. 2. Expose the target content via statically linked pages that are separately generated from the same backing store, and optimized for target search terms (extracted from your own search logs). -- Ken On Jan 10, 2011, at 5:41am, Otis Gospodnetic wrote: Hi, How do people with public search services deal with bots/crawlers? And I don't mean to ask how one bans them (robots.txt) or slow them down (Delay stuff in robots.txt) or prevent them from digging too deep in search results... What I mean is that when you have publicly exposed search that bots crawl, they issue all kinds of crazy "queries" that result in errors, that add noise to Solr caches, increase Solr cache evictions, etc. etc. Are there some known recipes for dealing with them, minimizing their negative side-effects, while still letting them crawl you? Thanks, Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/ -- Ken Krugler +1 530-210-6378 http://bixolabs.com e l a s t i c w e b m i n i n g -- Ken Krugler +1 530-210-6378 http://bixolabs.com e l a s t i c w e b m i n i n g
Re: How to let crawlers in, but prevent their damage?
Hi Otis, From what I learned at Krugle, the approach that worked for us was: 1. Block all bots on the search page. 2. Expose the target content via statically linked pages that are separately generated from the same backing store, and optimized for target search terms (extracted from your own search logs). -- Ken On Jan 10, 2011, at 5:41am, Otis Gospodnetic wrote: Hi, How do people with public search services deal with bots/crawlers? And I don't mean to ask how one bans them (robots.txt) or slow them down (Delay stuff in robots.txt) or prevent them from digging too deep in search results... What I mean is that when you have publicly exposed search that bots crawl, they issue all kinds of crazy "queries" that result in errors, that add noise to Solr caches, increase Solr cache evictions, etc. etc. Are there some known recipes for dealing with them, minimizing their negative side-effects, while still letting them crawl you? Thanks, Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/ ---------- Ken Krugler +1 530-210-6378 http://bixolabs.com e l a s t i c w e b m i n i n g
Re: entire farm fails at the same time with OOM issues
On Nov 30, 2010, at 5:16pm, Robert Petersen wrote: What would I do with the heap dump though? Run one of those java heap analyzers looking for memory leaks or something? I have no experience with thoseI saw there was a bug fix in solr 1.4.1 for a 100 byte memory leak occurring on each commit, but it would take thousands of commits to make that add up to anything right? Typically when I run out of memory in Solr, it's during an index update, when the new index searcher is getting warmed up. Looking at the heap often shows ways to reduce memory requirements, e.g. you'll see a really big chunk used for a sorted field. See http://wiki.apache.org/solr/SolrCaching and http://wiki.apache.org/solr/SolrPerformanceFactors for more details. -- Ken -Original Message----- From: Ken Krugler [mailto:kkrugler_li...@transpac.com] Sent: Tuesday, November 30, 2010 3:12 PM To: solr-user@lucene.apache.org Subject: Re: entire farm fails at the same time with OOM issues Hi Robert, I'd recommend launching Tomcat with -XX:+HeapDumpOnOutOfMemoryError and -XX:HeapDumpPath=, so then you have something to look at versus a Gedankenexperiment :) -- Ken On Nov 30, 2010, at 3:04pm, Robert Petersen wrote: Greetings, we are running one master and four slaves of our multicore solr setup. We just served searches for our catalog of 8 million products with this farm during black Friday and cyber Monday, our busiest days of the year, and the servers did not break a sweat! Index size is about 28GB. However, twice now recently during a time of low load we have had a fire drill where I have seen tomcat/solr fail and become unresponsive after some OOM heap errors. Solr wouldn't even serve up its admin pages. I've had to go in and manually knock tomcat out of memory and then restart it. These solr slaves are load balanced and the load balancers always probe the solr slaves so if they stop serving up searches they are automatically removed from the load balancer. When all four fail at the same time we have an issue! My question is this. Why in the world would all of my slaves, after running fine for some days, suddenly all at the exact same minute experience OOM heap errors and go dead? The load balancer kicks them all out at the same time each time. Each slave only talks to the master and not to each other, but the master show no errors in the logs at all. Something must be triggering this though. The only other odd thing I saw in the logs was after the first OOM errors were recorded, the slaves started occasionally not being able to get to the master. This behavior makes me a little nervous...=:-o eek! Environment: Lucid Imagination distro of Solr 1.4 on Tomcat Platform: RHEL with Sun JRE 1.6.0_18 on dual quad xeon machines with 64GB memory etc etc <http://ken-blog.krugler.org> +1 530-265-2225 ------ Ken Krugler +1 530-210-6378 http://bixolabs.com e l a s t i c w e b m i n i n g ------ Ken Krugler +1 530-210-6378 http://bixolabs.com e l a s t i c w e b m i n i n g
Re: entire farm fails at the same time with OOM issues
Hi Robert, I'd recommend launching Tomcat with -XX:+HeapDumpOnOutOfMemoryError and -XX:HeapDumpPath=, so then you have something to look at versus a Gedankenexperiment :) -- Ken On Nov 30, 2010, at 3:04pm, Robert Petersen wrote: Greetings, we are running one master and four slaves of our multicore solr setup. We just served searches for our catalog of 8 million products with this farm during black Friday and cyber Monday, our busiest days of the year, and the servers did not break a sweat! Index size is about 28GB. However, twice now recently during a time of low load we have had a fire drill where I have seen tomcat/solr fail and become unresponsive after some OOM heap errors. Solr wouldn't even serve up its admin pages. I've had to go in and manually knock tomcat out of memory and then restart it. These solr slaves are load balanced and the load balancers always probe the solr slaves so if they stop serving up searches they are automatically removed from the load balancer. When all four fail at the same time we have an issue! My question is this. Why in the world would all of my slaves, after running fine for some days, suddenly all at the exact same minute experience OOM heap errors and go dead? The load balancer kicks them all out at the same time each time. Each slave only talks to the master and not to each other, but the master show no errors in the logs at all. Something must be triggering this though. The only other odd thing I saw in the logs was after the first OOM errors were recorded, the slaves started occasionally not being able to get to the master. This behavior makes me a little nervous...=:-o eek! Environment: Lucid Imagination distro of Solr 1.4 on Tomcat Platform: RHEL with Sun JRE 1.6.0_18 on dual quad xeon machines with 64GB memory etc etc <http://ken-blog.krugler.org> +1 530-265-2225 ---------- Ken Krugler +1 530-210-6378 http://bixolabs.com e l a s t i c w e b m i n i n g
Re: Dinamically change master
Hi Tommaso, On Nov 30, 2010, at 7:41am, Tommaso Teofili wrote: Hi all, in a replication environment if the host where the master is running goes down for some reason, is there a way to communicate to the slaves to point to a different (backup) master without manually changing configuration (and restarting the slaves or their cores)? Basically I'd like to be able to change the replication master dinamically inside the slaves. Do you have any idea of how this could be achieved? One common approach is to use VIP (virtual IP) support provided by load balancers. Your slaves are configured to use a VIP to talk to the master, so that it's easy to dynamically change which master they use, via updates to the load balancer config. -- Ken ------ Ken Krugler +1 530-210-6378 http://bixolabs.com e l a s t i c w e b m i n i n g
Re: A Newbie Question
Solaris with hundreds of thousands of text files (there are other files, as well, but my target is these text files). The directories on the Solaris boxes are exported and are available as NFS mounts. I have installed Solr 1.4 on a Linux box and have tested the installation, using curl to post documents. However, the manual says that curl is not the recommended way of posting documents to Solr. Could someone please tell me what is the preferred approach in such an environment? I am not a programmer and would appreciate some hand-holding here :o) Thanks in advance, Sesh -- Lance Norskog goks...@gmail.com -- Ken Krugler +1 530-210-6378 http://bixolabs.com e l a s t i c w e b m i n i n g
Re: Dynamic creating of cores in solr
lrInputDocumentList(); UpdateResponse rsp; try { rsp = indexCore.add(docList); rsp = indexCore.commit(); } catch (IOException e) { LOG.warn("Error commiting documents", e); } catch (SolrServerException e) { LOG.warn("Error commiting documents", e); } [snip] 3) optimize, then swap cores: private void optimizeCore() { try { indexCore.optimize(); } catch(SolrServerException e) { LOG.warn("Error while optimizing core", e); } catch(IOException e) { LOG.warn("Error while optimizing core", e); } } private void swapCores() { String liveCore = indexName; String indexCore = indexName + SolrConstants.SUFFIX_INDEX; // SUFFIX_INDEX = "_INDEX" LOG.info("Swapping Solr cores: " + indexCore + ", " + liveCore); CoreAdminRequest request = new CoreAdminRequest(); request.setAction(CoreAdminAction.SWAP); request.setCoreName(indexCore); request.setOtherCoreName(liveCore); try { request.process(solr); } catch (SolrServerException e) { e.printStackTrace(); } catch (IOException e) { e.printStackTrace(); } } And that's about it. You could adjust the above so there's only one core per index that you want - if you don't do complete reindexes, and don't need the index to always be searchable. Hope that helps... Bob Sandiford | Lead Software Engineer | SirsiDynix P: 800.288.8020 X6943 | bob.sandif...@sirsidynix.com www.sirsidynix.com -Original Message- From: Nizan Grauer [mailto:niz...@yahoo-inc.com] Sent: Tuesday, November 09, 2010 3:36 AM To: solr-user@lucene.apache.org Subject: Dynamic creating of cores in solr Hi, I'm not sure this is the right mail to write to, hopefully you can help or direct me to the right person I'm using solr - one master with 17 slaves in the server and using solrj as the java client Currently there's only one core in all of them (master and slaves) - only the cpaCore. I thought about using multi-cores solr, but I have some problems with that. I don't know in advance which cores I'd need - When my java program runs, I call for documents to be index to a certain url, which contains the core name, and I might create a url based on core that is not yet created. For example: Calling to index - http://localhost:8080/cpaCore - existing core, everything as usual Calling to index - http://localhost:8080/newCore - server realizes there's no core "newCore", creates it and indexes to it. After that - also creates the new core in the slaves Calling to index - http://localhost:8080/newCore - existing core, everything as usual What I'd like to have on the server side to do is realize by itself if the cores exists or not, and if not - create it One other restriction - I can't change anything in the client side - calling to the server can only make the calls it's doing now - for index and search, and cannot make calls for cores creation via the CoreAdminHandler. All I can do is something in the server itself What can I do to get it done? Write some RequestHandler? REquestProcessor? Any other option? Thanks, nizan -- Ken Krugler +1 530-210-6378 http://bixolabs.com e l a s t i c w e b m i n i n g
Re: Inconsistent slave performance after optimize
. After the 4 hour break, we re-moved the 3rd and last slave server from our load-balancing pool, then re-enabled replication. This time we saw a tiny blip. The average performance went up to 1 second briefly then went back to the (normal for us) 0.25 to 0.5 second range. We then added this server back to the load-balancing pool and observed no degradation in performance. While we were happy to avoid a repeat of the poor performance we saw on the previous slaves, we are at a loss to explain why this slave did not also have such poor performance. At this point we're scratching our heads trying to understand: (a) Why the performance of the first two slaves was so terrible after the optimize. We think its cache-warming related, but we're not sure. 10 hours seems like a long time to wait for the cache to warm up (b) Why the performance of the third slave was barely impacted. It should have hit the same cold-cache issues as the other servers, if that is indeed the root cause. (c) Why performance of the first 2 slaves is still much worse after the optimize than it was before the optimize, where the performance of the 3rd slave is pretty much unchanged. We expected the optimize to *improve* performance. All 3 slave servers are identically configured, and the procedure for re-enabling replication was identical for the 2nd and 3rd slaves, with the exception of a 4-hour wait period. We have confirmed that the 3rd slave did replicate, the number of documents and total index size matches the master and other slave servers. I'm writing to fish for an explanation or ideas that might explain this inconsistent performance. Obviously, we'd like to be able to reproduce the performance of the 3rd slave, and avoid the poor performance of the first two slaves the next time we decide it's time to optimize our index. thanks in advance, Mason -- Ken Krugler +1 530-210-6378 http://bixolabs.com e l a s t i c w e b m i n i n g
Re: Multiple Word Facets
On Oct 27, 2010, at 6:29am, Adam Estrada wrote: Ahhh...I see! I am doing my testing crawling a couple websites using Nutch and in doing so I am assigning my facets to the title field which is type=text. Are you saying that I will need to manually generate the content for my facet field? I can see the reason and need for doing it that way but I really need for my faceting to happen dynamically based on the content in the field which in this case is the title of a URL. You would use copyfield to copy the contents of the title into a new field that uses the string type, and is the one you use for faceting. -- Ken On Wed, Oct 27, 2010 at 9:19 AM, Jayendra Patil wrote: The Shingle Filter Breaks the words in a sentence into a combination of 2/3 words. For faceting field you should use :- stored="true" multiValued="true"/> The type of the field should be *string *so that it is not tokenised at all. On Wed, Oct 27, 2010 at 9:12 AM, Adam Estrada wrote: Thanks guys, the solr.ShingleFilterFactory did work to get me multiple terms per facet but now I am seeing some redundancy in the facets numbers. See below... Highway (62) Highway System (59) National (59) National Highway (59) National Highway System (59) System (59) See what's going on here? How can I make my multi token facets smarter so that the tokens aren't duplicated? Thanks in advance, Adam On Tue, Oct 26, 2010 at 10:32 PM, Ahmet Arslan wrote: Facets are generated from indexed terms. Depending on your need/use-case: You can use a additional separate String field (which is not tokenized) for facets, populate it via copyField. Search on tokenized field facet on non-tokenized field. Or You can add solr.ShingleFilterFactory to your index analyzer to form multiple word terms. --- On Wed, 10/27/10, Adam Estrada wrote: From: Adam Estrada Subject: Multiple Word Facets To: solr-user@lucene.apache.org Date: Wednesday, October 27, 2010, 4:43 AM All, I am a new to Solr faceting and stuck on how to get multiple-word facets returned from a standard Solr query. See below for what is currently being returned. 89 87 87 87 84 60 32 22 19 15 15 14 12 11 10 9 7 7 7 6 6 6 6 ...etc... There are many terms in there that are 2 or 3 word phrases. For example, Eastern Federal Lands Highway Division all gets broken down in to the individual words that make up the total group of words. I've seen quite a few websites that do what it is I am trying to do here so any suggestions at this point would be great. See my schema below (copied from the example schema). Similar for type="query". Please advise on how to group or cluster document terms so that they can be used as facets. Many thanks in advance, Adam Estrada -- Ken Krugler +1 530-210-6378 http://bixolabs.com e l a s t i c w e b m i n i n g
Re: using HTTPClient sending solr ping request wont timeout as specified
Hi Renee, Mike is right, this is a question to post on the HttpClient users list (httpclient-us...@hc.apache.org). And yes, there is a separate setConnectionTimeout() that can be used. Though I'm most familiar with HttpClient 4.0, not 3.1. One possibility is that the ping response handler is responding (the connection is established), but you're not getting any data back. -- Ken On Oct 13, 2010, at 4:41am, Michael Sokolov wrote: This does seem more like an HTTPClient question than a solr question - you might get more traction on their lists? Still, from what I remember HTTPClient has a number of timeouts you can set. Perhaps it's the read timeout you need? -Mike -Original Message- From: Renee Sun [mailto:renee_...@mcafee.com] Sent: Tuesday, October 12, 2010 7:47 PM To: solr-user@lucene.apache.org Subject: Re: using HTTPClient sending solr ping request wont timeout as specified I also added the following timeout for the connection, still not working: client.getParams().setSoTimeout(httpClientPingTimeout); client .getParams().setConnectionManagerTimeout(httpClientPingTimeout); -- View this message in context: http://lucene.472066.n3.nabble.com/using-HTTPClient-sending-so lr-ping-request-wont-timeout-as-specified-tp1691292p1691355.html Sent from the Solr - User mailing list archive at Nabble.com. ---------- Ken Krugler +1 530-210-6378 http://bixolabs.com e l a s t i c w e b m i n i n g
Re: PatternReplaceFilterFactory creating empty string as a term
On Oct 5, 2010, at 6:24pm, Shawn Heisey wrote: I am developing a new schema. It has a pattern filter that trims leading and trailing punctuation from terms. It is resulting in empty terms, because there are situations in the analyzer stream where a term happens to be composed of nothing but punctuation. This problem is not happening in production. I want those terms removed. This blank term makes the top of the list as far as term frequency. Out of 7.6 million documents, 4.8 million of them have it. From TermsComponent: − 0 19106 − − 4830648 [snip] Is there any existing way to remove empty terms during analysis? I tried TrimFilterFactory but that made no difference. You could use LengthFilterFactory to restrict terms to being at least one character long. Is this a bug in PatternReplaceFilterFactory? No, I don't believe so. PatternReplaceFilterFactory creates a PatternReplaceFilter, and the JavaDoc for that says: Note: Depending on the input and the pattern used and the input TokenStream, this TokenFilter may produce Tokens whose text is the empty string. -- Ken <http://ken-blog.krugler.org> +1 530-265-2225 ------ Ken Krugler +1 530-210-6378 http://bixolabs.com e l a s t i c w e b m i n i n g
Re: getting a list of top page-ranked webpages
Hi Ian, On Sep 16, 2010, at 2:44pm, Ian Upright wrote: Hi, this question is a little off topic, but I thought since so many people on this are probably experts in this field, someone may know. I'm experimenting with my own semantic-based search engine, but I want to test it with a large corpus of web pages. Ideally I would like to have a list of the top 10M or top 100M page-ranked URL's in the world. Short of using Nutch to crawl the entire web and build this page- rank, is there any other ways? What other ways or resources might be available for me to get this (smaller) corpus of top webpages? The public terabyte dataset project would be a good match for what you need. http://bixolabs.com/datasets/public-terabyte-dataset-project/ Of course, that means we have to actually finish the crawl & finalize the Avro format we use for the data :) There are other free collections of data around, though none that I know of which target top-ranked pages. -- Ken ---------- Ken Krugler +1 530-210-6378 http://bixolabs.com e l a s t i c w e b m i n i n g
Re: Color search for images
On Sep 15, 2010, at 7:59am, Shawn Heisey wrote: My index consists of metadata for a collection of 45 million objects, most of which are digital images. The executives have fallen in love with Google's color image search. Here's a search for "flower" with a red color filter: http://www.google.com/images?q=flower&tbs=isch:1,ic:specific,isc:red I am interested in duplicating this. Can this group of fine people point me in the right direction? I don't want anyone to do it for me, just help me find software and/or algorithms that can extract the color information, then find a way to get Solr to index and search it. When I took at look at the search results, it seems like the word "red" shows up in the image name, or description, or tag for every found image. Are you sure Google is extracting color information? Or just being smart about color-specific keywords found in associated text? -- Ken -- Ken Krugler +1 530-210-6378 http://bixolabs.com e l a s t i c w e b m i n i n g
Re: Is semicolon a character that needs escaping?
Hi Michael, But in general escaping characters in a query gets tricky - if you can directly build queries versus pre-processing text sent to the query parser, you'll save yourself some pain and suffering. What do you mean by these two alternatives? That is, what exactly could I do better? By "can build...", I meant if you can come up with a GUI whereby the user doesn't have to use special characters (other than say quoting) then you can take a collection of clauses and programmatically build your query, without using the query parser. The code I wound up having to write for what seemed like simple escaping quickly got complex and convoluted - e.g. if you want to allow "AND" as a term, and don't want it to get processed specially by the query parser. Also, since I did the above code the DisMaxRequestHandler has been added to Solr, and it (IIRC) tries to be smart about handling this type of escaping for you. Dismax is not (yet) an option because we need the full lucene syntax within the query. OK - in that case sounds like you're stuck with escaping. -- Ken -- Ken Krugler +1 530-210-6378 http://bixolabs.com e l a s t i c w e b m i n i n g
Re: Is semicolon a character that needs escaping?
On Sep 2, 2010, at 12:35pm, Michael Lackhoff wrote: According to http://lucene.apache.org/java/2_9_1/ queryparsersyntax.html only these characters need escaping: + - && || ! ( ) { } [ ] ^ " ~ * ? : \ but with this simple query: TI:stroke; AND TI:journal I got the error message: HTTP ERROR: 400 Unknown sort order: TI:journal My first guess was that it was a URL encoding issue but everything looks fine: http://localhost:8983/solr/select/?q=TI%3Astroke%3B+AND+TI%3Ajournal&version=2.2&start=0&rows=10&indent=on as you can see, the semicolon is encoded as %3B There is no problem when the query ends with the semicolon: TI:stroke; gives no error. The first query also works if I escape the semicolon: TI:stroke\; AND TI:journal From this I conclude that there is a bug either in the docs or in the query parser or I missed something. What is wrong here? The docs need to be updated, I believe. From some code I wrote back in 2006... // Also note that we escape ';', as Solr uses this to support embedding // commands into the query string (yikes), and the code base we're using // has a bug where if the ';' doesn't have two tokens after it (white- // space separated) then you get an array index out of bounds error. I also had this note, no idea if it's still an issue: // Before we do regular escaping, work around a bug in the Lucene query // parser. If the last character is a '\', we can escape it as '\\', but // if we build an expression that looks like xxx AND () then // the Lucene query parser will treat the final '\' before the ')' as // a signal to escape the ')' character. That's just wrong, but for now // we'll just strip off any trailing '\' characters in the clause. But in general escaping characters in a query gets tricky - if you can directly build queries versus pre-processing text sent to the query parser, you'll save yourself some pain and suffering. Also, since I did the above code the DisMaxRequestHandler has been added to Solr, and it (IIRC) tries to be smart about handling this type of escaping for you. -- Ken -- Ken Krugler +1 530-210-6378 http://bixolabs.com e l a s t i c w e b m i n i n g
Re: Restricting HTML search?
Actually TagSoup's reason for existence is to clean up all of the messy HTML that's out in the wild. Tika's HTML parser wraps this, and uses it to generate the stream of SAX events that it then consumes and turns into a normalized XHTML 1.0- compliant data stream. -- Ken On Aug 25, 2010, at 7:22pm, Lance Norskog wrote: This assumes that the HTML is good quality. I don't know exactly what your use case is. If you're crawling the web you will find some very screwed-up HTML. On Wed, Aug 25, 2010 at 6:45 AM, Ken Krugler wrote: On Aug 24, 2010, at 10:55pm, Paul Libbrecht wrote: Wouldn't the usage of the NeckoHTML (as an XML-parser) and XPath be safer? I guess it all depends on the "quality" of the source document. If you're processing HTML then you definitely want to use something like NekoHTML or TagSoup. Note that Tika uses TagSoup and makes it easy to do special processing of specific elements - you give it a content handler that gets fed a stream of cleaned-up HTML elements. -- Ken Le 25-août-10 à 02:09, Lance Norskog a écrit : I would do this with regular expressions. There is a Pattern Analyzer and a Tokenizer which do regular expression-based text chopping. (I'm not sure how to make them do what you want). A more precise tool is the RegexTransformer in the DataImportHandler. Lance On Tue, Aug 24, 2010 at 7:08 AM, Andrew Cogan wrote: I'm quite new to SOLR and wondering if the following is possible: in addition to normal full text search, my users want to have the option to search only HTML heading innertext, i.e. content inside of , , or tags. ---- Ken Krugler +1 530-210-6378 http://bixolabs.com e l a s t i c w e b m i n i n g -- Lance Norskog goks...@gmail.com ---- Ken Krugler +1 530-210-6378 http://bixolabs.com e l a s t i c w e b m i n i n g
Re: Restricting HTML search?
On Aug 24, 2010, at 10:55pm, Paul Libbrecht wrote: Wouldn't the usage of the NeckoHTML (as an XML-parser) and XPath be safer? I guess it all depends on the "quality" of the source document. If you're processing HTML then you definitely want to use something like NekoHTML or TagSoup. Note that Tika uses TagSoup and makes it easy to do special processing of specific elements - you give it a content handler that gets fed a stream of cleaned-up HTML elements. -- Ken Le 25-août-10 à 02:09, Lance Norskog a écrit : I would do this with regular expressions. There is a Pattern Analyzer and a Tokenizer which do regular expression-based text chopping. (I'm not sure how to make them do what you want). A more precise tool is the RegexTransformer in the DataImportHandler. Lance On Tue, Aug 24, 2010 at 7:08 AM, Andrew Cogan wrote: I'm quite new to SOLR and wondering if the following is possible: in addition to normal full text search, my users want to have the option to search only HTML heading innertext, i.e. content inside of , , or tags. ------------ Ken Krugler +1 530-210-6378 http://bixolabs.com e l a s t i c w e b m i n i n g
Re: indexing???
On Aug 16, 2010, at 10:38pm, satya swaroop wrote: hi all, the error i got is ""Unexpected RuntimeException from org.apache.tika.parser.pdf.pdfpar...@8210fc"" when i indexed a file similar to the one in https://issues.apache.org/jira/browse/PDFBOX-709/samplerequestform.pdf 1. This URL doesn't work for me. 2. Please include the full stack trace from the RuntimeException. 3. What version of Tika are you using? Thanks, -- Ken -------- Ken Krugler +1 530-210-6378 http://bixolabs.com e l a s t i c w e b m i n i n g
Re: Best solution to avoiding multiple query requests
Hi Geert-jan, On Aug 4, 2010, at 12:04pm, Geert-Jan Brits wrote: If I understand correctly: you want to sort your collapsed results by 'nr of collapsed results'/ hits. It seems this can't be done out-of-the-box using this patch (I'm not entirely sure, at least it doesn't follow from the wiki-page. Perhaps best is to check the jira-issues to make sure this isn't already available now, but just not updated on the wiki) Also I found a blogpost (from the patch creator afaik) with in the comments someone with the same issue + some pointers. http://blog.jteam.nl/2009/10/20/result-grouping-field-collapsing-with-solr/ Yup, that's the one - http://blog.jteam.nl/2009/10/20/result-grouping-field-collapsing-with-solr/comment-page-1/#comment-1249 So with some modifications to that patch, it could work...thanks for the info! -- Ken 2010/8/4 Ken Krugler Hi Geert-Jan, On Aug 4, 2010, at 5:30am, Geert-Jan Brits wrote: Field Collapsing (currently as patch) is exactly what you're looking for imo. http://wiki.apache.org/solr/FieldCollapsing Thanks for the ref, good stuff. I think it's close, but if I understand this correctly, then I could get (using just top two, versus top 10 for simplicity) results that looked like "dog training" (faceted field value A) "super dog" (faceted field value B) but if the actual faceted field value/hit counts were: C (10) D (8) A (2) B (1) Then what I'd want is the top hit for "dog AND facet field:C", followed by "dog AND facet field:D". Used field collapsing would improve the probability that if I asked for the top 100 hits, I'd find entries for each of my top N faceted field values. Thanks again, -- Ken I've got a situation where the key result from an initial search request (let's say for "dog") is the list of values from a faceted field, sorted by hit count. For the top 10 of these faceted field values, I need to get the top hit for the target request ("dog") restricted to that value for the faceted field. Currently this is 11 total requests, of which the 10 requests following the initial query can be made in parallel. But that's still a lot of requests. So my questions are: 1. Is there any magic query to handle this with Solr as-is? 2. if not, is the best solution to create my own request handler? 3. And in that case, any input/tips on developing this type of custom request handler? Thanks, -- Ken Ken Krugler +1 530-210-6378 http://bixolabs.com e l a s t i c w e b m i n i n g Ken Krugler +1 530-210-6378 http://bixolabs.com e l a s t i c w e b m i n i n g
Re: Best solution to avoiding multiple query requests
Hi Geert-Jan, On Aug 4, 2010, at 5:30am, Geert-Jan Brits wrote: Field Collapsing (currently as patch) is exactly what you're looking for imo. http://wiki.apache.org/solr/FieldCollapsing Thanks for the ref, good stuff. I think it's close, but if I understand this correctly, then I could get (using just top two, versus top 10 for simplicity) results that looked like "dog training" (faceted field value A) "super dog" (faceted field value B) but if the actual faceted field value/hit counts were: C (10) D (8) A (2) B (1) Then what I'd want is the top hit for "dog AND facet field:C", followed by "dog AND facet field:D". Used field collapsing would improve the probability that if I asked for the top 100 hits, I'd find entries for each of my top N faceted field values. Thanks again, -- Ken I've got a situation where the key result from an initial search request (let's say for "dog") is the list of values from a faceted field, sorted by hit count. For the top 10 of these faceted field values, I need to get the top hit for the target request ("dog") restricted to that value for the faceted field. Currently this is 11 total requests, of which the 10 requests following the initial query can be made in parallel. But that's still a lot of requests. So my questions are: 1. Is there any magic query to handle this with Solr as-is? 2. if not, is the best solution to create my own request handler? 3. And in that case, any input/tips on developing this type of custom request handler? Thanks, -- Ken Ken Krugler +1 530-210-6378 http://bixolabs.com e l a s t i c w e b m i n i n g
Best solution to avoiding multiple query requests
Hi all, I've got a situation where the key result from an initial search request (let's say for "dog") is the list of values from a faceted field, sorted by hit count. For the top 10 of these faceted field values, I need to get the top hit for the target request ("dog") restricted to that value for the faceted field. Currently this is 11 total requests, of which the 10 requests following the initial query can be made in parallel. But that's still a lot of requests. So my questions are: 1. Is there any magic query to handle this with Solr as-is? 2. if not, is the best solution to create my own request handler? 3. And in that case, any input/tips on developing this type of custom request handler? Thanks, -- Ken ------------ Ken Krugler +1 530-210-6378 http://bixolabs.com e l a s t i c w e b m i n i n g
Re: SolrCore has a large number of SolrIndexSearchers retained in "infoRegistry"
On Jul 27, 2010, at 12:21pm, Chris Hostetter wrote: : : I was wondering if anyone has found any resolution to this email thread? As Grant asked in his reply when this thread was first started (December 2009)... It sounds like you are either using embedded mode or you have some custom code. Are you sure you are releasing your resources correctly? ...there was no response to his question for clarification. the problem, given the info we have to work with, definitely seems to be that the custom code utilizing the SolrCore directly is not releasing the resources that it is using in every case. if you are claling hte execute method, that means you have a SOlrQueryRequest object -- which means you somehow got an instance of a SolrIndexSearcher (every SOlrQueryRequest has one assocaited with it) and you are somehow not releasing that SolrIndexSearcher (probably because you are not calling close() on your SolrQueryRequest) One thing that bit me previously with using APIs in this area of Solr is that if you call CoreContainer.getCore(), this increments the open count, so you have to balance each getCore() call with a close() call. The naming here could be better - I think it's common to have an expectation that calls to get something don't change any state. Maybe openCore()? -- Ken But it relaly all depends on how you got ahold of that SOlrQueryRequest/SolrIndexSearcher pair in the first place ... every method in SolrCore that gives you access to a SolrIndexSearcher is documented very clearly on how to "release" it when you are done with it so the ref count can be decremented. -Hoss -------- Ken Krugler +1 530-210-6378 http://bixolabs.com e l a s t i c w e b m i n i n g
Re: faceted search with job title
Hi Savannah, A few comments below, scattered in-line... -- Ken On Jul 21, 2010, at 3:08pm, Savannah Beckett wrote: And I will have to recompile the dom or sax code each time I add a job board for crawling. Regex patten is only a string which can be stored in a text file or db, and retrieved based on the job board. What do you think? You can store the XPath expressions in a text file as strings, and load/compile them as needed. From: "Nagelberg, Kallin" To: "solr-user@lucene.apache.org" Sent: Wed, July 21, 2010 10:39:32 AM Subject: RE: faceted search with job title Yeah you should definitely just setup a custom parser for each site.. should be easy to extract title using groovy's xml parsing along with tagsoup for sloppy html. Definitely yes re using TagSoup to clean up bad HTML. And definitely yes to needing per-site "rules" (typically XPath + optional regex as needed) to extract specific details. For a common class of sites powered by the same back-end, you can often re-use the same general rules as the markup that you care about is consistent. If you can't find the pattern for each site leading to the job title how can you expect solr to? Humans have the advantage here :P -Kallin Nagelberg -Original Message- From: Savannah Beckett [mailto:savannah_becket...@yahoo.com] Sent: Wednesday, July 21, 2010 12:20 PM To: solr-user@lucene.apache.org Cc: dave.sea...@magicalia.com Subject: Re: faceted search with job title mmm...there must be better way...each job board has different format. If there are constantly new job boards being crawled, I don't think I can manually look for specific sequence of tags that leads to job title. Most of them don't even have class or id. There is no guarantee that the job title will be in the title tag, or header tag. Something else can be in the title. Should I do this in a class that extends IndexFilter in Nutch? When I do this kind of thing I use Bixo (http://openbixo.org), but that requires knowledge of Cascading (& some Hadoop) in order to construct web mining workflows. From: Dave Searle To: "solr-user@lucene.apache.org" Sent: Wed, July 21, 2010 8:42:55 AM Subject: RE: faceted search with job title You'd probably need to do some post processing on the pages and set up rules for each website to grab that specific bit of data. You could load the html into an xml parser, then use xpath to grab content from a particular tag with a class or id, based on the particular website -Original Message- From: Savannah Beckett [mailto:savannah_becket...@yahoo.com] Sent: 21 July 2010 16:38 To: solr-user@lucene.apache.org Subject: faceted search with job title Hi, I am currently using nutch to crawl some job pages from job boards. They are in my solr index now. I want to do faceted search with the job titles. How? The job titles can be in any locations of the page, e.g. title, header, content... If I use indexfilter in Nutch to search the content for job title, there are hundred of thousands of job titles, I can't hard code them all. Do you have a better idea? I think I need the job title in a separate field in the index to make it work with solr faceted search, am I right? Yes, you'd want a separate "job title" field in the index. Though often the job titles are slight variants on each other, so this would probably work much better if you automatically found common phrases and used those, otherwise you get "Senior Bottlewasher" and "Sr. Bottlewasher" and "Sr Bottlewasher" as separate facet values. Ken Krugler +1 530-210-6378 http://bixolabs.com e l a s t i c w e b m i n i n g
Re: Problem building Nightly Solr
On Jul 6, 2010, at 3:44pm, Chris Hostetter wrote: : Can you try "ant compile example"? : After Lucene/Solr merge, solr ant build needs to compile before example : target. the "compile" target is already in the dependency tree for the "example" target, so that won't change anything. At the moment, the "nightly" snapshots produced by hudson only iclude the "solr" section of the "dev" tree -- not modules or the lucene-java sections . The compiled versions of thothat code is included, so you can *run* solr from the hudson artifacts, but aparently you can't compile it. (this is particularly odd since the nightlies include all the compiled lucene code as jars in a "lucene-libs/" directory, but the build system doesn't seem to use that directory ... at least not when compiling solrj). This is all side effects of trunk still being somewhat in transition -- there are kinks in dealing with the artifacts of the nightly build process tha still need worked out, -- but if your goal is to compile things yourself, then you might as well just check out the entire trunk and use that compile fro mthat anyway. Note that you'll need to "ant compile" from the top of the lucene directory first, before trying any of the solr-specific builds from inside of the /solr sub-dir. Or at least that's what I ran into when trying to build a solr dist recently. -- Ken Ken Krugler +1 530-210-6378 http://bixolabs.com e l a s t i c w e b m i n i n g
Re: document level security: indexing/searching techniques
On Jul 6, 2010, at 8:27am, osocurious2 wrote: Someone else was recently asking a similar question (or maybe it was you but worded differently :) ). Putting user level security at a document level seems like a recipe for pain. Solr/Lucene don't do frequent update well...and being highly optimized for query, I don't blame them. Is there any way to create a series of roles that you can apply to your documents? If the security level of the document isn't changing, just the user access to them, give the docs a role in the index, put your user/usergroup stuff in a DB or some other system and resolve your user into valid roles, then FilterQuery on role. You're right, baking in too fine-grained a level of security information is a bad idea. As one example that worked pretty well for code search with Krugle, we set access control on a per project level using LDAP groups - ie each project had some number of groups that were granted access rights. Each file in the project would inherit the same list of groups. Then, when a user logs in they get authenticated via LDAP, and we have the set of groups they belong to being returned by the LDAP server. This then becomes a fairly well-bounded list of "terms" for an OR query against the "acl-groups" field in each file/project document. Just don't forget to set the boost to 0 for that portion of the query :) -- Ken -------- Ken Krugler +1 530-210-6378 http://bixolabs.com e l a s t i c w e b m i n i n g
Re: IOException: read past EOF when opening index built directly w/Lucene
On Jul 1, 2010, at 1:03pm, Ken Krugler wrote: I've got a version 2.3 index that appears to be valid - I can open it with Luke 1.0.1, and CheckIndex reports no problem. [snip] and Luke overview says: This time as text: Index version: 12984d2211c Index format: -4 (Lucene 2.3) Index functionality: lock-less, single norms file, shared doc store Currently opened commit point: segments_2 -- Ken ---- Ken Krugler +1 530-210-6378 http://bixolabs.com e l a s t i c w e b m i n i n g
IOException: read past EOF when opening index built directly w/Lucene
I've got a version 2.3 index that appears to be valid - I can open it with Luke 1.0.1, and CheckIndex reports no problem. Just for grins, I crafted a matching schema, and tried to use the index with Solr 1.4 (and also Solr-trunk). In either case, I get this exception during startup: SEVERE: java.lang.RuntimeException: java.io.IOException: read past EOF at org.apache.solr.core.SolrCore.getSearcher(SolrCore.java:1067) at org.apache.solr.core.SolrCore.(SolrCore.java:582) at org.apache.solr.core.CoreContainer.create(CoreContainer.java:431) at org.apache.solr.core.CoreContainer.load(CoreContainer.java:286) at org.apache.solr.core.CoreContainer $Initializer.initialize(CoreContainer.java:125) at org .apache.solr.servlet.SolrDispatchFilter.init(SolrDispatchFilter.java:86) ... Caused by: java.io.IOException: read past EOF at org .apache.lucene.store.BufferedIndexInput.refill(BufferedIndexInput.java: 154) at org .apache .lucene.store.BufferedIndexInput.readByte(BufferedIndexInput.java:39) at org .apache .lucene.store.ChecksumIndexInput.readByte(ChecksumIndexInput.java:40) at org.apache.lucene.store.DataInput.readInt(DataInput.java:76) at org.apache.lucene.index.SegmentInfo.(SegmentInfo.java:171) at org.apache.lucene.index.SegmentInfos.read(SegmentInfos.java:230) at org.apache.lucene.index.DirectoryReader $1.doBody(DirectoryReader.java:91) at org.apache.lucene.index.SegmentInfos $FindSegmentsFile.run(SegmentInfos.java:649) at org.apache.lucene.index.DirectoryReader.open(DirectoryReader.java: 87) at org.apache.lucene.index.IndexReader.open(IndexReader.java:415) at org.apache.lucene.index.IndexReader.open(IndexReader.java:294) at org .apache .solr .core .StandardIndexReaderFactory.newReader(StandardIndexReaderFactory.java: 38) at org.apache.solr.core.SolrCore.getSearcher(SolrCore.java:1056) ... 30 more and at the end of the startup logging, it says: Jul 1, 2010 12:51:25 PM org.apache.solr.core.SolrCore finalize SEVERE: REFCOUNT ERROR: unreferenced org.apache.solr.core.solrc...@4513e9fd () has a reference count of 1 Is what I'm trying to do something that's destined to fail? I would have expected schema/index miss-matches to show up later, not right when the index is being opened. I'd seen various posts about this type of error due to corrupt indexes, or having a buggy version of Java 1.6, or an obscure Lucene bug (https://issues.apache.org/jira/browse/SOLR-1778 and https://issues.apache.org/jira/browse/LUCENE-2270) , but none of those seem to apply to my situation. Thanks, -- Ken PS - index dir looks like: 249K Jun 29 13:47 _0.fdt 12K Jun 29 13:47 _0.fdx 159B Jun 29 13:47 _0.fnm 3.6M Jun 29 13:47 _0.frq 23K Jun 29 13:47 _0.nrm 10M Jun 29 13:47 _0.prx 51K Jun 29 13:47 _0.tii 2.9M Jun 29 13:47 _0.tis 20B Jun 29 13:47 segments.gen 45B Jun 29 13:47 segments_2 and Luke overview says: -------- Ken Krugler +1 530-210-6378 http://bixolabs.com e l a s t i c w e b m i n i n g
Re: SolrJ/EmbeddedSolrServer
Sort of answering my own question here too... It seems like I need to get the current core, and use that to instantiate a new SolrCore with the same exact config, other than the dataDir. The documentation for SolrCore()'s constructor says: "If a core with the same name already exists, it will be stopped and replaced by this one" But it's unclear to me whether this will do a graceful swap (like what I want) or a hard shutdown of the old core. Thanks, -- Ken On May 22, 2010, at 11:25am, Ryan McKinley wrote: accidentally hit send... Eache core can have the dataDir set explicitly. If you want to do this with solrj, you would need to manipulate the CoreDescriptor objects. I'm hoping somebody can clarify what's up with the CoreDescriptor class, since there's not much documentation. As far as I can tell, when you create a new SolrCore, it saves off the CoreDescriptor you pass in, and does nothing with it. The constructor for SolrCore also takes a datadir param, so I don't see how the CoreDescriptor's dataDir gets used during construction. And changing the CoreDescriptor's dataDir has no effect, since it's essentially a POJO. So how would one go about changing the dataDir for a core, in a multi- core setup? Thanks, -- Ken On Sat, May 22, 2010 at 2:24 PM, Ryan McKinley wrote: Check: http://wiki.apache.org/solr/CoreAdmin Unless I'm missing something, I think you should be able to sort what you need On Fri, May 21, 2010 at 7:55 PM, Ken Krugler wrote: I've got a situation where my data directory (a) needs to live elsewhere besides inside of Solr home, (b) moves to a different location when updating indexes, and (c) setting up a symlink from /data isn't a great option. So what's the best approach to making this work with SolrJ? The low-level solution seems to be - create my own SolrCore instance, where I specify the data directory - use that to update the CoreContainer - create a new EmbeddedSolrServer But recreating the EmbeddedSolrServer with each index update feels wrong, and I'd like to avoid mucking around with low-level SolrCore instantiation. Any other approaches? Thanks, -- Ken Ken Krugler +1 530-210-6378 http://bixolabs.com e l a s t i c w e b m i n i n g Ken Krugler +1 530-210-6378 http://bixolabs.com e l a s t i c w e b m i n i n g
Re: SolrJ/EmbeddedSolrServer
Answering my own question... I can use CoreContainer.reload("core name") to force a reload. I assume that if I've got an EmbeddedSolrServer running at the time I do this reload, everything will happen correctly under the covers. So now I just need to find out how to programmatically change settings for a core. -- Ken On May 22, 2010, at 11:24am, Ryan McKinley wrote: Check: http://wiki.apache.org/solr/CoreAdmin Unless I'm missing something, I think you should be able to sort what you need If I'm using SolrJ, is there a programmatic way to force a reload of a core? This, of course, assumes that I'm able to programmatically change the location of the dataDir, which is another issue. Thanks, -- Ken On Fri, May 21, 2010 at 7:55 PM, Ken Krugler wrote: I've got a situation where my data directory (a) needs to live elsewhere besides inside of Solr home, (b) moves to a different location when updating indexes, and (c) setting up a symlink from /data isn't a great option. So what's the best approach to making this work with SolrJ? The low- level solution seems to be - create my own SolrCore instance, where I specify the data directory - use that to update the CoreContainer - create a new EmbeddedSolrServer But recreating the EmbeddedSolrServer with each index update feels wrong, and I'd like to avoid mucking around with low-level SolrCore instantiation. Any other approaches? Thanks, -- Ken ---- Ken Krugler +1 530-210-6378 http://bixolabs.com e l a s t i c w e b m i n i n g ---- Ken Krugler +1 530-210-6378 http://bixolabs.com e l a s t i c w e b m i n i n g
Re: SolrJ/EmbeddedSolrServer
On May 22, 2010, at 11:24am, Ryan McKinley wrote: Check: http://wiki.apache.org/solr/CoreAdmin Unless I'm missing something, I think you should be able to sort what you need If I'm using SolrJ, is there a programmatic way to force a reload of a core? This, of course, assumes that I'm able to programmatically change the location of the dataDir, which is another issue. Thanks, -- Ken On Fri, May 21, 2010 at 7:55 PM, Ken Krugler wrote: I've got a situation where my data directory (a) needs to live elsewhere besides inside of Solr home, (b) moves to a different location when updating indexes, and (c) setting up a symlink from /data isn't a great option. So what's the best approach to making this work with SolrJ? The low- level solution seems to be - create my own SolrCore instance, where I specify the data directory - use that to update the CoreContainer - create a new EmbeddedSolrServer But recreating the EmbeddedSolrServer with each index update feels wrong, and I'd like to avoid mucking around with low-level SolrCore instantiation. Any other approaches? Thanks, -- Ken ------------ Ken Krugler +1 530-210-6378 http://bixolabs.com e l a s t i c w e b m i n i n g ------------ Ken Krugler +1 530-210-6378 http://bixolabs.com e l a s t i c w e b m i n i n g
Re: SolrJ/EmbeddedSolrServer
On May 22, 2010, at 11:25am, Ryan McKinley wrote: accidentally hit send... Eache core can have the dataDir set explicitly. If you want to do this with solrj, you would need to manipulate the CoreDescriptor objects. I'm hoping somebody can clarify what's up with the CoreDescriptor class, since there's not much documentation. As far as I can tell, when you create a new SolrCore, it saves off the CoreDescriptor you pass in, and does nothing with it. The constructor for SolrCore also takes a datadir param, so I don't see how the CoreDescriptor's dataDir gets used during construction. And changing the CoreDescriptor's dataDir has no effect, since it's essentially a POJO. So how would one go about changing the dataDir for a core, in a multi- core setup? Thanks, -- Ken On Sat, May 22, 2010 at 2:24 PM, Ryan McKinley wrote: Check: http://wiki.apache.org/solr/CoreAdmin Unless I'm missing something, I think you should be able to sort what you need On Fri, May 21, 2010 at 7:55 PM, Ken Krugler wrote: I've got a situation where my data directory (a) needs to live elsewhere besides inside of Solr home, (b) moves to a different location when updating indexes, and (c) setting up a symlink from /data isn't a great option. So what's the best approach to making this work with SolrJ? The low-level solution seems to be - create my own SolrCore instance, where I specify the data directory - use that to update the CoreContainer - create a new EmbeddedSolrServer But recreating the EmbeddedSolrServer with each index update feels wrong, and I'd like to avoid mucking around with low-level SolrCore instantiation. Any other approaches? Thanks, -- Ken Ken Krugler +1 530-210-6378 http://bixolabs.com e l a s t i c w e b m i n i n g Ken Krugler +1 530-210-6378 http://bixolabs.com e l a s t i c w e b m i n i n g
Re: [ANN] Solr 1.4.1 Released
On Jun 26, 2010, at 5:18pm, Jason Chaffee wrote: It appears the 1.4.1 version was deployed with a new maven groupId For eample, if you are trying to download solr-core, here are the differences between 1.4.0 and 1.4.1. 1.4.0 groupId: org.apache.solr artifactId: solr-core 1.4.1 groupId: org.apache.solr.solr artifactId:solr-core Was this change intentional or a mistake? If it was a mistake, can someone please fix it in maven's central repository? I believe it was a mistake. From a recent email thread on this list, Mark Miller said: Can a solr/maven dude look at this? I simply used the copy command on the release to-do wiki (sounds like it should be updated). If no one steps up, I'll try and straighten it out later. On 6/25/10 10:28 AM, Stevo Slavić wrote: Congrats on the release! Something seems to be wrong with solr 1.4.1 maven artifacts, there is in extra solr in the path. E.g. solr-parent-1.4.1.pom at in http://repo1.maven.org/maven2/org/apache/solr/solr/solr-parent/1.4.1/solr-parent-1.4.1.pomwhile it should be at http://repo1.maven.org/maven2/org/apache/solr/solr-parent/1.4.1/solr-parent-1.4.1.pom . Pom's seem to contain correct maven artifact coordinates. Regards, Stevo. -- Ken ------------ Ken Krugler +1 530-210-6378 http://bixolabs.com e l a s t i c w e b m i n i n g
Some minor Solritas layout tweaks
I grabbed the latest & greatest from trunk, and then had to make a few minor layout tweaks. 1. In main.css, the ".query-box input" { height} isn't tall enough (at least on my Mac 10.5/FF 3.6 config), so character descenders get clipped. I bumped it from 40px to 50px, and that fixed the issue for me. 2. The constraint text (for removing facet constraints) overlaps with the Solr logo. It looks like the div that contains this anchor text is missing a class="constraints", as I see a .constraints in the CSS. I added this class name, and also (to main.css): .constraints { margin-top: 10px; } But IANAWD, so this is probably not the best way to fix the issue. 3. And then I see a .constraints-title in the CSS, but it's not used. Was the intent of this to set the '>' character to gray? 4. It seems silly to open JIRA issues for these types of things, but I also don't want to add to noise on the list. Which approach is preferred? Thanks, -- Ken Ken Krugler +1 530-210-6378 http://bixolabs.com e l a s t i c w e b m i n i n g
Re: Minor bug with Solritas and price data
Hi Hoss, You're the man. I'd copied/pasted the 1.3 schema fields into my testbed schema, which was based on the version of Solr we were using back in the Dark Ages, when the version was 1.0 (and there was no such handy comment warning about changing the version :)) So fields were multivalued by default. I'd checked the docs for 1.3, but didn't realize that this behavior had changed from 1.0 days. Mystery solved. If I'd used http://localhost:8983/solr/admin/schema.jsp to examine the field type, I would have seen this. Thanks again. -- Ken On Jun 21, 2010, at 11:36am, Chris Hostetter wrote: : Here's what's in my schema: : : : : Which is exactly what was in the original example schema. but what does hte "version" property of your schema say (at the top) this is what's in the example... -Hoss ---- Ken Krugler +1 530-210-6378 http://bixolabs.com e l a s t i c w e b m i n i n g
Re: Minor bug with Solritas and price data
26 2.29 KAS Rugs Indira Black Circles Rug ... Any other ideas what might be going on? Thanks, -- Ken On Jun 19, 2010, at 9:12 PM, Ken Krugler wrote: I noticed that my prices weren't showing up, even though I've got a price field. I think the issue is with this line from hit.vm: #field('name') $! number.currency($doc.getFieldValue('price')) The number.currency() function needs to get passed something that looks like a number, but $doc.getFieldValue() will return "[2.96]", because it could be a list of values. The square brackets confuse number.currency, so you get no price. I think this line needs to be: #field('name') $! number.currency($doc.getFirstValue('price')) ...since getFirstValue() returns a single value without brackets. -- Ken ------------ <http://ken-blog.krugler.org> +1 530-265-2225 Ken Krugler +1 530-210-6378 http://bixolabs.com e l a s t i c w e b m i n i n g
Minor bug with Solritas and price data
I noticed that my prices weren't showing up, even though I've got a price field. I think the issue is with this line from hit.vm: #field('name') $! number.currency($doc.getFieldValue('price')) The number.currency() function needs to get passed something that looks like a number, but $doc.getFieldValue() will return "[2.96]", because it could be a list of values. The square brackets confuse number.currency, so you get no price. I think this line needs to be: #field('name') $! number.currency($doc.getFirstValue('price')) ...since getFirstValue() returns a single value without brackets. -- Ken Ken Krugler +1 530-210-6378 http://bixolabs.com e l a s t i c w e b m i n i n g
Minor bug in Solritas with post-facet search
I ran into one minor problem, where if I clicked a facet, and then tried a search, I'd get a 404 error. I think the problem is with the fqs Velocity macro in VM_global_library.vm, where it's missing the #else to insert a '?' into the URL: #macro(fqs $p)#foreach($fq in $p)#if($velocityCount>1)&#{else}? #{end}fq=$esc.url($fq)#end#end Without this, the URL becomes /solr/browsefq=, instead of /solr/ browse?fq= But I'm completely new to the world of Velocity templating, so I've got low confidence that this is the right way to fix it. -- Ken -------- Ken Krugler +1 530-210-6378 http://bixolabs.com e l a s t i c w e b m i n i n g
Re: Autocompletion with Solritas
Hi Erik, On Jun 18, 2010, at 6:58pm, Erik Hatcher wrote: Have a look at suggest.vm - the "name" field is used in there too. Just those two places, layout.vm and suggest.vm. That was the missing change I needed. Thanks much! -- Ken And I had already added a ## TODO in my local suggest.vm: ## TODO: make this more generic, maybe look at the request terms.fl? or just take the first terms field in the response? And also, ideally, there'd be a /suggest handler mapped with the field name specified there. I simply used what was already available to put suggest in there easily. Erik On Jun 18, 2010, at 7:54 PM, Ken Krugler wrote: Hi Erik, On Jun 17, 2010, at 8:34pm, Erik Hatcher wrote: Your wish is my command. Check out trunk, fire up Solr (ant run- example), index example data, hit http://localhost:8983/solr/ browse - type in search box. Just used jQuery's autocomplete plugin and the terms component for now, on the name field. Quite simple to plug in, actually. Check the commit diff. The main magic is doing this: <http://localhost:8983/solr/terms?terms.fl=name&terms.prefix=i&terms.sort=count&wt=velocity&v.template=suggest > Stupidly, though, jQuery's autocomplete seems to be hardcoded to send a q parameter, but I coded it to also send the same value as terms.prefix - but this could be an issue if hitting a different request handler where q is used for the actual query for filtering terms on. Let's say, just for grins, that a different field (besides "name") is being used for autocompletion. What would be all the places I'd need to hit to change the field, besides the terms.fl value in layout.vm? For example, what about browse.vm: $("input[type=text]").autoSuggest("/solr/suggest", {selectedItemProp: "name", searchObjProps: "name"}}); I'm asking because I'm trying to use this latest support with an index that uses "product_name" for the auto-complete field, and I'm not getting any auto-completes happening. I see from the Solr logs that requests being made to /solr/terms during auto-complete that look like: INFO: [] webapp=/solr path=/terms params = {limit = 10 ×tamp = 1276903135595 &terms .fl = product_name &q = rug &wt=velocity&terms.sort=count&v.template=suggest&terms.prefix=rug} status=0 QTime=0 Which I'd expect to work, but don't seem to be generating any results. What's odd is that if I try curling the same thing: curl -v "http://localhost:8983/solr/terms?limit=10×tamp=1276903135595&terms.fl=product_name&q=rug&wt=velocity&terms.sort=count&v.template=suggest&terms.prefix=rug " I get an empty HTML response: < Content-Type: text/html; charset=utf-8 < Content-Length: 0 < Server: Jetty(6.1.22) If I just use what I'd consider to be the minimum set of parameters: curl -v "http://localhost:8983/solr/terms?limit=10&terms.fl=product_name&q=rug&terms.sort=count&terms.prefix=rug " Then I get the expected XML response: < Content-Type: text/xml; charset=utf-8 < Content-Length: 225 < Server: Jetty(6.1.22) < 0name="QTime">0name="product_name">7 Any ideas what I'm doing wrong? Thanks, -- Ken On Jun 17, 2010, at 8:03 PM, Ken Krugler wrote: I don't believe Solritas supports autocompletion out of the box. So I'm wondering if anybody has experience using the LucidWorks distro & Solritas, plus the AJAX Solr auto-complete widget. I realize that AJAX Solr's autocomplete support is mostly just leveraging the jQuery Autocomplete plugin, and hooking it up to Solr facets, but I was curious if there were any tricks or traps in getting it all to work. Thanks, -- Ken Ken Krugler +1 530-210-6378 http://bixolabs.com e l a s t i c w e b m i n i n g
Re: Autocompletion with Solritas
Hi Erik, On Jun 17, 2010, at 8:34pm, Erik Hatcher wrote: Your wish is my command. Check out trunk, fire up Solr (ant run- example), index example data, hit http://localhost:8983/solr/browse - type in search box. Just used jQuery's autocomplete plugin and the terms component for now, on the name field. Quite simple to plug in, actually. Check the commit diff. The main magic is doing this: <http://localhost:8983/solr/terms?terms.fl=name&terms.prefix=i&terms.sort=count&wt=velocity&v.template=suggest > Stupidly, though, jQuery's autocomplete seems to be hardcoded to send a q parameter, but I coded it to also send the same value as terms.prefix - but this could be an issue if hitting a different request handler where q is used for the actual query for filtering terms on. Let's say, just for grins, that a different field (besides "name") is being used for autocompletion. What would be all the places I'd need to hit to change the field, besides the terms.fl value in layout.vm? For example, what about browse.vm: $("input[type=text]").autoSuggest("/solr/suggest", {selectedItemProp: "name", searchObjProps: "name"}}); I'm asking because I'm trying to use this latest support with an index that uses "product_name" for the auto-complete field, and I'm not getting any auto-completes happening. I see from the Solr logs that requests being made to /solr/terms during auto-complete that look like: INFO: [] webapp=/solr path=/terms params = {limit = 10 ×tamp = 1276903135595 &terms .fl = product_name &q =rug&wt=velocity&terms.sort=count&v.template=suggest&terms.prefix=rug} status=0 QTime=0 Which I'd expect to work, but don't seem to be generating any results. What's odd is that if I try curling the same thing: curl -v "http://localhost:8983/solr/terms?limit=10×tamp=1276903135595&terms.fl=product_name&q=rug&wt=velocity&terms.sort=count&v.template=suggest&terms.prefix=rug " I get an empty HTML response: < Content-Type: text/html; charset=utf-8 < Content-Length: 0 < Server: Jetty(6.1.22) If I just use what I'd consider to be the minimum set of parameters: curl -v "http://localhost:8983/solr/terms?limit=10&terms.fl=product_name&q=rug&terms.sort=count&terms.prefix=rug " Then I get the expected XML response: < Content-Type: text/xml; charset=utf-8 < Content-Length: 225 < Server: Jetty(6.1.22) < 0name="QTime">0name="product_name">7 Any ideas what I'm doing wrong? Thanks, -- Ken On Jun 17, 2010, at 8:03 PM, Ken Krugler wrote: I don't believe Solritas supports autocompletion out of the box. So I'm wondering if anybody has experience using the LucidWorks distro & Solritas, plus the AJAX Solr auto-complete widget. I realize that AJAX Solr's autocomplete support is mostly just leveraging the jQuery Autocomplete plugin, and hooking it up to Solr facets, but I was curious if there were any tricks or traps in getting it all to work. Thanks, -- Ken Ken Krugler +1 530-210-6378 http://bixolabs.com e l a s t i c w e b m i n i n g
Re: Autocompletion with Solritas
On Jun 17, 2010, at 8:34pm, Erik Hatcher wrote: Your wish is my command. Check out trunk, fire up Solr (ant run- example), index example data, hit http://localhost:8983/solr/browse - type in search box. That works - excellent! Now I'm trying to build a distribution from trunk that I can use for prototyping, and noticed a few things... 1. From a fresh check-out, you can't build from the trunk/solr sub-dir due to dependencies on Lucene classes. Once you've done a top-level "ant compile" then you can cd into /solr and do ant builds. 2. I noticed the run-example target in trunk/solr/build.xml doesn't have a description, so it doesn't show up with ant -p. 3. I tried "ant create-package" from trunk/solr, and got this error near the end: /Users/kenkrugler/svn/lucene/lucene-trunk/solr/common-build.xml: 252: /Users/kenkrugler/svn/lucene/lucene-trunk/solr/contrib/velocity/ src not found. I don't see contrib/velocity anywhere in the Lucene trunk tree. What's the recommended way to build a Solr distribution from trunk? In the meantime I'll just use example/start.jar with solr.solr.home and solr.data.dir system properties. Thanks, -- Ken Just used jQuery's autocomplete plugin and the terms component for now, on the name field. Quite simple to plug in, actually. Check the commit diff. The main magic is doing this: <http://localhost:8983/solr/terms?terms.fl=name&terms.prefix=i&terms.sort=count&wt=velocity&v.template=suggest > Stupidly, though, jQuery's autocomplete seems to be hardcoded to send a q parameter, but I coded it to also send the same value as terms.prefix - but this could be an issue if hitting a different request handler where q is used for the actual query for filtering terms on. Cool?! I think so! :) Erik On Jun 17, 2010, at 8:03 PM, Ken Krugler wrote: I don't believe Solritas supports autocompletion out of the box. So I'm wondering if anybody has experience using the LucidWorks distro & Solritas, plus the AJAX Solr auto-complete widget. I realize that AJAX Solr's autocomplete support is mostly just leveraging the jQuery Autocomplete plugin, and hooking it up to Solr facets, but I was curious if there were any tricks or traps in getting it all to work. Thanks, -- Ken -------- <http://ken-blog.krugler.org> +1 530-265-2225 Ken Krugler +1 530-210-6378 http://bixolabs.com e l a s t i c w e b m i n i n g
Re: Autocompletion with Solritas
You, sir, are on my Christmas card list. I'll fire it up tomorrow morning & let you know how it goes. -- Ken On Jun 17, 2010, at 8:34pm, Erik Hatcher wrote: Your wish is my command. Check out trunk, fire up Solr (ant run- example), index example data, hit http://localhost:8983/solr/browse - type in search box. Just used jQuery's autocomplete plugin and the terms component for now, on the name field. Quite simple to plug in, actually. Check the commit diff. The main magic is doing this: <http://localhost:8983/solr/terms?terms.fl=name&terms.prefix=i&terms.sort=count&wt=velocity&v.template=suggest > Stupidly, though, jQuery's autocomplete seems to be hardcoded to send a q parameter, but I coded it to also send the same value as terms.prefix - but this could be an issue if hitting a different request handler where q is used for the actual query for filtering terms on. Cool?! I think so! :) Erik On Jun 17, 2010, at 8:03 PM, Ken Krugler wrote: I don't believe Solritas supports autocompletion out of the box. So I'm wondering if anybody has experience using the LucidWorks distro & Solritas, plus the AJAX Solr auto-complete widget. I realize that AJAX Solr's autocomplete support is mostly just leveraging the jQuery Autocomplete plugin, and hooking it up to Solr facets, but I was curious if there were any tricks or traps in getting it all to work. Thanks, -- Ken Ken Krugler +1 530-210-6378 http://bixolabs.com e l a s t i c w e b m i n i n g Ken Krugler +1 530-210-6378 http://bixolabs.com e l a s t i c w e b m i n i n g
Autocompletion with Solritas
I don't believe Solritas supports autocompletion out of the box. So I'm wondering if anybody has experience using the LucidWorks distro & Solritas, plus the AJAX Solr auto-complete widget. I realize that AJAX Solr's autocomplete support is mostly just leveraging the jQuery Autocomplete plugin, and hooking it up to Solr facets, but I was curious if there were any tricks or traps in getting it all to work. Thanks, -- Ken ------------ Ken Krugler +1 530-210-6378 http://bixolabs.com e l a s t i c w e b m i n i n g
Re: Need help on Solr Cell usage with specific Tika parser
Hi Olivier, Are you setting the mime type explicitly via the stream.type parameter? -- Ken On Jun 14, 2010, at 9:14am, olivier sallou wrote: Hi, I use Solr Cell to send specific content files. I developped a dedicated Parser for specific mime types. However I cannot get Solr accepting my new mime types. In solrconfig, in update/extract requesthandler I specified name="tika.config">./tika-config.xml , where tika-config.xml is in conf directory (same as solrconfig). In tika-config I added my mimetypes: biosequence/document biosequence/embl biosequence/genbank I do not know for: whereas path to tika mimetypes should be absolute or relative... and even if this file needs to be redefined if "magic" is not used. When I run my update/extract, I have an error that "biosequence/ document" does not match any known parser. Thanks Olivier -------- Ken Krugler +1 530-210-6378 http://bixolabs.com e l a s t i c w e b m i n i n g
Re: Tika language extraction
Hi Sandhya, It is observed that TIKA does not extract the "Content-Language" for documents encoded in UTF-8. For natively encoded documents, it works fine. Any idea on how we can resolve this ? I would post this question to the u...@tika.apache.org mailing list, and include more details on what type of document. The Tika language detection is fairly weak, and when the encoding is universal (language independent) such as UTF-8, the resulting confidence level is often low enough that Tika doesn't assume it has a good match, and thus doesn't report a language. -- Ken -------- Ken Krugler +1 530-210-6378 http://bixolabs.com e l a s t i c w e b m i n i n g
Re: Indexing HTML
On Jun 9, 2010, at 8:38pm, Blargy wrote: What is the preferred way to index html using DIH (my html is stored in a blob field in our database)? I know there is the built in HTMLStripTransformer but that doesn't seem to work well with malformed/incomplete HTML. I've created a custom transformer to first tidy up the html using JTidy then I pass it to the HTMLStripTransformer like so: However this method isn't fool-proof as you can see by my ignoreErrors option. I quickly took a peek at Tika and I noticed that it has its own HtmlParser. Is this something I should look into? Are there any alternatives that deal with malformed/incomplete html? Thanks Actually the Tika HtmlParser just wraps TagSoup - that's a good option for cleaning up busted HTML. -- Ken <http://ken-blog.krugler.org> +1 530-265-2225 ------------ Ken Krugler +1 530-210-6378 http://bixolabs.com e l a s t i c w e b m i n i n g
Re: Build query programmatically with lucene, but issue to solr?
On May 28, 2010, at 9:23am, Phillip Rhodes wrote: Hi. I am building up a query with quite a bit of logic such as parentheses, plus signs, etc... and it's a little tedious dealing with it all at a string level. I was wondering if anyone has any thoughts on constructing the query in lucene and using the string representation of the query to send to solr. Depending on complexity, SolrJ could be a solution. See the section that talks about "SolrJ provides a APIs to create queries instead of hand coding the query..." on http://wiki.apache.org/solr/Solrj -- Ken ------------ Ken Krugler +1 530-210-6378 http://bixolabs.com e l a s t i c w e b m i n i n g
SolrJ/EmbeddedSolrServer
I've got a situation where my data directory (a) needs to live elsewhere besides inside of Solr home, (b) moves to a different location when updating indexes, and (c) setting up a symlink from /data isn't a great option. So what's the best approach to making this work with SolrJ? The low- level solution seems to be - create my own SolrCore instance, where I specify the data directory - use that to update the CoreContainer - create a new EmbeddedSolrServer But recreating the EmbeddedSolrServer with each index update feels wrong, and I'd like to avoid mucking around with low-level SolrCore instantiation. Any other approaches? Thanks, -- Ken -------- Ken Krugler +1 530-210-6378 http://bixolabs.com e l a s t i c w e b m i n i n g
"Special Circumstances" for embedded Solr
Hi all, We'd started using embedded Solr back in 2007, via a patched version of the in-progress 1.3 code base. I recently was reading http://wiki.apache.org/solr/EmbeddedSolr, and wondered about the paragraph that said: The simplest, safest, way to use Solr is via Solr's standard HTTP interfaces. Embedding Solr is less flexible, harder to support, not as well tested, and should be reserved for special circumstances. Given the current state of SolrJ, and the expected roadmap for Solr in general, what would be some guidelines for "special circumstances" that warrant the use of SolrJ? I know what ours were back in 2007 - namely: - we had multiple indexes, but didn't want to run multiple webapps (now handled by multi-core) - we needed efficient generation of updated indexes, without generating lots of HTTP traffic (now handled by DIH, maybe with specific extensions?) - we wanted tighter coupling of the front-end API with the back-end Solr search system, since this was an integrated system in the hands of customers - no "just restart the webapp container" option if anything got wedged (might still be an issue?) Any other commonly compelling reasons to use SolrJ? Thanks, -- Ken ------------ Ken Krugler +1 530-210-6378 http://bixolabs.com e l a s t i c w e b m i n i n g
Re: Personalized Search
On May 19, 2010, at 11:43pm, Rih wrote: Has anybody done personalized search with Solr? I'm thinking of including fields such as "bought" or "like" per member/visitor via dynamic fields to a product search schema. Another option is to have a multi-value field that can contain user IDs. What are the possible performance issues with this setup? Mitch is right, what you're looking for here is a recommendation engine, if I understand your question properly. And yes, Mahout should work though the Taste recommendation engine it supports is pretty new. But Sean Owen & Robin Anil have a "Mahout in Action" book that's in early release via Manning, and it has lots of good information about Mahout & recommender systems. Assuming you have a list of recommendations for a given user, based on their past behavior and the recommendation engine, then you could use this to adjust search results. I'm waiting for Hoss to jump in here on how best to handle that :) -- Ken Ken Krugler +1 530-210-6378 http://bixolabs.com e l a s t i c w e b m i n i n g
Re: How to query for similar documents before indexing
Hi all (especially Yonik), At the http://wiki.apache.org/solr/Deduplication page, it mentions "duplicate field collapsing" and later "Allow for both duplicate collapsing in search results..." But I don't see any mention of how deduplication happens during search time. Normally this requires that the field be stored (not just indexed), and for efficiency it might need to be in a FieldCache. I'm wondering about both status of this support, and thoughts on potential impact to index/memory size. Thanks, -- Ken On May 10, 2010, at 3:07pm, Markus Jelsma wrote: Hi Matthieu, On the top of the wiki page you can see it's in 1.4 already. As far as i know the API doesn't return information on found duplicates in its response header, the wiki isn't clear on that subject. I, at least, never saw any other response than an error or the usual status code and QTime. Perhaps it would be a nice feature. On the other hand, you can also have a manual process that finds duplicates based on that signature and gather that information yourself as long as such a feature isn't there. Cheers, -Original message- From: Matthieu Labour Sent: Mon 10-05-2010 23:30 To: solr-user@lucene.apache.org; Subject: RE: How to query for similar documents before indexing Markus Thank you for your response That would be great if the index has the option to prevent duplicate from entering the index. But is it going to be a silent action ? Or will the add method return that it failed indexing because it detected a duplicate ? Is it commited to the 1.4 already ? Cheers matt --- On Mon, 5/10/10, Markus Jelsma wrote: From: Markus Jelsma Subject: RE: How to query for similar documents before indexing To: solr-user@lucene.apache.org Date: Monday, May 10, 2010, 4:11 PM Hi, Deduplication [1] is what you're looking for.It can utilize different analyzers that will add a one or more signatures or hashes to your document depending on exact or partial matches for configurable fields. Based on that, it should be able to prevent new documents from entering the index. The first part works very well but i have some issues with removing those documents on which i also need to check with the community tomorrow back at work ;-) [1]: http://wiki.apache.org/solr/Deduplication Cheers, -Original message- From: Matthieu Labour Sent: Mon 10-05-2010 22:41 To: solr-user@lucene.apache.org; Subject: How to query for similar documents before indexing Hi I want to implement the following logic: Before I index a new document into the index, I want to check if there are already documents in the index with similar content to the content of the document about to be inserted. If the request returns 1 or more documents, then I don't want to insert the document. What is the best way to achieve the above functionality ? I read about Fuzzy searches in logic. But can I really build a request such as mydoc.title:wordexample~ AND mydoc.content:( all the content words)~0.9 ? Thank you for your help Ken Krugler +1 530-210-6378 http://bixolabs.com e l a s t i c w e b m i n i n g
Re: MoreLikeThis: How to get quality terms from html from content stream?
On Aug 7, 2009, at 5:23pm, Jay Hill wrote: I'm using the MoreLikeThisHandler with a content stream to get documents from my index that match content from an html page like this: http://localhost:8080/solr/mlt?stream.url=http://www.sfgate.com/cgi-bin/article.cgi ?f=/c/a/2009/08/06/SP5R194Q13.DTL&mlt.fl=body&rows=4&debugQuery=true But, not surprisingly, the query generated is meaningless because a lot of the markup is picked out as terms: body:li body:href body:div body:class body:a body:script body:type body:js body:ul body:text body:javascript body:style body:css body:h body:img body:var body:articl body:ad body:http body:span body:prop Does anyone know a way to transform the html so that the content can be parsed out of the content stream and processed w/o the markup? Or do I need to write my own HTMLParsingMoreLikeThisHandler? You'd want to parse the HTML to extract only text first, and use that for your index data. Both the Nutch and Tika OSS projects have examples of using HTML parsers (based on TagSoup or CyberNeko) to generate content suitable for indexing. -- Ken If I parse the content out to a plain text file and point the stream.url param to file:///parsedfile.txt it works great. -Jay ---------- Ken Krugler TransPac Software, Inc. <http://www.transpac.com> +1 530-210-6378