Re: Setting up MiniSolrCloudCluster to use pre-built index

2018-10-23 Thread Ken Krugler
Hi Mark,

I’ll have a completely new, rebuilt index that’s (a) large, and (b) already 
sharded appropriately.

In that case, using the merge API isn’t great, in that it would take 
significant time and temporarily use double (or more) disk space.

E.g. I’ve got an index with 250M+ records, and about 200GB. There are other 
indexes, still big but not quite as large as this one.

So I’m still wondering if there’s any robust way to swap in a fresh set of 
shards, especially without relying on legacy cloud mode.

I think I can figure out where the data is being stored for an existing (empty) 
collection, shut that down, swap in the new files, and reload.

But I’m wondering if that’s really the best (or even sane) approach.

Thanks,

— Ken

> On May 19, 2018, at 6:24 PM, Mark Miller  wrote:
> 
> You create MiniSolrCloudCluster with a base directory and then each Jetty
> instance created gets a SolrHome in a subfolder called node{i}. So if
> legacyCloud=true you can just preconfigure a core and index under the right
> node{i} subfolder. legacyCloud=true should not even exist anymore though,
> so the long term way to do this would be to create a collection and then
> use the merge API or something to merge your index into the empty
> collection.
> 
> - Mark
> 
> On Sat, May 19, 2018 at 5:25 PM Ken Krugler 
> wrote:
> 
>> Hi all,
>> 
>> Wondering if anyone has experience (this is with Solr 6.6) in setting up
>> MiniSolrCloudCluster for unit testing, where we want to use an existing
>> index.
>> 
>> Note that this index wasn’t built with SolrCloud, as it’s generated by a
>> distributed (Hadoop) workflow.
>> 
>> So there’s no “restore from backup” option, or swapping collection
>> aliases, etc.
>> 
>> We can push our configset to Zookeeper and create the collection as per
>> other unit tests in Solr, but what’s the right way to set up data dirs for
>> the cores such that Solr is running with this existing index (or indexes,
>> for our sharded test case)?
>> 
>> Thanks!
>> 
>> — Ken
>> 
>> PS - yes, we’re aware of the routing issue with generating our own shards….
>> 
>> --
>> Ken Krugler
>> +1 530-210-6378 <(530)%20210-6378>
>> http://www.scaleunlimited.com
>> Custom big data solutions & training
>> Flink, Solr, Hadoop, Cascading & Cassandra
>> 
>> --
> - Mark
> about.me/markrmiller

--
Ken Krugler
+1 530-210-6378
http://www.scaleunlimited.com
Custom big data solutions & training
Flink, Solr, Hadoop, Cascading & Cassandra



Re: Storing & using feature vectors

2018-10-22 Thread Ken Krugler
Hi Doug,

Many thanks for the tons of useful information!

Some comments/questions inline below.

— Ken

> On Oct 19, 2018, at 10:46 AM, Doug Turnbull 
>  wrote:
> 
> This is a pretty big hole in Lucene-based search right now that many
> practitioners have struggled with
> 
> I know a couple of people who have worked on solutions. And I've used a
> couple of hacks:
> 
> - You can hack together something that does cosine similarity using the
> term frequency & query boosts DelimitedTermFreqFilterFactory. Basically the
> term frequency becomes a feature weight on the document. Boosts become the
> query weight. If you massage things correctly with the similarity, the
> resulting boolean similarity is a dot product…

I’ve done a quick test of that approach, though not as elegantly. I just 
constructed a string of “terms” (feature indices) that generated an 
approximation to the target vector. DelimitedTermFreqFilterFactory is much 
better :)

The problem I ran into was that some features have negative weights, and it 
wasn’t obvious whether it would work to have a second field (with only the 
negative weights) that I used for (not really supported in Solr?) negative 
boosting.

Is there some hack to work around that?

> - Erik Hatcher has done some great work with payloads which you might want
> to check out. See the delimited payload filter factory, and payload score
> function queries

Thanks, I’d poked at payloads a bit. From what I could tell, there isn't a way 
to use payloads with negative feature values, or to filter results, but maybe I 
didn’t dig deep enough.

> - Simon Hughes Activate Talk (slides/video not yet posted) covers this
> topic in some depth

OK, that looks great - https://activate2018.sched.com/event/FkM3 and 
https://github.com/DiceTechJobs/VectorsInSearch

Seems like the planets are aligning for this kind of thing.

> - Rene Kriegler's Haystack Talk discusses encoding Inception model
> vectorizations of images:
> https://opensourceconnections.com/events/haystack-single/haystack-relevance-scoring/

Good stuff, thanks!

I’d be curious what his querqy <https://github.com/renekrie/querqy> 
configuration looked like for the “summing up fieldweights only (ignore df; use 
cross-field tf)” row in his results table on slide 36.

The use of LSHs (what he describes in this talk as “random projection forest") 
is something I’d suggested to the client, to mitigate the need for true feature 
vector support.

Using an initial LSH-based query to get candidates, and then re-ranking based 
on the actual feature vector, is something I was expecting Rene to discuss but 
he didn’t seem to mention it in his talk.

> If this is a huge importance to you, I might also suggest looking at vespa,
> which makes tensors a first-class citizen and makes matrix-math pretty
> seamless: http://vespa.ai

Interesting, though my client is pretty much locked into using Solr.



> On Fri, Oct 19, 2018 at 12:50 PM Ken Krugler 
> wrote:
> 
>> Hi all,
>> 
>> [I posted on the Lucene list two days ago, but didn’t see any response -
>> checking here for completeness]
>> 
>> I’ve been looking at directly storing feature vectors and providing
>> scoring/filtering support.
>> 
>> This is for vectors consisting of (typically 300 - 2048) floats or doubles.
>> 
>> It’s following the same pattern as geospatial support - so a new field
>> type and query/parser, plus plumbing to hook it into Solr.
>> 
>> Before I go much further, is there anything like this already done, or in
>> the works?
>> 
>> Thanks,
>> 
>> — Ken
>> 
> CTO, OpenSource Connections
> Author, Relevant Search
> http://o19s.com/doug

--
Ken Krugler
+1 530-210-6378
http://www.scaleunlimited.com
Custom big data solutions & training
Flink, Solr, Hadoop, Cascading & Cassandra



Storing & using feature vectors

2018-10-19 Thread Ken Krugler
Hi all,

[I posted on the Lucene list two days ago, but didn’t see any response - 
checking here for completeness]
 
I’ve been looking at directly storing feature vectors and providing 
scoring/filtering support.

This is for vectors consisting of (typically 300 - 2048) floats or doubles.

It’s following the same pattern as geospatial support - so a new field type and 
query/parser, plus plumbing to hook it into Solr.

Before I go much further, is there anything like this already done, or in the 
works?

Thanks,

— Ken

--
Ken Krugler
+1 530-210-6378
http://www.scaleunlimited.com
Custom big data solutions & training
Flink, Solr, Hadoop, Cascading & Cassandra



Is router.field an explicit shard name, or hashed?

2018-07-13 Thread Ken Krugler
Hi all,

From 
https://lucene.apache.org/solr/guide/6_6/shards-and-indexing-data-in-solrcloud.html#ShardsandIndexingDatainSolrCloud-DocumentRouting
 
<https://lucene.apache.org/solr/guide/6_6/shards-and-indexing-data-in-solrcloud.html#ShardsandIndexingDatainSolrCloud-DocumentRouting>,
 and various posts on the mailing list, the implication is that the content of 
the “router.field” field is used as the shard name.

But on 
https://lucene.apache.org/solr/guide/6_6/collections-api.html#CollectionsAPI-create
 
<https://lucene.apache.org/solr/guide/6_6/collections-api.html#CollectionsAPI-create>,
 the description of “router.field” says:

> If this field is specified, the router will look at the value of the field in 
> an input document to compute the hash and identify a shard instead of looking 
> at the uniqueKey field


I’m wondering which is correct.

Thanks,

— Ken

----------
Ken Krugler
+1 530-210-6378
http://www.scaleunlimited.com
Custom big data solutions & training
Flink, Solr, Hadoop, Cascading & Cassandra



Setting up MiniSolrCloudCluster to use pre-built index

2018-05-19 Thread Ken Krugler
Hi all,

Wondering if anyone has experience (this is with Solr 6.6) in setting up 
MiniSolrCloudCluster for unit testing, where we want to use an existing index.

Note that this index wasn’t built with SolrCloud, as it’s generated by a 
distributed (Hadoop) workflow.

So there’s no “restore from backup” option, or swapping collection aliases, etc.

We can push our configset to Zookeeper and create the collection as per other 
unit tests in Solr, but what’s the right way to set up data dirs for the cores 
such that Solr is running with this existing index (or indexes, for our sharded 
test case)?

Thanks!

— Ken

PS - yes, we’re aware of the routing issue with generating our own shards….

--
Ken Krugler
+1 530-210-6378
http://www.scaleunlimited.com
Custom big data solutions & training
Flink, Solr, Hadoop, Cascading & Cassandra



Handling of local params in QParserPlugin.createParser

2017-04-03 Thread Ken Krugler
Hi all,

As part of some interesting work creating a custom query parser, I was writing 
unit tests that exercised ExtendedDismaxQParser.

So I first created the ExtendedDismaxQParserPlugin, and then used that to 
create the QParser via:

QParser parser = plugin.createParser(query, localParams, params, req);

If query is something like {!complexphrase}fieldname:”A * query”, I was 
expecting the complex phrase query parser to get used, but that’s not happening 
- the local param is being treated as regular text.

Which makes me think my conceptual model of local params processing is wrong, 
and there’s higher level code that does a pre-processing step first.

But I was hoping that I’d get out a DisjunctionMaxQueries where one of the 
queries was a ComplexPhraseQuery, which would mean the processing has to happen 
inside of the ExtendedDismaxQParser code.

Any pointers for where to poke around?

Thanks,

— Ken

--
Ken Krugler
+1 530-210-6378
http://www.scaleunlimited.com
custom big data solutions & training
Hadoop, Cascading, Cassandra & Solr





Shingles from WDFF

2017-03-24 Thread Ken Krugler
Hi all,

I’ve got some ancient Lucene tokenizer code from 2006 that I’m trying to avoid 
forward-porting, but I don’t think there’s an equivalent in Solr 5/6.

Specifically it’s applying shingles to the output of something like the 
WordDelimiterFilter - e.g. MySuperSink gets split into “My” “Super” “Sink”, and 
then shingled (if we’re using shingle size of 2) to be “My”, “MySuper”, 
“Super”, “SuperSink”, “Sink”.

I can’t just follow the WDF with a single filter because shingles aren’t 
created across terms coming into the WDF - it’s only for the pieces generated 
by the WDF.

Or is there actually a way to make this work with Solr 5/6?

Thanks,

— Ken

--
Ken Krugler
+1 530-210-6378
http://www.scaleunlimited.com
custom big data solutions & training
Hadoop, Cascading, Cassandra & Solr





RE: how to update billions of docs

2016-03-19 Thread Ken Krugler
As others noted, currently updating a field means deleting and inserting the 
entire document.

Depending on how you use the field, you might be able to create another 
core/container with that one field (plus the key field), and use join support.

Note that https://issues.apache.org/jira/browse/LUCENE-6352 is an improvement, 
which looks like it's in the 5.x code line, though I don't see a fix version.

-- Ken

> From: Mohsin Beg Beg
> Sent: March 16, 2016 3:52:47pm PDT
> To: solr-user@lucene.apache.org
> Subject: how to update billions of docs
> 
> Hi,
> 
> I have a requirement to replace a value of a field in 100B's of docs in 100's 
> of cores.
> The field is multiValued=false docValues=true type=StrField stored=true 
> indexed=true.
> 
> Atomic Updates performance is on the order of 5K docs per sec per core in 
> solr 5.3 (other fields are quite big).
> 
> Any suggestions ?
> 
> -Mohsin


--
Ken Krugler
+1 530-210-6378
http://www.scaleunlimited.com
custom big data solutions & training
Hadoop, Cascading, Cassandra & Solr







RE: Embedded Solr now deprecated?

2015-08-05 Thread Ken Krugler
Hi Shawn,

We have a different use case than the ones you covered in your response to 
Robert (below), which I wanted to call out.

We currently use the embedded server when building indexes as part of a Hadoop 
workflow. The results get copied to a production analytics server and swapped 
in on a daily basis.

Writing to multiple embedded servers (one per reduce task) gives us maximum 
performance, and has proven to be a very reliable method for the daily rebuild 
of pre-aggregations we need for our analytics use case.

Regards,

-- Ken

PS - I'm also currently looking at using embedded Solr as a state storage 
engine for Samza.

> From: Shawn Heisey
> Sent: August 5, 2015 7:54:07am PDT
> To: solr-user@lucene.apache.org
> Subject: Re: Embedded Solr now deprecated?
> 
> On 8/5/2015 7:09 AM, Robert Krüger wrote:
>> I tried to upgrade my application from solr 4 to 5 and just now realized
>> that embedded use of solr seems to be on the way out. Is that correct or is
>> there a just new API to use for that?
> 
> Building on Erick's reply:
> 
> I doubt that the embedded server is going away, and I do not recall
> seeing *anything* marking the entire class deprecated.  The class still
> receives attention from devs -- this feature was released with 5.1.0:
> 
> https://issues.apache.org/jira/browse/SOLR-7307
> 
> That said, we have discouraged users from deploying it in production for
> quite some time, even though it continues to exist and receive developer
> attention.  Some of the reasons that I think users should avoid the
> embedded server:  It doesn't support SolrCloud, you cannot make it
> fault-tolerant (redundant), and troubleshooting is harder because you
> cannot connect to it from outside of the source code where it is embedded.
> 
> Deploying Solr as a network service offers much more capability than you
> can get when you embed it in your application.  Chances are that you can
> easily replace EmbeddedSolrServer with one of the SolrClient classes and
> use a separate Solr deployment from your application.
> 
> Thanks,
> Shawn
> 


--
Ken Krugler
+1 530-210-6378
http://www.scaleunlimited.com
custom big data solutions & training
Hadoop, Cascading, Cassandra & Solr







Re: When not to use NRTCachingDirectory and what to use instead.

2014-04-19 Thread Ken Krugler

On Jul 10, 2013, at 9:16am, Shawn Heisey  wrote:

> On 7/10/2013 9:59 AM, Tom Burton-West wrote:
>> The Javadoc for NRTCachingDirectoy (
>> http://lucene.apache.org/core/4_3_1/core/org/apache/lucene/store/NRTCachingDirectory.html?is-external=true)
>>  says:
>> 
>>  "This class is likely only useful in a near real-time context, where
>> indexing rate is lowish but reopen rate is highish, resulting in many tiny
>> files being written..."
>> 
>> It seems like we have exactly the opposite use case, so we would like
>> advice on what directory implementation to use instead.
>> 
>> We are doing offline batch indexing, so no searches are being done.  So we
>> don't need NRT.  We also have a high indexing rate as we are trying to
>> index 3 billion pages as quickly as possible.
>> 
>> I am not clear what determines the reopen rate.   Is it only related to
>> searching or is it involved in indexing as well?
>> 
>>  Does the NRTCachingDirectory have any benefit for indexing under the use
>> case noted above?
>> 
>> I'm guessing we should just use the solrStandardDirectoryFactory instead.
>>  Is this correct?
> 
> The NRT directory object in Solr uses the MMap implementation as its default 
> delegate.  

The code I see seems to be using an FSDirectory, or is there another layer of 
wrapping going on here?

return new NRTCachingDirectory(FSDirectory.open(new File(path)), 
maxMergeSizeMB, maxCachedMB);

> I would use MMapDirectoryFactory (the default for most of the 3.x releases) 
> for testing whether you can get any improvement from moving away from the 
> default.  The advantages of memory mapping are not something you'd want to 
> give up.

Tom - did you ever get any useful results from testing here? I'm also curious 
about the impact of various xxxDirectoryFactory implementations for batch 
indexing.

Thanks,

-- Ken

--
Ken Krugler
+1 530-210-6378
http://www.scaleunlimited.com
custom big data solutions & training
Hadoop, Cascading, Cassandra & Solr







Re: Enabling other SimpleText formats besides postings

2014-03-31 Thread Ken Krugler
Hi Erik (& Shawn),

On Mar 31, 2014, at 1:48pm, Shawn Heisey  wrote:

> On 3/31/2014 2:36 PM, Erik Hatcher wrote:
>> Not currently possible.  Solr’s SchemaCodecFactory only has a hook for 
>> postings format (and doc values format).

OK, thanks for confirming.

> Would it be a reasonable thing to develop a config structure (probably in 
> schema.xml) that starts with something like  and has ways 
> to specify the class and related configuration for each of the components in 
> the codec? Then you could specify codec="foo" on an individual field 
> definition.  The codec definition could allow one of them to have 
> default="true".
> 
> I will admit that my understanding of these Lucene-level details is low, so I 
> could be thinking about this wrong.

The absolute easiest approach would be to support a new init value for 
codecFactory, which SchemaCodecFactory would use to select a different base 
codec class to use (versus always using LuceneCodec). That would 
switch everything to a different codec.

Or you could extend the SchemaCodecFactory to support additional per-field 
settings for stored fields format, etc beyond what's currently available.

For my quick & dirty hack I've specified a different codecFactory in 
solrconfig.xml, and have my own factory that hard-codes the SimpleTextCodec.

This works - all files are in the SimpleTextXXX format, other than the 
segments.gen and segments_XX files; what, those aren't pluggable?!?! :)

-- Ken

--
Ken Krugler
+1 530-210-6378
http://www.scaleunlimited.com
custom big data solutions & training
Hadoop, Cascading, Cassandra & Solr







Re: Enabling other SimpleText formats besides postings

2014-03-31 Thread Ken Krugler
Hi all (and particularly Uwe and Robert),

On Mar 28, 2014, at 7:24am, Michael McCandless  
wrote:

> You told the fieldType to use SimpleText only for the postings, not
> all other parts of the codec (doc values, live docs, stored fields,
> etc...), and so it used the default codec for those components.
> 
> If instead you used the SimpleTextCodec (not sure how to specify this
> in Solr's schema.xml) then all components would be SimpleText.

Yes, that's the gist of my question - how do you specify use of SimpleTextXXX 
(e.g. SimpleTextStoredFieldsFormat) in Solr?

Or is this currently not possible?

Thanks,

-- Ken



> On Fri, Mar 28, 2014 at 8:53 AM, Ken Krugler
>  wrote:
>> Hi all,
>> 
>> I've been using the SimpleTextCodec in the past, but I just noticed 
>> something odd...
>> 
>> I'm running Solr 4.3, and enable the SimpleText posting format via something 
>> like:
>> 
>>> />
>> 
>> The resulting index does have the expected _0_SimpleText_0.pst text output, 
>> but I just noticed that the other files are all the standard binary format 
>> (e.g. .fdt for field data)
>> 
>> Based on SimpleTextCodec.java, I was assuming that I'd get the 
>> SimpleTextStoredFieldsFormat for stored data.
>> 
>> This same holds true for most (all?) of the other files, e.g. 
>> https://issues.apache.org/jira/browse/LUCENE-3074 is about adding a simple 
>> text format for DocValues.
>> 
>> I can walk the code to figure out what's up, but I'm hoping I just need to 
>> change some configuration setting.
>> 
>> Thanks!
>> 
>> -- Ken


--
Ken Krugler
+1 530-210-6378
http://www.scaleunlimited.com
custom big data solutions & training
Hadoop, Cascading, Cassandra & Solr







Enabling other SimpleText formats besides postings

2014-03-28 Thread Ken Krugler
Hi all,

I've been using the SimpleTextCodec in the past, but I just noticed something 
odd…

I'm running Solr 4.3, and enable the SimpleText posting format via something 
like:



The resulting index does have the expected _0_SimpleText_0.pst text output, but 
I just noticed that the other files are all the standard binary format (e.g. 
.fdt for field data)

Based on SimpleTextCodec.java, I was assuming that I'd get the 
SimpleTextStoredFieldsFormat for stored data.

This same holds true for most (all?) of the other files, e.g. 
https://issues.apache.org/jira/browse/LUCENE-3074 is about adding a simple text 
format for DocValues.

I can walk the code to figure out what's up, but I'm hoping I just need to 
change some configuration setting.

Thanks!

-- Ken

----------
Ken Krugler
+1 530-210-6378
http://www.scaleunlimited.com
custom big data solutions & training
Hadoop, Cascading, Cassandra & Solr



RegexReplaceProcessorFactory replacement string support for match groups

2013-10-15 Thread Ken Krugler
Hi Hoss,

In RegexReplaceProcessorFactory, this line means that you can't use match 
groups in the replacement string:

replacement = Matcher.quoteReplacement(replacementParam.toString());

What's the reasoning behind this? Or am I missing something here, and groups 
can be used?

It's making it hard for me to write up a simple solution to a training 
exercise, where students need to clean up incorrectly formatted dates :)

Thanks,

-- Ken

----------
Ken Krugler
+1 530-210-6378
http://www.scaleunlimited.com
custom big data solutions & training
Hadoop, Cascading, Cassandra & Solr







WikipediaTokenizer documentation - never mind

2013-10-03 Thread Ken Krugler
Hi all,

Sorry for the noise - I finally realized that the script I was running was 
using some Java code (EnwikiContentSource, from Lucene benchmark) to explicitly 
set up fields and then push the results to Solr.

-- Ken

==
Where's the documentation on the WikipediaTokenizer?

Specifically I'm wondering how pieces from the source XML get mapped to field 
names in the Solr schema.

For example,  seems to be going into the "date" field for 
an example schema I've got.

And  goes into "body".

But is there any way to get , for example?

Thanks,

-- Ken

------
Ken Krugler
+1 530-210-6378
http://www.scaleunlimited.com
custom big data solutions & training
Hadoop, Cascading, Cassandra & Solr







--
Ken Krugler
+1 530-210-6378
http://www.scaleunlimited.com
custom big data solutions & training
Hadoop, Cascading, Cassandra & Solr







WikipediaTokenizer documentation

2013-10-03 Thread Ken Krugler
Hi all,

Where's the documentation on the WikipediaTokenizer?

Specifically I'm wondering how pieces from the source XML get mapped to field 
names in the Solr schema.

For example,  seems to be going into the "date" field for 
an example schema I've got.

And  goes into "body".

But is there any way to get , for example?

Thanks,

-- Ken

------
Ken Krugler
+1 530-210-6378
http://www.scaleunlimited.com
custom big data solutions & training
Hadoop, Cascading, Cassandra & Solr







Re: Grouping by field substring?

2013-09-12 Thread Ken Krugler
Hi Jack,

On Sep 11, 2013, at 5:34pm, Jack Krupansky wrote:

> Do a copyField to another field, with a limit of 8 characters, and then use 
> that other field.

Thanks - I should have included a few more details in my original question.

The issue is that I've got an index with 200M records, of which about 50M have 
a unique value for this prefix (which is 32 characters long)

So adding another indexed field would be significant, which is why I was hoping 
there was a way to do it via grouping/collapsing at query time.

Or is that just not possible?

Thanks,

-- Ken

> -Original Message----- From: Ken Krugler
> Sent: Wednesday, September 11, 2013 8:24 PM
> To: solr-user@lucene.apache.org
> Subject: Grouping by field substring?
> 
> Hi all,
> 
> Assuming I want to use the first N characters of a specific field for 
> grouping results, is such a thing possible out-of-the-box?
> 
> If not, then what would the next best option be? E.g. a custom function query?
> 
> Thanks,
> 
> -- Ken
> 
> --
> Ken Krugler
> +1 530-210-6378
> http://www.scaleunlimited.com
> custom big data solutions & training
> Hadoop, Cascading, Cassandra & Solr
> 
> 
> 
> 
> 

--
Ken Krugler
+1 530-210-6378
http://www.scaleunlimited.com
custom big data solutions & training
Hadoop, Cascading, Cassandra & Solr







Grouping by field substring?

2013-09-11 Thread Ken Krugler
Hi all,

Assuming I want to use the first N characters of a specific field for grouping 
results, is such a thing possible out-of-the-box?

If not, then what would the next best option be? E.g. a custom function query?

Thanks,

-- Ken

--
Ken Krugler
+1 530-210-6378
http://www.scaleunlimited.com
custom big data solutions & training
Hadoop, Cascading, Cassandra & Solr







Re: Filter cache pollution during sharded edismax queries

2013-08-26 Thread Ken Krugler
Hi Otis,

Sorry I missed your reply, and thanks for trying to find a similar report.

Wondering if I should file a Jira issue? That might get more attention :)

-- Ken

On Jul 5, 2013, at 1:05pm, Otis Gospodnetic wrote:

> Hi Ken,
> 
> Uh, I left this email until now hoping I could find you a reference to
> similar reports, but I can't find them now.  I am quite sure I saw
> somebody with a similar report within the last month.  Plus, several
> people have reported issues with performance dropping when they went
> from 3.x to 4.x and maybe this is why.
> 
> Otis
> --
> Solr & ElasticSearch Support -- http://sematext.com/
> Performance Monitoring -- http://sematext.com/spm
> 
> 
> 
> On Tue, Jul 2, 2013 at 3:01 PM, Ken Krugler  
> wrote:
>> Hi all,
>> 
>> After upgrading from Solr 3.5 to 4.2.1, I noticed our filterCache hit ratio 
>> had dropped significantly.
>> 
>> Previously it was at 95+%, but now it's < 50%.
>> 
>> I enabled recording 100 entries for debugging, and in looking at them it 
>> seems that edismax (and faceting) is creating entries for me.
>> 
>> This is in a sharded setup, so it's a distributed search.
>> 
>> If I do a search for the string "bogus text" using edismax on two fields, I 
>> get an entry in each of the shard's filter caches that looks like:
>> 
>> item_+(((field1:bogus | field2:bogu) (field1:text | field2:text))~2):
>> 
>> Is this expected?
>> 
>> I have a similar situation happening during faceted search, even though my 
>> fields are single-value/untokenized strings, and I'm not using the enum 
>> facet method.
>> 
>> But I'll get many, many entries in the filterCache for facet values, and 
>> they all look like "item_::"
>> 
>> The net result of the above is that even with a very big filterCache size of 
>> 2K, the hit ratio is still only 60%.
>> 
>> Thanks for any insights,
>> 
>> -- Ken

--
Ken Krugler
+1 530-210-6378
http://www.scaleunlimited.com
custom big data solutions & training
Hadoop, Cascading, Cassandra & Solr







Blog posts on extracting text features using Solr

2013-07-21 Thread Ken Krugler
Hi all,

I recently posted parts 1 & 2 of a series on extracting text features for 
machine learning…

http://www.scaleunlimited.com/2013/07/10/text-feature-selection-for-machine-learning-part-1/

http://www.scaleunlimited.com/2013/07/21/text-feature-selection-for-machine-learning-part-2/

It uses Solr to generate terms from mailing list text, and then does analysis 
to extract good features for things like classification, similarity and 
clustering.

The last part will cover using Solr to implement a real-time similarity engine, 
and maybe a recommendation engine as well.

It undoubtedly has some things that are unclear or even incorrect, so please 
comment :)

Regards,

-- Ken

------
Ken Krugler
+1 530-210-6378
http://www.scaleunlimited.com
custom big data solutions & training
Hadoop, Cascading, Cassandra & Solr







Filter cache pollution during sharded edismax queries

2013-07-02 Thread Ken Krugler
Hi all,

After upgrading from Solr 3.5 to 4.2.1, I noticed our filterCache hit ratio had 
dropped significantly.

Previously it was at 95+%, but now it's < 50%.

I enabled recording 100 entries for debugging, and in looking at them it seems 
that edismax (and faceting) is creating entries for me.

This is in a sharded setup, so it's a distributed search.

If I do a search for the string "bogus text" using edismax on two fields, I get 
an entry in each of the shard's filter caches that looks like:

item_+(((field1:bogus | field2:bogu) (field1:text | field2:text))~2):

Is this expected?

I have a similar situation happening during faceted search, even though my 
fields are single-value/untokenized strings, and I'm not using the enum facet 
method.

But I'll get many, many entries in the filterCache for facet values, and they 
all look like "item_::"

The net result of the above is that even with a very big filterCache size of 
2K, the hit ratio is still only 60%.

Thanks for any insights,

-- Ken

--
Ken Krugler
+1 530-210-6378
http://www.scaleunlimited.com
custom big data solutions & training
Hadoop, Cascading, Cassandra & Solr







Solr 4.2.1 behavior with field names that use "|" character

2013-05-11 Thread Ken Krugler
Hi all,

We have a fieldname that uses the "|" character to separate elements (e.g. 
state|city)

Up until Solr 4.x this has worked fine.

Now, when doing a query that gets distributed across shards, we get a 
SolrException:

SEVERE: org.apache.solr.common.SolrException: can not use FieldCache on a field 
which is neither indexed nor has doc values: state
at 
org.apache.solr.schema.SchemaField.checkFieldCacheSource(SchemaField.java:186)
at org.apache.solr.schema.StrField.getValueSource(StrField.java:72)
at 
org.apache.solr.search.FunctionQParser.parseValueSource(FunctionQParser.java:362)
at org.apache.solr.search.FunctionQParser.parse(FunctionQParser.java:68)
at org.apache.solr.search.QParser.getQuery(QParser.java:142)
at 
org.apache.solr.search.SolrReturnFields.add(SolrReturnFields.java:285)
at 
org.apache.solr.search.SolrReturnFields.parseFieldList(SolrReturnFields.java:112)
at 
org.apache.solr.search.SolrReturnFields.(SolrReturnFields.java:98)
at 
org.apache.solr.search.SolrReturnFields.(SolrReturnFields.java:74)
at 
org.apache.solr.handler.component.QueryComponent.prepare(QueryComponent.java:96)

The problem appears to be that the fl= state|city parameter is getting split up 
by the FunctionQParser, and it tries to use "state" as a field name. This 
actually exists, but as an ignored field (since we can just do a  
q=state|city:ca|* to find all entries in California).

Is this a known issue? Is there any way to disable the parsing of field names 
in a field list?

Thanks,

-- Ken

----------
Ken Krugler
+1 530-210-6378
http://www.scaleunlimited.com
custom big data solutions & training
Hadoop, Cascading, Cassandra & Solr







Javadocs issue on Solr web site

2012-07-04 Thread Ken Krugler
Currently all Javadoc links seem to wind up pointing at the api-4_0_0-ALPHA 
versions - is that expected?

E.g. do a Google search on StreamingUpdateSolrServer. First hit is for 
"StreamingUpdateSolrServer (Solr 3.6.0 API)"

Follow that link, and you get a 404 for page 
http://lucene.apache.org/solr/api-4_0_0-ALPHA/org/apache/solr/client/solrj/impl/StreamingUpdateSolrServer.html

-- Ken

------
Ken Krugler
http://www.scaleunlimited.com
custom big data solutions & training
Hadoop, Cascading, Mahout & Solr






Re: Error with distributed search and Suggester component (Solr 3.4)

2012-05-02 Thread Ken Krugler
Hi Robert,

On May 1, 2012, at 7:07pm, Robert Muir wrote:

> On Tue, May 1, 2012 at 6:48 PM, Ken Krugler  
> wrote:
>> Hi list,
>> 
>> Does anybody know if the Suggester component is designed to work with shards?
> 
> I'm not really sure it is? They would probably have to override the
> default merge implementation specified by SpellChecker.

What confuses me is that Suggester says it's based on SpellChecker, which 
supposedly does work with shards.

> But, all of the current suggesters pump out over 100,000 QPS on my
> machine, so I'm wondering what the usefulness of this is?
> 
> And if it was useful, merging results from different machines is
> pretty inefficient, for suggest you would shard by term instead so
> that you need only contact a single host?

The issue is that I've got a configuration with 8 shards already that I'm 
trying to leverage for auto-complete.

My quick & dirty work-around would be to add a custom response handler that 
wraps the suggester, and returns results with the fields that the SearchHandler 
needs to do the merge.

-- Ken

--
Ken Krugler
http://www.scaleunlimited.com
custom big data solutions & training
Hadoop, Cascading, Mahout & Solr






Re: Error with distributed search and Suggester component (Solr 3.4)

2012-05-01 Thread Ken Krugler
I should have also included one more bit of information.

If I configure the top-level (sharding) request handler to use just the suggest 
component as such:

  


  explicit
  suggest-core
  localhost:8080/solr/core0/,localhost:8080/solr/core1/



  suggest

  

Then I don't get a NPE, but I also get a response with no results.


  
0
0

  r

  


For completeness, here are the other pieces to the solrconfig.xml puzzle:

  

  true
  suggest-one
  10



  suggest

  
  
  

  suggest-one
  org.apache.solr.spelling.suggest.Suggester
  org.apache.solr.spelling.suggest.fst.FSTLookup
  name  
  0.05
  true


  suggest-two
  org.apache.solr.spelling.suggest.Suggester
  org.apache.solr.spelling.suggest.fst.FSTLookup
  content  
  0.0
  true

  

Thanks,

-- Ken

On May 1, 2012, at 3:48pm, Ken Krugler wrote:

> Hi list,
> 
> Does anybody know if the Suggester component is designed to work with shards?
> 
> I'm asking because the documentation implies that it should (since 
> ...Suggester reuses much of the SpellCheckComponent infrastructure…, and the 
> SpellCheckComponent is documented as supporting a distributed setup).
> 
> But when I make a request, I get an exception:
> 
> java.lang.NullPointerException
>   at 
> org.apache.solr.handler.component.QueryComponent.mergeIds(QueryComponent.java:493)
>   at 
> org.apache.solr.handler.component.QueryComponent.handleResponses(QueryComponent.java:390)
>   at 
> org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:289)
>   at 
> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129)
>   at org.apache.solr.core.SolrCore.execute(SolrCore.java:1368)
>   at 
> org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:356)
>   at 
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:252)
>   at 
> org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1157)
>   at org.mortbay.servlet.UserAgentFilter.doFilter(UserAgentFilter.java:81)
>   at org.mortbay.servlet.GzipFilter.doFilter(GzipFilter.java:132)
>   at 
> org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1157)
>   at 
> org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:388)
>   at 
> org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
>   at 
> org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182)
>   at 
> org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:765)
>   at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:418)
>   at 
> org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230)
>   at 
> org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114)
>   at 
> org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152)
>   at org.mortbay.jetty.Server.handle(Server.java:326)
>   at 
> org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:542)
>   at 
> org.mortbay.jetty.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:923)
>   at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:547)
>   at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:212)
>   at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:404)
>   at 
> org.mortbay.io.nio.SelectChannelEndPoint.run(SelectChannelEndPoint.java:409)
>   at 
> org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:582)
> Looking at the QueryComponent.java:493 code, I see:
> 
>SolrDocumentList docs = 
> (SolrDocumentList)srsp.getSolrResponse().getResponse().get("response");
> 
>// calculate global maxScore and numDocsFound
>if (docs.getMaxScore() != null) { <<<<  This is line 493
> 
> So I'm assuming the "docs" variable is null, which would happen if there is 
> no "response" element in the Solr response.
> 
> If I make a direct request to the request handler in one core (e.g. 
> http://hostname:8080/solr/core0/select?qt=suggest-core&q=rad), the query 
> works.
> 
> But I see that there's no element named "response", unlike a regular query.
> 
> 
>  
>0
>1
>  
>  
>
>  
>10
>0
>3
>
>  radair
>  radar
>
>  
>
>  
> 
> 
> So I'm wondering if my co

Error with distributed search and Suggester component (Solr 3.4)

2012-05-01 Thread Ken Krugler
Hi list,

Does anybody know if the Suggester component is designed to work with shards?

I'm asking because the documentation implies that it should (since ...Suggester 
reuses much of the SpellCheckComponent infrastructure…, and the 
SpellCheckComponent is documented as supporting a distributed setup).

But when I make a request, I get an exception:

java.lang.NullPointerException
at 
org.apache.solr.handler.component.QueryComponent.mergeIds(QueryComponent.java:493)
at 
org.apache.solr.handler.component.QueryComponent.handleResponses(QueryComponent.java:390)
at 
org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:289)
at 
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:1368)
at 
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:356)
at 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:252)
at 
org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1157)
at org.mortbay.servlet.UserAgentFilter.doFilter(UserAgentFilter.java:81)
at org.mortbay.servlet.GzipFilter.doFilter(GzipFilter.java:132)
at 
org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1157)
at 
org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:388)
at 
org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
at 
org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182)
at 
org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:765)
at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:418)
at 
org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230)
at 
org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114)
at 
org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152)
at org.mortbay.jetty.Server.handle(Server.java:326)
at 
org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:542)
at 
org.mortbay.jetty.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:923)
at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:547)
at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:212)
at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:404)
at 
org.mortbay.io.nio.SelectChannelEndPoint.run(SelectChannelEndPoint.java:409)
at 
org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:582)
Looking at the QueryComponent.java:493 code, I see:

SolrDocumentList docs = 
(SolrDocumentList)srsp.getSolrResponse().getResponse().get("response");

// calculate global maxScore and numDocsFound
if (docs.getMaxScore() != null) { <<<<  This is line 493

So I'm assuming the "docs" variable is null, which would happen if there is no 
"response" element in the Solr response.

If I make a direct request to the request handler in one core (e.g. 
http://hostname:8080/solr/core0/select?qt=suggest-core&q=rad), the query works.

But I see that there's no element named "response", unlike a regular query.


  
0
1
  
  

  
10
0
3

  radair
  radar

  

  


So I'm wondering if my configuration is just borked and this should work, or 
the fact that the Suggester doesn't return a response field means that it just 
doesn't work with shards.
Thanks,
-- Ken
--------
http://about.me/kkrugler
+1 530-210-6378






--
Ken Krugler
http://www.scaleunlimited.com
custom big data solutions & training
Hadoop, Cascading, Mahout & Solr






Re: JSON & XML response writer issues with short & binary fields

2012-01-13 Thread Ken Krugler

On Jan 13, 2012, at 1:39pm, Yonik Seeley wrote:

> -Yonik
> http://www.lucidimagination.com
> 
> 
> 
> On Fri, Jan 13, 2012 at 4:22 PM, Yonik Seeley
>  wrote:
>> On Fri, Jan 13, 2012 at 4:04 PM, Ken Krugler
>>  wrote:
>>> I finally got around to looking at why short field values are returned as 
>>> "java.lang.Short:".
>>> 
>>> Both XMLWriter.writeVal() and TextResponseWriter.writeVal() are missing the 
>>> check for (val instanceof Short), and thus this bit of code is used:
>>> 
>>>  // default... for debugging only
>>>  writeStr(name, val.getClass().getName() + ':' + val.toString(), true);
>>> 
>>> The same thing happens when you have a binary field, since val in that case 
>>> is byte[], so you get "[B:[B@"
>>> 
>>> Has anybody else run into this? Seems odd that it's not a known issue, so 
>>> I'm wondering if there's something odd about my schema.
>>> 
>>> This is especially true since BinaryField has write methods for both XML 
>>> and JSON (via TextResponseWriter) that handle Base64-encoding the data. So 
>>> I'm wondering how normally the BinaryField.write() methods would get used, 
>>> and whether the actual problem lies elsewhere.
>> 
>> Hmmm, Ryan recently restructured some of the writer code to support
>> the pseudo-field feature.  A quick look at the code seems like
>> FieldType.write() methods are not used anymore (the Document is
>> transformed into a SolrDocument and writeVal is used for each value).
> 
> Double hmmm... I see this in writeVal()
> 
>} else if (val instanceof IndexableField) {
>  IndexableField f = (IndexableField)val;
>  SchemaField sf = schema.getFieldOrNull( f.name() );
>  if( sf != null ) {
>sf.getType().write(this, name, f);
>  }
> 
> So my initial quick analysis of FieldType.write() not being used
> anymore doesn't look correct.
> Anyway, please do open an issue and we'll get to the bottom of it.

Thanks for the fast response - I was beginning to worry that nobody read my 
posts :)

See https://issues.apache.org/jira/browse/SOLR-3035

I've attached some test code to the issue, plus a simple fix for the JSON case.

Regards,

-- Ken

--
Ken Krugler
http://www.scaleunlimited.com
custom big data solutions & training
Hadoop, Cascading, Mahout & Solr






JSON & XML response writer issues with short & binary fields

2012-01-13 Thread Ken Krugler
I finally got around to looking at why short field values are returned as 
"java.lang.Short:".

Both XMLWriter.writeVal() and TextResponseWriter.writeVal() are missing the 
check for (val instanceof Short), and thus this bit of code is used:

  // default... for debugging only
  writeStr(name, val.getClass().getName() + ':' + val.toString(), true);

The same thing happens when you have a binary field, since val in that case is 
byte[], so you get "[B:[B@"

Has anybody else run into this? Seems odd that it's not a known issue, so I'm 
wondering if there's something odd about my schema.

This is especially true since BinaryField has write methods for both XML and 
JSON (via TextResponseWriter) that handle Base64-encoding the data. So I'm 
wondering how normally the BinaryField.write() methods would get used, and 
whether the actual problem lies elsewhere.

-- Ken

PS - any good reason why XMLWriter is a final class? I created my own fixed 
version of JSONResponseWriter w/o much effort because I could subclass it, but 
XMLWriter being final makes it hard (impossible?) to do the same, since there 
are numerous internal methods that take an explicit XMLWriter object as a 
parameter.

--
Ken Krugler
http://www.scaleunlimited.com
custom big data solutions & training
Hadoop, Cascading, Mahout & Solr






Re: Solr core as a dispatcher

2012-01-11 Thread Ken Krugler
Hi Hector,

On Jan 9, 2012, at 4:15pm, Hector Castro wrote:

> Hi,
> 
> Has anyone had success with multicore single node Solr configurations that 
> have one core acting solely as a dispatcher for the other cores?  For 
> example, say you had 4 populated Solr cores – configure a 5th to be the 
> definitive endpoint with `shards` containing cores 1-4.  
> 
> Is there any advantage to this setup over simply having requests distributed 
> randomly across the 4 populated cores (all with `shards` equal to cores 1-4)? 
>  Is it even worth distributing requests across the cores over always hitting 
> the same one?

If you have low query rates, then using a shards approach can improve 
performance on a multi-core (CPUs here, not Solr cores) setup.

By distributing the requests, you effectively use all CPU cores in parallel on 
one request.

And if you spread your shards across spindles, then you're also maximizing I/O 
throughput.

But there are a few issues with this approach:

- binary fields don't work. The results come back as "@B[]", 
versus the actual data.
- short fields get "java.lang.Short" text prefixed on every value.
- deep queries result in lots of extra load. E.g. if you want the 5000th hit 
then you'll get (5000 * # of shards) hits being collected/returned to the 
dispatcher. Though only the unique id & score is returned in this case, 
followed by the second request to get the actual top N hits from the shards.

And there's something wonky with the way that distributed HTTP requests are 
queued up & processed - under load, I see IOExceptions where it's always N-1 
shards that succeed, and one shard request fails. But I don't have a good 
reproducible case yet to debug.

-- Ken

--
Ken Krugler
http://www.scaleunlimited.com
custom big data solutions & training
Hadoop, Cascading, Mahout & Solr






Re: strange performance issue with many shards on one server

2011-12-29 Thread Ken Krugler
>>>>> Vadim
>>>>> 
>>>>> 
>>>>> 2011/9/28 Frederik Kraus >>>> (mailto:frederik.kr...@gmail.com) (mailto:
>>>> frederik.kr...@gmail.com (mailto:frederik.kr...@gmail.com))>
>>>>> 
>>>>>> Hi,
>>>>>> 
>>>>>> 
>>>>>> I am experiencing a strange issue doing some load tests. Our setup:
>>>>>> 
>>>>>> - 2 server with each 24 cpu cores, 130GB of RAM
>>>>>> - 10 shards per server (needed for response times) running in a single
>>>>>> tomcat instance
>>>>>> - each query queries all 20 shards (distributed search)
>>>>>> 
>>>>>> - each shard holds about 1.5 mio documents (small shards are needed due
>>>> to
>>>>>> rather complex queries)
>>>>>> - all caches are warmed / high cache hit rates (99%) etc.
>>>>>> 
>>>>>> 
>>>>>> Now for some reason we cannot seem to fully utilize all CPU power (no
>>>> disk
>>>>>> IO), ie. increasing concurrent users doesn't increase CPU-Load at a
>>>> point,
>>>>>> decreases throughput and increases the response times of the individual
>>>>>> queries.
>>>>>> 
>>>>>> Also 1-2% of the queries take significantly longer: avg somewhere at
>>>> 100ms
>>>>>> while 1-2% take 1.5s or longer.
>>>>>> 
>>>>>> Any ideas are greatly appreciated :)
>>>>>> 
>>>>>> Fred.
> 

--
Ken Krugler
http://www.scaleunlimited.com
custom big data solutions & training
Hadoop, Cascading, Mahout & Solr






SearchComponents and ShardResponse

2011-12-15 Thread Ken Krugler
Hi all,

I feel like I must be missing something here...

I'm working on a customized version of the SearchHandler, which supports 
distributed searching in multiple *local* cores.

Assuming you want to support SearchComponents, then my handler needs to 
create/maintain a ResponseBuilder, which is passed to various SearchComponent 
methods.

The ResponseBuilder has a "finished" list of ShardRequest objects, for requests 
that have received responses from shards.

Inside the ShardRequest is a "responses" list of ShardResponse objects, which 
contain things like the SolrResponse.

The SolrResponse field in ShardResponse is private, and the method to set it is 
package private.

So it doesn't appear like there's any easy way to create the ShardResponse 
objects that the SearchComponents expect to receive inside of the 
ResponseBuilder.

If I put my custom SearchHandler class into the same package as the 
ShardResponse class, then I can call setSolrResponse().

It builds, and I can run locally. But if I deploy a jar with this code, then at 
runtime I get an illegal access exception when running under Jetty.

I can make it work by re-building the solr.war with my custom SearchHandler, 
but that's pretty painful.

Any other ideas/input?

Thanks,

-- Ken

--
Ken Krugler
http://www.scaleunlimited.com
custom big data solutions & training
Hadoop, Cascading, Mahout & Solr






Distributed search and binary fields w/Solr 3.4

2011-11-13 Thread Ken Krugler
Hi there,

I'm running into a problem, where queries that are distributed among multiple 
shards don't return binary field data properly.

If I hit a single core, the XML response to my HTTP request contains the 
expected data.

If I hit the request handler that's configured to distribute the request to my 
shards, the XML contains "B[B",

It looks like I wind up getting the .toString() data, not the 
actual data itself.

Has anybody else run into this? I've done a fair amount of searching, but no 
hits yet.

Next step is to create a unit test in Solr, if nobody raises their hand, and 
then walk it.

Thanks,

-- Ken

------
Ken Krugler
http://www.scaleunlimited.com
custom big data solutions & training
Hadoop, Cascading, Mahout & Solr






Re: overwrite=false support with SolrJ client

2011-11-10 Thread Ken Krugler

On Nov 7, 2011, at 12:06pm, Chris Hostetter wrote:

> 
> : I see that https://issues.apache.org/jira/browse/SOLR-653 removed this 
> : support from SolrJ, because it was deemed too dangerous for mere 
> : mortals.
> 
> I believe the concern was that the "novice level" API was very in your 
> face about asking if you wanted to "overwrite" and made it too easy to 
> hurt yourself.
> 
> It should still be fairly trivial to specify overwrite=false in a SolrJ 
> request -- just not using hte convenience methods.  something like...
> 
>   UpdateRequest req = new UpdateRequest();
>   req.add(myBigCollectionOfDocuments);
>   req.setParam(UpdateParams.OVERWRITE, true);
>   req.process(mySolrServer);

That seemed to work, thanks for the suggestion - though using (in case anybody 
else reads this)

   req.setParam(UpdateParams.OVERWRITE, Boolean.toString(false));

I'll need to run some tests to check performance improvements.

> : For Hadoop-based workflows, it's straightforward to ensure that the 
> : unique key field is really unique, thus if the performance gain is 
> : significant, I might look into figuring out some way (with a trigger 
> : lock) of re-enabling this support in SolrJ.
> 
> it's not just an issue of knowing that the key is unique -- it's an issue 
> of being certain that your index does not contain any documents with the 
> same key as a document you are about to add.  If you are generating a 
> completley new solr index from data that you are certain is unique -- then 
> you will probably see some perf gains.  but if you are adding to an 
> existing index, i would avoid it. 

For Hadoop workflows, the output is always fresh (unless you do some 
interesting helicopter stunts).

So yes, by default the index is always being rebuilt from scratch.

And thus as long as the primary key is being used as the reduce-phase key, it's 
easy to ensure uniqueness in the index.

Thanks again,

-- Ken

--
Ken Krugler
http://www.scaleunlimited.com
custom big data solutions & training
Hadoop, Cascading, Mahout & Solr






overwrite=false support with SolrJ client

2011-11-04 Thread Ken Krugler
Hi list,

I'm working on improving the performance of the Solr scheme for Cascading.

This supports generating a Solr index as the output of a Hadoop job. We use 
SolrJ to write the index locally (via EmbeddedSolrServer).

There are mentions of using overwrite=false with the CSV request handler, as a 
way of improving performance.

I see that https://issues.apache.org/jira/browse/SOLR-653 removed this support 
from SolrJ, because it was deemed too dangerous for mere mortals.

My question is whether anyone knows just how much performance boost this really 
provides.

For Hadoop-based workflows, it's straightforward to ensure that the unique key 
field is really unique, thus if the performance gain is significant, I might 
look into figuring out some way (with a trigger lock) of re-enabling this 
support in SolrJ.

Thanks,

-- Ken

------
Ken Krugler
http://www.scaleunlimited.com
custom big data solutions & training
Hadoop, Cascading, Mahout & Solr






Re: indexing key value pair into lucene solr index

2011-10-24 Thread Ken Krugler

On Oct 24, 2011, at 1:41pm, jame vaalet wrote:

> hi,
> in my use case i have list of key value pairs in each document object, if i
> index them as separate index fields then in the result doc object i will get
> two arrays corresponding to my keys and values. The problem i face here is
> that there wont be any mapping between those keys and values.
> 
> do we have any easy to index these data in solr ? thanks in advance ...

As Karsten said, providing more detail re what you're actually trying to do 
usually makes for better and more helpful/accurate answers.

But I'm guessing you only want to search on the key, not the value, right?

If so, then:

1. Create a multi-value field with a custom type, indexed, stored.
2. During indexing, add entries as 
3. In the custom type, set the analyzer to strip off the  so you 
only index the key. E.g.


  





  
  



  


-- Ken

------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
custom big data solutions & training
Hadoop, Cascading, Mahout & Solr





Re: Want to support "did you mean xxx" but is Chinese

2011-10-21 Thread Ken Krugler
Hi Floyd,

Typically you'd do this by creating a custom analyzer that

 - segments Chinese text into words
 - Converts from words to pinyin or zhuyin

Your index would have both the actual Hanzi characters, plus (via copyfield) 
this phonetic version.

During search, you'd use dismax to search against both fields, with a higher 
weighting to the Hanzi field.

But segmentation can be error prone, and requires embedding specialized code 
that you typically license (for high quality results) from a commercial vendor.

So my first cut approach would be to use the current synonym support to map 
each Hanzi to all possible pronunciations. There are numerous open source 
datasets that contain this information. Note that there might be performance 
issues with having such a huge set of synonyms.

Then, by weighting phrase matches sufficiently high (again using dismax) I 
think you could get reasonable results.

-- Ken
 
On Oct 21, 2011, at 7:33am, Floyd Wu wrote:

> Does anybody know how to implement this idea in SOLR. Please kindly
> point me a direction.
> 
> For example, when user enter a keyword in Chinese "貝多芬" (this is
> Beethoven in Chinese)
> but key in a wrong combination of characters  "背多分" (this is
> pronouncation the same with previous keyword "貝多芬").
> 
> There in solr index exist token "貝多芬" actually. How to hit documents
> where "貝多芬" exist when "背多分" is enter.
> 
> This is basic function of commercial search engine especially in
> Chinese processing. I wonder how to implements in SOLR and where is
> the start point.
> 
> Floyd

--
Ken Krugler
+1 530-210-6378
http://bixolabs.com
custom big data solutions & training
Hadoop, Cascading, Mahout & Solr





Re: Does anybody has experience in Chinese soundex(sounds like) of SOLR?

2011-10-20 Thread Ken Krugler
> Wow, interesting question.  Can soundex even be applied to a language like 
> Chinese, which is tonal and doesn't have individual letters, but whole 
> characters?  I'm no expert, but intuitively speaking it sounds hard or maybe 
> even impossible...  

The only two cases I can think of are:

 - Cases where you have two (or more) characters that are variant forms. 
Unicode tried to unify all of these, but some still exist. And in GB 18030 
there are tons.

 - If you wanted to support phonetic (pinyin or zhuyin) search, then you might 
want to collapse syllables that are commonly confused. But then of course you'd 
have to be storing the phonetic forms for all of the words.

-- Ken


>> From: Floyd Wu 
>> To: solr-user@lucene.apache.org
>> Sent: Thursday, October 20, 2011 5:43 AM
>> Subject: Does anybody has experience in Chinese soundex(sounds like) of SOLR?
>> 
>> Hi  there,
>> 
>> There are many English soundex implementation can be referenced, but I
>> wonder how to do Chinese soundex(sounds like) filter (maybe).
>> 
>> any idea?
>> 
>> Floyd
>> 
>> 
>> 

--
Ken Krugler
+1 530-210-6378
http://bixolabs.com
custom big data solutions & training
Hadoop, Cascading, Mahout & Solr





Re: Multi CPU Cores

2011-10-16 Thread Ken Krugler

On Oct 16, 2011, at 1:44pm, Rob Brown wrote:

> Looks like I checked the load during a quiet period, ab -n 1 -c 1000
> saw a decent 40% load on each core.
> 
> Still a little confused as to why 1 core stays at 100% constantly - even
> during the quiet periods?

Could be background GC, depending on what you've got your JVM configured to use.

Though that shouldn't stay at 100% for very long.

-- Ken


> -Original Message-
> From: Johannes Goll 
> Reply-to: solr-user@lucene.apache.org
> To: solr-user@lucene.apache.org 
> Subject: Re: Multi CPU Cores
> Date: Sat, 15 Oct 2011 21:30:11 -0400
> 
> Did you try to submit multiple search requests in parallel? The apache ab 
> tool is great tool to simulate simultaneous load using (-n and -c).
> Johannes
> 
> On Oct 15, 2011, at 7:32 PM, Rob Brown  wrote:
> 
>> Hi,
>> 
>> I'm running Solr on a machine with 16 CPU cores, yet watching "top"
>> shows that java is only apparently using 1 and maxing it out.
>> 
>> Is there anything that can be done to take advantage of more CPU cores?
>> 
>> Solr 3.4 under Tomcat
>> 
>> [root@solr01 ~]# java -version
>> java version "1.6.0_20"
>> OpenJDK Runtime Environment (IcedTea6 1.9.8)
>> (rhel-1.22.1.9.8.el5_6-x86_64)
>> OpenJDK 64-Bit Server VM (build 19.0-b09, mixed mode)
>> 
>> 
>> top - 14:36:18 up 22 days, 21:54,  4 users,  load average: 1.89, 1.24,
>> 1.08
>> Tasks: 317 total,   1 running, 315 sleeping,   0 stopped,   1 zombie
>> Cpu0  :  0.0%us,  0.0%sy,  0.0%ni, 99.6%id,  0.4%wa,  0.0%hi,  0.0%si,
>> 0.0%st
>> Cpu1  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,
>> 0.0%st
>> Cpu2  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,
>> 0.0%st
>> Cpu3  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,
>> 0.0%st
>> Cpu4  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,
>> 0.0%st
>> Cpu5  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,
>> 0.0%st
>> Cpu6  : 99.6%us,  0.4%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.0%si,
>> 0.0%st
>> Cpu7  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,
>> 0.0%st
>> Cpu8  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,
>> 0.0%st
>> Cpu9  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,
>> 0.0%st
>> Cpu10 :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,
>> 0.0%st
>> Cpu11 :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,
>> 0.0%st
>> Cpu12 :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,
>> 0.0%st
>> Cpu13 :  0.7%us,  0.0%sy,  0.0%ni, 99.3%id,  0.0%wa,  0.0%hi,  0.0%si,
>> 0.0%st
>> Cpu14 :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,
>> 0.0%st
>> Cpu15 :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,
>> 0.0%st
>> Mem:  132088928k total, 23760584k used, 108328344k free,   318228k
>> buffers
>> Swap: 25920868k total,0k used, 25920868k free, 18371128k cached
>> 
>> PID USER  PR  NI  VIRT  RES  SHR S %CPU %MEMTIME+
>> COMMAND  
>>  
>>  
>> 4466 tomcat20   0 31.2g 4.0g 171m S 101.0  3.2   2909:38
>> java 
>>  
>> 
>> 6495 root  15   0 42416 3892 1740 S  0.4  0.0   9:34.71
>> openvpn  
>>          
>>  
>> 11456 root  16   0 12892 1312  836 R  0.4  0.0   0:00.08
>> top  
>>  
>>  
>>   1 root  15   0 10368  632  536 S  0.0  0.0   0:04.69
>> init 
>> 
>> 
>> 
> 

--
Ken Krugler
+1 530-210-6378
http://bixolabs.com
custom big data solutions & training
Hadoop, Cascading, Mahout & Solr





Re: strange performance issue with many shards on one server

2011-09-28 Thread Ken Krugler
>>>>> special characters.
>>>>> if you don't know it:
>>>> http://www.hathitrust.org/blogs/large-scale-search/tuning-search-performance
>>>>> Regards
>>>>> Vadim
>>>>> 
>>>>> 
>>>>> 2011/9/28 Frederik Kraus >>>> (mailto:frederik.kr...@gmail.com) (mailto:
>>>> frederik.kr...@gmail.com (mailto:frederik.kr...@gmail.com))>
>>>>> 
>>>>>> Hi,
>>>>>> 
>>>>>> 
>>>>>> I am experiencing a strange issue doing some load tests. Our setup:
>>>>>> 
>>>>>> - 2 server with each 24 cpu cores, 130GB of RAM
>>>>>> - 10 shards per server (needed for response times) running in a single
>>>>>> tomcat instance
>>>>>> - each query queries all 20 shards (distributed search)
>>>>>> 
>>>>>> - each shard holds about 1.5 mio documents (small shards are needed due
>>>> to
>>>>>> rather complex queries)
>>>>>> - all caches are warmed / high cache hit rates (99%) etc.
>>>>>> 
>>>>>> 
>>>>>> Now for some reason we cannot seem to fully utilize all CPU power (no
>>>> disk
>>>>>> IO), ie. increasing concurrent users doesn't increase CPU-Load at a
>>>> point,
>>>>>> decreases throughput and increases the response times of the individual
>>>>>> queries.
>>>>>> 
>>>>>> Also 1-2% of the queries take significantly longer: avg somewhere at
>>>> 100ms
>>>>>> while 1-2% take 1.5s or longer.
>>>>>> 
>>>>>> Any ideas are greatly appreciated :)
>>>>>> 
>>>>>> Fred.
> 

--
Ken Krugler
+1 530-210-6378
http://bixolabs.com
custom big data solutions & training
Hadoop, Cascading, Mahout & Solr





Re: two cores but have single result set in solr

2011-09-23 Thread Ken Krugler

On Sep 23, 2011, at 2:03pm, hadi wrote:

> I have to cores with seprate schema and index but i want to have single
> result set in solr/browse,

If they have different schemas, how would you combine results from the two?

If they have the same schemas, then you can define a third core with a 
different conf dir, and in that separate conf/solrschema.xml you can set up a 
request handler that just dispatches to the two real cores.

-- Ken

------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
custom big data solutions & training
Hadoop, Cascading, Mahout & Solr





Re: Distinct elements in a field

2011-09-17 Thread Ken Krugler

On Sep 15, 2011, at 3:43am, swiss knife wrote:

> Simple question: I want to know how many distinct elements I have in a field 
> and these verify a query. Do you know if there's a way to do it today in 3.4.
> 
> I saw SOLR-1814 and SOLR-2242.
> 
> SOLR-1814 seems fairly easy to use. What do you think ? Thank you

If you turn on facets in your query (facet=true&facet.field=) then 
you'll get back all of the distinct values, though might have to play with 
other settings (e.g. facet.limit=-1) to get the results you need.

-- Ken

--
Ken Krugler
+1 530-210-6378
http://bixolabs.com
custom big data solutions & training
Hadoop, Cascading, Mahout & Solr





Re: Will Solr/Lucene crawl multi websites (aka a mini google with faceted search)?

2011-09-11 Thread Ken Krugler

On Sep 11, 2011, at 7:04pm, dpt9876 wrote:

> Hi thanks for the reply.
> 
> How does nutch/solr handle the scenario where 1 website calls price, "price"
> and another website calls it "cost". Same thing different name, yet I would
> want the facet to handle that and not create a different facet.
> 
> Is this combo of nutch and Solr that intelligent and or intuitive?

What you're describing here is web mining, not web crawling.

You want to extract price data from web pages, and put that into a specific 
field in Solr.

To do that using Nutch, you'd need to write custom plug-ins that know how to 
extract the price from a page, and add that as a custom field to the crawl 
results.

The above is a topic for the Nutch mailing list, since Solr is just a 
downstream consumer of whatever Nutch provides.

-- Ken

> On Sep 12, 2011 9:06 AM, "Erick Erickson [via Lucene]" <
> ml-node+s472066n3328340...@n3.nabble.com> wrote:
>> 
>> 
>> Nope, there's nothing in Solr that crawls anything, you have to feed
>> documents in yourself from the websites.
>> 
>> Or, look at the Nutch project, see: http://nutch.apache.org/about.html
>> 
>> which is designed for this kind of problem.
>> 
>> Best
>> Erick
>> 
>> On Sun, Sep 11, 2011 at 8:53 PM, dpt9876 
> wrote:
>>> Hi all,
>>> I am wondering if Solr will do the following for a project I am working
> on.
>>> I want to create a search engine with facets for potentially hundreds of
>>> websites.
>>> Similar to say crawling amazon + buy.com + ebay and someone can search
> these
>>> 3 sites from my 1 website.
>>> (I realise there are better ways of doing the above example, its for
>>> illustrative purposes).
>>> Eventually I would build that search crawl to index say 200 or 1000
>>> merchants.
>>> Someone would come to my site and search for "digital camera".
>>> 
>>> They would get results from all 3 indexes and hopefully dynamic facets eg
>>> Price $100-200
>>> Price 200-300
>>> Resolution 1mp-2mp
>>> 
>>> etc etc
>>> 
>>> Can this be done on the fly?
>>> 
>>> I ask this because I am currently developing webscrapers to crawl these
>>> websites, dump that data into a db, then was thinking of tacking on a
> solr
>>> server to crawl my db.
>>> 
>>> Problem with that approach is that crawling the worlds ecommerce sites
> will
>>> take forever, when it seems solr might do that for me? (I have read about
>>> multiple indexes etc).
>>> 
>>> Many thanks
>>> 
>>> --
>>> View this message in context:
> http://lucene.472066.n3.nabble.com/Will-Solr-Lucene-crawl-multi-websites-aka-a-mini-google-with-faceted-search-tp3328314p3328314.html
>>> Sent from the Solr - User mailing list archive at Nabble.com.
>>> 
>> 
>> 
>> ___
>> If you reply to this email, your message will be added to the discussion
> below:
>> 
> http://lucene.472066.n3.nabble.com/Will-Solr-Lucene-crawl-multi-websites-aka-a-mini-google-with-faceted-search-tp3328314p3328340.html
>> 
>> To unsubscribe from Will Solr/Lucene crawl multi websites (aka a mini
> google with faceted search)?, visit
> http://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code&node=3328314&code=ZGFuaW50aGV0cm9waWNzQGdtYWlsLmNvbXwzMzI4MzE0fC04MDk0NTc1ODg=
> 
> 
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Will-Solr-Lucene-crawl-multi-websites-aka-a-mini-google-with-faceted-search-tp3328314p3328449.html
> Sent from the Solr - User mailing list archive at Nabble.com.

--
Ken Krugler
+1 530-210-6378
http://bixolabs.com
custom big data solutions & training
Hadoop, Cascading, Mahout & Solr





Re: performance crossover between single index and sharding

2011-08-02 Thread Ken Krugler
With low qps and multi-core servers, I believe one reason to have multiple 
shards on one server is to provide better parallelism for a request, and thus 
reduce your response time.

-- Ken

On Aug 2, 2011, at 11:06am, Jonathan Rochkind wrote:

> What's the reasoning  behind having three shards on one machine, instead of 
> just combining those into one shard? Just curious.  I had been thinking the 
> point of shards was to get them on different machines, and there'd be no 
> reason to have multiple shards on one machine.
> 
> On 8/2/2011 1:59 PM, Burton-West, Tom wrote:
>> Hi Markus,
>> 
>> Just as a data point for a very large sharded index, we have the full text 
>> of 9.3 million books with an index size of about 6+ TB spread over 12 shards 
>> on 4 machines. Each machine has 3 shards. The size of each shard ranges 
>> between 475GB and 550GB.  We are definitely I/O bound. Our machines have 
>> 144GB of memory with about 16GB dedicated to the tomcat instance running the 
>> 3 Solr instances, which leaves about 120 GB (or 40GB per shard) for the OS 
>> disk cache.  We release a new index every morning and then warm the caches 
>> with several thousand queries.  I probably should add that our disk storage 
>> is a very high performance Isilon appliance that has over 500 drives and 
>> every block of every file is striped over no less than 14 different drives. 
>> (See blog for details *)
>> 
>> We have a very low number of queries per second (0.3-2 qps) and our modest 
>> response time goal is to keep 99th percentile response time for our 
>> application (i.e. Solr + application) under 10 seconds.
>> 
>> Our current performance statistics are:
>> 
>> average response time  300 ms
>> median response time   113 ms
>> 90th percentile663 ms
>> 95th percentile1,691 ms
>> 
>> We had plans to do some performance testing to determine the optimum shard 
>> size and optimum number of shards per machine, but that has remained on the 
>> back burner for a long time as other higher priority items keep pushing it 
>> down on the todo list.
>> 
>> We would be really interested to hear about the experiences of people who 
>> have so many shards that the overhead of distributing the queries, and 
>> consolidating/merging the responses becomes a serious issue.
>> 
>> 
>> Tom Burton-West
>> 
>> http://www.hathitrust.org/blogs/large-scale-search
>> 
>> * 
>> http://www.hathitrust.org/blogs/large-scale-search/scaling-large-scale-search-50-volumes-5-million-volumes-and-beyond
>> 
>> -Original Message-
>> From: Markus Jelsma [mailto:markus.jel...@openindex.io]
>> Sent: Tuesday, August 02, 2011 12:33 PM
>> To: solr-user@lucene.apache.org
>> Subject: Re: performance crossover between single index and sharding
>> 
>> Actually, i do worry about it. Would be marvelous if someone could provide
>> some metrics for an index of many terabytes.
>> 
>>> [..] At some extreme point there will be diminishing
>>> returns and a performance decrease, but I wouldn't worry about that at all
>>> until you've got many terabytes -- I don't know how many but don't worry
>>> about it.
>>> 
>>> ~ David
>>> 
>>> -
>>>  Author: https://www.packtpub.com/solr-1-4-enterprise-search-server/book
>>> --
>>> View this message in context:
>>> http://lucene.472066.n3.nabble.com/performance-crossover-between-single-in
>>> dex-and-sharding-tp3218561p3219397.html Sent from the Solr - User mailing
>>> list archive at Nabble.com.

--
Ken Krugler
+1 530-210-6378
http://bixolabs.com
custom data mining solutions








Re: Processing/Indexing CSV

2011-06-09 Thread Ken Krugler

On Jun 9, 2011, at 2:21pm, Helmut Hoffer von Ankershoffen wrote:

> Hi,
> 
> btw: there seems to somewhat of a non-match regarding efforts to Enhance DIH
> regarding the CSV format (James Dyer) and the effort to maintain the
> CSVLoader (Ken Krugler). How about merging your efforts and migrating the
> CSVLoader to a CSVEntityProcessor (cp. my initial email)? :-)

While I'm a CSVLoader user (and I've found/fixed one bug in it), I'm not 
involved in any active development/maintenance of that piece of code.

If James or you can make progress on merging support for CSV into DIH, that's 
great.

-- Ken


> On Thu, Jun 9, 2011 at 11:17 PM, Helmut Hoffer von Ankershoffen <
> helmut...@googlemail.com> wrote:
> 
>> 
>> 
>> On Thu, Jun 9, 2011 at 11:05 PM, Ken Krugler 
>> wrote:
>> 
>>> 
>>> On Jun 9, 2011, at 1:27pm, Helmut Hoffer von Ankershoffen wrote:
>>> 
>>>> Hi,
>>>> 
>>>> ... that would be an option if there is a defined set of field names and
>>> a
>>>> single column/CSV layout. The scenario however is different csv files
>>> (from
>>>> different shops) with individual column layouts (separators, encodings
>>>> etc.). The idea is to map known field names to defined field names in
>>> the
>>>> solr schema. If I understand the capabilities of the CSVLoader correctly
>>>> (sorry, I am completely new to Solr, started work on it today) this is
>>> not
>>>> possible - is it?
>>> 
>>> As per the documentation on
>>> http://wiki.apache.org/solr/UpdateCSV#fieldnames, you can specify the
>>> names/positions of fields in the CSV file, and ignore fieldnames.
>>> 
>>> So this seems like it would solve your requirement, as each different
>>> layout could specify its own such mapping during import.
>>> 
>>> Sure, but the requirement (to keep the process of integrating new shops
>> efficient) is not to have one mapping per import (cp. the Email regarding
>> "more or less schema free") but to enhance one mapping that maps common
>> field names to defined fields disregarding order of known fields/columns. As
>> far as I understand that is not a problem at all with DIH, however DIH and
>> CSV are not a perfect match ,-)
>> 
>> 
>>> It could be handy to provide a fieldname map (versus the value map that
>>> UpdateCSV supports).
>> 
>> Definitely. Either a fieldname map in CSVLoader or a robust CSVLoader in
>> DIH ...
>> 
>> 
>>> Then you could use the header, and just provide a mapping from header
>>> fieldnames to schema fieldnames.
>>> 
>> That's the idea -)
>> 
>> => what's the best way to progress. Either someone enhances the CSVLoader
>> by a field mapper (with multipel input field names mapping to one field name
>> in the Solr schema) or someone enhances the DIH with a robust CSV loader
>> ,-). As I am completely new to this Community, please give me the direction
>> to go (or wait :-).
>> 
>> best regards
>> 
>> 
>>> -- Ken
>>> 
>>>> On Thu, Jun 9, 2011 at 10:12 PM, Yonik Seeley <
>>> yo...@lucidimagination.com>wrote:
>>>> 
>>>>> On Thu, Jun 9, 2011 at 4:07 PM, Helmut Hoffer von Ankershoffen
>>>>>  wrote:
>>>>>> Hi,
>>>>>> yes, it's about CSV files loaded via HTTP from shops to be fed into a
>>>>>> shopping search engine.
>>>>>> The CSV Loader cannot map fields (only field values) etc.
>>>>> 
>>>>> You can provide your own list of fieldnames and optionally ignore the
>>>>> first line of the CSV file (assuming it contains the field names).
>>>>> http://wiki.apache.org/solr/UpdateCSV#fieldnames
>>>>> 
>>>>> -Yonik
>>>>> http://www.lucidimagination.com
>>>>> 
>>> 
>>> --
>>> Ken Krugler
>>> +1 530-210-6378
>>> http://bixolabs.com
>>> custom data mining solutions
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>> 

--
Ken Krugler
+1 530-210-6378
http://bixolabs.com
custom data mining solutions








Re: Processing/Indexing CSV

2011-06-09 Thread Ken Krugler

On Jun 9, 2011, at 1:27pm, Helmut Hoffer von Ankershoffen wrote:

> Hi,
> 
> ... that would be an option if there is a defined set of field names and a
> single column/CSV layout. The scenario however is different csv files (from
> different shops) with individual column layouts (separators, encodings
> etc.). The idea is to map known field names to defined field names in the
> solr schema. If I understand the capabilities of the CSVLoader correctly
> (sorry, I am completely new to Solr, started work on it today) this is not
> possible - is it?

As per the documentation on http://wiki.apache.org/solr/UpdateCSV#fieldnames, 
you can specify the names/positions of fields in the CSV file, and ignore 
fieldnames.

So this seems like it would solve your requirement, as each different layout 
could specify its own such mapping during import.

It could be handy to provide a fieldname map (versus the value map that 
UpdateCSV supports). Then you could use the header, and just provide a mapping 
from header fieldnames to schema fieldnames.

-- Ken
 
> On Thu, Jun 9, 2011 at 10:12 PM, Yonik Seeley 
> wrote:
> 
>> On Thu, Jun 9, 2011 at 4:07 PM, Helmut Hoffer von Ankershoffen
>>  wrote:
>>> Hi,
>>> yes, it's about CSV files loaded via HTTP from shops to be fed into a
>>> shopping search engine.
>>> The CSV Loader cannot map fields (only field values) etc.
>> 
>> You can provide your own list of fieldnames and optionally ignore the
>> first line of the CSV file (assuming it contains the field names).
>> http://wiki.apache.org/solr/UpdateCSV#fieldnames
>> 
>> -Yonik
>> http://www.lucidimagination.com
>> 

--
Ken Krugler
+1 530-210-6378
http://bixolabs.com
custom data mining solutions








Re: Solr monitoring: Newrelic

2011-06-09 Thread Ken Krugler
It sounds like "roySolr" is running embedded Jetty, launching solr using the 
start.jar

If so, then there's no app container where Newrelic can be installed.

-- Ken

On Jun 9, 2011, at 2:28am, Sujatha Arun wrote:

> Try the RPM support  accessed from the accout support page ,Giving all
> details ,they are very helpful.
> 
> Regards
> Sujatha
> 
> On Thu, Jun 9, 2011 at 2:33 PM, roySolr  wrote:
> 
>> Yes, that's the problem. There is no jetty folder.
>> I have try the example/lib directory, it's not working. There is no jetty
>> war file, only
>> jetty-***.jar files
>> 
>> Same error, could not locate a jetty instance.
>> 
>> --
>> View this message in context:
>> http://lucene.472066.n3.nabble.com/Solr-monitoring-Newrelic-tp3042889p3043080.html
>> Sent from the Solr - User mailing list archive at Nabble.com.
>> 

--
Ken Krugler
+1 530-210-6378
http://bixolabs.com
custom data mining solutions








Re: Hitting the URI limit, how to get around this?

2011-06-03 Thread Ken Krugler
It sounds like you're hitting the max URL length (8K is a common default) for 
the HTTP web server that you're using to run Solr.

All of the web servers I know about let you bump this limit up via 
configuration settings.

-- Ken

On Jun 3, 2011, at 9:27am, JohnRodey wrote:

> So here's what I'm seeing: I'm running Solr 3.1
> I'm running a java client that executes a Httpget (I tried HttpPost) with a
> large shard list.  If I remove a few shards from my current list it returns
> fine, when I use my full shard list I get a "HTTP/1.1 400 Bad Request".  If
> I execute it in firefox with a few shards removed it returns fine, with the
> full shard list I get a blank screen returned immediately.
> 
> My URI works at around 7800 characters but adding one more shard to it blows
> up.
> 
> Any ideas? 
> 
> I've tried using SolrJ rather than httpget before but ran into similar
> issues but with even less shards.
> See 
> http://lucene.472066.n3.nabble.com/Long-list-of-shards-breaks-solrj-query-td2748556.html
> http://lucene.472066.n3.nabble.com/Long-list-of-shards-breaks-solrj-query-td2748556.html
>  
> 
> My shards are added dynamically, every few hours I am adding new shards or
> cores into the cluster.  so I cannot have a shard list in the config files
> unless I can somehow update them while the system is running.
> 
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Hitting-the-URI-limit-how-to-get-around-this-tp3017837p3020185.html
> Sent from the Solr - User mailing list archive at Nabble.com.

--
Ken Krugler
+1 530-210-6378
http://bixolabs.com
custom data mining solutions








Re: Difference between Solr and Lucidworks distribution

2011-04-03 Thread Ken Krugler

On Apr 3, 2011, at 6:56am, yehosef wrote:

> How can they require payment for something that was developed under the
> apache license?

It's the difference between free speech and free beer :)

See http://en.wikipedia.org/wiki/Gratis_versus_libre

-- Ken

------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g







Re: boilerpipe solr tika howto please

2011-01-14 Thread Ken Krugler

Hi Arno,

On Jan 14, 2011, at 3:57am, arnaud gaudinat wrote:


Hello,

I would like to use BoilerPipe (a very good program which cleans the  
html content from surplus "clutter").
I saw that BoilerPipe is inside Tika 0.8 and so should be accessible  
from solr, am I right?


How I can Activate BoilerPipe in Solr? Do I need to change  
solrconfig.xml ( with  
org.apache.solr.handler.extraction.ExtractingRequestHandler)?


Or do I need to modify some code inside Solr?

I so something like TikaCLI -F in the tika forum (http://www.lucidimagination.com/search/document/242ce3a17f30f466/boilerpipe_integration 
) is it the right way?


You need to add the BoilerpipeContentHandler into Tika's content  
handler chain.


Which I'm pretty sure means you'd need to modify Solr, e.g. (in trunk)  
the TikaEntityProcessor.getHtmlHandler() method. I'd try something like:


return new BoilerpipeContentHandler(new ContentHandlerDecorator(

Though from a quick look at that code, I'm curious why it doesn't use  
BodyContentHandler, versus the current ContentHandlerDecorator.


-- Ken

--
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g







Re: How to let crawlers in, but prevent their damage?

2011-01-10 Thread Ken Krugler


On Jan 10, 2011, at 7:02am, Otis Gospodnetic wrote:


Hi Ken, thanks Ken. :)

The problem with this approach is that it exposes very limited  
content to

bots/web search engines.

Take http://search-lucene.com/ for example.  People enter all kinds  
of queries
in web search engines and end up on that site.  People who visit the  
site
directly don't necessarily search for those same things.  Plus, new  
terms are
entered to get to search-lucene.com every day, so keeping up with  
that would
mean constantly generating more and more of those static pages.   
Basically, the

tail is super long.


To clarify - the issue of using actual user search traffic is one of  
SEO, not what content you expose.


If, for example, people commonly do a search for "java "  
then that's a hint that the URL to the static content, and the page  
title, should have the language as part of it.


So you shouldn't be generating static pages based on search traffic.  
Though you might want to decide what content to "favor" (see below)  
based on popularity.



On top of that, new content is constantly being generated,
so one would have to also constantly both add and update those  
static pages.


Yes, but that's why you need to automate that content generation, and  
do it on a regular (e.g. weekly) basis.


The big challenges we ran into were:

1. Dealing with badly behaved bots that would hammer the site.

We wound up putting this content on a separate system, so it wouldn't  
impact users on the main system.


And generating a regular report by user agent & IP address, so that we  
could block by robots.txt and IP when necessary.


2. Figuring out how to structure the static content so that it didn't  
look like spam to Google/Yahoo/Bing


You don't want to have too many links per page, or too much depth, but  
that constrains how many pages you can reasonably expose.


We had project scores based on code, activity, usage - so we used that  
to rank the content and focus on exposing early (low depth) the "good  
stuff". You could do the same based on popularity, from search logs.


Anyway, there's a lot to this topic, but it doesn't feel very Solr  
specific. So apologies for reducing the signal-to-noise ratio with  
talk about SEO :)


-- Ken

I have a feeling there is not a good solution for this because on  
one hand
people don't like the negative bot side effect, on the other hand  
people want as
much of their sites indexed by the big guys.  The only half-solution  
that comes
to mind involves looking at who's actually crawling you and who's  
bringing you
visitors, then blocking those with a bad ratio of those two - bots  
that crawl a

lot but don't bring a lot of value.

Any other ideas?

Thanks,
Otis

Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/



- Original Message 

From: Ken Krugler 
To: solr-user@lucene.apache.org
Sent: Mon, January 10, 2011 9:43:49 AM
Subject: Re: How to let crawlers in, but prevent their damage?

Hi Otis,

From what I learned at Krugle, the approach that worked for us  was:

1. Block all bots on the search page.

2. Expose the target  content via statically linked pages that are  
separately
generated from the same  backing store, and optimized for target  
search terms

(extracted from your own  search logs).

-- Ken

On Jan 10, 2011, at 5:41am, Otis Gospodnetic  wrote:


Hi,

How do people with public search  services deal with bots/crawlers?
And I don't mean to ask how one bans  them (robots.txt) or slow  
them down

(Delay
stuff in robots.txt) or  prevent them from digging too deep in  
search

results...


What I  mean is that when you have publicly exposed search that  
bots crawl,

they
issue all kinds of crazy "queries" that result in errors, that add  
noise to

Solr

caches, increase Solr cache evictions, etc. etc.

Are there some known recipes for dealing with them, minimizing their

negative

side-effects, while still letting them crawl you?

Thanks,
Otis

Sematext :: http://sematext.com/ :: Solr -  Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/



--
Ken  Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n  g








--
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g







Re: How to let crawlers in, but prevent their damage?

2011-01-10 Thread Ken Krugler

Hi Otis,

From what I learned at Krugle, the approach that worked for us was:

1. Block all bots on the search page.

2. Expose the target content via statically linked pages that are  
separately generated from the same backing store, and optimized for  
target search terms (extracted from your own search logs).


-- Ken

On Jan 10, 2011, at 5:41am, Otis Gospodnetic wrote:


Hi,

How do people with public search services deal with bots/crawlers?
And I don't mean to ask how one bans them (robots.txt) or slow them  
down (Delay
stuff in robots.txt) or prevent them from digging too deep in search  
results...


What I mean is that when you have publicly exposed search that bots  
crawl, they
issue all kinds of crazy "queries" that result in errors, that add  
noise to Solr

caches, increase Solr cache evictions, etc. etc.

Are there some known recipes for dealing with them, minimizing their  
negative

side-effects, while still letting them crawl you?

Thanks,
Otis

Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/



----------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g







Re: entire farm fails at the same time with OOM issues

2010-12-01 Thread Ken Krugler


On Nov 30, 2010, at 5:16pm, Robert Petersen wrote:


What would I do with the heap dump though?  Run one of those java heap
analyzers looking for memory leaks or something?  I have no experience
with thoseI saw there was a bug fix in solr 1.4.1 for a 100 byte  
memory
leak occurring on each commit, but it would take thousands of  
commits to

make that add up to anything right?


Typically when I run out of memory in Solr, it's during an index  
update, when the new index searcher is getting warmed up.


Looking at the heap often shows ways to reduce memory requirements,  
e.g. you'll see a really big chunk used for a sorted field.


See http://wiki.apache.org/solr/SolrCaching and http://wiki.apache.org/solr/SolrPerformanceFactors 
 for more details.


-- Ken




-Original Message-----
From: Ken Krugler [mailto:kkrugler_li...@transpac.com]
Sent: Tuesday, November 30, 2010 3:12 PM
To: solr-user@lucene.apache.org
Subject: Re: entire farm fails at the same time with OOM issues

Hi Robert,

I'd recommend launching Tomcat with -XX:+HeapDumpOnOutOfMemoryError
and -XX:HeapDumpPath=, so then
you have something to look at versus a Gedankenexperiment :)

-- Ken

On Nov 30, 2010, at 3:04pm, Robert Petersen wrote:


Greetings, we are running one master and four slaves of our multicore
solr setup.  We just served searches for our catalog of 8 million
products with this farm during black Friday and cyber Monday, our
busiest days of the year, and the servers did not break a sweat!
Index
size is about 28GB.

However, twice now recently during a time of low load we have had a
fire
drill where I have seen tomcat/solr fail and become unresponsive  
after

some OOM heap errors.  Solr wouldn't even serve up its admin pages.
I've had to go in and manually knock tomcat out of memory and then
restart it.  These solr slaves are load balanced and the load
balancers
always probe the solr slaves so if they stop serving up searches they
are automatically removed from the load balancer.  When all four
fail at
the same time we have an issue!

My question is this.  Why in the world would all of my slaves, after
running fine for some days, suddenly all at the exact same minute
experience OOM heap errors and go dead?  The load balancer kicks them
all out at the same time each time.  Each slave only talks to the
master
and not to each other, but the master show no errors in the logs at
all.
Something must be triggering this though.  The only other odd thing I
saw in the logs was after the first OOM errors were recorded, the
slaves
started occasionally not being able to get to the master.

This behavior makes me a little nervous...=:-o  eek!





Environment:  Lucid Imagination distro of Solr 1.4 on Tomcat



Platform: RHEL with Sun JRE 1.6.0_18 on dual quad xeon machines with
64GB memory etc etc










<http://ken-blog.krugler.org>
+1 530-265-2225






------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g







------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g







Re: entire farm fails at the same time with OOM issues

2010-11-30 Thread Ken Krugler

Hi Robert,

I'd recommend launching Tomcat with -XX:+HeapDumpOnOutOfMemoryError  
and -XX:HeapDumpPath=, so then  
you have something to look at versus a Gedankenexperiment :)


-- Ken

On Nov 30, 2010, at 3:04pm, Robert Petersen wrote:


Greetings, we are running one master and four slaves of our multicore
solr setup.  We just served searches for our catalog of 8 million
products with this farm during black Friday and cyber Monday, our
busiest days of the year, and the servers did not break a sweat!   
Index

size is about 28GB.

However, twice now recently during a time of low load we have had a  
fire

drill where I have seen tomcat/solr fail and become unresponsive after
some OOM heap errors.  Solr wouldn't even serve up its admin pages.
I've had to go in and manually knock tomcat out of memory and then
restart it.  These solr slaves are load balanced and the load  
balancers

always probe the solr slaves so if they stop serving up searches they
are automatically removed from the load balancer.  When all four  
fail at

the same time we have an issue!

My question is this.  Why in the world would all of my slaves, after
running fine for some days, suddenly all at the exact same minute
experience OOM heap errors and go dead?  The load balancer kicks them
all out at the same time each time.  Each slave only talks to the  
master
and not to each other, but the master show no errors in the logs at  
all.

Something must be triggering this though.  The only other odd thing I
saw in the logs was after the first OOM errors were recorded, the  
slaves

started occasionally not being able to get to the master.

This behavior makes me a little nervous...=:-o  eek!





Environment:  Lucid Imagination distro of Solr 1.4 on Tomcat



Platform: RHEL with Sun JRE 1.6.0_18 on dual quad xeon machines with
64GB memory etc etc










<http://ken-blog.krugler.org>
+1 530-265-2225






----------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g







Re: Dinamically change master

2010-11-30 Thread Ken Krugler

Hi Tommaso,

On Nov 30, 2010, at 7:41am, Tommaso Teofili wrote:


Hi all,

in a replication environment if the host where the master is running  
goes
down for some reason, is there a way to communicate to the slaves to  
point
to a different (backup) master without manually changing  
configuration (and

restarting the slaves or their cores)?

Basically I'd like to be able to change the replication master  
dinamically

inside the slaves.

Do you have any idea of how this could be achieved?


One common approach is to use VIP (virtual IP) support provided by  
load balancers.


Your slaves are configured to use a VIP to talk to the master, so that  
it's easy to dynamically change which master they use, via updates to  
the load balancer config.


-- Ken

------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g







Re: A Newbie Question

2010-11-14 Thread Ken Krugler
Solaris





with



hundreds of thousands of text files (there are other files,  
as well,





but





my



target is these text files). The directories on the Solaris  
boxes are

exported and are available as NFS mounts.

I have installed Solr 1.4 on a Linux box and have tested the




installation,



using curl to post  documents. However, the manual says that  
curl is





not





the



recommended way of posting documents to Solr. Could someone  
please



tell





me



what is the preferred approach in such an environment? I am  
not a





programmer




and would appreciate some hand-holding here :o)

Thanks in advance,

Sesh






--
Lance Norskog
goks...@gmail.com


















--
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g







Re: Dynamic creating of cores in solr

2010-11-10 Thread Ken Krugler
lrInputDocumentList();

  UpdateResponse rsp;
   try
   {
   rsp = indexCore.add(docList);
   rsp = indexCore.commit();
   }
   catch (IOException e)
   {
   LOG.warn("Error commiting documents", e);
   }
   catch (SolrServerException e)
   {
   LOG.warn("Error commiting documents", e);
   }
[snip]

3) optimize, then swap cores:

   private void optimizeCore()
   {
   try
   {
   indexCore.optimize();
   }
   catch(SolrServerException e)
   {
   LOG.warn("Error while optimizing core", e);
   }
   catch(IOException e)
   {
   LOG.warn("Error while optimizing core", e);
   }
   }

   private void swapCores()
   {
   String liveCore = indexName;
   String indexCore = indexName + SolrConstants.SUFFIX_INDEX; //

SUFFIX_INDEX = "_INDEX"

   LOG.info("Swapping Solr cores: " + indexCore + ", " +

liveCore);

   CoreAdminRequest request = new CoreAdminRequest();
   request.setAction(CoreAdminAction.SWAP);
   request.setCoreName(indexCore);
   request.setOtherCoreName(liveCore);
   try
   {
   request.process(solr);
   }
   catch (SolrServerException e)
   {
   e.printStackTrace();
   }
   catch (IOException e)
   {
   e.printStackTrace();
   }
   }


And that's about it.

You could adjust the above so there's only one core per index that
you want - if you don't do complete reindexes, and don't need the  
index

to always be searchable.


Hope that helps...


Bob Sandiford | Lead Software Engineer | SirsiDynix
P: 800.288.8020 X6943 | bob.sandif...@sirsidynix.com
www.sirsidynix.com




-Original Message-
From: Nizan Grauer [mailto:niz...@yahoo-inc.com]
Sent: Tuesday, November 09, 2010 3:36 AM
To: solr-user@lucene.apache.org
Subject: Dynamic creating of cores in solr

Hi,

I'm not sure this is the right mail to write to, hopefully you can

help

or direct me to the right person

I'm using solr - one master with 17 slaves in the server and using
solrj as the java client

Currently there's only one core in all of them (master and  
slaves) -

only the cpaCore.

I thought about using multi-cores solr, but I have some problems

with

that.

I don't know in advance which cores I'd need -

When my java program runs, I call for documents to be index to a
certain url, which contains the core name, and I might create a url
based on core that is not yet created. For example:

Calling to index - http://localhost:8080/cpaCore  - existing core,
everything as usual
Calling to index -  http://localhost:8080/newCore - server realizes
there's no core "newCore", creates it and indexes to it. After that

-

also creates the new core in the slaves
Calling to index - http://localhost:8080/newCore  - existing core,
everything as usual

What I'd like to have on the server side to do is realize by itself

if

the cores exists or not, and if not  - create it

One other restriction - I can't change anything in the client  
side -

calling to the server can only make the calls it's doing now - for
index and search, and cannot make calls for cores creation via the
CoreAdminHandler. All I can do is something in the server itself

What can I do to get it done? Write some RequestHandler?
REquestProcessor? Any other option?

Thanks, nizan









--
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g







Re: Inconsistent slave performance after optimize

2010-10-27 Thread Ken Krugler
.

After the 4 hour break, we re-moved the 3rd and last slave server  
from our

load-balancing pool, then re-enabled replication.
This time we saw a tiny blip. The average performance went up to 1  
second

briefly then went back to the (normal for us)
0.25 to 0.5 second range. We then added this server back to the
load-balancing pool and observed no degradation in performance.

While we were happy to avoid a repeat of the poor performance we  
saw on

the
previous slaves, we are at a loss to explain
why this slave did not also have such poor performance.

At this point we're scratching our heads trying to understand:
  (a) Why the performance of the first two slaves was so terrible  
after

the
optimize. We think its cache-warming related, but we're not sure.

10 hours seems like a long time to wait for the cache to warm

up
  (b) Why the performance of the third slave was barely impacted. It
should
have hit the same cold-cache issues as the other servers, if that is
indeed
the root cause.
  (c) Why performance of the first 2 slaves is still much worse  
after the

optimize than it was before the optimize,
 where the performance of the 3rd slave is pretty much  
unchanged. We

expected the optimize to *improve* performance.

All 3 slave servers are identically configured, and the procedure  
for

re-enabling replication was identical for the 2nd and 3rd
slaves, with the exception of a 4-hour wait period.

We have confirmed that the 3rd slave did replicate, the number of
documents
and total index size matches the master and other slave servers.

I'm writing to fish for an explanation or ideas that might explain  
this
inconsistent performance. Obviously, we'd like to be able to  
reproduce the
performance of the 3rd slave, and avoid the poor performance of  
the first

two slaves the next time we decide it's time to optimize our index.

thanks in advance,

Mason







--
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g







Re: Multiple Word Facets

2010-10-27 Thread Ken Krugler


On Oct 27, 2010, at 6:29am, Adam Estrada wrote:


Ahhh...I see! I am doing my testing crawling a couple websites using
Nutch and in doing so I am assigning my facets to the title field
which is type=text. Are you saying that I will need to manually
generate the content for my facet field? I can see the reason and need
for doing it that way but I really need for my faceting to happen
dynamically based on the content in the field which in this case is
the title of a URL.


You would use copyfield to copy the contents of the title into a new  
field that uses the string type, and is the one you use for faceting.


-- Ken


On Wed, Oct 27, 2010 at 9:19 AM, Jayendra Patil
 wrote:
The Shingle Filter Breaks the words in a sentence into a  
combination of 2/3

words.

For faceting field you should use :-
stored="true"

multiValued="true"/>

The type of the field should be *string *so that it is not  
tokenised at all.


On Wed, Oct 27, 2010 at 9:12 AM, Adam Estrada  
wrote:


Thanks guys, the solr.ShingleFilterFactory did work to get me  
multiple

terms per facet but now I am seeing some redundancy in the facets
numbers. See below...

Highway (62)
Highway System (59)
National (59)
National Highway (59)
National Highway System (59)
System (59)

See what's going on here? How can I make my multi token facets  
smarter

so that the tokens aren't duplicated?

Thanks in advance,
Adam

On Tue, Oct 26, 2010 at 10:32 PM, Ahmet Arslan   
wrote:

Facets are generated from indexed terms.

Depending on your need/use-case:

You can use a additional separate String field (which is not  
tokenized)
for facets, populate it via copyField. Search on tokenized field  
facet on

non-tokenized field.


Or

You can add solr.ShingleFilterFactory to your index analyzer to  
form

multiple word terms.


--- On Wed, 10/27/10, Adam Estrada  wrote:


From: Adam Estrada 
Subject: Multiple Word Facets
To: solr-user@lucene.apache.org
Date: Wednesday, October 27, 2010, 4:43 AM
All,
I am a new to Solr faceting and stuck on how to get
multiple-word
facets returned from a standard Solr query. See below for
what is
currently being returned.





89
87
87
87
84
60
32
22
19
15
15
14
12
11
10
9
7
7
7
6
6
6
6
...etc...

There are many terms in there that are 2 or 3 word phrases.
For
example, Eastern Federal Lands Highway Division all gets
broken down
in to the individual words that make up the total group of
words. I've
seen quite a few websites that do what it is I am trying to
do here so
any suggestions at this point would be great. See my schema
below
(copied from the example schema).


  
 




  

Similar for type="query". Please advise on how to group or
cluster
document terms so that they can be used as facets.

Many thanks in advance,
Adam Estrada












--
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g







Re: using HTTPClient sending solr ping request wont timeout as specified

2010-10-13 Thread Ken Krugler

Hi Renee,

Mike is right, this is a question to post on the HttpClient users list  
(httpclient-us...@hc.apache.org).


And yes, there is a separate setConnectionTimeout() that can be used.  
Though I'm most familiar with HttpClient 4.0, not 3.1.


One possibility is that the ping response handler is responding (the  
connection is established), but you're not getting any data back.


-- Ken


On Oct 13, 2010, at 4:41am, Michael Sokolov wrote:

This does seem more like an HTTPClient question than a solr question  
- you

might get more traction on their lists?  Still, from what I remember
HTTPClient has a number of timeouts you can set.  Perhaps it's the  
read

timeout you need?

-Mike



-Original Message-
From: Renee Sun [mailto:renee_...@mcafee.com]
Sent: Tuesday, October 12, 2010 7:47 PM
To: solr-user@lucene.apache.org
Subject: Re: using HTTPClient sending solr ping request wont
timeout as specified


I also added the following timeout for the connection, still
not working:


   client.getParams().setSoTimeout(httpClientPingTimeout);

client 
.getParams().setConnectionManagerTimeout(httpClientPingTimeout);


--
View this message in context:
http://lucene.472066.n3.nabble.com/using-HTTPClient-sending-so
lr-ping-request-wont-timeout-as-specified-tp1691292p1691355.html
Sent from the Solr - User mailing list archive at Nabble.com.





----------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g







Re: PatternReplaceFilterFactory creating empty string as a term

2010-10-05 Thread Ken Krugler


On Oct 5, 2010, at 6:24pm, Shawn Heisey wrote:

I am developing a new schema. It has a pattern filter that trims  
leading and trailing punctuation from terms.




It is resulting in empty terms, because there are situations in the  
analyzer stream where a term happens to be composed of nothing but  
punctuation. This problem is not happening in production. I want  
those terms removed.


This blank term makes the top of the list as far as term frequency.  
Out of 7.6 million documents, 4.8 million of them have it. From  
TermsComponent:



−

0
19106

−

−

4830648


[snip]

Is there any existing way to remove empty terms during analysis? I  
tried TrimFilterFactory but that made no difference.


You could use LengthFilterFactory to restrict terms to being at least  
one character long.



Is this a bug in PatternReplaceFilterFactory?


No, I don't believe so. PatternReplaceFilterFactory creates a  
PatternReplaceFilter, and the JavaDoc for that says:
Note: Depending on the input and the pattern used and the input  
TokenStream, this TokenFilter may produce Tokens whose text is the  
empty string.




-- Ken


<http://ken-blog.krugler.org>
+1 530-265-2225




------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g







Re: getting a list of top page-ranked webpages

2010-09-16 Thread Ken Krugler

Hi Ian,

On Sep 16, 2010, at 2:44pm, Ian Upright wrote:

Hi, this question is a little off topic, but I thought since so many  
people

on this are probably experts in this field, someone may know.

I'm experimenting with my own semantic-based search engine, but I  
want to
test it with a large corpus of web pages.  Ideally I would like to  
have a

list of the top 10M or top 100M page-ranked URL's in the world.

Short of using Nutch to crawl the entire web and build this page- 
rank, is
there any other ways?  What other ways or resources might be  
available for

me to get this (smaller) corpus of top webpages?


The public terabyte dataset project would be a good match for what you  
need.


http://bixolabs.com/datasets/public-terabyte-dataset-project/

Of course, that means we have to actually finish the crawl & finalize  
the Avro format we use for the data :)


There are other free collections of data around, though none that I  
know of which target top-ranked pages.


-- Ken

----------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g







Re: Color search for images

2010-09-15 Thread Ken Krugler


On Sep 15, 2010, at 7:59am, Shawn Heisey wrote:

My index consists of metadata for a collection of 45 million  
objects, most of which are digital images.  The executives have  
fallen in love with Google's color image search.  Here's a search  
for "flower" with a red color filter:


http://www.google.com/images?q=flower&tbs=isch:1,ic:specific,isc:red

I am interested in duplicating this.  Can this group of fine people  
point me in the right direction?  I don't want anyone to do it for  
me, just help me find software and/or algorithms that can extract  
the color information, then find a way to get Solr to index and  
search it.


When I took at look at the search results, it seems like the word  
"red" shows up in the image name, or description, or tag for every  
found image.


Are you sure Google is extracting color information? Or just being  
smart about color-specific keywords found in associated text?


-- Ken

--
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g







Re: Is semicolon a character that needs escaping?

2010-09-02 Thread Ken Krugler

Hi Michael,

But in general escaping characters in a query gets tricky - if you  
can

directly build queries versus pre-processing text sent to the query
parser, you'll save yourself some pain and suffering.


What do you mean by these two alternatives? That is, what exactly  
could

I do better?


By "can build...", I meant if you can come up with a GUI whereby the  
user doesn't have to use special characters (other than say quoting)  
then you can take a collection of clauses and programmatically build  
your query, without using the query parser.


The code I wound up having to write for what seemed like simple  
escaping quickly got complex and convoluted - e.g. if you want to  
allow "AND" as a term, and don't want it to get processed specially by  
the query parser.



Also, since I did the above code the DisMaxRequestHandler has been
added to Solr, and it (IIRC) tries to be smart about handling this
type of escaping for you.


Dismax is not (yet) an option because we need the full lucene syntax
within the query.


OK - in that case sounds like you're stuck with escaping.


-- Ken

--
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g







Re: Is semicolon a character that needs escaping?

2010-09-02 Thread Ken Krugler


On Sep 2, 2010, at 12:35pm, Michael Lackhoff wrote:

According to http://lucene.apache.org/java/2_9_1/ 
queryparsersyntax.html

only these characters need escaping:
+ - && || ! ( ) { } [ ] ^ " ~ * ? : \
but with this simple query:
TI:stroke; AND TI:journal
I got the error message:
HTTP ERROR: 400
Unknown sort order: TI:journal

My first guess was that it was a URL encoding issue but everything  
looks

fine:
http://localhost:8983/solr/select/?q=TI%3Astroke%3B+AND+TI%3Ajournal&version=2.2&start=0&rows=10&indent=on
as you can see, the semicolon is encoded as %3B
There is no problem when the query ends with the semicolon:
TI:stroke;
gives no error.
The first query also works if I escape the semicolon:
TI:stroke\; AND TI:journal

From this I conclude that there is a bug either in the docs or in the
query parser or I missed something. What is wrong here?


The docs need to be updated, I believe. From some code I wrote back in  
2006...


// Also note that we escape ';', as Solr uses this to support  
embedding
// commands into the query string (yikes), and the code base  
we're using
// has a bug where if the ';' doesn't have two tokens after  
it (white-
// space separated) then you get an array index out of bounds  
error.


I also had this note, no idea if it's still an issue:

// Before we do regular escaping, work around a bug in the  
Lucene query
// parser. If the last character is a '\', we can escape it  
as '\\', but
// if we build an expression that looks like xxx AND  
() then
// the Lucene query parser will treat the final '\' before  
the ')' as
// a signal to escape the ')' character. That's just wrong,  
but for now
// we'll just strip off any trailing '\' characters in the  
clause.


But in general escaping characters in a query gets tricky - if you can  
directly build queries versus pre-processing text sent to the query  
parser, you'll save yourself some pain and suffering.


Also, since I did the above code the DisMaxRequestHandler has been  
added to Solr, and it (IIRC) tries to be smart about handling this  
type of escaping for you.


-- Ken

--
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g







Re: Restricting HTML search?

2010-08-25 Thread Ken Krugler
Actually TagSoup's reason for existence is to clean up all of the  
messy HTML that's out in the wild.


Tika's HTML parser wraps this, and uses it to generate the stream of  
SAX events that it then consumes and turns into a normalized XHTML 1.0- 
compliant data stream.


-- Ken

On Aug 25, 2010, at 7:22pm, Lance Norskog wrote:


This assumes that the HTML is good quality. I don't know exactly what
your use case is. If you're crawling the web you will find some very
screwed-up HTML.

On Wed, Aug 25, 2010 at 6:45 AM, Ken Krugler
 wrote:


On Aug 24, 2010, at 10:55pm, Paul Libbrecht wrote:

Wouldn't the usage of the NeckoHTML (as an XML-parser) and XPath  
be safer?

I guess it all depends on the "quality" of the source document.


If you're processing HTML then you definitely want to use something  
like

NekoHTML or TagSoup.

Note that Tika uses TagSoup and makes it easy to do special  
processing of
specific elements - you give it a content handler that gets fed a  
stream of

cleaned-up HTML elements.

-- Ken


Le 25-août-10 à 02:09, Lance Norskog a écrit :

I would do this with regular expressions. There is a Pattern  
Analyzer
and a Tokenizer which do regular expression-based text chopping.  
(I'm

not sure how to make them do what you want). A more precise tool is
the RegexTransformer in the DataImportHandler.

Lance

On Tue, Aug 24, 2010 at 7:08 AM, Andrew Cogan
 wrote:


I'm quite new to SOLR and wondering if the following is  
possible: in
addition to normal full text search, my users want to have the  
option to
search only HTML heading innertext, i.e. content inside of ,  
,

or
 tags.




----
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g









--
Lance Norskog
goks...@gmail.com


----
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g






Re: Restricting HTML search?

2010-08-25 Thread Ken Krugler


On Aug 24, 2010, at 10:55pm, Paul Libbrecht wrote:

Wouldn't the usage of the NeckoHTML (as an XML-parser) and XPath be  
safer?

I guess it all depends on the "quality" of the source document.


If you're processing HTML then you definitely want to use something  
like NekoHTML or TagSoup.


Note that Tika uses TagSoup and makes it easy to do special processing  
of specific elements - you give it a content handler that gets fed a  
stream of cleaned-up HTML elements.


-- Ken


Le 25-août-10 à 02:09, Lance Norskog a écrit :


I would do this with regular expressions. There is a Pattern Analyzer
and a Tokenizer which do regular expression-based text chopping. (I'm
not sure how to make them do what you want). A more precise tool is
the RegexTransformer in the DataImportHandler.

Lance

On Tue, Aug 24, 2010 at 7:08 AM, Andrew Cogan
 wrote:

I'm quite new to SOLR and wondering if the following is possible: in
addition to normal full text search, my users want to have the  
option to
search only HTML heading innertext, i.e. content inside of ,  
, or

 tags.




------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g






Re: indexing???

2010-08-17 Thread Ken Krugler


On Aug 16, 2010, at 10:38pm, satya swaroop wrote:


hi all,
  the error i got is ""Unexpected RuntimeException from
org.apache.tika.parser.pdf.pdfpar...@8210fc"" when i indexed a file  
similar

to the one in
  https://issues.apache.org/jira/browse/PDFBOX-709/samplerequestform.pdf


1. This URL doesn't work for me.

2. Please include the full stack trace from the RuntimeException.

3. What version of Tika are you using?

Thanks,

-- Ken

--------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g






Re: Best solution to avoiding multiple query requests

2010-08-04 Thread Ken Krugler

Hi Geert-jan,

On Aug 4, 2010, at 12:04pm, Geert-Jan Brits wrote:

If I understand correctly: you want to sort your collapsed results  
by 'nr of

collapsed results'/ hits.

It seems this can't be done out-of-the-box using this patch (I'm not
entirely sure, at least it doesn't follow from the wiki-page.  
Perhaps best
is to check the jira-issues to make sure this isn't already  
available now,

but just not updated on the wiki)

Also I found a blogpost (from the patch creator afaik) with in the  
comments

someone with the same issue + some pointers.
http://blog.jteam.nl/2009/10/20/result-grouping-field-collapsing-with-solr/


Yup, that's the one - 
http://blog.jteam.nl/2009/10/20/result-grouping-field-collapsing-with-solr/comment-page-1/#comment-1249

So with some modifications to that patch, it could work...thanks for  
the info!


-- Ken


2010/8/4 Ken Krugler 


Hi Geert-Jan,


On Aug 4, 2010, at 5:30am, Geert-Jan Brits wrote:

Field Collapsing (currently as patch) is exactly what you're  
looking for

imo.

http://wiki.apache.org/solr/FieldCollapsing



Thanks for the ref, good stuff.

I think it's close, but if I understand this correctly, then I  
could get
(using just top two, versus top 10 for simplicity) results that  
looked like


"dog training" (faceted field value A)
"super dog" (faceted field value B)

but if the actual faceted field value/hit counts were:

C (10)
D (8)
A (2)
B (1)

Then what I'd want is the top hit for "dog AND facet field:C",  
followed by

"dog AND facet field:D".

Used field collapsing would improve the probability that if I asked  
for the
top 100 hits, I'd find entries for each of my top N faceted field  
values.


Thanks again,

-- Ken


I've got a situation where the key result from an initial search  
request
(let's say for "dog") is the list of values from a faceted field,  
sorted

by
hit count.

For the top 10 of these faceted field values, I need to get the  
top hit

for
the target request ("dog") restricted to that value for the faceted
field.

Currently this is 11 total requests, of which the 10 requests  
following

the
initial query can be made in parallel. But that's still a lot of
requests.

So my questions are:

1. Is there any magic query to handle this with Solr as-is?

2. if not, is the best solution to create my own request handler?

3. And in that case, any input/tips on developing this type of  
custom

request handler?

Thanks,

-- Ken





Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g








Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g






Re: Best solution to avoiding multiple query requests

2010-08-04 Thread Ken Krugler

Hi Geert-Jan,

On Aug 4, 2010, at 5:30am, Geert-Jan Brits wrote:

Field Collapsing (currently as patch) is exactly what you're looking  
for

imo.

http://wiki.apache.org/solr/FieldCollapsing


Thanks for the ref, good stuff.

I think it's close, but if I understand this correctly, then I could  
get (using just top two, versus top 10 for simplicity) results that  
looked like


"dog training" (faceted field value A)
"super dog" (faceted field value B)

but if the actual faceted field value/hit counts were:

C (10)
D (8)
A (2)
B (1)

Then what I'd want is the top hit for "dog AND facet field:C",  
followed by "dog AND facet field:D".


Used field collapsing would improve the probability that if I asked  
for the top 100 hits, I'd find entries for each of my top N faceted  
field values.


Thanks again,

-- Ken

I've got a situation where the key result from an initial search  
request
(let's say for "dog") is the list of values from a faceted field,  
sorted by

hit count.

For the top 10 of these faceted field values, I need to get the top  
hit for
the target request ("dog") restricted to that value for the faceted  
field.


Currently this is 11 total requests, of which the 10 requests  
following the
initial query can be made in parallel. But that's still a lot of  
requests.


So my questions are:

1. Is there any magic query to handle this with Solr as-is?

2. if not, is the best solution to create my own request handler?

3. And in that case, any input/tips on developing this type of custom
request handler?

Thanks,

-- Ken



Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g






Best solution to avoiding multiple query requests

2010-08-03 Thread Ken Krugler

Hi all,

I've got a situation where the key result from an initial search  
request (let's say for "dog") is the list of values from a faceted  
field, sorted by hit count.


For the top 10 of these faceted field values, I need to get the top  
hit for the target request ("dog") restricted to that value for the  
faceted field.


Currently this is 11 total requests, of which the 10 requests  
following the initial query can be made in parallel. But that's still  
a lot of requests.


So my questions are:

1. Is there any magic query to handle this with Solr as-is?

2. if not, is the best solution to create my own request handler?

3. And in that case, any input/tips on developing this type of custom  
request handler?


Thanks,

-- Ken


------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g






Re: SolrCore has a large number of SolrIndexSearchers retained in "infoRegistry"

2010-07-27 Thread Ken Krugler


On Jul 27, 2010, at 12:21pm, Chris Hostetter wrote:


:
: I was wondering if anyone has found any resolution to this email  
thread?


As Grant asked in his reply when this thread was first started  
(December 2009)...



It sounds like you are either using embedded mode or you have some
custom code.  Are you sure you are releasing your resources  
correctly?


...there was no response to his question for clarification.

the problem, given the info we have to work with, definitely seems  
to be
that the custom code utilizing the SolrCore directly is not  
releasing the

resources that it is using in every case.

if you are claling hte execute method, that means you have a
SOlrQueryRequest object -- which means you somehow got an instance of
a SolrIndexSearcher (every SOlrQueryRequest has one assocaited with  
it)
and you are somehow not releasing that SolrIndexSearcher (probably  
because

you are not calling close() on your SolrQueryRequest)


One thing that bit me previously with using APIs in this area of Solr  
is that if you call CoreContainer.getCore(), this increments the open  
count, so you have to balance each getCore() call with a close() call.


The naming here could be better - I think it's common to have an  
expectation that calls to get something don't change any state. Maybe  
openCore()?


-- Ken


But it relaly all depends on how you got ahold of that
SOlrQueryRequest/SolrIndexSearcher pair in the first place ... every
method in SolrCore that gives you access to a SolrIndexSearcher is
documented very clearly on how to "release" it when you are done  
with it

so the ref count can be decremented.


-Hoss


--------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g






Re: faceted search with job title

2010-07-22 Thread Ken Krugler

Hi Savannah,

A few comments below, scattered in-line...

-- Ken

On Jul 21, 2010, at 3:08pm, Savannah Beckett wrote:

And I will have to recompile the dom or sax code each time I add a  
job board for
crawling.  Regex patten is only a string which can be stored in a  
text file or

db, and retrieved based on the job board.  What do you think?


You can store the XPath expressions in a text file as strings, and  
load/compile them as needed.



From: "Nagelberg, Kallin" 
To: "solr-user@lucene.apache.org" 
Sent: Wed, July 21, 2010 10:39:32 AM
Subject: RE: faceted search with job title

Yeah you should definitely just setup a custom parser for each  
site.. should be
easy to extract title using groovy's xml parsing along with tagsoup  
for sloppy

html.


Definitely yes re using TagSoup to clean up bad HTML.

And definitely yes to needing per-site "rules" (typically XPath +  
optional regex as needed) to extract specific details.


For a common class of sites powered by the same back-end, you can  
often re-use the same general rules as the markup that you care about  
is consistent.


If you can't find the pattern for each site leading to the job title  
how

can you expect solr to? Humans have the advantage here :P

-Kallin Nagelberg

-Original Message-
From: Savannah Beckett [mailto:savannah_becket...@yahoo.com]
Sent: Wednesday, July 21, 2010 12:20 PM
To: solr-user@lucene.apache.org
Cc: dave.sea...@magicalia.com
Subject: Re: faceted search with job title

mmm...there must be better way...each job board has different  
format.  If there
are constantly new job boards being crawled, I don't think I can  
manually look
for specific sequence of tags that leads to job title.  Most of them  
don't even
have class or id.  There is no guarantee that the job title will be  
in the title
tag, or header tag.  Something else can be in the title.  Should I  
do this in a

class that extends IndexFilter in Nutch?


When I do this kind of thing I use Bixo (http://openbixo.org), but  
that requires knowledge of Cascading (& some Hadoop) in order to  
construct web mining workflows.




From: Dave Searle 
To: "solr-user@lucene.apache.org" 
Sent: Wed, July 21, 2010 8:42:55 AM
Subject: RE: faceted search with job title

You'd probably need to do some post processing on the pages and set  
up rules for
each website to grab that specific bit of data. You could load the  
html into an
xml parser, then use xpath to grab content from a particular tag  
with a class or

id, based on the particular website



-Original Message-
From: Savannah Beckett [mailto:savannah_becket...@yahoo.com]
Sent: 21 July 2010 16:38
To: solr-user@lucene.apache.org
Subject: faceted search with job title

Hi,
  I am currently using nutch to crawl some job pages from job  
boards.  They are
in my solr index now.  I want to do faceted search with the job  
titles.  How?
The job titles can be in any locations of the page, e.g. title,  
header,
content...   If I use indexfilter in Nutch to search the content for  
job title,
there are hundred of thousands of job titles, I can't hard code them  
all.  Do
you have a better idea?  I think I need the job title in a separate  
field in the

index to make it work with solr faceted search, am I right?


Yes, you'd want a separate "job title" field in the index. Though  
often the job titles are slight variants on each other, so this would  
probably work much better if you automatically found common phrases  
and used those, otherwise you get "Senior Bottlewasher" and "Sr.  
Bottlewasher" and "Sr Bottlewasher" as separate facet values.



Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g






Re: Problem building Nightly Solr

2010-07-06 Thread Ken Krugler


On Jul 6, 2010, at 3:44pm, Chris Hostetter wrote:



: Can you try "ant compile example"?
: After Lucene/Solr merge, solr ant build needs to compile before  
example

: target.

the "compile" target is already in the dependency tree for the  
"example"

target, so that won't change anything.

At the moment, the "nightly" snapshots produced by hudson only  
iclude the

"solr" section of the "dev" tree -- not modules or the lucene-java
sections .  The compiled versions of thothat code is included, so  
you can
*run* solr from the hudson artifacts, but aparently you can't  
compile it.

(this is particularly odd since the nightlies include all the compiled
lucene code as jars in a "lucene-libs/" directory, but the build  
system
doesn't seem to use that directory ... at least not when compiling  
solrj).


This is all side effects of trunk still being somewhat in transition  
--
there are kinks in dealing with the artifacts of the nightly build  
process

tha still need worked out, -- but if your goal is to compile things
yourself, then you might as well just check out the entire trunk and  
use

that compile fro mthat anyway.


Note that you'll need to "ant compile" from the top of the lucene  
directory first, before trying any of the solr-specific builds from  
inside of the /solr sub-dir. Or at least that's what I ran into when  
trying to build a solr dist recently.


-- Ken


Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g






Re: document level security: indexing/searching techniques

2010-07-06 Thread Ken Krugler


On Jul 6, 2010, at 8:27am, osocurious2 wrote:



Someone else was recently asking a similar question (or maybe it was  
you but

worded differently :) ).

Putting user level security at a document level seems like a recipe  
for
pain. Solr/Lucene don't do frequent update well...and being highly  
optimized
for query, I don't blame them. Is there any way to create a series  
of roles
that you can apply to your documents? If the security level of the  
document
isn't changing, just the user access to them, give the docs a role  
in the

index, put your user/usergroup stuff in a DB or some other system and
resolve your user into valid roles, then FilterQuery on role.


You're right, baking in too fine-grained a level of security  
information is a bad idea.


As one example that worked pretty well for code search with Krugle, we  
set access control on a per project level using LDAP groups - ie each  
project had some number of groups that were granted access rights.  
Each file in the project would inherit the same list of groups.


Then, when a user logs in they get authenticated via LDAP, and we have  
the set of groups they belong to being returned by the LDAP server.  
This then becomes a fairly well-bounded list of "terms" for an OR  
query against the "acl-groups" field in each file/project document.  
Just don't forget to set the boost to 0 for that portion of the query :)


-- Ken

--------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g






Re: IOException: read past EOF when opening index built directly w/Lucene

2010-07-01 Thread Ken Krugler


On Jul 1, 2010, at 1:03pm, Ken Krugler wrote:

I've got a version 2.3 index that appears to be valid - I can open  
it with Luke 1.0.1, and CheckIndex reports no problem.


[snip]


and Luke overview says:


This time as text:

Index version: 12984d2211c
Index format: -4 (Lucene 2.3)
Index functionality: lock-less, single norms file, shared doc store
Currently opened commit point: segments_2

-- Ken

----
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g






IOException: read past EOF when opening index built directly w/Lucene

2010-07-01 Thread Ken Krugler
I've got a version 2.3 index that appears to be valid - I can open it  
with Luke 1.0.1, and CheckIndex reports no problem.


Just for grins, I crafted a matching schema, and tried to use the  
index with Solr 1.4 (and also Solr-trunk).


In either case, I get this exception during startup:

SEVERE: java.lang.RuntimeException: java.io.IOException: read past EOF
at org.apache.solr.core.SolrCore.getSearcher(SolrCore.java:1067)
at org.apache.solr.core.SolrCore.(SolrCore.java:582)
at org.apache.solr.core.CoreContainer.create(CoreContainer.java:431)
at org.apache.solr.core.CoreContainer.load(CoreContainer.java:286)
	at org.apache.solr.core.CoreContainer 
$Initializer.initialize(CoreContainer.java:125)
	at  
org 
.apache.solr.servlet.SolrDispatchFilter.init(SolrDispatchFilter.java:86)

...
Caused by: java.io.IOException: read past EOF
	at  
org 
.apache.lucene.store.BufferedIndexInput.refill(BufferedIndexInput.java: 
154)
	at  
org 
.apache 
.lucene.store.BufferedIndexInput.readByte(BufferedIndexInput.java:39)
	at  
org 
.apache 
.lucene.store.ChecksumIndexInput.readByte(ChecksumIndexInput.java:40)

at org.apache.lucene.store.DataInput.readInt(DataInput.java:76)
at org.apache.lucene.index.SegmentInfo.(SegmentInfo.java:171)
at org.apache.lucene.index.SegmentInfos.read(SegmentInfos.java:230)
	at org.apache.lucene.index.DirectoryReader 
$1.doBody(DirectoryReader.java:91)
	at org.apache.lucene.index.SegmentInfos 
$FindSegmentsFile.run(SegmentInfos.java:649)
	at org.apache.lucene.index.DirectoryReader.open(DirectoryReader.java: 
87)

at org.apache.lucene.index.IndexReader.open(IndexReader.java:415)
at org.apache.lucene.index.IndexReader.open(IndexReader.java:294)
	at  
org 
.apache 
.solr 
.core 
.StandardIndexReaderFactory.newReader(StandardIndexReaderFactory.java: 
38)

at org.apache.solr.core.SolrCore.getSearcher(SolrCore.java:1056)
... 30 more

and at the end of the startup logging, it says:

Jul 1, 2010 12:51:25 PM org.apache.solr.core.SolrCore finalize
SEVERE: REFCOUNT ERROR: unreferenced  
org.apache.solr.core.solrc...@4513e9fd () has a reference count of 1


Is what I'm trying to do something that's destined to fail? I would  
have expected schema/index miss-matches to show up later, not right  
when the index is being opened.


I'd seen various posts about this type of error due to corrupt  
indexes, or having a buggy version of Java 1.6, or an obscure Lucene  
bug (https://issues.apache.org/jira/browse/SOLR-1778 and https://issues.apache.org/jira/browse/LUCENE-2270) 
, but none of those seem to apply to my situation.


Thanks,

-- Ken

PS - index dir looks like:

249K Jun 29 13:47 _0.fdt
 12K Jun 29 13:47 _0.fdx
159B Jun 29 13:47 _0.fnm
3.6M Jun 29 13:47 _0.frq
 23K Jun 29 13:47 _0.nrm
 10M Jun 29 13:47 _0.prx
 51K Jun 29 13:47 _0.tii
2.9M Jun 29 13:47 _0.tis
 20B Jun 29 13:47 segments.gen
 45B Jun 29 13:47 segments_2

and Luke overview says:





--------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g






Re: SolrJ/EmbeddedSolrServer

2010-06-27 Thread Ken Krugler

Sort of answering my own question here too...

It seems like I need to get the current core, and use that to  
instantiate a new SolrCore with the same exact config, other than the  
dataDir.


The documentation for SolrCore()'s constructor says:

"If a core with the same name already exists, it will be stopped and  
replaced by this one"


But it's unclear to me whether this will do a graceful swap (like what  
I want) or a hard shutdown of the old core.


Thanks,

-- Ken

On May 22, 2010, at 11:25am, Ryan McKinley wrote:


accidentally hit send...

Eache core can have the dataDir set explicitly.


  


  


If you want to do this with solrj, you would need to manipulate the
CoreDescriptor objects.


I'm hoping somebody can clarify what's up with the CoreDescriptor  
class, since there's not much documentation.


As far as I can tell, when you create a new SolrCore, it saves off the  
CoreDescriptor you pass in, and does nothing with it.


The constructor for SolrCore also takes a datadir param, so I don't  
see how the CoreDescriptor's dataDir gets used during construction.


And changing the CoreDescriptor's dataDir has no effect, since it's  
essentially a POJO.


So how would one go about changing the dataDir for a core, in a multi- 
core setup?


Thanks,

-- Ken

On Sat, May 22, 2010 at 2:24 PM, Ryan McKinley   
wrote:

Check:
http://wiki.apache.org/solr/CoreAdmin

Unless I'm missing something, I think you should be able to sort  
what you need



On Fri, May 21, 2010 at 7:55 PM, Ken Krugler
 wrote:
I've got a situation where my data directory (a) needs to live  
elsewhere
besides inside of Solr home, (b) moves to a different location  
when updating
indexes, and (c) setting up a symlink from /data isn't  
a great

option.

So what's the best approach to making this work with SolrJ? The  
low-level

solution seems to be

- create my own SolrCore instance, where I specify the data  
directory

- use that to update the CoreContainer
- create a new EmbeddedSolrServer

But recreating the EmbeddedSolrServer with each index update feels  
wrong,
and I'd like to avoid mucking around with low-level SolrCore  
instantiation.


Any other approaches?

Thanks,

-- Ken


Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g










Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g






Re: SolrJ/EmbeddedSolrServer

2010-06-27 Thread Ken Krugler

Answering my own question...

I can use CoreContainer.reload("core name") to force a reload.

I assume that if I've got an EmbeddedSolrServer running at the time I  
do this reload, everything will happen correctly under the covers.


So now I just need to find out how to programmatically change settings  
for a core.


-- Ken

On May 22, 2010, at 11:24am, Ryan McKinley wrote:


Check:
http://wiki.apache.org/solr/CoreAdmin

Unless I'm missing something, I think you should be able to sort  
what you need


If I'm using SolrJ, is there a programmatic way to force a reload of a  
core?


This, of course, assumes that I'm able to programmatically change the  
location of the dataDir, which is another issue.


Thanks,

-- Ken


On Fri, May 21, 2010 at 7:55 PM, Ken Krugler
 wrote:
I've got a situation where my data directory (a) needs to live  
elsewhere
besides inside of Solr home, (b) moves to a different location when  
updating
indexes, and (c) setting up a symlink from /data isn't a  
great

option.

So what's the best approach to making this work with SolrJ? The low- 
level

solution seems to be

- create my own SolrCore instance, where I specify the data directory
- use that to update the CoreContainer
- create a new EmbeddedSolrServer

But recreating the EmbeddedSolrServer with each index update feels  
wrong,
and I'd like to avoid mucking around with low-level SolrCore  
instantiation.


Any other approaches?

Thanks,

-- Ken

----
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g







----
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g






Re: SolrJ/EmbeddedSolrServer

2010-06-27 Thread Ken Krugler


On May 22, 2010, at 11:24am, Ryan McKinley wrote:


Check:
http://wiki.apache.org/solr/CoreAdmin

Unless I'm missing something, I think you should be able to sort  
what you need


If I'm using SolrJ, is there a programmatic way to force a reload of a  
core?


This, of course, assumes that I'm able to programmatically change the  
location of the dataDir, which is another issue.


Thanks,

-- Ken


On Fri, May 21, 2010 at 7:55 PM, Ken Krugler
 wrote:
I've got a situation where my data directory (a) needs to live  
elsewhere
besides inside of Solr home, (b) moves to a different location when  
updating
indexes, and (c) setting up a symlink from /data isn't a  
great

option.

So what's the best approach to making this work with SolrJ? The low- 
level

solution seems to be

- create my own SolrCore instance, where I specify the data directory
- use that to update the CoreContainer
- create a new EmbeddedSolrServer

But recreating the EmbeddedSolrServer with each index update feels  
wrong,
and I'd like to avoid mucking around with low-level SolrCore  
instantiation.


Any other approaches?

Thanks,

-- Ken

------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g







------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g






Re: SolrJ/EmbeddedSolrServer

2010-06-27 Thread Ken Krugler


On May 22, 2010, at 11:25am, Ryan McKinley wrote:


accidentally hit send...

Eache core can have the dataDir set explicitly.

 
   
 
 
   
 

If you want to do this with solrj, you would need to manipulate the
CoreDescriptor objects.


I'm hoping somebody can clarify what's up with the CoreDescriptor  
class, since there's not much documentation.


As far as I can tell, when you create a new SolrCore, it saves off the  
CoreDescriptor you pass in, and does nothing with it.


The constructor for SolrCore also takes a datadir param, so I don't  
see how the CoreDescriptor's dataDir gets used during construction.


And changing the CoreDescriptor's dataDir has no effect, since it's  
essentially a POJO.


So how would one go about changing the dataDir for a core, in a multi- 
core setup?


Thanks,

-- Ken

On Sat, May 22, 2010 at 2:24 PM, Ryan McKinley   
wrote:

Check:
http://wiki.apache.org/solr/CoreAdmin

Unless I'm missing something, I think you should be able to sort  
what you need



On Fri, May 21, 2010 at 7:55 PM, Ken Krugler
 wrote:
I've got a situation where my data directory (a) needs to live  
elsewhere
besides inside of Solr home, (b) moves to a different location  
when updating
indexes, and (c) setting up a symlink from /data isn't  
a great

option.

So what's the best approach to making this work with SolrJ? The  
low-level

solution seems to be

- create my own SolrCore instance, where I specify the data  
directory

- use that to update the CoreContainer
- create a new EmbeddedSolrServer

But recreating the EmbeddedSolrServer with each index update feels  
wrong,
and I'd like to avoid mucking around with low-level SolrCore  
instantiation.


Any other approaches?

Thanks,

-- Ken


Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g










Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g






Re: [ANN] Solr 1.4.1 Released

2010-06-26 Thread Ken Krugler


On Jun 26, 2010, at 5:18pm, Jason Chaffee wrote:


It appears the 1.4.1 version was deployed with a new maven groupId

For eample, if you are trying to download solr-core, here are the  
differences between 1.4.0 and 1.4.1.


1.4.0
groupId: org.apache.solr
artifactId: solr-core

1.4.1
groupId: org.apache.solr.solr
artifactId:solr-core

Was this change intentional or a mistake?  If it was a mistake, can  
someone please fix it in maven's central repository?


I believe it was a mistake. From a recent email thread on this list,  
Mark Miller said:



Can a solr/maven dude look at this? I simply used the copy command on
the release to-do wiki (sounds like it should be updated).

If no one steps up, I'll try and straighten it out later.

On 6/25/10 10:28 AM, Stevo Slavić wrote:

Congrats on the release!

Something seems to be wrong with solr 1.4.1 maven artifacts, there  
is in

extra solr in the path. E.g. solr-parent-1.4.1.pom at in
http://repo1.maven.org/maven2/org/apache/solr/solr/solr-parent/1.4.1/solr-parent-1.4.1.pomwhile
it should be at
http://repo1.maven.org/maven2/org/apache/solr/solr-parent/1.4.1/solr-parent-1.4.1.pom 
.

Pom's seem to contain correct maven artifact coordinates.

Regards,
Stevo.


-- Ken

------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g






Some minor Solritas layout tweaks

2010-06-23 Thread Ken Krugler
I grabbed the latest & greatest from trunk, and then had to make a few  
minor layout tweaks.


1. In main.css, the ".query-box input" { height} isn't tall enough (at  
least on my Mac 10.5/FF 3.6 config), so character descenders get  
clipped.


I bumped it from 40px to 50px, and that fixed the issue for me.

2. The constraint text (for removing facet constraints) overlaps with  
the Solr logo.


It looks like the div that contains this anchor text is missing a  
class="constraints", as I see a .constraints in the CSS.


I added this class name, and also (to main.css):

.constraints {
  margin-top: 10px;
}

But IANAWD, so this is probably not the best way to fix the issue.

3. And then I see a .constraints-title in the CSS, but it's not used.

Was the intent of this to set the '>' character to gray?

4. It seems silly to open JIRA issues for these types of things, but I  
also don't want to add to noise on the list.


Which approach is preferred?

Thanks,

-- Ken




Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g






Re: Minor bug with Solritas and price data

2010-06-21 Thread Ken Krugler

Hi Hoss,

You're the man.

I'd copied/pasted the 1.3 schema fields into my testbed schema, which  
was based on the version of Solr we were using back in the Dark Ages,  
when the version was 1.0 (and there was no such handy comment warning  
about changing the version :)) So fields were multivalued by default.  
I'd checked the docs for 1.3, but didn't realize that this behavior  
had changed from 1.0 days.


Mystery solved.

If I'd used http://localhost:8983/solr/admin/schema.jsp to examine the  
field type, I would have seen this.


Thanks again.

-- Ken

On Jun 21, 2010, at 11:36am, Chris Hostetter wrote:



: Here's what's in my schema:
:
: 
:
: Which is exactly what was in the original example schema.

but what does hte "version" property of your schema say (at the top)  
this

is what's in the example...


 


-Hoss



----
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g






Re: Minor bug with Solritas and price data

2010-06-21 Thread Ken Krugler
26


2.29


KAS Rugs Indira Black Circles Rug

...



Any other ideas what might be going on?

Thanks,

-- Ken



On Jun 19, 2010, at 9:12 PM, Ken Krugler wrote:

I noticed that my prices weren't showing up, even though I've got a  
price field.


I think the issue is with this line from hit.vm:

#field('name') $! 
number.currency($doc.getFieldValue('price'))


The number.currency() function needs to get passed something that  
looks like a number, but $doc.getFieldValue() will return "[2.96]",  
because it could be a list of values.


The square brackets confuse number.currency, so you get no price.

I think this line needs to be:

#field('name') $! 
number.currency($doc.getFirstValue('price'))


...since getFirstValue() returns a single value without brackets.

-- Ken



------------
<http://ken-blog.krugler.org>
+1 530-265-2225





Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g






Minor bug with Solritas and price data

2010-06-19 Thread Ken Krugler
I noticed that my prices weren't showing up, even though I've got a  
price field.


I think the issue is with this line from hit.vm:

  #field('name') $! 
number.currency($doc.getFieldValue('price'))


The number.currency() function needs to get passed something that  
looks like a number, but $doc.getFieldValue() will return "[2.96]",  
because it could be a list of values.


The square brackets confuse number.currency, so you get no price.

I think this line needs to be:

  #field('name') $! 
number.currency($doc.getFirstValue('price'))


...since getFirstValue() returns a single value without brackets.

-- Ken


Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g






Minor bug in Solritas with post-facet search

2010-06-19 Thread Ken Krugler
I ran into one minor problem, where if I clicked a facet, and then  
tried a search, I'd get a 404 error.


I think the problem is with the fqs Velocity macro in  
VM_global_library.vm, where it's missing the #else to insert a '?'  
into the URL:


#macro(fqs $p)#foreach($fq in $p)#if($velocityCount>1)&#{else}? 
#{end}fq=$esc.url($fq)#end#end


Without this, the URL becomes /solr/browsefq=, instead of /solr/ 
browse?fq=


But I'm completely new to the world of Velocity templating, so I've  
got low confidence that this is the right way to fix it.


-- Ken

--------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g






Re: Autocompletion with Solritas

2010-06-19 Thread Ken Krugler

Hi Erik,

On Jun 18, 2010, at 6:58pm, Erik Hatcher wrote:

Have a look at suggest.vm - the "name" field is used in there too.   
Just those two places, layout.vm and suggest.vm.


That was the missing change I needed.

Thanks much!

-- Ken



  And I had already added a ## TODO in my local suggest.vm:

## TODO: make this more generic, maybe look at the request  
terms.fl?  or just take the first terms field in the response?


And also, ideally, there'd be a /suggest handler mapped with the  
field name specified there.  I simply used what was already  
available to put suggest in there easily.


Erik

On Jun 18, 2010, at 7:54 PM, Ken Krugler wrote:


Hi Erik,

On Jun 17, 2010, at 8:34pm, Erik Hatcher wrote:

Your wish is my command.  Check out trunk, fire up Solr (ant run- 
example), index example data, hit http://localhost:8983/solr/ 
browse - type in search box.


Just used jQuery's autocomplete plugin and the terms component for  
now, on the name field.  Quite simple to plug in, actually.  Check  
the commit diff.  The main magic is doing this:


<http://localhost:8983/solr/terms?terms.fl=name&terms.prefix=i&terms.sort=count&wt=velocity&v.template=suggest 
>


Stupidly, though, jQuery's autocomplete seems to be hardcoded to  
send a q parameter, but I coded it to also send the same value as  
terms.prefix - but this could be an issue if hitting a different  
request handler where q is used for the actual query for filtering  
terms on.


Let's say, just for grins, that a different field (besides "name")  
is being used for autocompletion.


What would be all the places I'd need to hit to change the field,  
besides the terms.fl value in layout.vm? For example, what about  
browse.vm:


  $("input[type=text]").autoSuggest("/solr/suggest",  
{selectedItemProp: "name", searchObjProps: "name"}});


I'm asking because I'm trying to use this latest support with an  
index that uses "product_name" for the auto-complete field, and I'm  
not getting any auto-completes happening.


I see from the Solr logs that requests being made to /solr/terms  
during auto-complete that look like:


INFO: [] webapp=/solr path=/terms  
params 
= 
{limit 
= 
10 
×tamp 
= 
1276903135595 
&terms 
.fl 
= 
product_name 
&q 
= 
rug 
&wt=velocity&terms.sort=count&v.template=suggest&terms.prefix=rug}  
status=0 QTime=0


Which I'd expect to work, but don't seem to be generating any  
results.


What's odd is that if I try curling the same thing:

curl -v "http://localhost:8983/solr/terms?limit=10×tamp=1276903135595&terms.fl=product_name&q=rug&wt=velocity&terms.sort=count&v.template=suggest&terms.prefix=rug 
"


I get an empty HTML response:

< Content-Type: text/html; charset=utf-8
< Content-Length: 0
< Server: Jetty(6.1.22)

If I just use what I'd consider to be the minimum set of parameters:

curl -v "http://localhost:8983/solr/terms?limit=10&terms.fl=product_name&q=rug&terms.sort=count&terms.prefix=rug 
"


Then I get the expected XML response:

< Content-Type: text/xml; charset=utf-8
< Content-Length: 225
< Server: Jetty(6.1.22)
<


0name="QTime">0name="product_name">7



Any ideas what I'm doing wrong?

Thanks,

-- Ken



On Jun 17, 2010, at 8:03 PM, Ken Krugler wrote:


I don't believe Solritas supports autocompletion out of the box.

So I'm wondering if anybody has experience using the LucidWorks  
distro & Solritas, plus the AJAX Solr auto-complete widget.


I realize that AJAX Solr's autocomplete support is mostly just  
leveraging the jQuery Autocomplete plugin, and hooking it up to  
Solr facets, but I was curious if there were any tricks or traps  
in getting it all to work.


Thanks,

-- Ken




Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g






Re: Autocompletion with Solritas

2010-06-18 Thread Ken Krugler

Hi Erik,

On Jun 17, 2010, at 8:34pm, Erik Hatcher wrote:

Your wish is my command.  Check out trunk, fire up Solr (ant run- 
example), index example data, hit http://localhost:8983/solr/browse  
- type in search box.


Just used jQuery's autocomplete plugin and the terms component for  
now, on the name field.  Quite simple to plug in, actually.  Check  
the commit diff.  The main magic is doing this:


  <http://localhost:8983/solr/terms?terms.fl=name&terms.prefix=i&terms.sort=count&wt=velocity&v.template=suggest 
>


Stupidly, though, jQuery's autocomplete seems to be hardcoded to  
send a q parameter, but I coded it to also send the same value as  
terms.prefix - but this could be an issue if hitting a different  
request handler where q is used for the actual query for filtering  
terms on.


Let's say, just for grins, that a different field (besides "name") is  
being used for autocompletion.


What would be all the places I'd need to hit to change the field,  
besides the terms.fl value in layout.vm? For example, what about  
browse.vm:


$("input[type=text]").autoSuggest("/solr/suggest",  
{selectedItemProp: "name", searchObjProps: "name"}});


I'm asking because I'm trying to use this latest support with an index  
that uses "product_name" for the auto-complete field, and I'm not  
getting any auto-completes happening.


I see from the Solr logs that requests being made to /solr/terms  
during auto-complete that look like:


INFO: [] webapp=/solr path=/terms  
params 
= 
{limit 
= 
10 
×tamp 
= 
1276903135595 
&terms 
.fl 
= 
product_name 
&q 
=rug&wt=velocity&terms.sort=count&v.template=suggest&terms.prefix=rug}  
status=0 QTime=0


Which I'd expect to work, but don't seem to be generating any results.

What's odd is that if I try curling the same thing:

curl -v "http://localhost:8983/solr/terms?limit=10×tamp=1276903135595&terms.fl=product_name&q=rug&wt=velocity&terms.sort=count&v.template=suggest&terms.prefix=rug 
"


I get an empty HTML response:

< Content-Type: text/html; charset=utf-8
< Content-Length: 0
< Server: Jetty(6.1.22)

If I just use what I'd consider to be the minimum set of parameters:

curl -v "http://localhost:8983/solr/terms?limit=10&terms.fl=product_name&q=rug&terms.sort=count&terms.prefix=rug 
"


Then I get the expected XML response:

< Content-Type: text/xml; charset=utf-8
< Content-Length: 225
< Server: Jetty(6.1.22)
<


0name="QTime">0name="product_name">7



Any ideas what I'm doing wrong?

Thanks,

-- Ken



On Jun 17, 2010, at 8:03 PM, Ken Krugler wrote:


I don't believe Solritas supports autocompletion out of the box.

So I'm wondering if anybody has experience using the LucidWorks  
distro & Solritas, plus the AJAX Solr auto-complete widget.


I realize that AJAX Solr's autocomplete support is mostly just  
leveraging the jQuery Autocomplete plugin, and hooking it up to  
Solr facets, but I was curious if there were any tricks or traps in  
getting it all to work.


Thanks,

-- Ken




Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g






Re: Autocompletion with Solritas

2010-06-18 Thread Ken Krugler


On Jun 17, 2010, at 8:34pm, Erik Hatcher wrote:

Your wish is my command.  Check out trunk, fire up Solr (ant run- 
example), index example data, hit http://localhost:8983/solr/browse  
- type in search box.


That works - excellent!

Now I'm trying to build a distribution from trunk that I can use for  
prototyping, and noticed a few things...


1. From a fresh check-out, you can't build from the trunk/solr sub-dir  
due to dependencies on Lucene classes. Once you've done a top-level  
"ant compile" then you can cd into /solr and do ant builds.


2. I noticed the run-example target in trunk/solr/build.xml doesn't  
have a description, so it doesn't show up with ant -p.


3. I tried "ant create-package" from trunk/solr, and got this error  
near the end:


/Users/kenkrugler/svn/lucene/lucene-trunk/solr/common-build.xml: 
252: /Users/kenkrugler/svn/lucene/lucene-trunk/solr/contrib/velocity/ 
src not found.


I don't see contrib/velocity anywhere in the Lucene trunk tree.

What's the recommended way to build a Solr distribution from trunk?

In the meantime I'll just use example/start.jar with solr.solr.home  
and solr.data.dir system properties.


Thanks,

-- Ken



Just used jQuery's autocomplete plugin and the terms component for  
now, on the name field.  Quite simple to plug in, actually.  Check  
the commit diff.  The main magic is doing this:


  <http://localhost:8983/solr/terms?terms.fl=name&terms.prefix=i&terms.sort=count&wt=velocity&v.template=suggest 
>


Stupidly, though, jQuery's autocomplete seems to be hardcoded to  
send a q parameter, but I coded it to also send the same value as  
terms.prefix - but this could be an issue if hitting a different  
request handler where q is used for the actual query for filtering  
terms on.


Cool?!   I think so!  :)

Erik


On Jun 17, 2010, at 8:03 PM, Ken Krugler wrote:


I don't believe Solritas supports autocompletion out of the box.

So I'm wondering if anybody has experience using the LucidWorks  
distro & Solritas, plus the AJAX Solr auto-complete widget.


I realize that AJAX Solr's autocomplete support is mostly just  
leveraging the jQuery Autocomplete plugin, and hooking it up to  
Solr facets, but I was curious if there were any tricks or traps in  
getting it all to work.


Thanks,

-- Ken


--------
<http://ken-blog.krugler.org>
+1 530-265-2225





Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g






Re: Autocompletion with Solritas

2010-06-17 Thread Ken Krugler

You, sir, are on my Christmas card list.

I'll fire it up tomorrow morning & let you know how it goes.

-- Ken


On Jun 17, 2010, at 8:34pm, Erik Hatcher wrote:

Your wish is my command.  Check out trunk, fire up Solr (ant run- 
example), index example data, hit http://localhost:8983/solr/browse  
- type in search box.


Just used jQuery's autocomplete plugin and the terms component for  
now, on the name field.  Quite simple to plug in, actually.  Check  
the commit diff.  The main magic is doing this:


  <http://localhost:8983/solr/terms?terms.fl=name&terms.prefix=i&terms.sort=count&wt=velocity&v.template=suggest 
>


Stupidly, though, jQuery's autocomplete seems to be hardcoded to  
send a q parameter, but I coded it to also send the same value as  
terms.prefix - but this could be an issue if hitting a different  
request handler where q is used for the actual query for filtering  
terms on.


Cool?!   I think so!  :)

        Erik


On Jun 17, 2010, at 8:03 PM, Ken Krugler wrote:


I don't believe Solritas supports autocompletion out of the box.

So I'm wondering if anybody has experience using the LucidWorks  
distro & Solritas, plus the AJAX Solr auto-complete widget.


I realize that AJAX Solr's autocomplete support is mostly just  
leveraging the jQuery Autocomplete plugin, and hooking it up to  
Solr facets, but I was curious if there were any tricks or traps in  
getting it all to work.


Thanks,

-- Ken


Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g









Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g






Autocompletion with Solritas

2010-06-17 Thread Ken Krugler

I don't believe Solritas supports autocompletion out of the box.

So I'm wondering if anybody has experience using the LucidWorks distro  
& Solritas, plus the AJAX Solr auto-complete widget.


I realize that AJAX Solr's autocomplete support is mostly just  
leveraging the jQuery Autocomplete plugin, and hooking it up to Solr  
facets, but I was curious if there were any tricks or traps in getting  
it all to work.


Thanks,

-- Ken

------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g






Re: Need help on Solr Cell usage with specific Tika parser

2010-06-14 Thread Ken Krugler

Hi Olivier,

Are you setting the mime type explicitly via the stream.type parameter?

-- Ken

On Jun 14, 2010, at 9:14am, olivier sallou wrote:


Hi,
I use Solr Cell to send specific content files. I developped a  
dedicated

Parser for specific mime types.
However I cannot get Solr accepting my new mime types.

In solrconfig, in update/extract requesthandler I specified name="tika.config">./tika-config.xml , where tika-config.xml  
is in

conf directory (same as solrconfig).

In tika-config I added my mimetypes:


   biosequence/document
   biosequence/embl
   biosequence/genbank
   

I do not know for:
 

whereas path to tika mimetypes should be absolute or relative... and  
even if

this file needs to be redefined if "magic" is not used.


When I run my update/extract, I have an error that "biosequence/ 
document"

does not match any known parser.

Thanks

Olivier


--------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g






Re: Tika language extraction

2010-06-10 Thread Ken Krugler

Hi Sandhya,

It is observed that TIKA does not extract the "Content-Language" for  
documents encoded in UTF-8. For natively encoded documents, it works  
fine. Any idea on how we can resolve this ?


I would post this question to the u...@tika.apache.org mailing list,  
and include more details on what type of document.


The Tika language detection is fairly weak, and when the encoding is  
universal (language independent) such as UTF-8, the resulting  
confidence level is often low enough that Tika doesn't assume it has a  
good match, and thus doesn't report a language.


-- Ken

--------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g






Re: Indexing HTML

2010-06-09 Thread Ken Krugler


On Jun 9, 2010, at 8:38pm, Blargy wrote:



What is the preferred way to index html using DIH (my html is stored  
in a

blob field in our database)?

I know there is the built in HTMLStripTransformer but that doesn't  
seem to
work well with malformed/incomplete HTML. I've created a custom  
transformer

to first tidy up the html using JTidy then I pass it to the
HTMLStripTransformer like so:




However this method isn't fool-proof as you can see by my ignoreErrors
option.

I quickly took a peek at Tika and I noticed that it has its own  
HtmlParser.
Is this something I should look into? Are there any alternatives  
that deal

with malformed/incomplete  html? Thanks


Actually the Tika HtmlParser just wraps TagSoup - that's a good option  
for cleaning up busted HTML.


-- Ken


<http://ken-blog.krugler.org>
+1 530-265-2225




------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g






Re: Build query programmatically with lucene, but issue to solr?

2010-05-28 Thread Ken Krugler


On May 28, 2010, at 9:23am, Phillip Rhodes wrote:


Hi.
I am building up a query with quite a bit of logic such as  
parentheses, plus
signs, etc... and it's a little tedious dealing with it all at a  
string
level.  I was wondering if anyone has any thoughts on constructing  
the query
in lucene and using the string representation of the query to send  
to solr.


Depending on complexity, SolrJ could be a solution.

See the section that talks about "SolrJ provides a APIs to create  
queries instead of hand coding the query..." on http://wiki.apache.org/solr/Solrj


-- Ken

------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g






SolrJ/EmbeddedSolrServer

2010-05-21 Thread Ken Krugler
I've got a situation where my data directory (a) needs to live  
elsewhere besides inside of Solr home, (b) moves to a different  
location when updating indexes, and (c) setting up a symlink from  
/data isn't a great option.


So what's the best approach to making this work with SolrJ? The low- 
level solution seems to be


- create my own SolrCore instance, where I specify the data directory
- use that to update the CoreContainer
- create a new EmbeddedSolrServer

But recreating the EmbeddedSolrServer with each index update feels  
wrong, and I'd like to avoid mucking around with low-level SolrCore  
instantiation.


Any other approaches?

Thanks,

-- Ken

--------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g






"Special Circumstances" for embedded Solr

2010-05-20 Thread Ken Krugler

Hi all,

We'd started using embedded Solr back in 2007, via a patched version  
of the in-progress 1.3 code base.


I recently was reading http://wiki.apache.org/solr/EmbeddedSolr, and  
wondered about the paragraph that said:
The simplest, safest, way to use Solr is via Solr's standard HTTP  
interfaces. Embedding Solr is less flexible, harder to support, not  
as well tested, and should be reserved for special circumstances.


Given the current state of SolrJ, and the expected roadmap for Solr in  
general, what would be some guidelines for "special circumstances"  
that warrant the use of SolrJ?


I know what ours were back in 2007 - namely:

- we had multiple indexes, but didn't want to run multiple webapps  
(now handled by multi-core)
- we needed efficient generation of updated indexes, without  
generating lots of HTTP traffic (now handled by DIH, maybe with  
specific extensions?)
- we wanted tighter coupling of the front-end API with the back-end  
Solr search system, since this was an integrated system in the hands  
of customers - no "just restart the webapp container" option if  
anything got wedged (might still be an issue?)


Any other commonly compelling reasons to use SolrJ?

Thanks,

-- Ken


------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g






Re: Personalized Search

2010-05-20 Thread Ken Krugler


On May 19, 2010, at 11:43pm, Rih wrote:

Has anybody done personalized search with Solr? I'm thinking of  
including
fields such as "bought" or "like" per member/visitor via dynamic  
fields to a
product search schema. Another option is to have a multi-value field  
that
can contain user IDs. What are the possible performance issues with  
this

setup?


Mitch is right, what you're looking for here is a recommendation  
engine, if I understand your question properly.


And yes, Mahout should work though the Taste recommendation engine it  
supports is pretty new. But Sean Owen & Robin Anil have a "Mahout in  
Action" book that's in early release via Manning, and it has lots of  
good information about Mahout & recommender systems.


Assuming you have a list of recommendations for a given user, based on  
their past behavior and the recommendation engine, then you could use  
this to adjust search results. I'm waiting for Hoss to jump in here on  
how best to handle that :)


-- Ken


Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g






Re: How to query for similar documents before indexing

2010-05-10 Thread Ken Krugler

Hi all (especially Yonik),

At the http://wiki.apache.org/solr/Deduplication page, it mentions  
"duplicate field collapsing" and later "Allow for both duplicate  
collapsing in search results..."


But I don't see any mention of how deduplication happens during search  
time. Normally this requires that the field be stored (not just  
indexed), and for efficiency it might need to be in a FieldCache. I'm  
wondering about both status of this support, and thoughts on potential  
impact to index/memory size.


Thanks,

-- Ken


On May 10, 2010, at 3:07pm, Markus Jelsma wrote:


Hi Matthieu,

On the top of the wiki page you can see it's in 1.4 already. As far  
as i know the API doesn't return information on found duplicates in  
its response header, the wiki isn't clear on that subject. I, at  
least, never saw any other response than an error or the usual  
status code and QTime.


Perhaps it would be a nice feature. On the other hand, you can also  
have a manual process that finds duplicates based on that signature  
and gather that information yourself as long as such a feature isn't  
there.


Cheers,

-Original message-
From: Matthieu Labour 
Sent: Mon 10-05-2010 23:30
To: solr-user@lucene.apache.org;
Subject: RE: How to query for similar documents before indexing

Markus
Thank you for your response
That would be great if the index has the option to prevent duplicate  
from entering the index. But is it going to be a silent action ? Or  
will the add method return that it failed indexing because it  
detected a duplicate ?

Is it commited to the 1.4 already ?
Cheers
matt


--- On Mon, 5/10/10, Markus Jelsma  wrote:

From: Markus Jelsma 
Subject: RE: How to query for similar documents before indexing
To: solr-user@lucene.apache.org
Date: Monday, May 10, 2010, 4:11 PM

Hi,

Deduplication [1] is what you're looking for.It can utilize  
different analyzers that will add a one or more signatures or hashes  
to your document depending on exact or partial matches for  
configurable fields. Based on that, it should be able to prevent new  
documents from entering the index.


The first part works very well but i have some issues with removing  
those documents on which i also need to check with the community  
tomorrow back at work ;-)



[1]: http://wiki.apache.org/solr/Deduplication

Cheers,



-Original message-
From: Matthieu Labour 
Sent: Mon 10-05-2010 22:41
To: solr-user@lucene.apache.org;
Subject: How to query for similar documents before indexing

Hi

I want to implement the following logic:

Before I index a new document into the index, I want to check if  
there are already documents in the index with similar content to the  
content of the document about to be inserted. If the request returns  
1 or more documents, then I don't want to insert the document.


What is the best way to achieve the above functionality ?

I read about Fuzzy searches in logic. But can I really build a  
request such as
mydoc.title:wordexample~ AND mydoc.content:( all the content  
words)~0.9 ?


Thank you for your help




Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g






Re: MoreLikeThis: How to get quality terms from html from content stream?

2009-08-08 Thread Ken Krugler


On Aug 7, 2009, at 5:23pm, Jay Hill wrote:

I'm using the MoreLikeThisHandler with a content stream to get  
documents

from my index that match content from an html page like this:
http://localhost:8080/solr/mlt?stream.url=http://www.sfgate.com/cgi-bin/article.cgi 
?f=/c/a/2009/08/06/SP5R194Q13.DTL&mlt.fl=body&rows=4&debugQuery=true


But, not surprisingly, the query generated is meaningless because a  
lot of

the markup is picked out as terms:

body:li body:href  body:div body:class body:a body:script body:type  
body:js

body:ul body:text body:javascript body:style body:css body:h body:img
body:var body:articl body:ad body:http body:span body:prop


Does anyone know a way to transform the html so that the content can  
be
parsed out of the content stream and processed w/o the markup? Or do  
I need

to write my own HTMLParsingMoreLikeThisHandler?


You'd want to parse the HTML to extract only text first, and use that  
for your index data.


Both the Nutch and Tika OSS projects have examples of using HTML  
parsers (based on TagSoup or CyberNeko) to generate content suitable  
for indexing.


-- Ken

If I parse the content out to a plain text file and point the  
stream.url

param to file:///parsedfile.txt it works great.

-Jay


----------
Ken Krugler
TransPac Software, Inc.
<http://www.transpac.com>
+1 530-210-6378



  1   2   >