Re: Faceting on a date field multiple times

2012-05-04 Thread Ian Holsman
Thanks Marc.
On May 4, 2012, at 8:52 PM, Marc Sturlese wrote:

 http://lucene.472066.n3.nabble.com/Multiple-Facet-Dates-td495480.html
 
 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/Faceting-on-a-date-field-multiple-times-tp3961282p3961865.html
 Sent from the Solr - User mailing list archive at Nabble.com.



Faceting on a date field multiple times

2012-05-03 Thread Ian Holsman
Hi.

I would like to be able to do a facet on a date field, but with different 
ranges (in a single query).

for example. I would like to show

#documents by day for the last week - 
#documents by week for the last couple of months
#documents by year for the last several years.

is there a way to do this without hitting solr 3 times?


thanks
Ian

how does Solr/Lucene index multi-value fields

2011-05-31 Thread Ian Holsman
Hi.

I want to store a list of documents (say each being 30-60k of text) into a 
single SolrDocument. (to speed up post-retrieval querying)

In order to do this, I need to know if lucene calculates the TF/IDF score over 
the entire field or does it treat each value in the list as a unique field? 

If I can't store it as a multi-value, I could create a schema where I put each 
document into a unique field, but I'm not sure how to create the query to 
search all the fields.


Regards
Ian



Re: how does Solr/Lucene index multi-value fields

2011-05-31 Thread Ian Holsman

On May 31, 2011, at 12:11 PM, Erick Erickson wrote:

 Can you explain the use-case a bit more here? Especially the post-query
 processing and how you expect the multiple documents to help here.
 

we have a collection of related stories. when a user searches for something, we 
might not want to display the story that is most-relevant (according to SOLR), 
but according to other home-grown rules.  by combing all the possibilities in 
one SolrDocument, we can avoid a DB-hit to get related stories.


 But TF/IDF is calculated over all the values in the field. There's really no
 difference between a multi-valued field and storing all the data in a
 single field
 as far as relevance calculations are concerned.
 

so.. it will suck regardless.. I thought we had per-field relevance in the 
current trunk. :-(


 Best
 Erick
 
 On Tue, May 31, 2011 at 11:02 AM, Ian Holsman had...@holsman.net wrote:
 Hi.
 
 I want to store a list of documents (say each being 30-60k of text) into a 
 single SolrDocument. (to speed up post-retrieval querying)
 
 In order to do this, I need to know if lucene calculates the TF/IDF score 
 over the entire field or does it treat each value in the list as a unique 
 field?
 
 If I can't store it as a multi-value, I could create a schema where I put 
 each document into a unique field, but I'm not sure how to create the query 
 to search all the fields.
 
 
 Regards
 Ian
 
 



Re: how does Solr/Lucene index multi-value fields

2011-05-31 Thread Ian Holsman
Thanks Erick.

sadly in my use-case I don't that wouldn't work. I'll go back to storing them 
at the story level, and hitting a DB to get related stories I think.

--I
On May 31, 2011, at 12:27 PM, Erick Erickson wrote:

 Hmmm, I may have mis-lead you. Re-reading my text it
 wasn't very well written
 
 TF/IDF calculations are, indeed, per-field. I was trying
 to say that there was no difference between storing all
 the data for an individual field as a single long string of text
 in a single-valued field or as several shorter strings in
 a multi-valued field.
 
 Best
 Erick
 
 On Tue, May 31, 2011 at 12:16 PM, Ian Holsman had...@holsman.net wrote:
 
 On May 31, 2011, at 12:11 PM, Erick Erickson wrote:
 
 Can you explain the use-case a bit more here? Especially the post-query
 processing and how you expect the multiple documents to help here.
 
 
 we have a collection of related stories. when a user searches for something, 
 we might not want to display the story that is most-relevant (according to 
 SOLR), but according to other home-grown rules.  by combing all the 
 possibilities in one SolrDocument, we can avoid a DB-hit to get related 
 stories.
 
 
 But TF/IDF is calculated over all the values in the field. There's really no
 difference between a multi-valued field and storing all the data in a
 single field
 as far as relevance calculations are concerned.
 
 
 so.. it will suck regardless.. I thought we had per-field relevance in the 
 current trunk. :-(
 
 
 Best
 Erick
 
 On Tue, May 31, 2011 at 11:02 AM, Ian Holsman had...@holsman.net wrote:
 Hi.
 
 I want to store a list of documents (say each being 30-60k of text) into a 
 single SolrDocument. (to speed up post-retrieval querying)
 
 In order to do this, I need to know if lucene calculates the TF/IDF score 
 over the entire field or does it treat each value in the list as a unique 
 field?
 
 If I can't store it as a multi-value, I could create a schema where I put 
 each document into a unique field, but I'm not sure how to create the 
 query to search all the fields.
 
 
 Regards
 Ian
 
 
 
 



[ANN] Zoie Solr Plugin - Zoie Solr Plugin enables real-time update functionality for Apache Solr 1.4+

2010-03-07 Thread Ian Holsman


I just saw this on twitter, and thought you guys would be interested.. I 
haven't tried it, but it looks interesting.


http://snaprojects.jira.com/wiki/display/ZOIE/Zoie+Solr+Plugin

Thanks for the RT Shalin!


Re: If you could have one feature in Solr...

2010-02-28 Thread Ian Holsman

On 2/24/10 8:42 AM, Grant Ingersoll wrote:

What would it be?

   

most of this will be coming in 1.5,
but for me it's

- sharding.. it still seems a bit clunky

secondly.. this one isn't in 1.5.
I'd like to be able to find interesting terms that appear in my result 
set that don't appear in the global corpus.


it's kind of like doing a facet count on *:* and then on the search term 
and discount the terms that appear heavily on the global one.
(sorry.. there is a textbook definition of this.. XX distance.. but I 
haven't got the books in front of me).








Re: Improvising solr queries

2010-01-04 Thread Ian Holsman

On 1/5/10 12:46 AM, Shalin Shekhar Mangar wrote:

sitename:XYZ OR sitename:All Sites) AND (localeid:1237400589415) AND
  ((assettype:Gallery))  AND (rbcategory:ABC XYZ ) AND (startdate:[* TO
  2009-12-07T23:59:00Z] AND enddate:[2009-12-07T00:00:00Z TO
  *])rows=9start=63sort=date
  descfacet=truefacet.field=assettypefacet.mincount=1

  Similar to this query we have several much complex queries supporting all
  major landing pages of our application.

  Just want to confirm that whether anyone can identify any major flaws or
  issues in the sample query?


 
I'm not the expert Shalin is, but I seem to remember sorting by date was 
pretty rough on CPU. (this could have been resolved since I last looked 
at it)


the other thing I'd question is the facet. it looks like your only 
retrieving a single assetType  (Gallery).
so you will only get a single field back. if thats the case, wouldn't 
the rows returned (which is part of the response)

give you the same answer ?


Most of those AND conditions can be separate filter queries. Filter queries
can be cached separately and can therefore be re-used. See
http://wiki.apache.org/solr/FilterQueryGuidance

   




Re: Adaptive search?

2009-12-21 Thread Ian Holsman

On 12/18/09 2:46 AM, Siddhant Goel wrote:

Let say we have a search engine (a simple front end - web app kind of a
thing - responsible for querying Solr and then displaying the results in a
human readable form) based on Solr. If a user searches for something, gets
quite a few search results, and then clicks on one such result - is there
any mechanism by which we can notify Solr to boost the score/relevance of
that particular result in future searches? If not, then any pointers on how
to go about doing that would be very helpful.
   


Hi Siddhant.
Solr can't do this out of the box.
you would need to use a external field and a custom scoring function to 
do something like this.


regards
Ian

Thanks,

On Thu, Dec 17, 2009 at 7:50 PM, Paul Libbrechtp...@activemath.org  wrote:

   

What can it mean to adapt to user clicks ? Quite many things in my head.
Do you have maybe a citation that inspires you here?

paul


Le 17-déc.-09 à 13:52, Siddhant Goel a écrit :


  Does Solr provide adaptive searching? Can it adapt to user clicks within
 

the
search results it provides? Or that has to be done externally?

   


 


   




Re: Chrome Web Browser doesn't render properly

2009-07-16 Thread Ian Holsman

Brian Klippel wrote:

Nope, chrome treats xml as html.  Either view source or use another
browser.
  


I always thought the XML output should contain a XSLT file in it by default.
that way I could debug with safari (and chrome).

-Original Message-
From: Jason Rutherglen [mailto:jason.rutherg...@gmail.com] 
Sent: Wednesday, July 15, 2009 2:15 PM

To: solr-user@lucene.apache.org
Subject: Chrome Web Browser doesn't render properly

From the Solr admin page, solr/admin/file/?file=schema.xml and
/solr/select/?q=solrversion=2.2start=0rows=10indent=on
renders improperly (meaning the XML isn't formatted). Maybe
Chrome doesn't support XML?

  




Re: Facets with an IDF concept

2009-06-23 Thread Ian Holsman

Asif Rahman wrote:

Hi Grant,

I'll give a real life example of the problem that we are trying to solve.

We index a large number of current news articles on a continuing basis.  We
tag these articles with news topics (e.g. Barack Obama, Iran, etc.).  We
then use these tags to facet our queries.  For example, we might issue a
query for all articles in the last 24 hours.  The facets would then tell us
which news topics have been written about the most in that period.  The
problem is that Barack Obama, for example, is always written about in high
frequency, as opposed to Iran which is currently very hot in the news, but
which has not always been the case.  In this case, we'd like to see Iran
show up higher than Barack Obama in the facet results.

  


your not looking for a IDF based function.
you need to figure out what a 'normal' amount of news flow for a given 
topic is and then determine when an abnormal amount is happening.

note.. that an abnormal amount is positive or negative.
we use a similar method to this on http://love.com, so we know for 
example something is going on with Ed McMahon as I type.


I wouldn't be looking at using SOLR to do this kind of thing btw. try 
something like esper. I think it might hold some promise to this kind of 
thing (esper is a open source stream database).


Regards


To me, this seems identical to the tf-idf scoring expression that is used in
normal search.  The facet count is analogous to the tf and I can access the
facet term idf's through the Similarity API.

Is my reasoning sound?  Can you provide any guidance as to the best way to
implement this?

Thanks for your help,

Asif


On Tue, Jun 23, 2009 at 1:19 PM, Grant Ingersoll gsing...@apache.orgwrote:

  

On Jun 23, 2009, at 3:58 AM, Asif Rahman wrote:

 Hi again,


I guess nobody has used facets in the way I described below before.  Do
any
of the experts have any ideas as to how to do this efficiently and
correctly?  Any thoughts would be greatly appreciated.

Thanks,

Asif

On Wed, Jun 17, 2009 at 12:42 PM, Asif Rahman a...@newscred.com wrote:

 Hi all,
  

We have an index of news articles that are tagged with news topics.
Currently, we use solr facets to see which topics are popular for a given
query or time period.  I'd like to apply the concept of IDF to the facet
counts so as to penalize the topics that occur broadly through our index.
I've begun to write custom facet component that applies the IDF to the
facet
counts, but I also wanted to check if anyone has experience using facets
in
this way.



I'm not sure I'm following.  Would you be faceting on one field, but using
the DF from some other field?  Faceting is already a count of all the
documents that contain the term on a given field for that search.  If I'm
understanding, you would still do the typical faceting, but then rerank by
the global DF values, right?

Backing up, what is the problem you are seeing that you are trying to
solve?

I think you could do this, but you'd have to hook it in yourself.  By
penalize, do you mean remove, or just have them in the sort?  Generally
speaking, looking up the DF value can be expensive, especially if you do a
lot of skipping around.  I don't know how pluggable the sort capabilities
are for faceting, but that might be the place to start if you are just
looking at the sorting options.



--
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) using
Solr/Lucene:
http://www.lucidimagination.com/search






  




Auto suggest.. how to do mixed case

2009-06-19 Thread Ian Holsman
hi guys.

I've noticed that one of the new features in Solr 1.4 is the Termscomponent
which enables the Autosuggest.

but what puzzles me is how to actually use it in an application.

most autosuggests are case insensitive, so there is no difference if I type
in 'San Francisco' or 'san francisco'.

now I've tried with a 'text' field, and a 'string' field with no joy. with
String providing the best result, but still with case sensitivity.

at the moment I'm using a custom field type

fieldType name=string_lc class=solr.TextField
sortMissingLast=true omitNorms=true
  analyzer
!-- KeywordTokenizer does no actual tokenizing, so the entire
 input string is preserved as a single token
  --
tokenizer class=solr.KeywordTokenizerFactory/
!-- The LowerCase TokenFilter does what you expect, which can be
 when you want your sorting to be case insensitive
  --
filter class=solr.LowerCaseFilterFactory /


  /analyzer
/fieldType

which converts all the field to all lower case, which allows me to submit
the query as lower case and better good results.

so the point of the email is to find out how do I get the autosuggest to
return mixed case results, and not require me to lower case the query before
I send it?


storing complex types in a multiValued field

2009-01-11 Thread Ian Holsman

hi.
I don't think this is a FAQ, but it's been bugging me for a while.

I want to store key/value pairs in a single field. for example
  field name=tags type=keyval indexed=true stored=true 
multiValued=true /


where keyval would be a ID# and the value.

I'm guessing it is as simple as creating my own field class, but I was 
wondering if there were any gotchas.

and more importantly why I've never seen the question asked before.

It would seem to me a common use case.


Re: storing complex types in a multiValued field

2009-01-11 Thread Ian Holsman

Shalin Shekhar Mangar wrote:

I guess most people store it as a simple string key(separator)value. Is
there something special that you want to do with the values that you need a
custom field implementation?
  
no..not really.. I guess  I could achieve it via payloads as well.. the 
whole thing about stuffing 2 fields into the same field irks me thats all.

I've got them set up as 2 separate MV fields at the moment.

On Mon, Jan 12, 2009 at 5:36 AM, Ian Holsman li...@holsman.net wrote:

  

hi.
I don't think this is a FAQ, but it's been bugging me for a while.

I want to store key/value pairs in a single field. for example
 field name=tags type=keyval indexed=true stored=true
multiValued=true /

where keyval would be a ID# and the value.

I'm guessing it is as simple as creating my own field class, but I was
wondering if there were any gotchas.
and more importantly why I've never seen the question asked before.

It would seem to me a common use case.






  




Re: Solr security

2008-11-17 Thread Ian Holsman

There was a patch by Sean Timm you should investigate as well.

It limited a query so it would take a maximum of X seconds to execute, 
and would just return the rows it had found in that time.



Feak, Todd wrote:

I see value in this in the form of protecting the client from itself.

For example, our Solr isn't accessible from the Internet. It's all
behind firewalls. But, the client applications can make programming
mistakes. I would love the ability to lock them down to a certain number
of rows, just in case someone typos and puts in 1000 instead of 100, or
the like.

Admittedly, testing and QA should catch these things, but sometimes it's
nice to put in a few safeguards to stop the obvious mistakes from
occurring.

-Todd Feak

-Original Message-
From: Matthias Epheser [mailto:[EMAIL PROTECTED] 
Sent: Monday, November 17, 2008 9:07 AM

To: solr-user@lucene.apache.org
Subject: Re: Solr security

Ryan McKinley schrieb:
  however I have found that in any site where
  

stability/load and uptime are a serious concern, this is better

handled 
  
in a tier in front of java -- typically the loadbalancer / haproxy / 
whatever -- and managed by people more cautious then me.



Full ack. What do you think about the only solr related thing left,
the 
paramter filtering/blocking (eg. rows1000). Is this suitable to do it
in a 
Filter delivered by solr? Of course as an optional alternative.


  

ryan







  




Re: Solr security

2008-11-17 Thread Ian Holsman

if thats the case putting apache in front of it would be handy.

something like
limit  POST
order deny,allow
deny from all
allow from 192.168.0.1
/limit

might be helpful.

Sean Timm wrote:
I believe the Solr replication scripts require POSTing a commit to 
read in the new index--so at least limited POST capability is required 
in most scenarios.


-Sean

Lance Norskog wrote:

About that read-only switch for Solr: one of the basic HTTP design
guidelines is that GET should only return values, and should never 
change
the state of the data. All changes to the data should be made with 
POST. (In

REST style guidelines, PUT, POST, and DELETE.) This prevents you from
passing around URLs in email that can destroy the index.  The first 
role of

security is to prevent accidents.

I would suggest two layers of read-only switch. 1) Open the Lucene 
index

in read-only mode. 2) Allow only search servers to accept GET requests.

Lance

  






Re: Solr security

2008-11-17 Thread Ian Holsman

Ryan McKinley wrote:


On Nov 17, 2008, at 4:20 PM, Erik Hatcher wrote:

trouble is, you can also GET /solr/update, even all on the URL, no 
request body...


  
http://localhost:8983/solr/update?stream.body=%3Cadd%3E%3Cdoc%3E%3Cfield%20name=%22id%22%3ESTREAMED%3C/field%3E%3C/doc%3E%3C/add%3Ecommit=true 



Solr is a bad RESTafarian.



but with Ian's options in the apache config, this would not work...  
rather it would only work if stream.body was a POST


location /solr/update

order deny,allow
deny from all
allow from 192.168.0.1
/location
?
or perhaps locationmatch.. but you get the picture.






Getting warmer!

Erik


On Nov 17, 2008, at 4:11 PM, Ian Holsman wrote:


if thats the case putting apache in front of it would be handy.

something like
limit  POST
order deny,allow
deny from all
allow from 192.168.0.1
/limit

might be helpful.

Sean Timm wrote:
I believe the Solr replication scripts require POSTing a commit to 
read in the new index--so at least limited POST capability is 
required in most scenarios.


-Sean

Lance Norskog wrote:

About that read-only switch for Solr: one of the basic HTTP design
guidelines is that GET should only return values, and should never 
change
the state of the data. All changes to the data should be made with 
POST. (In

REST style guidelines, PUT, POST, and DELETE.) This prevents you from
passing around URLs in email that can destroy the index.  The 
first role of

security is to prevent accidents.

I would suggest two layers of read-only switch. 1) Open the 
Lucene index
in read-only mode. 2) Allow only search servers to accept GET 
requests.


Lance













Re: Solr security

2008-11-16 Thread Ian Holsman

Erik Hatcher wrote:
I'm pondering the viability of running Solr as effectively a UI 
server... what I mean by that is having a public facing browser-based 
application hitting a Solr backend directly for JSON, XML, etc data.


I know folks are doing this (I won't name names, in case this thread 
comes up with any vulnerabilities that would effect such existing 
environments).


Let's just assume a typical deployment environment... replicated 
Solr's behind a load balancer, maybe even a caching proxy.

What known vulnerabilities are there in Solr 1.3, for example?

What I think we can get out this is a Solr deployment configuration 
suitable for direct browser access, but we're not safely there yet are 
we?  Is this an absurd goal?  Must we always have a moving piece 
between browser and data/search servers?


Thanks,
Erik




First thing I would look at is disabling write access, or writing a 
servlet that sits on top of the write handler to filter your data.


Second thing I would be concerned about is people writing DoS queries 
that bypass the cache.


so you may need to write your own custom request handler to filter out 
that kind of thing.




Re: Solr security

2008-11-16 Thread Ian Holsman

Erik Hatcher wrote:


On Nov 16, 2008, at 5:41 PM, Ian Holsman wrote:
First thing I would look at is disabling write access, or writing a 
servlet that sits on top of the write handler to filter your data.


We can turn off all the update handlers, but how does that affect 
replication?  Can a Solr replicant be entirely read-only in the HTTP 
request sense?


Second thing I would be concerned about is people writing DoS queries 
that bypass the cache.



so you may need to write your own custom request handler to filter 
out that kind of thing.


Is this a concern that can be punted to what you'd naturally be 
putting in front of Solr anyway or a proxy tier that can have DoS 
blocking rules?  I mean, if you're deploying a Struts that hits Solr 
under the covers, how do you prevent against DoS on that?  A malicious 
user could keep sending queries indirectly to a Solr through a whole 
lot of public apps now.  In other words, another tier in front of Solr 
doesn't add (much) to DoS protection to an underlying Solr, no?


famous last words and all, but you shouldn't be just passing what a user 
types directly into a application should you?


I'd be parsing out wildcards, boosts, and fuzzy searches (or at least 
thinking about the effects).
I mean jakarta apache~1000 or roam~0.1 aren't as efficient as a 
regular query.


but they don't let me into design meetings any more ;(

Erik






Re: solrj and CLOSE_WAIT's

2008-11-14 Thread Ian Holsman

Ryan McKinley wrote:

not sure if it is something we can do better or part of HttpClient...

From:
http://www.nabble.com/CLOSE_WAIT-td19959428.html

it seems to suggest you may want to call:
con.closeIdleConnections(0L);

But if you are creating a new MultiThreadedHttpConnectionManager for 
each request, is seems odd you would have to explicitly close the 
connection for each request.


What happens if you try using a SimpleHttpConnectionManager rather 
then a MultiThreadedHttpConnectionManager?  You can explicitly pass in:
 

I was thinking the same thing when i saw the other constructor.

I've modified the code to call the 'simple' version and will let it run
for an hour or three to make sure it works and doesn't exhibit the
behavior, so far it looks good and there are no CLOSE_WAITs (or
FIN_WAIT2's) showing up for longer than a couple of seconds. (according
to netstat -tn)

I'd petition we go back to the 'stupid' version by default that just
does what it is supposed to do, and leave the other one for 'experts'. I
can't even see how to tell the multi-threaded version to close itself
nicely ;(




to:
public CommonsHttpSolrServer(URL baseURL, HttpClient client, 
ResponseParser parser, boolean useMultiPartPost) {


if that fixes things, it is a bit disturbing, but something we should 
look into.


ryan







solrj and CLOSE_WAIT's

2008-11-13 Thread Ian Holsman

Hi guys.

I'm running a little upload project that uploads documents into a solr 
index. there is also a 2nd thread that runs a deleteby query and a 
optimize every once and a while.


in an effort to reduce the probably of things being held onto I've made 
everything local, but it is still collecting CLOSE_WAITs and FIN_WAIT2's 
on the server side until it eventually runs out of file handles in a day 
or two.


the following are the code snippets being used to call solr.

   protected void doArchiveSolr() throws IOException, SolrServerException {
   Calendar rightNow = Calendar.getInstance();
   rightNow.add(Calendar.DATE, 31 * -1);
   DateFormat f = new SimpleDateFormat(-MM-dd'T'HH:mm:ss.SSS'Z');
   java.util.Date d = rightNow.getTime();

   String s = publish_date:[1976-03-06T23:59:59.999Z/YEAR TO  + 
f.format(d) + ];

   logger.info(Archiver: + s);
   CommonsHttpSolrServer solrServer;
   solrServer = new CommonsHttpSolrServer(solrURL);
   solrServer.deleteByQuery(s);
   solrServer.commit();
   }

and this runs every X minutes.
it also has other local parts like
{
  CommonsHttpSolrServer solrServer;
  solrServer = new CommonsHttpSolrServer(solrURL);
  solrServer.optimize();
}


and
{
   CommonsHttpSolrServer solrServer;
   UpdateResponse r;

   solrServer = new CommonsHttpSolrServer(solrUrl);
   solrServer.setSoTimeout(12);  // socket read 
timeout  2minutes

   solrServer.setConnectionTimeout(100);
   solrServer.setDefaultMaxConnectionsPerHost(100);
   solrServer.setMaxTotalConnections(100);
   solrServer.setFollowRedirects(false);  // 
defaults to false

   solrServer.setAllowCompression(false);

   r = solrServer.add(docs);

   r = solrServer.commit();
   docs.clear();

}




Re: Release date of SOLR 1.3

2008-05-19 Thread Ian Holsman (Lists)

Noble Paul നോബിള്‍ नोब्ळ् wrote:

If you are looking for an immediate need waiting for a release I must
advice you against waiting for the solr1.3 release. The best strategy
would be to take a nightly and start using it. Test is thoroughly and
if bugs are found report them back . If everything is fine go into
production with that

--Noble


I'd be very hesitant to recommend ANYONE go into production with 
non-released software if you are unfamiliar with the codebase.
waiting on the list for someone to fix a bug which is causing a site 
outage for your site is somewhat of a career limiting move.


I'd recommend using the stable release, and learning the codebase ;-)

regards
Ian


On Thu, May 15, 2008 at 12:28 AM, Matthew Runo [EMAIL PROTECTED] wrote:

There isn't a specific date so far, but I'd like to say that only once in
the year or so I've been working with the SVN head build of Solr have I
noticed a bug get committed. And it was fixed very quickly once it was
found.. I think if you need to have development features you're probably
safe to use the SVN head, but remember that it is dev, and you should
*always* test new builds before actually using them =p

Thanks!

Matthew Runo
Software Developer
Zappos.com
702.943.7833

On May 14, 2008, at 9:08 AM, Umar Shah wrote:

Hi,

I'm using the latest trunk code from SOLR .
I am basically using function queries (sum, product, scale) for my project
which are not present in 1.2.
I wanted to know if there is some decided date for release of Solr1.3.
If the date is far/ not decide, what should be the best practice to adopt
the above mentioned feature while not compromising on stability of the
system.

thanks
-umar










Re: Solr replication by solr (for windows)

2008-04-29 Thread Ian Holsman
The current scripts use rsync to minimize the amount of data actually 
being copied.


I've had a brief look and found only 1 implementation which is GPL and 
abandoned

http://sourceforge.net/projects/jarsync.

Personally I still think the size of the transfer is important (as for 
most use cases not much is actually changed every hour).. but thats just 
me.. your case may be different than mine.


regards
Ian


Noble Paul നോബിള്‍ नोब्ळ् wrote:

hi ,
The current replication strategy in solr involves shell scripts . The
following are the drawbacks
*  It does not work with windows
* Replication works as a separate piece not integrated with solr.
* Cannot control replication from solr admin/JMX
* Each operation requires manual telnet to the host

Doing the replication within java code has the following advantages
* Platform independence
* Manual steps can be completely eliminated. Everything can be driven
from solrconfig.xml .
** Just put in the url of the master in the slaves that should be good
enough to enable replication. Other things like frequency of
snapshoot/snappull can also be configured
* Start/stop can be triggered from solr/admin or JMX
* Can get the status/progress while replication is going on
* No need to have a login into the machine

The implementation can be done as two components
* A SolrEventListener which does a snapshoot . Same as done by the script
* A ReplicationHandler which can act as a server to dish out the index
snapshots (in the master)
** In the slave the same handler can poll at regular intervals and if
there is a new snapshot fetch the index over http (it can use
solrj+BinaryReponseWriter)
* The same Handler can do a snap install
* The Handler may expose all the operations over a REST interface or JMX
* It may also show the current state of the master index through the console

What do you think?

  




Re: unique values from a field in a result

2008-04-29 Thread Ian Holsman

Hi Thijs.

If you are not concerned with a *EXACT* number there is a paper that was 
published in 1990 that discusses this problem.


http://dblab.kaist.ac.kr/Publication/pdf/ACM90_TODS_v15n2.pdf

from the paper (If I understand it correctly)

For 120,000,000 records you can sample 10,112,529 records  (10%) when 
the variance is low and get an answer with 95% confidence.



Regards
Ian

Thijs wrote:

It must be my english.
When I read your comment, I think you could compare it to the category 
example...


Maybe with an example I can explain my situation better:
The documents in the index contain variations of different products.
Say for example I have 10 different products. Every product is indexed 
1000 times (1000 different variations, per product) the product is not 
unique, the variation is unique.
The first 10 result of a search only contain the best matching 
variations for all the products in the complete result. So lets say 
the result returns 1000 variations for 3 different products. What I 
need is some 'sidebar information' containing detailed information on 
al the 3 unique products in the complete result.


My example is just simple, in real life the numbers are a lot bigger. 
However, the amount of unique products vs variations is such that it 
seems a lot of work to iterate over al variations in a DocSet just to 
get the few unique products.
But, what I understand from you anwser is that the best way to get the 
3 unique products is to iterate over the 1000 variations in the result 
DocSet? And if that is the case I'm happy with it.


Thanks
Thijs



But to get some extra inforamtion I need al the unique values for one 
of the fields in the index (being the pk of the product).


Chris Hostetter schreef:
: You are correct I'm looking for the unique values for one field in 
a DocSet.
: The field is not multivalued. and it contains only 1 long value, 
the pk of a

: database table
: But you said the counts are stored in the index, I don't see that. 
Because


there's something very confusing about your question ... if the value 
of the field is unique for every document (by pk you mean the 
primary key for these docs in your database correct?) then why do you 
specificly need the unique terms ? ... aren't they by definition 
unique?


usually when people ask questions like this, they are interested in 
the unique values for something like a category field, where lots 
of documenst are in the same category, and they want to know what the 
full list of categories is for all ofhte documenst that match their 
query.


if you want the list of all primary keys for all the documents that 
match your query, why not just make sure that field has stored=true 
in the schema.xml and getthe values that way?


I'm extra confused because of this comment...

: when I debug simplefacet. It always iterates over all the documents 
in the

: result docset (SimpleFacet.getFieldCacheCounts line 259).

it doesn't *seem* like faceting is neccessary, but why do you think 
iterating over all the documents in your result set set seems like a 
waste here?  if you want to know what *all* the values are for every 
document in your doc set, then regardless of wether the values are 
distinct for each doc, how else could Solr get all the values then 
looking at each matching doc?




-Hoss

  







Re: Lucene-based Distributed Index Leveraging Hadoop

2008-02-06 Thread Ian Holsman

Clay Webster wrote:

There seem to be a few other players in this space too.

Are you from Rackspace?  
(http://highscalability.com/how-rackspace-now-uses-mapreduce-and-hadoop-

query-terabytes-data)

AOL also has a Hadoop/Solr project going on.

CNET does not have much brewing there.  Although Yonik and I had 
talked about it a bunch -- but that was long ago. 
  


Hi.
AOL has a couple of projects going on in the lucene/hadoop/solr space, 
and we will be pushing more stuff out as we can. We don't have anything 
going with solr over hadoop at the moment.


I'm not sure if this would be better than what SOLR-303 does, but you 
should have a look at the work being done there.


One of the things you mentioned is that the data sets are disjoint. 
SOLR-303 doesn't require this, and allows us to have a document stored 
in multiple shards (with different caching/update characteristics).

--cw

Clay Webster   tel:1.908.541.3724
Associate VP, Platform Infrastructure http://www.cnet.com
CNET, Inc. (Nasdaq:CNET) mailto:[EMAIL PROTECTED]

  



  




Re: leading wildcards

2007-11-15 Thread Ian Holsman
the solution that works for me is to store the field in reverse order, 
and have your application reverse the field in the query.


so the field www.example.com would be stored as
moc.elmpaxe.www

so now I can do a search for *.example.com in my application.

Regards
Ian
(hat tip to erik for the idea)

Michael Kimsal wrote:

Vote for that issue and perhaps it'll gain some more traction.  A former
colleague of mine was the one who contributed the patch in SOLR 218 and it
would be nice to have that configuration option 'standard' (if off by
default) in the next SOLR release.


On Nov 12, 2007 11:18 AM, Traut [EMAIL PROTECTED] wrote:

  

Seems like there is no way to enable leading wildcard queries except
code editing and files repacking. :(

On 11/12/07, Bill Au [EMAIL PROTECTED] wrote:


The related bug is still open:

http://issues.apache.org/jira/browse/SOLR-218

Bill

On Nov 12, 2007 10:25 AM, Traut [EMAIL PROTECTED] wrote:
  

Hi
 I found the thread about enabling leading wildcards in
Solr as additional option in config file. I've got nightly Solr build
and I can't find any options connected with leading wildcards in
config files.

 How I can enable leading wildcard queries in Solr? Thank


you


--
Best regards,
Traut



--
Best regards,
Traut






  




where did my foreign language go?

2007-10-24 Thread Ian Holsman

Hi.

I'm in the middle of bringing up a new solr server and am using the 
trunk. (where I was using an earlier nightly release of about 2-3 weeks 
ago on my old server)


now, when I do a search for 日本 (japan) it used to show the kanji in 
the q area, but now it shows gibberish instead 日本



any hints on where I should start investigating on why this is happening?

regards
Ian

(server is here: 
http://pyro.holsman.net:8983/solr/select/?q=%E6%97%A5%E6%9C%ACversion=2.2start=0rows=10indent=on 
)




Re: where did my foreign language go?

2007-10-24 Thread Ian Holsman
Thanks.. I'll do that
sunrise1984 wrote:
 Maybe the following is useful for you.(It comes from 
 http://wiki.apache.org/solr/SolrTomcat)

 If you are going to query Solr using international characters (127) using 
 HTTP-GET, you must configure Tomcat to conform to the URI standard by 
 accepting percent-encoded UTF-8. 
 Edit Tomcat's conf/server.xml and add the following attribute to the correct 
 Connector element: URIEncoding=UTF-8. 
 Server ...
  Service ...
Connector ... URIEncoding=UTF-8/
  ...
/Connector
  /Service
 /Server

 This is only an issue when sending non-ascii characters in a query request... 
 no configuration is needed for Solr/Tomcat to return non-ascii chars in a 
 response, or accept non-ascii chars in an HTTP-POST body. 




 sunrise1984
 2007-10-25

   



Seeing if an entry exists in an index for a set of terms

2007-10-03 Thread Ian Holsman

Hi.

I was wondering if there was a easy way to give solr a list of things 
and finding out which have entries.



ie I pass it a list

Bill Clinton
George Bush
Mary Papas
(and possibly 20 others)

to a solr index which contains news articles about presidents. I would 
like a response saying


bill Clinton was found in 20 records
George Bush was found in 15.

possibly with the links, but thats not too important.

I know I can do this by doing ~20 individual queries, but I thought 
there may be a more efficient way


Regards
Ian


Re: Seeing if an entry exists in an index for a set of terms

2007-10-03 Thread Ian Holsman

Yonik Seeley wrote:

On 10/3/07, Ian Holsman [EMAIL PROTECTED] wrote:
  

Hi.

I was wondering if there was a easy way to give solr a list of things
and finding out which have entries.


ie I pass it a list

Bill Clinton
George Bush
Mary Papas
(and possibly 20 others)

to a solr index which contains news articles about presidents. I would
like a response saying

bill Clinton was found in 20 records
George Bush was found in 15.

possibly with the links, but thats not too important.

I know I can do this by doing ~20 individual queries, but I thought
there may be a more efficient way



How about
facet.query=Bill Clintonfacet.query=George Bush, etc

Will give you counts, but not the links

-Yonik

  

That will work.
Thanks Yonik.



Re: Geographical distance searching

2007-09-26 Thread Ian Holsman

Have you guys seen Local Lucene ?
http://www.nsshutdown.com/projects/lucene/whitepaper/*locallucene*.htm

no need for mysql if you don't want too.

rgrds
Ian

Will Johnson wrote:

With the new/improved value source functions it should be pretty easy to
develop a new best practice.  You should be able to pull in the lat/lon
values from valuesource fields and then do your greater circle calculation.

- will

-Original Message-
From: Lance Norskog [mailto:[EMAIL PROTECTED] 
Sent: Wednesday, September 26, 2007 3:15 PM

To: solr-user@lucene.apache.org
Subject: Geographical distance searching

It is a best practice to store the master copy of this data in a
relational database and use Solr/Lucene as a high-speed cache.
MySQL has a geographical database option, so maybe that is a better option
than Lucene indexing.

Lance

(P.s. please start new threads for new topics.)

-Original Message-
From: Sandeep Shetty [mailto:[EMAIL PROTECTED] 
Sent: Wednesday, September 26, 2007 5:15 AM

To: 'solr-user@lucene.apache.org'
Subject: custom sorting

  

Hi Guys,

this question as been asked before but i was unable to find an answer 
thats good for me, so hope you guys can help again i am working on a 
website where we need to sort the results by distance from the 
location entered by the user. I have indexed the lat and long info for 
each record in solr and also i can get the lat and long of the 
location input by the user.
Previously we were using lucene to do this. by using the 
SortComparatorSource we could sort the documents returned by distance 
nicely. we are now switching over to lucene because of the features it 
provides, however i am not able to see a way to do this in Solr.


If someone can point me in the right direction i would be very grateful!

Thanks in advance,
Sandeep



This email is confidential and may also be privileged. If you are not the
intended recipient please notify us immediately by telephoning +44 (0)20
7452 5300 or email [EMAIL PROTECTED] You should not copy it or use
it for any purpose nor disclose its contents to any other person. Touch
Local cannot accept liability for statements made which are clearly the
sender's own and are not made on behalf of the firm.

Touch Local Limited
Registered Number: 2885607
VAT Number: GB896112114
Cardinal Tower, 12 Farringdon Road, London EC1M 3NN
+44 (0)20 7452 5300


  




Re: Nutch with SOLR

2007-09-25 Thread Ian Holsman
[moving this thread to solr-user, as it really has nothing to do with 
hadoop]


Daniel Clark wrote:

There's info on website
http://blog.foofactory.fi/2007/02/online-indexing-integrating-nutch-with.htm
l, but it's not clear.

  


Sami has a patch in there which used a older version of the solr client. 
with the current solr client in the SVN tree, his patch becomes much easier.
your job would be to upgrade the patch and mail it back to him so he can 
update his blog, or post it as a patch for inclusion in nutch/contrib 
(if sami is ok with that). If you have issues with how to use the solr 
client api, solr-user is here to help.


the nutch specific changes are:
1. configure nutch-site.xml to add a config option to point to your solr 
server.


2. instead of calling the nutch 'index' command, you would call it like so
bin/nutch org.apache.nutch.indexer.SolrIndexer $BASEDIR/crawldb 
$BASEDIR/linkdb $SEGMENT



regards
Ian



~
Daniel Clark, President
DAC Systems, Inc.
(703) 403-0340
~

-Original Message-
From: Dmitry [mailto:[EMAIL PROTECTED] 
Sent: Tuesday, September 25, 2007 2:56 PM

To: [EMAIL PROTECTED]
Subject: Re: Nutch with SOLR

Daniel,

We just started to test/research posibility of integration of Nutch and Solr

so it will be nice to hear any advices as well.

Thanks,
DT
www.ejizn.com

- Original Message - 
From: Daniel Clark [EMAIL PROTECTED]

To: [EMAIL PROTECTED]
Sent: Tuesday, September 25, 2007 1:23 PM
Subject: Nutch with SOLR


  
Has anyone been able to get Nutch 0.9 working with SOLR?  Any help would 
be

appreciated.



~

Daniel Clark, President

DAC Systems, Inc.

(703) 403-0340

~









  




Re: Nutch with SOLR

2007-09-25 Thread Ian Holsman

Thanks Brian.
I'm sure this will help lots of people.

Brian Whitman wrote:


But we still use a version of Sami's patch that works on both trunk 
nutch and trunk solr (solrj.) I sent my changes to sami when we did 
it, if you need it let me know...




I put my files up here: http://variogr.am/latest/?p=26

-b






Solr Injection

2007-07-02 Thread Ian Holsman

Hi.

I've been playing with Kettle (http://kettle.pentaho.org/ ) as a method 
to inject data into Solr (and other things at the same time), and it 
looks really promising.


I was wondering if anyone else had some experience using it with Solr 
and if they set it up to add a document at a time, or wrote a single XML 
'add' document

and then added all of them in one lot

Ideally I would like to have Solr accept a REST style URL without all 
the XML bs around and just pass the fields in as parameters (which is 
alluded to in http://issues.apache.org/jira/browse/SOLR-85 )
and just pound the Solr master with lots of little posts when I do 
incremental updates for  1000 things and use the CSV uploader for 
larger things.


Thoughts?






RDF uploader -- has anyone built such a beast?

2007-06-19 Thread Ian Holsman

Hi.

For a project i'm working on, I'm getting a RDF formatted feed.

I was wondering if someone has built a RDF to solr upload function 
similar to the CSV and mysql ones sitting in Jira.


regards
Ian




Re: Requests per second/minute monitor?

2007-05-09 Thread Ian Holsman


Walter Underwood wrote:
 This is for monitoring -- what happened in the last 30 seconds.
 Log file analysis doesn't really do that.
 

I would respectfully disagree.
Log file analysis of each request can give you that, and a whole lot more.

you could either grab the stats via a regular cron job, or create a separate
filter to parse them real time.
It would then let you grab more sophisticated stats if you choose to.

What I would like to know is (and excuse the newbieness of the question) how
to enable solr to log a file with the following data.


- time spent (ms) in the request. 
- IP# of the incoming request
- what the request was (and what handler executed it)
- a status code to signal if the request failed for some reasons 
- number of rows fetched
and 
- the number of rows actually returned

is this possible? (I'm using tomcat if that changes the answer).

regards
Ian
-- 
View this message in context: 
http://www.nabble.com/Re%3A-Requests-per-second-minute-monitor--tf3659369.html#a10407072
Sent from the Solr - User mailing list archive at Nabble.com.



newbie Q regarding schema configuration

2006-06-19 Thread Ian Holsman

hi.

so I finally managed to find a bit of time to get a SolR instance  
going, and now have some questions about it ;-)


first the application is tagging. ie.. to associate some keywords  
with a given item, and to show them on a particular object (you can  
see this in action here http://economy-chat.com/aggy/detail/andrew- 
leigh/ )


It user-based (ie individuals can tag a particular object themselves,  
and that get's merged into a global summary for that object)
and it is also hierarchal, ie tagging a child implies you have also  
tagged the parent.


so.. my first question in schema.xml, can you have a composite key as  
the 'uniquekey' field, or do i need to do this on the client side?


2nd question.

can you have complex types which are multivalued?
I'd like to store something like
a tag-name with a corresponding tag-weighting.

can you do sum(*) type queries in lucene/solr? it is efficient ? or  
are you better having a 2nd index which has these sum(*) values in it  
and keep it up to date instead.




Thanks


Re: SolPHP

2006-06-01 Thread Ian Holsman

I think I could get some python bindings off those as well.
and if people feel there is a need some C/APR ones as well.

On 02/06/2006, at 11:16 AM, Brian Lucas wrote:


Erik,

I'll get the PHP bindings out to see how they suit the needs of  
people and
use that feedback for the Rails bindings.  I'm looking forward to  
seeing how

they could be implemented as well.
Brian

-Original Message-
From: Erik Hatcher [mailto:[EMAIL PROTECTED]
Sent: Thursday, June 01, 2006 6:59 PM
To: solr-user@lucene.apache.org
Subject: Re: SolPHP

Brian,

I'd love to give any RoR bindings a try if you're a point to share.
I can see all sorts of interesting fun that can be had with such
bindings, such as pulling schema.xml from the server and using its
field definitions to build mapping objects (like ActiveRecord),
support for all the parameters of the request handler(s), clever
iterators that would page through the hits by requesting bite-sized
chunks from Solr.  At the very least, of course, is having the
request and response abstracted so no XML or HTTP is seen by the
client code.

Erik



On Jun 1, 2006, at 8:49 PM, Brian Lucas wrote:


Yes, I have written bindings but hadn't abstracted them fully.
They're
pretty solid and since you're the second person that's asked, let
me get
those out as soon as possible.  I'm also working on the Ruby/Rails
bindings
as well.

Brian

-Original Message-
From: Yonik Seeley [mailto:[EMAIL PROTECTED]
Sent: Thursday, June 01, 2006 6:17 PM
To: solr-user@lucene.apache.org; [EMAIL PROTECTED]
Subject: Re: SolPHP

Nothing in SVN... It looks like Brian Lucas might have been  
working on

something:
http://www.mail-archive.com/solr-user%40lucene.apache.org/
msg00325.html

-Yonik

On 6/1/06, Michael J. Giarlo [EMAIL PROTECTED] wrote:

Hey folks,

I noticed a stub on the wiki about two PHP classes for solr.  I've
tried
to track down the classes but have been unsuccessful so far.  Does
anyone know where, or if, these classes are available?

Thanks!

-Mike