from:"Bill Bell"

TimeAllowed bug

2015-08-24 Thread Bill Bell

Weird fq caching bug when using timeAllowed

Find a pwid (in this case YLGVQ)
Run a query w/ a FQ on the pwid and timeAllowed=1.
http://hgsolr2devsl.healthgrades.com:8983/solr/providersearch/select/?q=*:*&wt=json&fl=pwid&fq=pwid:YLGVQ&timeAllowed=1
Ensure #2 returns 0 results
Rerun the query without the timeAllowed param.
http://hgsolr2devsl.healthgrades.com:8983/solr/providersearch/select/?q=*:*&wt=json&fl=pwid&fq=pwid:YLGVQ
Note that after removing the timeAllowed parameter the query is still returning 
0 results.

 Solr seems to be caching the FQ when the timeAllowed parameter is present.


Bill Bell
Sent from mobile

Re: Solr performance is slow with just 1GB of data indexed

2015-08-23 Thread Bill Bell

We use 8gb to 10gb for those size indexes all the time.


Bill Bell
Sent from mobile


> On Aug 23, 2015, at 8:52 AM, Shawn Heisey  wrote:
> 
>> On 8/22/2015 10:28 PM, Zheng Lin Edwin Yeo wrote:
>> Hi Shawn,
>> 
>> Yes, I've increased the heap size to 4GB already, and I'm using a machine
>> with 32GB RAM.
>> 
>> Is it recommended to further increase the heap size to like 8GB or 16GB?
> 
> Probably not, but I know nothing about your data.  How many Solr docs
> were created by indexing 1GB of data?  How much disk space is used by
> your Solr index(es)?
> 
> I know very little about clustering, but it looks like you've gotten a
> reply from Toke, who knows a lot more about that part of the code than I do.
> 
> Thanks,
> Shawn
>

Re: solr multicore vs sharding vs 1 big collection

2015-08-03 Thread Bill Bell

Yeah a separate by month or year is good and can really help in this case.

Bill Bell
Sent from mobile


> On Aug 2, 2015, at 5:29 PM, Jay Potharaju  wrote:
> 
> Shawn,
> Thanks for the feedback. I agree that increasing timeout might alleviate
> the timeout issue. The main problem with increasing timeout is the
> detrimental effect it will have on the user experience, therefore can't
> increase it.
> I have looked at the queries that threw errors, next time I try it
> everything seems to work fine. Not sure how to reproduce the error.
> My concern with increasing the memory to 32GB is what happens when the
> index size grows over the next few months.
> One of the other solutions I have been thinking about is to rebuild
> index(weekly) and create a new collection and use it. Are there any good
> references for doing that?
> Thanks
> Jay
> 
>> On Sun, Aug 2, 2015 at 10:19 AM, Shawn Heisey  wrote:
>> 
>>> On 8/2/2015 8:29 AM, Jay Potharaju wrote:
>>> The document contains around 30 fields and have stored set to true for
>>> almost 15 of them. And these stored fields are queried and updated all
>> the
>>> time. You will notice that the deleted documents is almost 30% of the
>>> docs.  And it has stayed around that percent and has not come down.
>>> I did try optimize but that was disruptive as it caused search errors.
>>> I have been playing with merge factor to see if that helps with deleted
>>> documents or not. It is currently set to 5.
>>> 
>>> The server has 24 GB of memory out of which memory consumption is around
>> 23
>>> GB normally and the jvm is set to 6 GB. And have noticed that the
>> available
>>> memory on the server goes to 100 MB at times during a day.
>>> All the updates are run through DIH.
>> 
>> Using all availble memory is completely normal operation for ANY
>> operating system.  If you hold up Windows as an example of one that
>> doesn't ... it lies to you about "available" memory.  All modern
>> operating systems will utilize memory that is not explicitly allocated
>> for the OS disk cache.
>> 
>> The disk cache will instantly give up any of the memory it is using for
>> programs that request it.  Linux doesn't try to hide the disk cache from
>> you, but older versions of Windows do.  In the newer versions of Windows
>> that have the Resource Monitor, you can go there to see the actual
>> memory usage including the cache.
>> 
>>> Every day at least once i see the following error, which result in search
>>> errors on the front end of the site.
>>> 
>>> ERROR org.apache.solr.servlet.SolrDispatchFilter -
>>> null:org.eclipse.jetty.io.EofException
>>> 
>>> From what I have read these are mainly due to timeout and my timeout is
>> set
>>> to 30 seconds and cant set it to a higher number. I was thinking maybe
>> due
>>> to high memory usage, sometimes it leads to bad performance/errors.
>> 
>> Although this error can be caused by timeouts, it has a specific
>> meaning.  It means that the client disconnected before Solr responded to
>> the request, so when Solr tried to respond (through jetty), it found a
>> closed TCP connection.
>> 
>> Client timeouts need to either be completely removed, or set to a value
>> much longer than any request will take.  Five minutes is a good starting
>> value.
>> 
>> If all your client timeout is set to 30 seconds and you are seeing
>> EofExceptions, that means that your requests are taking longer than 30
>> seconds, and you likely have some performance issues.  It's also
>> possible that some of your client timeouts are set a lot shorter than 30
>> seconds.
>> 
>>> My objective is to stop the errors, adding more memory to the server is
>> not
>>> a good scaling strategy. That is why i was thinking maybe there is a
>> issue
>>> with the way things are set up and need to be revisited.
>> 
>> You're right that adding more memory to the servers is not a good
>> scaling strategy for the general case ... but in this situation, I think
>> it might be prudent.  For your index and heap sizes, I would want the
>> company to pay for at least 32GB of RAM.
>> 
>> Having said that ... I've seen Solr installs work well with a LOT less
>> memory than the ideal.  I don't know that adding more memory is
>> necessary, unless your system (CPU, storage, and memory speeds) is
>> particularly slow.  Based on your document count and index size, your
>> documents are quite small

Re: Division with Stats Component when Grouping in Solr

2015-06-13 Thread Bill Bell

It would be cool to be able to set 2 group by with facets 

>> GROUP BY
>>site_id, keyword


Bill Bell
Sent from mobile


On Jun 13, 2015, at 2:28 PM, Yonik Seeley  wrote:

>> GROUP BY
>>site_id, keyword

Re: Facet

2015-04-05 Thread Bill Bell

Ok

Clarification

The limit is set to -1. But the average result is 300. 

The amount of strings stored in the field increased a lot. Like 250k to 350k. 
But the amount coming out is limited by facet.prefix. 

Would creating 900 fields be better ? Then I could just put the prefix in the 
field name. Like this: proc_ps122

Thoughts ?

So far I heard solcloud, docvalues as viable solutions. Stay away from enum.

Bill Bell
Sent from mobile


> On Apr 5, 2015, at 2:56 AM, Toke Eskildsen  wrote:
> 
> William Bell  wrote:
> Sent: 05 April 2015 06:20
> To: solr-user@lucene.apache.org
> Subject: Facet
> 
>> We increased our number of terms (String) in a facet by 50,000.
> 
> Do you mean facet.limit=5?
> 
>> Now we are getting an error when we facet by this field - so we switched it 
>> to
>> facet.method=enum, and now the results come back. However, when we put
>> it into production we literally hit a wall (CPU went to 100% for 16 cores)
>> after about 30 minutes live.
> 
> It was strange that enum worked. Internally, the difference between 
> facet.limit=100 and facet.limit=5 is quite small. The real hits are for 
> fine-counting within SolrCloud and serializing the result in order to deliver 
> it to the client. I thought enum behaved the same as fc with regard to those 
> two.
> 
>> We tried adding more machines to reduce the CPU, but it did not help.
> 
> Sounds like SolrCloud. More machines does not help here, it might even be 
> worse. What happens is that distributed faceting is two-phase, where the 
> second phase is fine-counting. The fine-counting essentially makes all shards 
> perform micro-searches for a large part of the terms returned: Your shards 
> are bogged down by tens of thousands of small searches.
> 
> If you are feeling adventurous, you can try putting
> http://tokee.github.io/lucene-solr/
> on a test-installation (I am the author). It changes the way the 
> fine-counting is done.
> 
> 
> Depending on your container, you might need to raise the internal limits for 
> GET-communication. Tomcat has a default of 2MB somewhere (sorry, don't 
> remember the details), which is not a lot for 50,000 values.
> 
>> What are some ideas? We are going to try docValues on the field. Does
>> anyone know if method=fc or method=enum works for docValue? I cannot find
>> any documentation on that.
> 
> If DocValues are enabled, fc will use them. It does not change anything for 
> enum. But I would argue against enum for anything in the thousands anyway.
> 
>> We are thinking of splitting the field into 2 fields (fielda, fieldb). At
>> least the number will be less, but not sure if it will help memory?
> 
> The killer is the number of terms requested/returned.
> 
>> The weird thing is for the first 30 minutes things are performing great.
>> Literally at like 10% CPU across 16 cores, not much memory and normal GC.
> 
> It might be because you have just been lucky. Take a look at
> https://twitter.com/anjacks0n/status/509284768035262464
> for how different performance can be for different result set sizes.
> 
>> Originally the facet was a method=fc. Is there an issue with enum? We have
>> facet.threads=20 set, and not sure this is wise for a enum ?
> 
> Facet threading does not thread within each field, it just means that 
> multiple fields are processed in parallel.
> 
> - Toke Eskildsen

Re: ZFS File System for SOLR 3.6 and SOLR 4

2015-03-28 Thread Bill Bell

Is the an advantage for Xfs over ext4 for Solr ? Anyone done testing?

Bill Bell
Sent from mobile


> On Mar 27, 2015, at 8:14 AM, Shawn Heisey  wrote:
> 
>> On 3/27/2015 12:30 AM, abhi Abhishek wrote:
>> i am trying to use ZFS as filesystem for my Linux Environment. are
>> there any performance implications of using any filesystem other than
>> ext-3/ext-4 with SOLR?
> 
> That should work with no problem.
> 
> The only time Solr tends to have problems is if you try to use a network
> filesystem.  As long as it's a local filesystem and it implements
> everything a program can typically expect from a local filesystem, Solr
> should work perfectly.
> 
> Because of the compatibility problems that the license for ZFS has with
> the GPL, ZFS on Linux is probably not as well tested as other
> filesystems like ext4, xfs, or btrfs, but I have not heard about any big
> problems, so it's probably safe.
> 
> Thanks,
> Shawn
>

Re: How to boost documents at index time?

2015-03-28 Thread Bill Bell

Issue a Jura ticket ?

Did you try debugQuery ?

Bill Bell
Sent from mobile


> On Mar 28, 2015, at 1:49 AM, CKReddy Bhimavarapu  wrote:
> 
> I am want to boost docs at index time, I am doing this using boost
> parameter in doc field .
> but I can't see direct impact on the  doc by using  debuQuery.
> 
> My question is that is there any other way to boost doc at index time and
> can see the reflected changes i.e direct impact.
> 
> -- 
> ckreddybh.

Re: Sort on multivalued attributes

2015-02-09 Thread Bill Bell

Definitely needed !!

Bill Bell
Sent from mobile


> On Feb 9, 2015, at 5:51 AM, Jan Høydahl  wrote:
> 
> Sure, vote for it. Number of votes do not directly make prioritized sooner.
> So you better also add a comment to the JIRA, it will raise committer's 
> attention.
> Even better of course is if you are able to help bring the issue forward by 
> submitting patches.
> 
> --
> Jan Høydahl, search solution architect
> Cominvent AS - www.cominvent.com
> 
>> 9. feb. 2015 kl. 12.15 skrev Flavio Pompermaier :
>> 
>> Do I have to vote for it..?
>> 
>>> On Mon, Feb 9, 2015 at 11:50 AM, Jan Høydahl  wrote:
>>> 
>>> See https://issues.apache.org/jira/browse/SOLR-2522
>>> 
>>> --
>>> Jan Høydahl, search solution architect
>>> Cominvent AS - www.cominvent.com
>>> 
>>>> 9. feb. 2015 kl. 10.30 skrev Flavio Pompermaier :
>>>> 
>>>> In my use case it could be very helpful because I use the SIREn plugin to
>>>> index arbitrary JSON-LD and this plugin automatically index also all
>>> nested
>>>> attributes as a Solr field.
>>>> Thus I need for example to gather all entries with a certain value of the
>>>> "type" attribute, ordered by "name" (but name could be a multivalued
>>>> attribute in my use case :( )
>>>> I'd like to avoid to switch to Elasticsearch just to have this single
>>>> feature.
>>>> 
>>>> Thanks for the support,
>>>> Flavio
>>>> 
>>>> On Mon, Feb 9, 2015 at 10:02 AM, Anshum Gupta 
>>>> wrote:
>>>> 
>>>>> Sure, that's correct and makes sense in some use cases. I'll need to
>>> check
>>>>> if Solr functions support such a thing.
>>>>> 
>>>>> On Mon, Feb 9, 2015 at 12:47 AM, Flavio Pompermaier <
>>> pomperma...@okkam.it>
>>>>> wrote:
>>>>> 
>>>>>> I saw that this is possible in Lucene (
>>>>>> https://issues.apache.org/jira/browse/LUCENE-5454) and also in
>>>>>> Elasticsearch. Or am I wrong?
>>>>>> 
>>>>>> On Mon, Feb 9, 2015 at 9:05 AM, Anshum Gupta 
>>>>>> wrote:
>>>>>> 
>>>>>>> Unless I'm missing something here, sorting on a multi-valued field
>>>>> would
>>>>>> be
>>>>>>> non-deterministic in nature.
>>>>>>> 
>>>>>>> On Sun, Feb 8, 2015 at 11:59 PM, Flavio Pompermaier <
>>>>>> pomperma...@okkam.it>
>>>>>>> wrote:
>>>>>>> 
>>>>>>>> Hi to all,
>>>>>>>> 
>>>>>>>> Is there any possibility that in the near future Solr could support
>>>>>>> sorting
>>>>>>>> on multivalued fields?
>>>>>>>> 
>>>>>>>> Best,
>>>>>>>> Flavio
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> --
>>>>>>> Anshum Gupta
>>>>>>> http://about.me/anshumgupta
>>>>> 
>>>>> 
>>>>> 
>>>>> --
>>>>> Anshum Gupta
>>>>> http://about.me/anshumgupta
>

Re: Collations are not working fine.

2015-02-09 Thread Bill Bell

Can you order the collation a by highest to lowest hits ?

Bill Bell
Sent from mobile


> On Feb 9, 2015, at 6:47 AM, Nitin Solanki  wrote:
> 
> I am working on spell checking in Solr. I have implemented Suggestions and
> collations in my spell checker component.
> 
> Most of the time collations work fine but in few case it fails.
> 
> *Working*:
> I tried query:*gone wthh thes wnd*: In this "wnd" doesn't give suggestion
> "wind" but collation is coming right = "gone with the wind", hits = 117
> 
> 
> *Not working:*
> But when I tried query: *gone wthh thes wint*: In this "wint" does give
> suggestion "wind" but collation is not coming right. Instead of gone with
> the wind it gives gone with the west, hits = 1.
> 
> And I want to also know what is *hits* in collations.

Re: How large is your solr index?

2015-01-03 Thread Bill Bell

For Solr 5 why don't we switch it to 64 bit ??

Bill Bell
Sent from mobile


> On Dec 29, 2014, at 1:53 PM, Jack Krupansky  wrote:
> 
> And that Lucene index document limit includes deleted and updated
> documents, so even if your actual document count stays under 2^31-1,
> deleting and updating documents can push the apparent document count over
> the limit unless you very aggressively merge segments to expunge deleted
> documents.
> 
> -- Jack Krupansky
> 
> -- Jack Krupansky
> 
> On Mon, Dec 29, 2014 at 12:54 PM, Erick Erickson 
> wrote:
> 
>> When you say 2B docs on a single Solr instance, are you talking only one
>> shard?
>> Because if you are, you're very close to the absolute upper limit of a
>> shard, internally
>> the doc id is an int or 2^31. 2^31 + 1 will cause all sorts of problems.
>> 
>> But yeah, your 100B documents are going to use up a lot of servers...
>> 
>> Best,
>> Erick
>> 
>> On Mon, Dec 29, 2014 at 7:24 AM, Bram Van Dam 
>> wrote:
>>> Hi folks,
>>> 
>>> I'm trying to get a feel of how large Solr can grow without slowing down
>> too
>>> much. We're looking into a use-case with up to 100 billion documents
>>> (SolrCloud), and we're a little afraid that we'll end up requiring 100
>>> servers to pull it off.
>>> 
>>> The largest index we currently have is ~2billion documents in a single
>> Solr
>>> instance. Documents are smallish (5k each) and we have ~50 fields in the
>>> schema, with an index size of about 2TB. Performance is mostly OK. Cold
>>> searchers take a while, but most queries are alright after warming up. I
>>> wish I could provide more statistics, but I only have very limited
>> access to
>>> the data (...banks...).
>>> 
>>> I'd very grateful to anyone sharing statistics, especially on the larger
>> end
>>> of the spectrum -- with or without SolrCloud.
>>> 
>>> Thanks,
>>> 
>>> - Bram
>>

Re: Old facet value doesn't go away after index update

2014-12-19 Thread Bill Bell

Set mincount=1

Bill Bell
Sent from mobile


> On Dec 19, 2014, at 12:22 PM, Tang, Rebecca  wrote:
> 
> Hi there,
> 
> I have an index that has a field called collection_facet.
> 
> There was a value 'Ness Motley Law Firm Documents' that we wanted to update 
> to 'Ness Motley Law Firm'.  There were 36,132 records with this value.  So I 
> re-indexed just the 36,132 records.  After the update, I ran a facet query 
> (q=*:*&facet=true&facet.field=collection_facet) to see if the value got 
> updated and I saw
> Ness Motley Law Firm 36,132  -- as expected
> Ness Motley Law Firm Documents 0 — Why is this value still here even though 
> clearly there are no records with this value anymore?  I thought maybe it was 
> cached, so I restarted solr, but I still got the same results.
> 
> "facet_fields": { "collection_facet": [
> … "Ness Motley Law Firm", 36132,
> … "Ness Motley Law Firm Documents", 0 ]
> 
> 
> 
> Rebecca Tang
> Applications Developer, UCSF CKM
> Legacy Tobacco Document Library
> E: rebecca.t...@ucsf.edu

Re: Solr Dynamic Field Performance

2014-09-14 Thread Bill Bell

How about perf if you dynamically create 5000 fields ?

Bill Bell
Sent from mobile


> On Sep 14, 2014, at 10:06 AM, Erick Erickson  wrote:
> 
> Dynamic fields, once they are actually _in_ a document, aren't any
> different than statically defined fields. Literally, there's no place
> in the search code that I know of that _ever_ has to check
> whether a field was dynamically or statically defined.
> 
> AFAIK, the only additional cost would be figuring out which pattern
> matched at index time, which is such a tiny portion of the cost of
> indexing that I doubt you could measure it.
> 
> Best,
> Erick
> 
> On Sun, Sep 14, 2014 at 7:58 AM, Saumitra Srivastav
>  wrote:
>> I have a collection with 200 fields and >300M docs running in cloud mode.
>> Each doc have around 20 fields. I now have a use case where I need to
>> replace these explicit fields with 6 dynamic fields. Each of these 200
>> fields will match one of the 6 dynamic field.
>> 
>> I am evaluating performance implications of switching to dynamicFields. I
>> have tested with a smaller dataset(5M docs) but didn't noticed any indexing
>> or query performance degradation.
>> 
>> Query on dynamic fields will either be faceting, range query or full text
>> search.
>> 
>> Are there any known performance issues with using dynamicFields instead of
>> explicit ones?
>> 
>> 
>> 
>> 
>> --
>> View this message in context: 
>> http://lucene.472066.n3.nabble.com/Solr-Dynamic-Field-Performance-tp4158737.html
>> Sent from the Solr - User mailing list archive at Nabble.com.

Re: How to solve?

2014-09-06 Thread Bill Bell

Yeah we already use it. I will try to create a custom functionif I get it 
to work I will post.

The challenge for me is how to dynamically match and add them based in the 
faceting.

Here is a better example.

The doctor core has payload as name:val. The "name" are doctor specialties. I 
need to pull back by the name since the user faceted on a specialty. So far 
payloads work. But the user now wants to facet on another specialty. For 
example they are looking for a cardiologist and an internal medicine doctor and 
if the doctor practices at the same hospital I need to take the values and add 
them. Else take the max value for the 2 specialties. 

Make sense now ?

Seems like I need to create a payload and my own custom function.

Bill Bell
Sent from mobile

> On Sep 6, 2014, at 12:57 PM, Erick Erickson  wrote:
> 
> Here's a blog with an end-to-end example. Jack's right, it takes some
> configuration and having first-class support in Solr would be a good
> thing...
> 
> http://searchhub.org/2014/06/13/end-to-end-payload-example-in-solr/
> 
> Best,
> Erick
> 
>> On Sat, Sep 6, 2014 at 10:24 AM, Jack Krupansky  
>> wrote:
>> Payload really don't have first class support in Solr. It's a solid feature
>> of Lucene, but never expressed well in Solr. Any thoughts or proposals are
>> welcome!
>> 
>> (Hmmm... I wonder what the good folks at Heliosearch have up their sleeves
>> in this area?!)
>> 
>> -- Jack Krupansky
>> 
>> -Original Message- From: William Bell
>> Sent: Friday, September 5, 2014 10:03 PM
>> To: solr-user@lucene.apache.org
>> Subject: How to solve?
>> 
>> 
>> We have a core with each document as a person.
>> 
>> We want to boost based on the sweater color, but if the person has sweaters
>> in their closet which are the same manufactuer we want to boost even more
>> by adding them together.
>> 
>> Peter Smit - Sweater: Blue = 1 : Nike, Sweater: Red = 2: Nike, Sweater:
>> Blue=1 : Polo
>> Tony S - Sweater: Red =2: Nike
>> Bill O - Sweater:Red = 2: Polo, Blue=1: Polo
>> 
>> Scores:
>> 
>> Peter Smit - 1+2 = 3.
>> Tony S - 2
>> Bill O - 2 + 1
>> 
>> I thought about using payloads.
>> 
>> sweaters_payload
>> Blue: Nike: 1
>> Red: Nike: 2
>> Blue: Polo: 1
>> 
>> How do I query this?
>> 
>> http://localhost:8983/solr/persons?q=*:*&sort=??
>> 
>> Ideas?
>> 
>> 
>> 
>> 
>> --
>> Bill Bell
>> billnb...@gmail.com
>> cell 720-256-8076

Re: embedded documents

2014-08-24 Thread Bill Bell

See my Jira. It supports it via json.fsuffix=_json&wt=json

http://mail-archives.apache.org/mod_mbox/lucene-dev/201304.mbox/%3CJIRA.12641293.1365394604231.125944.1365397875874@arcas%3E

Bill Bell
Sent from mobile


> On Aug 24, 2014, at 6:43 AM, "Jack Krupansky"  wrote:
> 
> Indexing and query of raw JSON would be a valuable addition to Solr, so maybe 
> you could simply explain more precisely your data model and transformation 
> rules. For example, when multi-level nesting occurs, what does your loader do?
> 
> Maybe if the fielld names were derived by concatenating the full path of JSON 
> key names, like titles_json.FR, field_naming nesting could be handled in a 
> fully automated manner.
> 
> I had been thinking of filing a Jira proposing exactly that, so that even the 
> most deeply nested JSON maps could be supported, although combinations of 
> arrays and maps would be problematic.
> 
> -- Jack Krupansky
> 
> -Original Message- From: Michael Pitsounis
> Sent: Wednesday, August 20, 2014 7:14 PM
> To: solr-user@lucene.apache.org
> Subject: embedded documents
> 
> Hello everybody,
> 
> I had a requirement to store complicated json documents in solr.
> 
> i have modified the JsonLoader to accept complicated json documents with
> arrays/objects as values.
> 
> It stores the object/array and then flatten it and  indexes the fields.
> 
> e.g  basic example document
> 
> {
>   "titles_json":{"FR":"This is the FR title" , "EN":"This is the EN
> title"} ,
>   "id": 103,
>   "guid": "3b2f2998-85ac-4a4e-8867-beb551c0b3c6"
>  }
> 
> It will store titles_json:{"FR":"This is the FR title" , "EN":"This is the
> EN title"}
> and then index fields
> 
> titles.FR:"This is the FR title"
> titles.EN:"This is the EN title"
> 
> 
> Do you see any problems with this approach?
> 
> 
> 
> Regards,
> Michael Pitsounis

Re: SolrCloud Scale Struggle

2014-08-02 Thread Bill Bell

Auto correct not good

Corrected below 

Bill Bell
Sent from mobile


> On Aug 2, 2014, at 11:11 AM, Bill Bell  wrote:
> 
> Seems way overkill. Are you using /get at all ? If you need the docs avail 
> right away - why ? How about after 30 seconds ? How many docs do you get 
> added per second during peak ? Even Google has a delay when you do Adwords. 
> 
> One idea is to have an empty core that you insert into and then shard into 
> the queries. So one core would be called newdocs and then you would add this 
> core into your query. There are a couple issues with this with scoring but it 
> works nicely. I would not even use Solrcloud for that core.
> 
> Try to reduce number of Java instances running. Reduce memory and use one 
> java per machine. 
> 
> Then if you need faster avail of docs you really need to ask why. Why not 
> later? Do you need search or just showing the user the info ? If for showing 
> maybe query a indexed table for the few not yet indexed ?? Or just store in a 
> db to show the user the info and index later?
> 
> Bill Bell
> Sent from mobile
> 
> 
>> On Aug 1, 2014, at 4:19 AM, "anand.mahajan"  wrote:
>> 
>> Hello all,
>> 
>> Struggling to get this going with SolrCloud - 
>> 
>> Requirement in brief :
>> - Ingest about 4M Used Cars listings a day and track all unique cars for
>> changes
>> - 4M automated searches a day (during the ingestion phase to check if a doc
>> exists in the index (based on values of 4-5 key fields) or it is a new one
>> or an updated version)
>> - Of the 4 M - About 3M Updates to existing docs (for every non-key value
>> change)
>> - About 1M inserts a day (I'm assuming these many new listings come in
>> every day)
>> - Daily Bulk CSV exports of inserts / updates in last 24 hours of various
>> snapshots of the data to various clients
>> 
>> My current deployment : 
>> i) I'm using Solr 4.8 and have set up a SolrCloud with 6 dedicated machines
>> - 24 Core + 96 GB RAM each.
>> ii)There are over 190M docs in the SolrCloud at the moment (for all
>> replicas its consuming overall disk 2340GB which implies - each doc is at
>> about 5-8kb in size.)
>> iii) The docs are split into 36 Shards - and 3 replica per shard (in all
>> 108 Solr Jetty processes split over 6 Servers leaving about 18 Jetty JVMs
>> running on each host)
>> iv) There are 60 fields per doc and all fields are stored at the moment  :( 
>> (The backend is only Solr at the moment)
>> v) The current shard/routing key is a combination of Car Year, Make and
>> some other car level attributes that help classify the cars
>> vi) We are mostly using the default Solr config as of now - no heavy caching
>> as the search is pretty random in nature 
>> vii) Autocommit is on - with maxDocs = 1
>> 
>> Current throughput & Issues :
>> With the above mentioned deployment the daily throughout is only at about
>> 1.5M on average (Inserts + Updates) - falling way short of what is required.
>> Search is slow - Some queries take about 15 seconds to return - and since
>> insert is dependent on at least one Search that degrades the write
>> throughput too. (This is not a Solr issue - but the app demands it so)
>> 
>> Questions :
>> 
>> 1. Autocommit with maxDocs = 1 - is that a goof up and could that be slowing
>> down indexing? Its a requirement that all docs are available as soon as
>> indexed.
>> 
>> 2. Should I have been better served had I deployed a Single Jetty Solr
>> instance per server with multiple cores running inside? The servers do start
>> to swap out after a couple of days of Solr uptime - right now we reboot the
>> entire cluster every 4 days.
>> 
>> 3. The routing key is not able to effectively balance the docs on available
>> shards - There are a few shards with just about 2M docs - and others over
>> 11M docs. Shall I split the larger shards? But I do not have more nodes /
>> hardware to allocate to this deployment. In such case would splitting up the
>> large shards give better read-write throughput? 
>> 
>> 4. To remain with the current hardware - would it help if I remove 1 replica
>> each from a shard? But that would mean even when just 1 node goes down for a
>> shard there would be only 1 live node left that would not serve the write
>> requests.
>> 
>> 5. Also, is there a way to control where the Split Shard replicas would go?
>> Is there a pattern / rule that Solr follows when it creates replicas for
>> split shards?
>> 
>> 6. I read somewhere that creating a Core would cost the OS on

Re: SolrCloud Scale Struggle

2014-08-02 Thread Bill Bell

Seems way overkill. Are you using /get at all ? If you need the docs avail 
right away - why ? How about after 30 seconds ? How many docs do you get added 
per second during peak ? Even Google has a delay when you do Adwords. 

One idea is yo have an empty core that you insert into and then shard into the 
queries. So one fire would be called newdocs and then you would add this core 
into your query. There are a couple issues with this with scoring but it works 
nicely. I would not even use Solrcloud for that core.

Try to reduce number of Java running. Reduce memory and use one java per 
machine. 

Then if you need faster avail if docs you really need to ask why. Why not 
later? If it got search or just showing the user the info ? If for showing 
maybe query a not indexes table for the few not yet indexed ?? Or just store in 
a db to show the user the info and index later?

Bill Bell
Sent from mobile


> On Aug 1, 2014, at 4:19 AM, "anand.mahajan"  wrote:
> 
> Hello all,
> 
> Struggling to get this going with SolrCloud - 
> 
> Requirement in brief :
> - Ingest about 4M Used Cars listings a day and track all unique cars for
> changes
> - 4M automated searches a day (during the ingestion phase to check if a doc
> exists in the index (based on values of 4-5 key fields) or it is a new one
> or an updated version)
> - Of the 4 M - About 3M Updates to existing docs (for every non-key value
> change)
> - About 1M inserts a day (I'm assuming these many new listings come in
> every day)
> - Daily Bulk CSV exports of inserts / updates in last 24 hours of various
> snapshots of the data to various clients
> 
> My current deployment : 
> i) I'm using Solr 4.8 and have set up a SolrCloud with 6 dedicated machines
> - 24 Core + 96 GB RAM each.
> ii)There are over 190M docs in the SolrCloud at the moment (for all
> replicas its consuming overall disk 2340GB which implies - each doc is at
> about 5-8kb in size.)
> iii) The docs are split into 36 Shards - and 3 replica per shard (in all
> 108 Solr Jetty processes split over 6 Servers leaving about 18 Jetty JVMs
> running on each host)
> iv) There are 60 fields per doc and all fields are stored at the moment  :( 
> (The backend is only Solr at the moment)
> v) The current shard/routing key is a combination of Car Year, Make and
> some other car level attributes that help classify the cars
> vi) We are mostly using the default Solr config as of now - no heavy caching
> as the search is pretty random in nature 
> vii) Autocommit is on - with maxDocs = 1
> 
> Current throughput & Issues :
> With the above mentioned deployment the daily throughout is only at about
> 1.5M on average (Inserts + Updates) - falling way short of what is required.
> Search is slow - Some queries take about 15 seconds to return - and since
> insert is dependent on at least one Search that degrades the write
> throughput too. (This is not a Solr issue - but the app demands it so)
> 
> Questions :
> 
> 1. Autocommit with maxDocs = 1 - is that a goof up and could that be slowing
> down indexing? Its a requirement that all docs are available as soon as
> indexed.
> 
> 2. Should I have been better served had I deployed a Single Jetty Solr
> instance per server with multiple cores running inside? The servers do start
> to swap out after a couple of days of Solr uptime - right now we reboot the
> entire cluster every 4 days.
> 
> 3. The routing key is not able to effectively balance the docs on available
> shards - There are a few shards with just about 2M docs - and others over
> 11M docs. Shall I split the larger shards? But I do not have more nodes /
> hardware to allocate to this deployment. In such case would splitting up the
> large shards give better read-write throughput? 
> 
> 4. To remain with the current hardware - would it help if I remove 1 replica
> each from a shard? But that would mean even when just 1 node goes down for a
> shard there would be only 1 live node left that would not serve the write
> requests.
> 
> 5. Also, is there a way to control where the Split Shard replicas would go?
> Is there a pattern / rule that Solr follows when it creates replicas for
> split shards?
> 
> 6. I read somewhere that creating a Core would cost the OS one thread and a
> file handle. Since a core repsents an index in its entirty would it not be
> allocated the configured number of write threads? (The dafault that is 8)
> 
> 7. The Zookeeper cluster is deployed on the same boxes as the Solr instance
> - Would separating the ZK cluster out help?
> 
> Sorry for the long thread _ I thought of asking these all at once rather
> than posting separate ones.
> 
> Thanks,
> Anand
> 
> 
> 
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/SolrCloud-Scale-Struggle-tp4150592.html
> Sent from the Solr - User mailing list archive at Nabble.com.

Latest jetty

2014-07-26 Thread Bill Bell

Since we are now on latest Java JDK can we move to Jetty 9?

Thoughts ?

Bill Bell
Sent from mobile

Re: stucked with log4j configuration

2014-04-12 Thread Bill Bell

Well I hope log4j2 is something Solr supports when GA

Bill Bell
Sent from mobile


> On Apr 12, 2014, at 7:26 AM, Aman Tandon  wrote:
> 
> I have upgraded my solr4.2 to solr 4.7.1 but in my logs there is an error
> for log4j
> 
> log4j: Could not find resource
> 
> Please find the attachment of the screenshot of the error console
> https://drive.google.com/file/d/0B5GzwVkR3aDzdjE1b2tXazdxcGs/edit?usp=sharing
> -- 
> With Regards
> Aman Tandon

Re: boost results within 250km

2014-04-09 Thread Bill Bell

Just take geodist and use the map function and send to bf or boost 

Bill Bell
Sent from mobile


> On Apr 9, 2014, at 8:26 AM, Erick Erickson  wrote:
> 
> Why do you want to do this? This sounds like an XY problem, you're
> asking how to do something specific without explaining why you care,
> perhaps there are other ways to do this.
> 
> Best,
> Erick
> 
>> On Tue, Apr 8, 2014 at 11:30 PM, Aman Tandon  wrote:
>> How can i gave the more boost to the results within 250km than others
>> without using result filtering.

Re: Luke 4.6.1 released

2014-02-16 Thread Bill Bell

Yes it works with Solr 

Bill Bell
Sent from mobile


> On Feb 16, 2014, at 3:38 PM, Alexandre Rafalovitch  wrote:
> 
> Does it work with Solr? I couldn't tell what the description was from
> this repo and it's Solr relevance.
> 
> I am sure all the long timers know, but for more recent Solr people,
> the additional information would be useful.
> 
> Regards,
>   Alex.
> Personal website: http://www.outerthoughts.com/
> LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
> - Time is the quality of nature that keeps events from happening all
> at once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD
> book)
> 
> 
>> On Mon, Feb 17, 2014 at 3:02 AM, Dmitry Kan  wrote:
>> Hello!
>> 
>> Luke 4.6.1 has been just released. Grab it here:
>> 
>> https://github.com/DmitryKey/luke/releases/tag/4.6.1
>> 
>> fixes:
>> loading the jar from command line is now working fine.
>> 
>> --
>> Dmitry Kan
>> Blog: http://dmitrykan.blogspot.com
>> Twitter: twitter.com/dmitrykan

Status if 4.6.1?

2014-01-18 Thread Bill Bell

We just need the bug fix for Solr.xml 

https://issues.apache.org/jira/browse/SOLR-5543

Bill Bell
Sent from mobile

Re: Call to Solr via TCP

2013-12-10 Thread Bill Bell

Yeah open socket to port and send correct Get syntax and Solr will respond with 
results...



Bill Bell
Sent from mobile


> On Dec 10, 2013, at 2:50 PM, Doug Turnbull 
>  wrote:
> 
> Zwer, is there a reason you need to do this? Its probably very hard to
> get solr to speak TCP. But if you're having a performance or
> infrastructure problem, the group might be able to help you with a far
> simpler solution.
> 
> Sent from my Windows Phone From: Zwer
> Sent: 12/10/2013 12:15 PM
> To: solr-user@lucene.apache.org
> Subject: Re: Call to Solr via TCP
> Maybe I asked incorrectly.
> 
> 
> Solr is Web Application, hosted by some servlet container and is reachable
> via HTTP.
> 
> HTTP is an extension of TCP and I would like to know whether exists some
> lower way to communicate with application (i.e. Solr) hosted by Jetty?
> 
> 
> 
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Call-to-Solr-via-TCP-tp4105932p4105935.html
> Sent from the Solr - User mailing list archive at Nabble.com.

Re: Reverse mm(min-should-match)

2013-11-22 Thread Bill Bell

This is an awesome idea!

Sent from my iPad

> On Nov 22, 2013, at 12:54 PM, Doug Turnbull 
>  wrote:
> 
> Instead of specifying a percentage or number of query terms must match
> tokens in a field, I'd like to do the opposite -- specify how much of a
> field must match a query.
> 
> The problem I'm trying to solve is to boost document titles that closely
> match the query string. If a title looks something like
> 
> *Title: *[solr] [the] [worlds] [greatest] [search] [engine]
> 
> I want to be able to specify how much of the field must match the query
> string. This differs from normal mm. Normal mm specifies a how much of the
> query must match a field.
> 
> As an example, with this title, if I use normal mm=100% and perform the
> following query:
> 
> mm=100%
> q=solr
> 
> This will match the title above, as 100% of [solr] matches the field
> 
> What I really want to get at is a reverse mm:
> 
> Rmm=100%
> q=solr
> 
> The title above will not match in this case. Only 1/6 of the tokens in the
> field match the query.
> 
> However an exact search would match:
> 
> Rmm=100%
> q=solr the worlds greatest search engine
> 
> Here 100% of the query matches the title, so I'm good.
> 
> Is there any way to achieve this in Solr?
> 
> -- 
> Doug Turnbull
> Search & Big Data Architect
> OpenSource Connections

Re: NullPointerException

2013-11-22 Thread Bill Bell

It seems to be a modified row and referenced in EvaluatorBag.

I am not familiar with either.

Sent from my iPad

> On Nov 22, 2013, at 3:05 AM, Adrien RUFFIE  wrote:
> 
> Hello all,
> 
> I have perform a full indexation with solr, but when I try to perform an 
> incrementation indexation I get the following exception (cf attachment).
> 
> Any one have a idea of the problem ?
> 
> Greate thank
>

Re: useColdSearcher in SolrCloud config

2013-11-22 Thread Bill Bell

Wouldn't that be true means use cold searcher? It seems backwards to me...

Sent from my iPad

> On Nov 22, 2013, at 2:44 AM, ade-b  wrote:
> 
> Hi
> 
> The definition of useColdSearcher config element in solrconfig.xml is
> 
> "If a search request comes in and there is no current registered searcher,
> then immediately register the still warming searcher and use it.  If "false"
> then all requests will block until the first searcher is done warming".
> 
> By the term 'block', I assume SOLR returns a non 200 response to requests.
> Does anybody know the exact response code returned when the server is
> blocking requests?
> 
> If a new SOLR server is introduced into an existing array of SOLR servers
> (in SOLR Cloud setup), it will sync it's index from the leader. To save you
> having to specify warm-up queries in the solrconfig.xml file for first
> searchers, would/could the new server not auto warm it's caches from the
> caches of an existing server?
> 
> Thanks
> Ade 
> 
> 
> 
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/useColdSearcher-in-SolrCloud-config-tp4102569.html
> Sent from the Solr - User mailing list archive at Nabble.com.

Re: How to work with remote solr savely?

2013-11-22 Thread Bill Bell

Do you have a sample jetty XML to setup basic auth for updates in Solr?

Sent from my iPad

> On Nov 22, 2013, at 7:34 AM, "michael.boom"  wrote:
> 
> Use HTTP basic authentication, setup in your servlet container
> (jetty/tomcat).
> 
> That should work fine if you are *not* using SolrCloud.
> 
> 
> 
> -
> Thanks,
> Michael
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/How-to-work-with-remote-solr-savely-tp4102612p4102613.html
> Sent from the Solr - User mailing list archive at Nabble.com.

Re: Jetty 9?

2013-11-07 Thread Bill Bell

So no Jetty 9 until Solr 5? Java 7 is at rel 40 Is that our commitment to 
not require Java 7 until Solr 5? 

Most people are probably already on Java 7...

Bill Bell
Sent from mobile

> On Nov 7, 2013, at 1:29 AM, Furkan KAMACI  wrote:
> 
> Here is an issue points to that:
> https://issues.apache.org/jira/browse/SOLR-4839
> 
> 
> 2013/11/7 William Bell 
> 
>> When are we moving Solr to Jetty 9?
>> 
>> --
>> Bill Bell
>> billnb...@gmail.com
>> cell 720-256-8076
>>

Re: Performance of "rows" and "start" parameters

2013-11-04 Thread Bill Bell

Do you want to look thru then all ? Have you considered Lucene API? Not sure if 
that is better but it might be.

Bill Bell
Sent from mobile


> On Nov 4, 2013, at 6:43 AM, "michael.boom"  wrote:
> 
> I saw that some time ago there was a JIRA ticket dicussing this, but still i
> found no relevant information on how to deal with it.
> 
> When working with big nr of docs (e.g. 70M) in my case, I'm using
> start=0&rows=30 in my requests.
> For the first req the query time is ok, the next one is visibily slower, the
> third even more slow and so on until i get some huge query times of up
> 140secs, after a few hundreds requests. My test were done with SolrMeter at
> a rate of 1000qpm. Same thing happens at 100qpm, tough.
> 
> Is there a best practice on how to do in this situation, or maybe an
> explanation why is the query time increasing, from request to request ?
> 
> Thanks!
> 
> 
> 
> -
> Thanks,
> Michael
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Performance-of-rows-and-start-parameters-tp4099194.html
> Sent from the Solr - User mailing list archive at Nabble.com.

Re: Core admin: create new core

2013-11-04 Thread Bill Bell

You could pre create a bunch of directories and base configs. Create as needed. 
Then use schema less API to set it up ... Or make changes in a script and 
reload the core..

Bill Bell
Sent from mobile


> On Nov 4, 2013, at 6:06 AM, Erick Erickson  wrote:
> 
> Right, this has been an issue for a while, there's no current
> way to do this.
> 
> Someday, I'll be able to work on SOLR-4779 which should
> go some toward making this work more easily. It's still not
> exactly what you're looking for, but it might work.
> 
> Of course with SolrCloud you can specify a configuration
> set that is used for multiple collections.
> 
> People are using Puppet or similar to automate this over
> large numbers of nodes, but that's not entirely satisfactory
> either in our case I suspect.
> 
> FWIW,
> Erick
> 
> 
>> On Mon, Nov 4, 2013 at 4:00 AM, Bram Van Dam  wrote:
>> 
>> The core admin CREATE function requires that the new instance dir and
>> schema/config exist already. Is there a particular reason for this? It
>> would be incredible convenient if I could create a core with a new schema
>> and new config simply by calling CREATE (maybe providing the contents of
>> config.xml and schema.xml as base64 encoded strings in HTTP POST or
>> something?).
>> 
>> I'm guessing this isn't currently possible?
>> 
>> Ta,
>> 
>> - bram
>>

Re: Proposal for new feature, cold replicas, brainstorming

2013-10-27 Thread Bill Bell

Yeah replicate to a DR site would be good too. 

Bill Bell
Sent from mobile


> On Oct 24, 2013, at 6:27 AM, yriveiro  wrote:
> 
> I'm wondering some time ago if it's possible have replicas of a shard
> synchronized but in an state that they can't accept queries only updates. 
> 
> This replica in "replication" mode only awake to accept queries if it's the
> last alive replica and goes to replication mode when other replica becomes
> alive and synchronized.
> 
> The motivation of this is simple, I want have replication but I don't want
> have n replicas actives with full resources allocated (cache and so on).
> This is usefull in enviroments where replication is needed but a high query
> throughput is not fundamental and the resources are limited.
> 
> I know that right now is not possible, but I think that it's a feature that
> can be implemented in a easy way creating a new status for shards.
> 
> The bottom line question is, I'm the only one with this kind of
> requeriments? Does it make sense one functionality like this?
> 
> 
> 
> -
> Best regards
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Proposal-for-new-feature-cold-replicas-brainstorming-tp4097501.html
> Sent from the Solr - User mailing list archive at Nabble.com.

Re: Solr - what's the next big thing?

2013-10-26 Thread Bill Bell

Full JSON support deep complex object indexing and search Game changer 

Bill Bell
Sent from mobile


> On Oct 26, 2013, at 1:04 PM, Otis Gospodnetic  
> wrote:
> 
> Hi,
> 
>> On Sat, Oct 26, 2013 at 5:58 AM, Saar Carmi  wrote:
>> LOL,  Jack.  I can imagine Otis saying that.
> 
> Funny indeed, but not really.
> 
>> Otis,  with these marriage,  are we going to see map reduce based queries?
> 
> Can you please describe what you mean by that?  Maybe with an example.
> 
> Thanks,
> Otis
> --
> Performance Monitoring * Log Analytics * Search Analytics
> Solr & Elasticsearch Support * http://sematext.com/
> 
> 
> 
>>> On Oct 25, 2013 10:03 PM, "Jack Krupansky"  wrote:
>>> 
>>> But a lot of that big yellow elephant stuff is in 4.x anyway.
>>> 
>>> (Otis: I was afraid that you were going to say that the next big thing in
>>> Solr is... Elasticsearch!)
>>> 
>>> -- Jack Krupansky
>>> 
>>> -Original Message- From: Otis Gospodnetic
>>> Sent: Friday, October 25, 2013 2:43 PM
>>> To: solr-user@lucene.apache.org
>>> Subject: Re: Solr - what's the next big thing?
>>> 
>>> Saar,
>>> 
>>> The marriage with the big yellow elephant is a big deal. It changes the
>>> scale.
>>> 
>>> Otis
>>> Solr & ElasticSearch Support
>>> http://sematext.com/
>>> On Oct 25, 2013 5:32 AM, "Saar Carmi"  wrote:
>>> 
>>> If I am not mistaken the most impressive improvement of Solr 4.0 compared
>>>> to previous versions was the Solr Cloud architecture.
>>>> 
>>>> What would be the next big thing in Solr 5.0 ?
>>>> 
>>>> Saar
>>>

Re: Spatial Distance Range

2013-10-22 Thread Bill Bell

Yes frange works 

Bill Bell
Sent from mobile


> On Oct 22, 2013, at 8:17 AM, Eric Grobler  wrote:
> 
> Hi Everyone,
> 
> Normally one would search for documents where the location is within a
> specified distance, for example widthin 5 km:
> fq={!geofilt pt=45.15,-93.85 sfield=store
> d=5}<http://localhost:8983/solr/select?wt=json&indent=true&fl=name,store&q=*:*&fq=%7B!geofilt%20pt=45.15,-93.85%20sfield=store%20d=5%7D>
> 
> It there a way to specify a range between 10 and 20 km?
> Something like:
> fq={!geofilt pt=45.15,-93.85 sfield=store distancefrom=10
> distanceupto=20}<http://localhost:8983/solr/select?wt=json&indent=true&fl=name,store&q=*:*&fq=%7B!geofilt%20pt=45.15,-93.85%20sfield=store%20d=5%7D>
> 
> Thanks
> Ericz

Re: Skipping caches on a /select

2013-10-17 Thread Bill Bell

But global on a qt would be awesome !!!

Bill Bell
Sent from mobile


> On Oct 17, 2013, at 2:43 PM, Yonik Seeley  wrote:
> 
> There isn't a global  "cache=false"... it's a local param that can be
> applied to any "fq" or "q" parameter independently.
> 
> -Yonik
> 
> 
>> On Thu, Oct 17, 2013 at 4:39 PM, Tim Vaillancourt  
>> wrote:
>> Thanks Yonik,
>> 
>> Does "cache=false" apply to all caches? The docs make it sound like it is
>> for filterCache only, but I could be misunderstanding.
>> 
>> When I force a commit and perform a /select a query many times with
>> "cache=false", I notice my query gets cached still, my guess is in the
>> queryResultCache. At first the query takes 500ms+, then all subsequent
>> requests take 0-1ms. I'll confirm this queryResultCache assumption today.
>> 
>> Cheers,
>> 
>> Tim
>> 
>> 
>>> On 16/10/13 06:33 PM, Yonik Seeley wrote:
>>> 
>>> On Wed, Oct 16, 2013 at 6:18 PM, Tim Vaillancourt
>>> wrote:
>>>> 
>>>> I am debugging some /select queries on my Solr tier and would like to see
>>>> if there is a way to tell Solr to skip the caches on a given /select
>>>> query
>>>> if it happens to ALREADY be in the cache. Live queries are being inserted
>>>> and read from the caches, but I want my debug queries to bypass the cache
>>>> entirely.
>>>> 
>>>> I do know about the "cache=false" param (that causes the results of a
>>>> select to not be INSERTED in to the cache), but what I am looking for
>>>> instead is a way to tell Solr to not read the cache at all, even if there
>>>> actually is a cached result for my query.
>>> 
>>> Yeah, cache=false for "q" or "fq" should already not use the cache at
>>> all (read or write).
>>> 
>>> -Yonik

Re: DIH

2013-10-15 Thread Bill Bell

We are NOW CPU bound Thoughts ???

Bill Bell
Sent from mobile


> On Oct 15, 2013, at 8:49 PM, Bill Bell  wrote:
> 
> We have a custom Field processor in DIH and we are not CPU bound on one 
> core... How do we thread it ?? We need to use more cores
> 
> The box has 32 cores and 1 is 100% CPU bound.
> 
> Ideas ?
> 
> Bill Bell
> Sent from mobile
>

DIH

2013-10-15 Thread Bill Bell

We have a custom Field processor in DIH and we are not CPU bound on one core... 
How do we thread it ?? We need to use more cores

The box has 32 cores and 1 is 100% CPU bound.

Ideas ?

Bill Bell
Sent from mobile

Re: Solr 4.4.0 on Ubuntu 10.04 with Jetty 6.1 from package Repository

2013-10-10 Thread Bill Bell

Does this work ?
I can suggest -XX:-UseLoopPredicate to switch off predicates.

???

Which version of 7 is recommended ?

Bill Bell
Sent from mobile


> On Oct 10, 2013, at 11:29 AM, "Smiley, David W."  wrote:
> 
> *Don't* use JDK 7u40, it's been known to cause index corruption and
> SIGSEGV faults with Lucene: LUCENE-5212   This has not been unnoticed by
> Oracle.
> 
> ~ David
> 
>> On 10/10/13 12:34 PM, "Guido Medina"  wrote:
>> 
>> 2. Java version: There are huges performance winning between Java 5, 6
>>   and 7; we use Oracle JDK 7u40.
>

Re: Field with default value and stored=false, will be reset back to the default value in case of updating other fields

2013-10-09 Thread Bill Bell

You have to update the whole record including all fields...

Bill Bell
Sent from mobile


> On Oct 9, 2013, at 7:50 PM, deniz  wrote:
> 
> hi all,
> 
> I have encountered some problems and post it on stackoverflow here:
> http://stackoverflow.com/questions/19285251/solr-field-with-default-value-resets-itself-if-it-is-stored-false
>  
> 
> as you can see from the response, does it make sense to open a bug ticket
> for this? because, although i can workaround this by setting everything back
> to stored=true, it does not make sense to keep every field stored while i
> dont need to return them in the search result.. or will anyone can make more
> detailed explanations that this is expected and normal? 
> 
> 
> 
> -
> Zeki ama calismiyor... Calissa yapar...
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Field-with-default-value-and-stored-false-will-be-reset-back-to-the-default-value-in-case-of-updatins-tp4094508.html
> Sent from the Solr - User mailing list archive at Nabble.com.

Re: Solr 4.5 spatial search - distance and score

2013-09-13 Thread Bill Bell

You can apply his 4.5 patches to 4.4 or take trunk and it is there

Bill Bell
Sent from mobile


On Sep 12, 2013, at 6:23 PM, Weber  wrote:

> I'm trying to get score by using a custom boost and also get the distance. I
> found David's code* to get it using "Intersects", which I want to replace by
> {!geofilt} or geodist()
> 
> *David's code: https://issues.apache.org/jira/browse/SOLR-4255
> 
> He told me geodist() will be available again for this kind of field, which
> is a geohash type.
> 
> Then, I'd like to know how it can be done today on 4.4 with {!geofilt} and
> how it will be done on 4.5 using geodist()
> 
> Thanks in advance.
> 
> 
> 
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Solr-4-5-spatial-search-distance-and-score-tp4089706.html
> Sent from the Solr - User mailing list archive at Nabble.com.

Re: Some highlighted snippets aren't being returned

2013-09-08 Thread Bill Bell

Zip up all your configs 

Bill Bell
Sent from mobile


On Sep 8, 2013, at 3:00 PM, "Eric O'Hanlon"  wrote:

> Hi again Everyone,
> 
> I didn't get any replies to this, so I thought I'd re-send in case anyone 
> missed it and has any thoughts.
> 
> Thanks,
> Eric
> 
> On Aug 7, 2013, at 1:51 PM, Eric O'Hanlon  wrote:
> 
>> Hi Everyone,
>> 
>> I'm facing an issue in which my solr query is returning highlighted snippets 
>> for some, but not all results.  For reference, I'm searching through an 
>> index that contains web crawls of human-rights-related websites.  I'm 
>> running solr as a webapp under Tomcat and I've included the query's solr 
>> params from the Tomcat log:
>> 
>> ...
>> webapp=/solr-4.2
>> path=/select
>> params={facet=true&sort=score+desc&group.limit=10&spellcheck.q=Unangan&f.mimetype_code.facet.limit=7&hl.simple.pre=&q.alt=*:*&f.organization_type__facet.facet.limit=6&f.language__facet.facet.limit=6&hl=true&f.date_of_capture_.facet.limit=6&group.field=original_url&hl.simple.post=&facet.field=domain&facet.field=date_of_capture_&facet.field=mimetype_code&facet.field=geographic_focus__facet&facet.field=organization_based_in__facet&facet.field=organization_type__facet&facet.field=language__facet&facet.field=creator_name__facet&hl.fragsize=600&f.creator_name__facet.facet.limit=6&facet.mincount=1&qf=text^1&hl.fl=contents&hl.fl=title&hl.fl=original_url&wt=ruby&f.geographic_focus__facet.facet.limit=6&defType=edismax&rows=10&f.domain.facet.limit=6&q=Unangan&f.organization_based_in__facet.facet.limit=6&q.op=AND&group=true&hl.usePhraseHighlighter=true}
>>  hits=8 status=0 QTime=108
>> ...
>> 
>> For the query above (which can be simplified to say: find all documents that 
>> contain the word "unangan" and return facets, highlights, etc.), I get five 
>> search results.  Only three of these are returning highlighted snippets.  
>> Here's the "highlighting" portion of the solr response (note: printed in 
>> ruby notation because I'm receiving this response in a Rails app):
>> 
>> 
>> "highlighting"=>
>> {"20100602195444/http://www.kontras.org/uu_ri_ham/UU%20Nomor%2023%20Tahun%202002%20tentang%20Perlindungan%20Anak.pdf"=>
>>   {},
>>  
>> "20100902203939/http://www.kontras.org/uu_ri_ham/UU%20Nomor%2023%20Tahun%202002%20tentang%20Perlindungan%20Anak.pdf"=>
>>   {},
>>  
>> "20111202233029/http://www.kontras.org/uu_ri_ham/UU%20Nomor%2023%20Tahun%202002%20tentang%20Perlindungan%20Anak.pdf"=>
>>   {},
>>  "20100618201646/http://www.komnasham.go.id/portal/files/39-99.pdf"=>
>>   {"contents"=>
>> ["...actual snippet is returned here..."]},
>>  "20100902235358/http://www.komnasham.go.id/portal/files/39-99.pdf"=>
>>   {"contents"=>
>> ["...actual snippet is returned here..."]},
>>  
>> "20110302213056/http://www.komnasham.go.id/publikasi/doc_download/2-uu-no-39-tahun-1999"=>
>>   {"contents"=>
>> ["...actual snippet is returned here..."]},
>>  
>> "20110302213102/http://www.komnasham.go.id/publikasi/doc_view/2-uu-no-39-tahun-1999?tmpl=component&format=raw"=>
>>   {"contents"=>
>> ["...actual snippet is returned here..."]},
>>  
>> "20120303113654/http://www.iwgia.org/iwgia_files_publications_files/0028_Utimut_heritage.pdf"=>
>>   {}}
>> 
>> 
>> I have eight (as opposed to five) results above because I'm also doing a 
>> grouped query, grouping by a field called "original_url", and this leads to 
>> five grouped results.
>> 
>> I've confirmed that my highlight-lacking results DO contain the word 
>> "unangan", as expected, and this term is appearing in a text field that's 
>> indexed and stored, and being searched for all text searches.  For example, 
>> one of the search results is for a crawl of this document: 
>> http://www.iwgia.org/iwgia_files_publications_files/0028_Utimut_heritage.pdf
>> 
>> And if you view that document on the web, you'll see that it does contain 
>> "unangan".
>> 
>> Has anyone seen this before?  And does anyone have any good suggestions for 
>> troubleshooting/fixing the problem?
>> 
>> Thanks!
>> 
>> - Eric
>

Re: Solr 4.2.1 update to 4.3/4.4 problem

2013-08-27 Thread Bill Bell

Index and query

analyzer type="index">

Bill Bell
Sent from mobile


On Aug 26, 2013, at 5:42 AM, skorrapa  wrote:

> I have also re indexed the data and tried. And also tried with the belowl
>   sortMissingLast="true" omitNorms="true">
>  
>
>
>  
>
>
>
>  
>
>
>
>  
>
> This didnt work as well...
> 
> 
> 
> On Mon, Aug 26, 2013 at 4:03 PM, skorrapa [via Lucene] <
> ml-node+s472066n4086601...@n3.nabble.com> wrote:
> 
>> Hello All,
>> 
>> I am still facing the same issue. Case insensitive search isnot working on
>> Solr 4.3
>> I am using the below configurations in schema.xml
>> > sortMissingLast="true" omitNorms="true">
>>  
>>
>>
>>  
>>
>>
>>
>>  
>>
>>
>>
>>  
>>
>> Basically I want my string which could have spaces or characters like '-'
>> or \ to be searched upon case insensitively.
>> Please help.
>> 
>> 
>> --
>> If you reply to this email, your message will be added to the discussion
>> below:
>> 
>> http://lucene.472066.n3.nabble.com/Solr-4-2-1-update-to-4-3-4-4-problem-tp4081896p4086601.html
>> To unsubscribe from Solr 4.2.1 update to 4.3/4.4 problem, click 
>> here<http://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code&node=4081896&code=a29ycmFwYXRpLnN1c2htYUBnbWFpbC5jb218NDA4MTg5Nnw0MjEwNTY0Mzc=>
>> .
>> NAML<http://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>
> 
> 
> 
> 
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Solr-4-2-1-update-to-4-3-4-4-problem-tp4081896p4086606.html
> Sent from the Solr - User mailing list archive at Nabble.com.

Re: Concat 2 fields in another field

2013-08-27 Thread Bill Bell

If for search just copyField into a multivalued field

Or do it on indexing using DIH or code. A rhino script works too.

Bill Bell
Sent from mobile


On Aug 27, 2013, at 7:15 AM, "Jack Krupansky"  wrote:

> I have additional examples in the two most recent early access releases of my 
> book - variations on using the existing update processors.
> 
> -- Jack Krupansky
> 
> -Original Message- From: Federico Chiacchiaretta
> Sent: Tuesday, August 27, 2013 8:39 AM
> To: solr-user@lucene.apache.org
> Subject: Re: Concat 2 fields in another field
> 
> Hi,
> we do the same thing using an update request processor chain, this is the
> snippet from solrconfig.xml
> 
> 
>  > firstname concatfield   class="solr.CloneFieldUpdateProcessorFactory"> lastname str> concatfield   "solr.ConcatFieldUpdateProcessorFactory"> concatfield
>  _ 
>   "solr.RunUpdateProcessorFactory" />
> 
> 
> 
> Regards,
> Federico Chiacchiaretta
> 
> 
> 
> 2013/8/27 Markus Jelsma 
> 
>> You may be more interested in the ConcatFieldUpdateProcessorFactory:
>> 
>> http://lucene.apache.org/solr/4_1_0/solr-core/org/apache/solr/update/processor/ConcatFieldUpdateProcessorFactory.html
>> 
>> 
>> 
>> -Original message-
>> > From:Alok Bhandari 
>> > Sent: Tuesday 27th August 2013 14:05
>> > To: solr-user@lucene.apache.org
>> > Subject: Re: Concat 2 fields in another field
>> >
>> > Thanks for reply.
>> >
>> > But I don't want to introduce any scripting in my code so want to know > is
>> > there any Java component available for the same.
>> >
>> >
>> >
>> > --
>> > View this message in context:
>> http://lucene.472066.n3.nabble.com/Concat-2-fields-in-another-field-tp4086786p4086791.html
>> > Sent from the Solr - User mailing list archive at Nabble.com.
>> >
>

Re: How might one search for dupe IDs other than faceting on the ID field?

2013-07-30 Thread Bill Bell

This seems like a fairly large issue. Can you create a Jira issue ?

Bill Bell
Sent from mobile


On Jul 30, 2013, at 12:34 PM, Dotan Cohen  wrote:

> On Tue, Jul 30, 2013 at 9:21 PM, Aloke Ghoshal  wrote:
>> Does adding facet.mincount=2 help?
>> 
>> 
> 
> In fact, when adding facet.mincount=20 (I know that some dupes are in
> the hundreds) I got the OutOfMemoryError in seconds instead of
> minutes.
> 
> -- 
> Dotan Cohen
> 
> http://gibberish.co.il
> http://what-is-what.com

Re: Performance question on Spatial Search

2013-07-29 Thread Bill Bell

Can you compare with the old geo handler as a baseline. ?

Bill Bell
Sent from mobile


On Jul 29, 2013, at 4:25 PM, Erick Erickson  wrote:

> This is very strange. I'd expect slow queries on
> the first few queries while these caches were
> warmed, but after that I'd expect things to
> be quite fast.
> 
> For a 12G index and 256G RAM, you have on the
> surface a LOT of hardware to throw at this problem.
> You can _try_ giving the JVM, say, 18G but that
> really shouldn't be a big issue, your index files
> should be MMaped.
> 
> Let's try the crude thing first and give the JVM
> more memory.
> 
> FWIW
> Erick
> 
> On Mon, Jul 29, 2013 at 4:45 PM, Steven Bower  wrote:
>> I've been doing some performance analysis of a spacial search use case I'm
>> implementing in Solr 4.3.0. Basically I'm seeing search times alot higher
>> than I'd like them to be and I'm hoping people may have some suggestions
>> for how to optimize further.
>> 
>> Here are the specs of what I'm doing now:
>> 
>> Machine:
>> - 16 cores @ 2.8ghz
>> - 256gb RAM
>> - 1TB (RAID 1+0 on 10 SSD)
>> 
>> Content:
>> - 45M docs (not very big only a few fields with no large textual content)
>> - 1 geo field (using config below)
>> - index is 12gb
>> - 1 shard
>> - Using MMapDirectory
>> 
>> Field config:
>> 
>> > distErrPct="0.025" maxDistErr="0.00045"
>> spatialContextFactory="com.spatial4j.core.context.jts.JtsSpatialContextFactory"
>> units="degrees"/>
>> 
>> > required="false" stored="true" type="geo"/>
>> 
>> 
>> What I've figured out so far:
>> 
>> - Most of my time (98%) is being spent in
>> java.nio.Bits.copyToByteArray(long,Object,long,long) which is being
>> driven by BlockTreeTermsReader$FieldReader$SegmentTermsEnum$Frame.loadBlock()
>> which from what I gather is basically reading terms from the .tim file
>> in blocks
>> 
>> - I moved from Java 1.6 to 1.7 based upon what I read here:
>> http://blog.vlad1.com/2011/10/05/looking-at-java-nio-buffer-performance/
>> and it definitely had some positive impact (i haven't been able to
>> measure this independantly yet)
>> 
>> - I changed maxDistErr from 0.09 (which is 1m precision per docs)
>> to 0.00045 (50m precision) ..
>> 
>> - It looks to me that the .tim file are being memory mapped fully (ie
>> they show up in pmap output) the virtual size of the jvm is ~18gb
>> (heap is 6gb)
>> 
>> - I've optimized the index but this doesn't have a dramatic impact on
>> performance
>> 
>> Changing the precision and the JVM upgrade yielded a drop from ~18s
>> avg query time to ~9s avg query time.. This is fantastic but I want to
>> get this down into the 1-2 second range.
>> 
>> At this point it seems that basically i am bottle-necked on basically
>> copying memory out of the mapped .tim file which leads me to think
>> that the only solution to my problem would be to read less data or
>> somehow read it more efficiently..
>> 
>> If anyone has any suggestions of where to go with this I'd love to know
>> 
>> 
>> thanks,
>> 
>> steve

Re: How to setup SimpleFSDirectoryFactory

2012-07-22 Thread Bill Bell

I get a similar situation using Windows 2008 and Solr 3.6. Memory using mmap is 
never released. Even if I turn off traffic and commit and do a manual gc. If 
the size of the index is 3gb then memory used will be heap + 3gb of shared 
used. If I use a 6gb index I get heap + 6gb. If I turn off MMapDirectoryFactory 
it goes back down. When is the MMap supposed to release memory ? It only does 
it on JVM restart now.

Bill Bell
Sent from mobile


On Jul 22, 2012, at 6:21 AM, geetha anjali  wrote:

> It happens in 3.6, for this reasons I thought of moving to solandra.
> If I do a commit, the all documents are persisted with out any issues.
> There is no issues  in terms of any functionality, but only this happens is
> increase in physical RAM, goes higher and higher and stop at maximum and it
> never comes down.
> 
> Thanks
> 
> On Sun, Jul 22, 2012 at 3:38 AM, Lance Norskog  wrote:
> 
>> Interesting. Which version of Solr is this? What happens if you do a
>> commit?
>> 
>> On Sat, Jul 21, 2012 at 8:01 AM, geetha anjali 
>> wrote:
>>> Hi uwe,
>>> Great to know. We have files indexing 1/min. After 30 mins I see all
>>> my physical memory say its 100 percentage used(windows). On deep
>>> investigation found that mmap is not releasing os files handles. Do you
>>> find this behaviour?
>>> 
>>> Thanks
>>> 
>>> On 20 Jul 2012 14:04, "Uwe Schindler"  wrote:
>>> 
>>> Hi Bill,
>>> 
>>> MMapDirectory uses the file system cache of your operating system, which
>> has
>>> following consequences: In Linux, top & free should normally report only
>>> *few* free memory, because the O/S uses all memory not allocated by
>>> applications to cache disk I/O (and shows it as allocated, so having 0%
>> free
>>> memory is just normal on Linux and also Windows). If you have other
>>> applications or Lucene/Solr itself that allocate lot's of heap space or
>>> malloc() a lot, then you are reducing free physical memory, so reducing
>> fs
>>> cache. This depends also on your swappiness parameter (if swappiness is
>>> higher, inactive processes are swapped out easier, default is 60% on
>> linux -
>>> freeing more space for FS cache - the backside is of course that maybe
>>> in-memory structures of Lucene and other applications get pages out).
>>> 
>>> You will only see no paging at all if all memory allocated all
>> applications
>>> + all mmapped files fit into memory. But paging in/out the mmapped Lucene
>>> index is much cheaper than using SimpleFSDirectory or
>> NIOFSDirectory. If
>>> you use SimpleFS or NIO and your index is not in FS cache, it will also
>> read
>>> it from physical disk again, so where is the difference. Paging is
>> actually
>>> cheaper as no syscalls are involved.
>>> 
>>> If you want as much as possible of your index in physical RAM, copy it to
>>> /dev/null regularily and buy more RUM :-)
>>> 
>>> 
>>> -
>>> Uwe Schindler
>>> H.-H.-Meier-Allee 63, D-28213 Bremen
>>> http://www.thetaphi.de
>>> eMail: uwe@thetaphi...
>>> 
>>>> From: Bill Bell [mailto:billnb...@gmail.com]
>>>> Sent: Friday, July 20, 2012 5:17 AM
>>>> Subject: Re: ...
>>>> s=op using it? The least used memory will be removed from the OS
>>>> automaticall=? Isee some paging. Wouldn't paging slow down the querying?
>>> 
>>>> 
>>>> My index is 10gb and every 8 hours we get most of it in shared memory.
>> The
>>>> m=mory is 99 percent used, and that does not leave any room for other
>>> apps. =
>>> 
>>>> Other implications?
>>>> 
>>>> Sent from my mobile device
>>>> 720-256-8076
>>>> 
>>>> On Jul 19, 2012, at 9:49 A...
>>>> H=ap space or free system RAM:
>>> 
>>>>> 
>>>>> 
>> http://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.htm
>>>>> l
>>>>> 
>>>>> Uwe
>>>>> ...
>>>>>> use i= since you might run out of memory on large indexes right?
>>> 
>>>>>> 
>>>>>> Here is how I got iSimpleFSDirectoryFactory to work. Just set -
>>>>>> Dsolr.directoryFactor...
>>>>>> set it=all up with a helper in solrconfig.xml...
>>> 
>>>>>> 
>>>>>> if (Constants.WINDOWS) {
>>>>>> if (MMapDirectory.UNMAP_SUPPORTED && Constants.JRE_IS_64...
>> 
>> 
>> 
>> --
>> Lance Norskog
>> goks...@gmail.com
>>

Re: How to setup SimpleFSDirectoryFactory

2012-07-19 Thread Bill Bell

Thanks. Are you saying that if we run low on memory, the MMapDirectory will 
stop using it? The least used memory will be removed from the OS automatically? 
Isee some paging. Wouldn't paging slow down the querying?

My index is 10gb and every 8 hours we get most of it in shared memory. The 
memory is 99 percent used, and that does not leave any room for other apps. 

Other implications?

Sent from my mobile device
720-256-8076

On Jul 19, 2012, at 9:49 AM, "Uwe Schindler"  wrote:

> Read this, then you will see that MMapDirectory will use 0% of your Java Heap 
> space or free system RAM:
> 
> http://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html
> 
> Uwe
> 
> -
> Uwe Schindler
> H.-H.-Meier-Allee 63, D-28213 Bremen
> http://www.thetaphi.de
> eMail: u...@thetaphi.de
> 
> 
>> -Original Message-
>> From: William Bell [mailto:billnb...@gmail.com]
>> Sent: Tuesday, July 17, 2012 6:05 AM
>> Subject: How to setup SimpleFSDirectoryFactory
>> 
>> We all know that MMapDirectory is fastest. However we cannot always use it
>> since you might run out of memory on large indexes right?
>> 
>> Here is how I got iSimpleFSDirectoryFactory to work. Just set -
>> Dsolr.directoryFactory=solr.SimpleFSDirectoryFactory.
>> 
>> Your solrconfig.xml:
>> 
>> > class="${solr.directoryFactory:solr.StandardDirectoryFactory}"/>
>> 
>> You can check it with http://localhost:8983/solr/admin/stats.jsp
>> 
>> Notice that the default for Windows 64bit is MMapDirectory. Else
>> NIOFSDirectory except for WIndows It would be nicer if we just set it 
>> all up
>> with a helper in solrconfig.xml...
>> 
>> if (Constants.WINDOWS) {
>> if (MMapDirectory.UNMAP_SUPPORTED && Constants.JRE_IS_64BIT)
>>    return new MMapDirectory(path, lockFactory);
>> else
>>return new SimpleFSDirectory(path, lockFactory);
>> } else {
>>return new NIOFSDirectory(path, lockFactory);
>>  }
>> }
>> 
>> 
>> 
>> --
>> Bill Bell
>> billnb...@gmail.com
>> cell 720-256-8076
> 
>

Re: Mmap

2012-07-16 Thread Bill Bell

Any thought on this? Is the default Mmap?



Sent from my mobile device
720-256-8076

On Feb 14, 2012, at 7:16 AM, Bill Bell  wrote:

> Does someone have an example of using unmap in 3.5 and chunksize?
> 
> I am using Solr 3.5.
> 
> I noticed in solrconfig.xml:
> 
>  class="${solr.directoryFactory:solr.StandardDirectoryFactory}"/>
> 
> I don't see this parameter taking.. When I set 
> -Dsolr.directoryFactory=solr.MMapDirectoryFactory
> 
> How do I see the setting in the log or in stats.jsp ? I cannot find a place 
> that indicates it is set or not.
> 
> I would assume StandardDirectoryFactory is being used but I do see (when I 
> set it or NOT set it)
> 
> Bill Bell
> Sent from mobile
>

Re: Problem with sorting solr docs

2012-07-04 Thread Bill Bell

Would all optional fields need the sortmissinglast and sortmissingfirst set 
even when not sorting on that field? Seems broken to me.

Sent from my Mobile device
720-256-8076

On Jul 3, 2012, at 6:45 AM, Shubham Srivastava 
 wrote:

> Just adding to the below--> If there is a field(say X) which is not populated 
> and in the query I am not sorting on this particular field but on another 
> field (say Y) still the result ordering would depend on X .
> 
> Infact in the below problem mentioned by Harsh making X as 
> sortMissingLast="false" sortMissingFirst="false" solved the problem while in 
> the query he was sorting on Y.  This seems a bit illogical.
> 
> Regards,
> Shubham
> 
> From: Harshvardhan Ojha [harshvardhan.o...@makemytrip.com]
> Sent: Tuesday, July 03, 2012 5:58 PM
> To: solr-user@lucene.apache.org
> Subject: RE: Problem with sorting solr docs
> 
> Hi,
> 
> I have added  sortMissingLast="false" sortMissingFirst="false"/> to my schema.xml, although 
> I am searching on name field.
> It seems to be working fine. What is its default behavior?
> 
> Regards
> Harshvardhan Ojha
> 
> -Original Message-
> From: Rafał Kuć [mailto:r@solr.pl]
> Sent: Tuesday, July 03, 2012 5:35 PM
> To: solr-user@lucene.apache.org
> Subject: Re: Problem with sorting solr docs
> 
> Hello!
> 
> But the latlng field is not taken into account when sorting with sort defined 
> such as in your query. You only sort on the name field and only that field. 
> You can also define Solr behavior when there is no value in the field, but 
> adding sortMissingLast="true" or sortMissingFirst="true" to your type 
> definition in the schema.xml file.
> 
> --
> Regards,
> Rafał Kuć
> Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch - ElasticSearch
> 
>> Hi,
> 
>> Thanks for reply.
>> I want to sort my docs on name field, it is working well only if I have all 
>> fields populated well.
>> But my latlng field is optional, every doc will not have this value.
>> So those docs are not getting sorted.
> 
>> Regards
>> Harshvardhan Ojha
> 
>> -Original Message-
>> From: Rafał Kuć [mailto:r@solr.pl]
>> Sent: Tuesday, July 03, 2012 5:24 PM
>> To: solr-user@lucene.apache.org
>> Subject: Re: Problem with sorting solr docs
> 
>> Hello!
> 
>> Your query suggests that you are sorting on the 'name' field instead
>> of the latlng field (sort=name +asc).
> 
>> The question is what you are trying to achieve ? Do you want to sort
>> your documents from a given geographical point ? If that's the case
>> you may want to look here:
>> http://wiki.apache.org/solr/SpatialSearch/
>> and look at the possibility of sorting on the distance from a given point.
> 
>> --
>> Regards,
>> Rafał Kuć
>> Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch -
>> ElasticSearch
> 
> 
>> Hi,
>> 
>> I have 260 docs which I want to sort on a single field latlng.
>> 
>> 1
>> Amphoe Khanom
>> 1.0,1.0
>> 
>> 
>> My query is :
>> http://localhost:8080/solr/select?q=*:*&sort=name +asc
>> 
>> This query sorts all documents except those which doesn’t have latlng,
>> and I can’t keep any default value for this field.
>> My question is how can I sort all docs on latlng?
>> 
>> Regards
>> Harshvardhan Ojha  | Software Developer - Technology Development
>>|  MakeMyTrip.com, 243 SP Infocity, Udyog Vihar Phase 1, Gurgaon,
>> Haryana - 122 016, India
> 
>> What's new?: Inspire - Discover an inspiring new way to plan and book travel 
>> online.
> 
> 
>> Office Map
> 
>> Facebook
> 
>> Twitter
> 
> 
>> 
>

Re: UI

2012-05-21 Thread Bill Bell

The php.net plugin is the best. SolrPHPClient is missing several features.

Sent from my Mobile device
720-256-8076

On May 21, 2012, at 6:35 AM, Tolga  wrote:

> Hi,
> 
> Can you recommend a good PHP UI to search? Is SolrPHPClient good?

Re: slave index not cleaned

2012-05-14 Thread Bill Bell

This is a known issue in 1.4 especially in Windows. Some of it was resolved in 
3x.

Bill Bell
Sent from mobile


On May 14, 2012, at 5:54 AM, Erick Erickson  wrote:

> Hmmm, replication will require up to twice the space of the
> index _temporarily_, just checking if that's what you're seeing
> But that should go away reasonably soon. Out of curiosity, what
> happens if you restart your server, do the extra files go away?
> 
> But it sounds like your index is growing over a longer period of time
> than just a single replication, is that true?
> 
> Best
> Erick
> 
> On Fri, May 11, 2012 at 6:03 AM, Jasper Floor  wrote:
>> Hi,
>> 
>> On Thu, May 10, 2012 at 5:59 PM, Otis Gospodnetic
>>  wrote:
>>> Hi Jasper,
>> 
>> Sorry, I should've added more technical info wihtout being prompted.
>> 
>>> Solr does handle that for you.  Some more stuff to share:
>>> 
>>> * Solr version?
>> 
>> 1.4
>> 
>>> * JVM version?
>> 1.7 update 2
>> 
>>> * OS?
>> Debian (2.6.32-5-xen-amd64)
>> 
>>> * Java replication?
>> yes
>> 
>>> * Errors in Solr logs?
>> no
>> 
>>> * deletion policy section in solrconfig.xml?
>> missing I would say, but I don't see this on the replication wiki page.
>> 
>> This is what we have configured for replication:
>> 
>> 
>>
>> 
>>> name="masterUrl">${solr.master.url}/df-stream-store/replication
>> 
>>00:20:00
>>internal
>>5000
>>1
>> 
>> 
>> 
>> 
>> We will be updating to 3.6 fairly soon however. To be honest, from
>> what I've read, the Solr cloud is what we really want in the future
>> but we will have to be patient for that.
>> 
>> thanks in advance
>> 
>> mvg,
>> Jasper
>> 
>>> You may also want to look at your Index report in SPM 
>>> (http://sematext.com/spm) before/during/after replication and share what 
>>> you see.
>>> 
>>> Otis
>>> 
>>> Performance Monitoring for Solr / ElasticSearch / HBase - 
>>> http://sematext.com/spm
>>> 
>>> 
>>> 
>>> - Original Message -
>>>> From: Jasper Floor 
>>>> To: solr-user@lucene.apache.org
>>>> Cc:
>>>> Sent: Thursday, May 10, 2012 9:08 AM
>>>> Subject: slave index not cleaned
>>>> 
>>>> Perhaps I am missing the obvious but our slaves tend to run out of
>>>> disk space. The index sizes grow to multiple times the size of the
>>>> master. So I just toss all the data and trigger a replication.
>>>> However, can't solr handle this for me?
>>>> 
>>>> I'm sorry if I've missed a simple setting which does this for me, but
>>>> if its there then I have missed it.
>>>> 
>>>> mvg
>>>> Jasper
>>>>

Re: Is it possible to limit the bandwidth of replication

2012-05-09 Thread Bill Bell

+1 as well especially for larger indexes

Sent from my Mobile device
720-256-8076

On May 9, 2012, at 9:46 AM, Jan Høydahl  wrote:

>> I think we have to add this for java based rep. 
> +1
>

Re: Replication. confFiles and permissions.

2012-05-09 Thread Bill Bell

Why would you replicate data import properties? The master does the importing 
not the slave...

Sent from my Mobile device
720-256-8076

On May 9, 2012, at 7:23 AM, stockii  wrote:

> Hello.
> 
> 
> i running a solr replication. works well, but i need to replicate my
> dataimport-properties. 
> 
> if server1 replicate this file after he create everytime a new file, with
> *.timestamp, because the first replication run create this file with wrong
> permissions ...
> 
> how can is say to solr replication "chmod 755  dataimport-properties ..."  ?
> ;-)
> 
> thx
> 
> -
> --- System 
> 
> 
> One Server, 12 GB RAM, 2 Solr Instances, 8 Cores, 
> 1 Core with 45 Million Documents other Cores < 200.000
> 
> - Solr1 for Search-Requests - commit every Minute  - 5GB Xmx
> - Solr2 for Update-Request  - delta every Minute - 4GB Xmx
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Replication-confFiles-and-permissions-tp3973825.html
> Sent from the Solr - User mailing list archive at Nabble.com.

Re: Solritas in production

2012-05-08 Thread Bill Bell

I would not use Solaritas unless for very rudimentary solutions and prototypes.

Sent from my Mobile device
720-256-8076

On May 6, 2012, at 6:02 AM, András Bártházi  wrote:

> Hi,
> 
> We're currently evaluating Solr as a Sphinx replacement. Our site has
> 1.000.000+ pageviews a day, it's a real estate search engine. The
> development is almost done, and it seems to be working fine, however some
> of my colleagues come with the idea that we're using it wrong. We're using
> it as a service from PHP/Symfony.
> 
> They think we should use Solritas as a frontend, so site visitors will
> directly use it, so no PHP will be involved, so it will be use much less
> infrastructure. One of them said that even mobile.de using it that way (I
> have found no clue about it at all).
> 
> Do you think is it a good idea?
> 
> Do you know services using Solritas as a frontend on a public site?
> 
> My personal opinion is that using Solritas in production is a very bad idea
> for us, but have not so much experience with Solr yet, and Solritas
> documentation is far from a detailed, up-to-date one, so don't really know
> what is it really usable for.
> 
> Thanks,
>  Andras

Re: Does Solr fit my needs?

2012-04-27 Thread Bill Bell

You could use SQL Server and External Fields in Solr to get what you need from 
the database on result of the query.

Bill Bell
Sent from mobile


On Apr 27, 2012, at 8:31 AM, "G.Long"  wrote:

> Hi there :)
> 
> I'm looking for a way to save xml files into some sort of database and i'm 
> wondering if Solr would fit my needs.
> The xml files I want to save have a lot of child nodes which also contain 
> child nodes with multiple values. The depth level can be more than 10.
> 
> After having indexed the files, I would like to be able to query for subparts 
> of those xml files and be able to reconstruct them as xml files with all 
> their children included. However, I'm wondering if it is possible with an 
> index like solr lucene to keep or easily recover the structure of my xml data?
> 
> Thanks for your help,
> 
> Regards,
> 
> Gary

Re: commit stops

2012-04-27 Thread Bill Bell

We also see extreme slowness using Solr 3.6 when trying to commit a delete. We 
also get hangs. We do 1 commit at most a week. Rebuilding from scratching using 
DIH works fine and has never hung.

Bill Bell
Sent from mobile


On Apr 27, 2012, at 5:59 PM, "mav.p...@holidaylettings.co.uk" 
 wrote:

> Thanks for the reply
> 
> The client expects a response within 2 minutes and after that will report
> an error. When we build fresh it seems to work and the operation takes a
> second or two to complete. Once it gets to a stage it hangs it simply
> won't accept any further commits. I did an index check and all was ok.
> 
> I don¹t see any major commit happening at any  time, it seems to just
> hang. Even starting up and shutting down takes ages.
> 
> We make 3 - 4 commits a day.
> 
> We use solr 3.5
> 
> No autocommit
> 
> 
> 
> On 28/04/2012 00:56, "Yonik Seeley"  wrote:
> 
>> On Fri, Apr 27, 2012 at 9:18 AM, mav.p...@holidaylettings.co.uk
>>  wrote:
>>> We have an index of about 3.5gb which seems to work fine until it
>>> suddenly stops accepting new commits.
>>> 
>>> Users can still search on the front end but nothing new can be
>>> committed and it always times out on commit.
>>> 
>>> Any ideas?
>> 
>> Perhaps the commit happens to cause a major merge which may take a
>> long time (and solr isn't going to allow overlapping commits).
>> How long does a commit request take to time out?
>> 
>> What Solr version is this?  Do you have any kind of auto-commit set
>> up?  How often are you manually committing?
>> 
>> -Yonik
>> lucenerevolution.com - Lucene/Solr Open Source Search Conference.
>> Boston May 7-10
>

Ampersand issue

2012-04-27 Thread Bill Bell

We are indexing a simple XML field from SQL Server into Solr as a stored field. 
We have noticed that the & is outputed as &amp; when using wt=XML. When 
using wt=JSON we get the normal &. If there a way to indicate that we don't 
want to encode the field since it is already XML when using wt=XML ?

Bill Bell
Sent from mobile

Re: change index/store at indexing time

2012-04-27 Thread Bill Bell

Yes you can. Just use a script that is called for each row.

Bill Bell
Sent from mobile


On Apr 27, 2012, at 6:38 PM, "Vazquez, Maria (STM)"  
wrote:

> Hi,
> I'm migrating a project from Lucene 2.9 to Solr 3.4.
> There is a special case in the code that indexes the same field in two 
> different ways, which is completely legal in Lucene directly but I don't know 
> how to duplicate this same behavior in Solr:
> 
> if (isFirstGeo) {
> document.add(new Field("geoids", geoId, Field.Store.YES, 
> Field.Index.NOT_ANALYZED_NO_NORMS));
> isFirstGeo = false;
> } else {
> if (countProducts < 100)
>  document.add(new Field("geoids", geoId, Field.Store.NO, 
> Field.Index.NOT_ANALYZED_NO_NORMS));
> else
>  document.add(new Field("geoids", geoId, Field.Store.YES, 
> Field.Index.NO));
> }
> 
> Is there any way to do this in Solr in a Tranformer? I'm using the DIH to 
> index and I can't see a way to do this other than having three fields in the 
> schema like geoids_store_index, geoids_nostore_index, and 
> geoids_store_noindex.
> 
> Thanks a lot in advance.
> Maria
> 
> 
>

Question concerning date fields

2012-04-20 Thread Bill Bell

We are loading a long (number of seconds since 1970?) value into Solr using 
java and Solrj. What is the best way to convert this into the right Solr date 
fields?

Sent from my Mobile device
720-256-8076

Re: ExtractingRequestHandler

2012-04-01 Thread Bill Bell

I have had good luck with creating a separate core index for just data. This is 
a different core than the indexed core.

Very fast.

Bill Bell
Sent from mobile


On Apr 1, 2012, at 11:15 AM, Erick Erickson  wrote:

> Yes, you can. but Generally, storing the raw input in Solr is
> not the best approach. The problem here is that pretty soon
> you get a huge index that contains *everything*. Solr was not
> intended to be a data store.
> 
> Besides, you then need to store the binary form of the file. Solr
> only deals with text, not markup.
> 
> Most people index the text in Solr, and enough information
> so the application knows where to go to fetch the original
> document when the user drills down (e.g. file path, database
> PK, etc). Would that work for your situation?
> 
> Best
> Erick
> 
> On Sat, Mar 31, 2012 at 3:55 PM,   wrote:
>> Hi,
>> 
>> I want to index various filetypes in solr, this can easily done with
>> ExtractingRequestHandler. But I also need the extracted content back.
>> I know ext.extract.only but then nothing gets indexed, right?
>> 
>> Can I index the document AND get the content back as with ext.extract.only?
>> In a single request?
>> 
>> Thank you
>> 
>>

Re: Empty facet counts

2012-03-29 Thread Bill Bell

Send schema.xml and did you apply any patches? What version of Solr?

Bill Bell
Sent from mobile


On Mar 29, 2012, at 5:26 AM, Youri Westerman  wrote:

> Hi,
> 
> I'm currently learning how to use solr and everything seems pretty straight
> forward. For some reason when I use faceted queries it returns only empty
> sets in the facet_count section.
> 
> The get params I'm using are:
>  ?q=*:*&rows=0&facet=true&facet.field=urn
> 
> The result:
>  "facet_counts": {
> 
>  "facet_queries": { },
>  "facet_fields": { },
>  "facet_dates": { },
>  "facet_ranges": { }
> 
>  }
> 
> The urn field is indexed and there are enough entries to be counted. When
> adding facet.method=Enum, nothing changes.
> Does anyone know why this is happening? Am I missing something?
> 
> Thanks in advance!
> 
> Youri

Re: DataImportHandler: backups prior to full-import

2012-03-28 Thread Bill Bell

You could use the Solr Command Utility SCU that runs from Windows and can be 
scheduled to run. 

https://github.com/justengland/Solr-Command-Utility

This is a windows system that will index using a core, and swap it if it 
succeeds. It works it's Solr.

Let me know if you have any questions.

On Mar 28, 2012, at 10:11 PM, Shawn Heisey  wrote:

> On 3/28/2012 12:46 PM, Artem Shnayder wrote:
>> Does anyone know of any work done to automatically run a backup prior to a
>> DataImportHandler full-import?
>> 
>> I've asked this question on #solr and was pointed to
>> https://wiki.apache.org/solr/SolrReplication?highlight=%28backup%29#HTTP_API
>> which
>> is helpful but is not an automatic backup in the context of full-import's.
>> I'm wondering if anyone else has done this work yet.
> 
> I have located a previous message from you where you mention that you are on 
> Ubuntu.  If that's true, you can use hard links to make nearly instantaneous 
> backups with a single command:
> 
> ln /path/to/index/* /path/to/backup/.
> 
> One caveat to that - the backup must be on the same filesystem as the index.  
> If keeping backups on another filesystem (or even another computer) is 
> important, then treat the hard link backup as a temporary directory.  Copy 
> the files from that directory to your remote location, then delete them.
> 
> This works because of the way that Lucene (and by extension Solr) manages 
> files on disk - existing segment files are never modified.  If they get 
> merged, new files are created before the old ones are deleted.  There is only 
> one file in an index directory that does change without getting a new name - 
> segments.gen.  I have verified (on Solr 3.5) that even this file is properly 
> handled so that a hard link backup keeps the correct version.
> 
> For people running on Windows, this particular method won't work.  Newer 
> Windows server versions do have one feature that might actually make it 
> possible to do something similar - shadow copies.  I do not know how to 
> leverage the feature, though.
> 
> Thanks,
> Shawn
>

Re: Performance Question

2012-03-19 Thread Bill Bell

The size of the index does matter practically speaking.

Bill Bell
Sent from mobile


On Mar 19, 2012, at 11:41 AM, Mikhail Khludnev  
wrote:

> Exactly. That's what I mean.
> 
> On Mon, Mar 19, 2012 at 6:15 PM, Jamie Johnson  wrote:
> 
>> Mikhail,
>> 
>> Thanks for the response.  Just to be clear you're saying that the size
>> of the index does not matter, it's more the size of the results?
>> 
>> On Fri, Mar 16, 2012 at 2:43 PM, Mikhail Khludnev
>>  wrote:
>>> Hello,
>>> 
>>> Frankly speaking the computational complexity of Lucene search depends
>> from
>>> size of search result: numFound*log(start+rows), but from size of index.
>>> 
>>> Regards
>>> 
>>> On Fri, Mar 16, 2012 at 9:34 PM, Jamie Johnson 
>> wrote:
>>> 
>>>> I'm curious if anyone tell me how Solr/Lucene performs in a situation
>>>> where you have 100,000 documents each with 100 tokens vs having
>>>> 1,000,000 documents each with 10 tokens.  Should I expect the
>>>> performance to be the same?  Any information would be greatly
>>>> appreciated.
>>>> 
>>> 
>>> 
>>> 
>>> --
>>> Sincerely yours
>>> Mikhail Khludnev
>>> Lucid Certified
>>> Apache Lucene/Solr Developer
>>> Grid Dynamics
>>> 
>>> <http://www.griddynamics.com>
>>> 
>> 
> 
> 
> 
> -- 
> Sincerely yours
> Mikhail Khludnev
> Lucid Certified
> Apache Lucene/Solr Developer
> Grid Dynamics
> 
> <http://www.griddynamics.com>
>

Re: Solr core swap after rebuild in HA-setup / High-traffic

2012-03-17 Thread Bill Bell

DIH sets the time of update to the start time not the end time,

So when the index is rebuilt, if you run an delta and use the update time you 
should be okay. We normally go back a few minutes to make sure we have all s a 
fail safe as well.

Sent from my Mobile device
720-256-8076

On Mar 14, 2012, at 12:58 PM, KeesSchepers  wrote:

> Hello everybody,
> 
> I am designing a new Solr architecture for one of my clients. This sorl
> architecture is for a high-traffic website with million of visitors but I am
> facing some design problems were I hope you guys could help me out.
> 
> In my situation there are 4 Solr servers running, 1 server is master and 3
> are slave. They are running Solr version 1.4.
> 
> I use two cores 'live' and 'rebuild' and I use Solr DIH to rebuild a core
> which goes like this:
> 
> 1. I wipe the reindex core
> 2. I run the DIH to the complete dataset (4 million documents) in peices of
> 20.000 records (to prevent very long mysql locks)
> 3. After the DIH is finished (2 hours) we have to also have to update the
> rebuild core with changes from the last two hours, this is a problem
> 4. After updating is done and the core is not more then some seconds behind
> we want to SWAP the cores.
> 
> Everything goes well except for step 3. The rebuild and the core swap is all
> okay. 
> 
> Because the website is undergoing changes every minute we cannot pauze the
> delta-import on the live and walk behind for 2 hours. The problem is that I
> can't figure out a closing system with not delaying the live core to long
> and use the DIH instead of writing a lot of code.
> 
> Did anyone face this problem before or could give me some tips?
> 
> Thanks!
> 
> 
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Solr-core-swap-after-rebuild-in-HA-setup-High-traffic-tp3826461p3826461.html
> Sent from the Solr - User mailing list archive at Nabble.com.

Re: 3 Way Solr Join . . ?

2012-03-11 Thread Bill Bell

You can do concatenation johns and then put into Solr. You can denormalize the 
results. Everyone is telling you the same thing.

Select customer_name, (select group_concat(city) from address where 
nameid=customers.nameid) as state_bar from customers

DIH handler has a way to split on comma to add to a multiValued field.

As I also mentioned elsewhere you can concat into an XML field and store it 
into the index. That works fantastic to denormalize.

Why do you need everything in the index? Why not do an external field to get it 
later ? Are you trying to search on something? What? If you need the adds 
searchable then searching on city or state is pretty useful as I showed above,

Sent from my Mobile device
720-256-8076

On Mar 11, 2012, at 10:59 AM, Angelyna Bola  wrote:

> Walter,
> 
> :: Fields can be multi-valued. Put multiple phone numbers in a field
> and match all of them.
> 
> Thank you for the suggestion, unfortunately I oversimplified my example =(
> 
> Let me try again:
> 
>I should have said that I need to match on 2 fields (as a set) from
> within a given child table.
> 
>Logically, I need to query in Solr for Customers who:
> 
>- Have an address in a given state (e.g. NY) and that address is of
> a given type (e.g. condo)
>- Have a phone in a given area code (e.g. 212) and of a given brand
> (e.g. Nokia)
>- Are a given gender (e.g. male)
> 
> Respectfully,
> 
> Angelyna
> 
> 
> 
> On Sat, Mar 10, 2012 at 7:58 PM, Angelina Bola  
> wrote:
>> Does "Solr" support a 3-way join? i.e.
>> http://wiki.apache.org/solr/Join (I have the 2-way join working)
>> 
>> For example, I am pulling 3 different tables from a RDBMS into one Solr core:
>> 
>>   Table#1: Customers (parent table)
>>   Table#2: Addresses  (child table with foreign key to customers)
>>   Table#3: Phones (child table with foreign key to customers)
>> 
>> with a ONE to MANY relationship between:
>> 
>>Customers and Addresses
>>Customers and Phones
>> 
>> When I pull them into Solr I cannot denormalize the relationships as a
>> given customers can have many addresses and many phones.
>> 
>> When they come into the my single core (customerInfo), each document
>> gets a customerInfo_type and a uid corresponding to that type, for
>> example:
>> 
>>Customer Document
>>customerInfo_type='customer'
>>customer_id
>> 
>>Address Document
>>customerInfo_type='address'
>>fk_address_customer_id
>> 
>>Phone Document
>>customerInfo_type='phone'
>>fk_phone_customer_id
>> 
>> Logically, I need to query in Solr for Customers who:
>> 
>>- Have an address in a given state
>>- Have a phone in a given area code
>>- Are a given gender
>> 
>> Syntactically, it would think it would look like:
>> 
>>  - http://localhost:8983/solr/customerInfo/select/?
>> q={!join from=fk_address_customer_id to=customer_id}address_State:Maine&
>> fq={!join from=customer_id to=fk_phone_customer_id}phone_area_code:212&
>> fq=customer_gender:female
>> 
>> But that does not work for me.
>> 
>> Appreciate any thoughts,
>> 
>> Angelyna

Re: 3 Way Solr Join . . ?

2012-03-11 Thread Bill Bell

Sure we do this a lot for smaller indexes.

Create a string field. Not text. Store it. Then it will come out when you do a 
simple select query.

  



Sent from my Mobile device
720-256-8076

On Mar 11, 2012, at 11:09 AM, Angelyna Bola  wrote:

> William,
> 
> :: You can also use external fields, or store formatted info into a
> String field in json or xml format.
> 
> Thank you for the idea . . .
> 
> I have tried to load xml formatted data into Solr (not to be confused
> with the Solr XML load format), but not had any luck. Could you please
> point me to an example of how to load and take advatage of xml format
> in a solr core?
> 
> I can see it being straight forward to load json format into a solr
> core, but I do not see how I can leverage it for this problem?  Could
> you please point me to an example?
> 
> External fields are new to me. From what I'm reading I am not seeing
> how I can use them to help with this problem. Could you explain?
> 
> Respectfully,
> 
> Angelyna
> 
> 
> 
> On Sat, Mar 10, 2012 at 7:58 PM, Angelina Bola  
> wrote:
>> Does "Solr" support a 3-way join? i.e.
>> http://wiki.apache.org/solr/Join (I have the 2-way join working)
>> 
>> For example, I am pulling 3 different tables from a RDBMS into one Solr core:
>> 
>>  Table#1: Customers (parent table)
>>  Table#2: Addresses  (child table with foreign key to customers)
>>  Table#3: Phones (child table with foreign key to customers)
>> 
>> with a ONE to MANY relationship between:
>> 
>>   Customers and Addresses
>>   Customers and Phones
>> 
>> When I pull them into Solr I cannot denormalize the relationships as a
>> given customers can have many addresses and many phones.
>> 
>> When they come into the my single core (customerInfo), each document
>> gets a customerInfo_type and a uid corresponding to that type, for
>> example:
>> 
>>   Customer Document
>>   customerInfo_type='customer'
>>   customer_id
>> 
>>   Address Document
>>   customerInfo_type='address'
>>   fk_address_customer_id
>> 
>>   Phone Document
>>   customerInfo_type='phone'
>>   fk_phone_customer_id
>> 
>> Logically, I need to query in Solr for Customers who:
>> 
>>   - Have an address in a given state
>>   - Have a phone in a given area code
>>   - Are a given gender
>> 
>> Syntactically, it would think it would look like:
>> 
>> - http://localhost:8983/solr/customerInfo/select/?
>>q={!join from=fk_address_customer_id to=customer_id}address_State:Maine&
>>fq={!join from=customer_id to=fk_phone_customer_id}phone_area_code:212&
>>fq=customer_gender:female
>> 
>> But that does not work for me.
>> 
>> Appreciate any thoughts,
>> 
>> Angelyna

Re: Vector based queries

2012-03-11 Thread Bill Bell

It is way too slow

Sent from my Mobile device
720-256-8076

On Mar 11, 2012, at 12:07 PM, Pat Ferrel  wrote:

> I found a description here: 
> http://cephas.net/blog/2008/03/30/how-morelikethis-works-in-lucene/
> 
> If it is the same four years later, it looks like lucene is doing an index 
> lookup for each important term in the example doc boosting each term based on 
> the term weights. My guess would be that this is a little slower than 2-3word 
> query but still scalable.
> 
> Has anyone used this on a very large index?
> 
> Thanks,
> Pat
> 
> On 3/11/12 10:45 AM, Pat Ferrel wrote:
>> MoreLikeThis looks exactly like what I need. I would probably create a new 
>> "like" method to take a mahout vector and build a search? I build the vector 
>> by starting from a doc and reweighting certain terms. The prototype just 
>> reweights words but I may experiment with dirichlet clusters and reweighting 
>> an entire cluster of words so you could boost the importance of a topic in 
>> the results. Still the result of either algorithm would be a mahout vector.
>> 
>> Is there a description of how this works somewhere? Is it basically an index 
>> lookup? I always though the Google feature used precalculated results (and 
>> it probably does). I'm curious but mainly asking to see how fast it is.
>> 
>> Thanks
>> Pat
>> 
>> On 3/11/12 8:36 AM, Paul Libbrecht wrote:
>>> Maybe that's exactly it but... given a document with n tokens A, and m 
>>> tokens B, a query A^n B^m would find what you're looking for or?
>>> 
>>> paul
>>> 
>>> PS  I've always viewed queries as linear forms on the vector space and I'd 
>>> like to see this really mathematically written one day...
>>> Le 11 mars 2012 à 07:23, Lance Norskog a écrit :
>>> 
 Look at the MoreLikeThis feature in Lucene. I believe it does roughly
 what you describe.
 
 On Sat, Mar 10, 2012 at 9:58 AM, Pat Ferrel  wrote:
> I have a case where I'd like to get documents which most closely match a
> particular vector. The RowSimilarityJob of Mahout is ideal for
> precalculating similarity between existing documents but in my case the
> query is constructed at run time. So the UI constructs a vector to be used
> as a query. We have this running in prototype using a run time calculation
> of cosine similarity but the implementation is not scalable to large doc
> stores.
> 
> One thought is to calculate fairly small clusters. The UI will know which
> cluster to target for the vector query. So we might be able to narrow down
> the number of docs per query to a reasonable size.
> 
> It seems like a place for multiple hash functions maybe? Could we use some
> kind of hack of the boost feature of Solr or some other approach?
> 
> Does anyone have a suggestion?
 
 
 -- 
 Lance Norskog
 goks...@gmail.com
>>>

Re: Dynamically Load Query Time Synonym File

2012-02-26 Thread Bill Bell

It would depend.

If the synonyms are used on indexing, you need to reindex. Otherwise, you
could reload and use the synonyms on "query".

On 2/26/12 4:05 AM, "Ahmet Arslan"  wrote:

>
>> Is there a way to dynamically load a synonym file without
>> restarting solr core ?
>
>There is an open jira for this :
>https://issues.apache.org/jira/browse/SOLR-1307
>

Re: Improving performance for SOLR geo queries?

2012-02-14 Thread Bill Bell

Can we get this back ported to 3x?

Bill Bell
Sent from mobile


On Feb 14, 2012, at 3:45 AM, Matthias Käppler  wrote:

> hey thanks all for the suggestions, didn't have time to look into them
> yet as we're feature-sprinting for MWC, but will report back with some
> feedback over the next weeks (we will have a few more performance
> sprints in March)
> 
> Best,
> Matthias
> 
> On Mon, Feb 13, 2012 at 2:32 AM, Yonik Seeley
>  wrote:
>> On Thu, Feb 9, 2012 at 1:46 PM, Yonik Seeley  
>> wrote:
>>> One way to speed up numeric range queries (at the cost of increased
>>> index size) is to lower the precisionStep.  You could try changing
>>> this from 8 to 4 and then re-indexing to see how that affects your
>>> query speed.
>> 
>> Your issue, and the fact that I had been looking at the post-filtering
>> code again for another client, reminded me that I had been planning on
>> implementing post-filtering for spatial.  It's now checked into trunk.
>> 
>> If you have the ability to use trunk, you can add a high cost (like
>> cost=200) along with cache=false to trigger it.
>> 
>> More details here:
>> http://www.lucidimagination.com/blog/2012/02/10/advanced-filter-caching-in-solr/
>> 
>> -Yonik
>> lucidimagination.com
> 
> 
> 
> -- 
> Matthias Käppler
> Lead Developer API & Mobile
> 
> Qype GmbH
> Großer Burstah 50-52
> 20457 Hamburg
> Telephone: +49 (0)40 - 219 019 2 - 160
> Skype: m_kaeppler
> Email: matth...@qype.com
> 
> Managing Director: Ian Brotherston
> Amtsgericht Hamburg
> HRB 95913
> 
> This e-mail and its attachments may contain confidential and/or
> privileged information. If you are not the intended recipient (or have
> received this e-mail in error) please notify the sender immediately
> and destroy this e-mail and its attachments. Any unauthorized copying,
> disclosure or distribution of this e-mail and  its attachments is
> strictly forbidden. This notice also applies to future messages.

Mmap

2012-02-14 Thread Bill Bell

Does someone have an example of using unmap in 3.5 and chunksize?

 I am using Solr 3.5.

I noticed in solrconfig.xml:



I don't see this parameter taking.. When I set 
-Dsolr.directoryFactory=solr.MMapDirectoryFactory

How do I see the setting in the log or in stats.jsp ? I cannot find a place 
that indicates it is set or not.

I would assume StandardDirectoryFactory is being used but I do see (when I set 
it or NOT set it)

Bill Bell
Sent from mobile

Debugging on 3,5

2012-02-14 Thread Bill Bell


I did find a solution, but the output is horrible. Why does explain look so 
badly?


6.351252 = (MATCH) boost(*:*,query(specialties_ids: #1;#0;#0;#0;#0;#0;#0;#0;#0; 
,def=0.0)), product of:
  1.0 = (MATCH) MatchAllDocsQuery, product of:
1.0 = queryNorm
  6.351252 = query(specialties_ids: #1;#0;#0;#0;#0;#0;#0;#0;#0; 
,def=0.0)=6.351252



defType=edismax&boost=query($param)¶m=multi_field:87
--


We like the boost parameter in SOLR 3.5 with eDismax.

The question we have is what we would like to replace bq with boost, but we get 
the "multi-valued field issue" when we try to do this.

Bill Bell
Sent from mobile

FW: boost question. need boost to take a query like bq

2012-02-11 Thread Bill Bell



I did find a solution, but the output is horrible. Why does explain look so
badly?


6.351252 = (MATCH) boost(*:*,query(specialties_ids:
#1;#0;#0;#0;#0;#0;#0;#0;#0; ,def=0.0)), product of:
  1.0 = (MATCH) MatchAllDocsQuery, product of:
1.0 = queryNorm
  6.351252 = query(specialties_ids: #1;#0;#0;#0;#0;#0;#0;#0;#0;
,def=0.0)=6.351252



defType=edismax&boost=query($param)¶m=multi_field:87
--


We like the boost parameter in SOLR 3.5 with eDismax.

The question we have is what we would like to replace bq with boost, but we
get the "multi-valued field issue" when we try to do the equivalent queries
HTTP ERROR 400
Problem accessing /solr/providersearch/select. Reason:
can not use FieldCache on multivalued field: specialties_ids


q=*:*bq=multi_field:87^2&defType=dismax

How do you do this using boost?

q=*:*&boost=multi_field:87&defType=edismax

We know we can use bq with edismax, but we like the "multiply" feature of
boost.

If I change it to a single valued field I get results, but they are all 1.0.


1.0 = (MATCH) MatchAllDocsQuery, product of:
  1.0 = queryNorm


q=*:*&boost=single_field:87&defType=edismax  // this works, but I need it on
multivalued

boost question. need boost to take a query like bq

2012-02-11 Thread Bill Bell



We like the boost parameter in SOLR 3.5 with eDismax.

The question we have is what we would like to replace bq with boost, but we
get the "multi-valued field issue" when we try to do the equivalent queries
HTTP ERROR 400
Problem accessing /solr/providersearch/select. Reason:
can not use FieldCache on multivalued field: specialties_ids


q=*:*bq=multi_field:87^2&defType=dismax

How do you do this using boost?

q=*:*&boost=multi_field:87&defType=edismax

We know we can use bq with edismax, but we like the "multiply" feature of
boost.

If I change it to a single valued field I get results, but they are all 1.0.


1.0 = (MATCH) MatchAllDocsQuery, product of:
  1.0 = queryNorm


q=*:*&boost=single_field:87&defType=edismax  // this works, but I need it on
multivalued

Re: Help with MMapDirectoryFactory in 3.5

2012-02-11 Thread Bill Bell

Also, does someone have an example of using unmap in 3.5 and chunksize?

From:  Bill Bell 
Date:  Sat, 11 Feb 2012 10:39:56 -0700
To:  
Subject:  Help with MMapDirectoryFactory in 3.5

 I am using Solr 3.5.

I noticed in solrconfig.xml:



I don't see this parameter taking.. When I set
-Dsolr.directoryFactory=solr.MMapDirectoryFactory

How do I see the setting in the log or in stats.jsp ? I cannot find a place
that indicates it is set or not.

I would assume StandardDirectoryFactory is being used but I do see (when I
set it or NOT set it)

ame:  searcher  class:  org.apache.solr.search.SolrIndexSearcher  version:
1.0  description:  index searcher  stats: searcherName : Searcher@71fc3828
main 
caching : true 
numDocs : 2121163 
maxDoc : 2121163 
reader : 
SolrIndexReader{this=1867ec28,r=ReadOnlyDirectoryReader@1867ec28,refCnt=1,se
gments=1} 
readerDir : 
org.apache.lucene.store.MMapDirectory@C:\solr\jetty\example\solr\providersea
rch\data\index 
lockFactory=org.apache.lucene.store.NativeFSLockFactory@45c1cfc1
indexVersion : 1324594650551
openedAt : Sat Feb 11 09:49:31 MST 2012
registeredAt : Sat Feb 11 09:49:31 MST 2012
warmupTime : 0 

Also, how do I set unman and what is the purpose of chunk size?

Help with MMapDirectoryFactory in 3.5

2012-02-11 Thread Bill Bell

 I am using Solr 3.5.

I noticed in solrconfig.xml:



I don't see this parameter taking.. When I set
-Dsolr.directoryFactory=solr.MMapDirectoryFactory

How do I see the setting in the log or in stats.jsp ? I cannot find a place
that indicates it is set or not.

I would assume StandardDirectoryFactory is being used but I do see (when I
set it or NOT set it)

ame:  searcher  class:  org.apache.solr.search.SolrIndexSearcher  version:
1.0  description:  index searcher  stats: searcherName :  Searcher@71fc3828
main 
caching :  true 
numDocs :  2121163 
maxDoc :  2121163 
reader :  
SolrIndexReader{this=1867ec28,r=ReadOnlyDirectoryReader@1867ec28,refCnt=1,se
gments=1} 
readerDir :  
org.apache.lucene.store.MMapDirectory@C:\solr\jetty\example\solr\providersea
rch\data\index 
lockFactory=org.apache.lucene.store.NativeFSLockFactory@45c1cfc1
indexVersion :  1324594650551
openedAt :  Sat Feb 11 09:49:31 MST 2012
registeredAt :  Sat Feb 11 09:49:31 MST 2012
warmupTime :  0

Also, how do I set unman and what is the purpose of chunk size?

Re: Performance issue: Frange with geodist()

2011-10-15 Thread Bill Bell

I added a Jira issue for this:

https://issues.apache.org/jira/browse/SOLR-2840



On 10/13/11 8:15 AM, "Yonik Seeley"  wrote:

>On Thu, Oct 13, 2011 at 9:55 AM, Mikhail Khludnev
> wrote:
>> is it possible with geofilt and facet.query?
>>
>> facet.query={!geofilt pt=45.15,-93.85 sfield=store d=5}
>
>Yes, that should be both possible and faster... something along the lines
>of:
>&sfield=store&pt=45.15,-93.85
>&facet.query={!geofilt d=10 key=d10}
>&facet.query={!geofilt d=20 key=d20}
>&facet.query={!geofilt d=50 key=d50}
>
>Eventually we should implement range faceting over functions and also
>add a max distance you care about to the geodist function.
>
>-Yonik
>http://www.lucene-eurocon.com - The Lucene/Solr User Conference
>
>
>> On Thu, Oct 13, 2011 at 4:20 PM, roySolr 
>>wrote:
>>
>>> I don't want to use some basic facets. When the user doesn't get any
>>> results
>>> i want
>>> to search in the radius of his search location. Example:
>>>
>>> apple store in Manchester gives no result. I want this:
>>>
>>> Click here to see 2 results in a radius of 10km.
>>> Click here to see 11 results in a radius of 50km.
>>> Click here to see 19 results in a radius of 100km.
>>>
>>> With geodist() and facet.query is this possible but the performance
>>>isn't
>>> very good..
>>>
>>>
>>> --
>>> View this message in context:
>>> 
>>>http://lucene.472066.n3.nabble.com/Performance-issue-Frange-with-geodist
>>>-tp3417962p3418429.html
>>> Sent from the Solr - User mailing list archive at Nabble.com.
>>>
>>
>>
>>
>> --
>> Sincerely yours
>> Mikhail (Mike) Khludnev
>> Developer
>> Grid Dynamics
>> tel. 1-415-738-8644
>> Skype: mkhludnev
>> 
>>  
>>

Re: what is the recommended way to store locations?

2011-10-06 Thread Bill Bell

You could client-side Google Geocoding on why the user typed in.
Then get the lat,long returned from Google, and do a geo spatial search



On 10/6/11 9:27 AM, "Jason Toy"  wrote:

>In our current system ,we have 3 fields for location,  city, state, and
>country.People in our system search for one of those 3 strings.
>So a user can search for "San Francisco" or "California".  In solr I store
>those 3 fields as strings and when a search happens I search with an OR
>statement across those 3 fields.
>
>Is there a more efficient way to store this data storage wise and/or speed
>wise?  We don't currently plan to use any spacial features like "3 miles
>near SF".

Re: is there a way to know which mm value was used?

2011-10-05 Thread Bill Bell

It would be good to output the mm value for debugging.

Something like mm_value = 2

Then you should know the results are right.

On 10/5/11 9:58 AM, "Shawn Heisey"  wrote:

>On 10/5/2011 9:06 AM, elisabeth benoit wrote:
>> thanks for answering.
>>
>> echoParams just echos mm value in solrconfig.xml (in my case mm = 4<-1
>> 6<-2), not the actual value of mm for one particular request.
>>
>> I think would be very useful to be able to know which mm value was
>> effectively used, in particular for request with stopwords.
>>
>> It's of course possible to calculate mm in my own code, but this would
>> necessitate to be synchronize with mm default value in solrconfig.xml +
>>with
>> stopwords.txt + identifying all stopwords in request.
>
>Just tried this on a Solr 3.4.0 server.  I have an edismax handler that
>includes echoParams, set to "all", as well as an mm parameter, set to
>"2<-1 4<-50%".  If I send a request with no mm parameter, that value is
>reflected in the response.  When I add "&mm=50%25" to the URL in my
>browser (%25 being the URL encoding for the percent symbol), the
>response changes the mm value to "50%" as expected, overriding the value
>in solrconfig.xml.  I have not tried it with SolrJ or any of the other
>client APIs, just a browser.
>
>Is this not happening for you?
>
>Thanks,
>Shawn
>

Re: Scoring of DisMax in Solr

2011-10-05 Thread Bill Bell

Markus,

The calculation is correct.

Look at your output.

Result = queryWeight(text:gb) * fieldWeight(text:gb in 1)

Result = (idf(docFreq=6, numDocs=26) * queryNorm) *
(tf(termFreq(text:gb)=2) * idf(docFreq=6, numDocs=26) *
fieldNorm(field=text, doc=1))

This you should notice that idf(docFreq=6, numDocs=26 is repeated twice.

This si just how the weight() is calculated.




> > 0.18314168 = (MATCH) sum of:
> >   0.18314168 = (MATCH) weight(text:gb in 1), product of:
> > 0.35845062 = queryWeight(text:gb), product of:
> >   2.3121865 = idf(docFreq=6, numDocs=26)
> >   0.15502669 = queryNorm
> >
> > 0.5109258 = (MATCH) fieldWeight(text:gb in 1), product of:
> >   1.4142135 = tf(termFreq(text:gb)=2)
> >   2.3121865 = idf(docFreq=6, numDocs=26)
> >   0.15625 = fieldNorm(field=text, doc=1)





On 10/5/11 11:42 AM, "Markus Jelsma"  wrote:

>Hi,
>
>I don't see 2.3121865 * 2 anywhere in your debug output or something that
>looks like that.
>
>
>> Hi Markus,
>> 
>> The idf calculation itself is correct.
>> What I am trying to understand here is  why idf value is multiplied
>>twice
>> in the final score calculation. Essentially,  tf x idf^2 is used instead
>> of tf x idf.
>> I'd like to understand the rational behind that.
>> 
>> On Wed, Oct 5, 2011 at 9:43 AM, Markus Jelsma
>wrote:
>> > In Lucene's default similarity idf = 1 + ln (numDocs / df + 1).
>> > 1 + ln(26 / 7) =~ 2.3121865
>> > 
>> > I don't see a problem.
>> > 
>> > > Hi,
>> > > 
>> > > 
>> > > When I examine the score calculation of DisMax in Solr,   it looks
>>to
>> > > me that DisMax is using  tf x idf^2 instead of tf x idf.
>> > > Does anyone have insight why tf x idf is not used here?
>> > > 
>> > > Here is the score contribution from one one field:
>> > > 
>> > > score(q,c) =  queryWeight x fieldWeight
>> > > 
>> > >= tf x idf x idf x queryNorm x fieldNorm
>> > > 
>> > > Here is the example that I used to derive the formula above.
>>Clearly,
>> > > idf is multiplied twice in the score calculation.
>> > > *
>> > 
>> > 
>>http://localhost:8983/solr/select/?q=GB&version=2.2&start=0&rows=10&inden
>> > t=
>> > 
>> > > on&debugQuery=true&fl=id,score *
>> > > 
>> > > 
>> > > 
>> > > 0.18314168 = (MATCH) sum of:
>> > >   0.18314168 = (MATCH) weight(text:gb in 1), product of:
>> > > 0.35845062 = queryWeight(text:gb), product of:
>> > >   2.3121865 = idf(docFreq=6, numDocs=26)
>> > >   0.15502669 = queryNorm
>> > > 
>> > > 0.5109258 = (MATCH) fieldWeight(text:gb in 1), product of:
>> > >   1.4142135 = tf(termFreq(text:gb)=2)
>> > >   2.3121865 = idf(docFreq=6, numDocs=26)
>> > >   0.15625 = fieldNorm(field=text, doc=1)
>> > > 
>> > > 
>> > > 
>> > > 
>> > > Thanks!

Re: Scoring of DisMax in Solr

2011-10-04 Thread Bill Bell

This seems like a bug to me.

On 10/4/11 6:52 PM, "David Ryan"  wrote:

>Hi,
>
>
>When I examine the score calculation of DisMax in Solr,   it looks to me
>that DisMax is using  tf x idf^2 instead of tf x idf.
>Does anyone have insight why tf x idf is not used here?
>
>Here is the score contribution from one one field:
>
>score(q,c) =  queryWeight x fieldWeight
>   = tf x idf x idf x queryNorm x fieldNorm
>
>Here is the example that I used to derive the formula above. Clearly, idf
>is
>multiplied twice in the score calculation.
>*
>http://localhost:8983/solr/select/?q=GB&version=2.2&start=0&rows=10&indent
>=on&debugQuery=true&fl=id,score
>*
>
>
>0.18314168 = (MATCH) sum of:
>  0.18314168 = (MATCH) weight(text:gb in 1), product of:
>0.35845062 = queryWeight(text:gb), product of:
>  2.3121865 = idf(docFreq=6, numDocs=26)
>  0.15502669 = queryNorm
>0.5109258 = (MATCH) fieldWeight(text:gb in 1), product of:
>  1.4142135 = tf(termFreq(text:gb)=2)
>  2.3121865 = idf(docFreq=6, numDocs=26)
>  0.15625 = fieldNorm(field=text, doc=1)
>
>
>
>Thanks!

Re: Solr stopword problem in Query

2011-09-26 Thread Bill Bell

This is pretty serious issue

Bill Bell
Sent from mobile


On Sep 26, 2011, at 4:09 AM, Isan Fulia  wrote:

> Hi all,
> 
> I have a text field named* textForQuery* .
> Following content has been indexed into solr in field textForQuery
> *Coke Studio at MTV*
> 
> when i fired the query as
> *textForQuery:("coke studio at mtv")* the results showed 0 documents
> 
> After runing the same query in debugMode i got the following results
> 
> 
> 
> textForQuery:("coke studio at mtv")
> textForQuery:("coke studio at mtv")
> PhraseQuery(textForQuery:"coke studio ? mtv")
> textForQuery:"coke studio *? *mtv"
> 
> Why the query did not matched any document even when there is a document
> with value of textForQuery as *Coke Studio at MTV*?
> Is this because of the stopword *at* present in stopwordList?
> 
> 
> 
> -- 
> Thanks & Regards,
> Isan Fulia.

Re: Search query doesn't work in solr/browse pnnel

2011-09-24 Thread Bill Bell

Yes. It appears that "&" cannot be encoded in the URL or there is really
bad results.
For example we get an error on first request, but if we refresh it goes
away.



On 9/23/11 2:57 PM, "hadi"  wrote:

>When I create a query like "something&fl=content" in solr/browse the "&"
>and
>"=" in URL converted to %26 and %3D and no result occurs. but it works in
>solr/admin advanced search and also in URL bar directly, How can I solve
>this problem?  Thanks
>
>--
>View this message in context:
>http://lucene.472066.n3.nabble.com/Search-query-doesn-t-work-in-solr-brows
>e-pnnel-tp3363032p3363032.html
>Sent from the Solr - User mailing list archive at Nabble.com.

Best Solr escaping?

2011-09-24 Thread Bill Bell

What is the best algorithm for escaping strings before sending to Solr? Does
someone have some code?

A few things I have witnessed in "q" using DIH handler
* Double quotes - " that are not balanced can cause several issues from an
error (strip the double quote?), to no results.
* Should we use + or %20  and what cases make sense:
> * "Dr. Phil Smith" or "Dr.+Phil+Smith" or "Dr.%20Phil%20Smith" - also what is
> the impact of double quotes?
* Unmatched parenthesis I.e. Opening ( and not closing.
> * (Dr. Holstein
> * Cardiologist+(Dr. Holstein
Regular encoding of strings does not always work for the whole string due to
several issues like white space:
* White space works better when we use back quote "Bill\ Bell" especially
when using facets.

Thoughts? Code? Ideas? Better Wikis?

Re: indexing a xml file

2011-09-24 Thread Bill Bell

Send us the example "solr.xml" and "schema.xml'". You are missing fields
in the schema.xml that you are referencing.

On 9/24/11 8:15 AM, "ahmad ajiloo"  wrote:

>hello
>Solr Tutorial page explains about index a xml file. but when I try to
>index
>a xml file with this command:
>~/Desktop/apache-solr-3.3.0/example/exampledocs$ java -jar post.jar
>solr.xml
>I get this error:
>SimplePostTool: FATAL: Solr returned an error #400 ERROR:unknown field
>'name'
>
>can anyone help me?
>thanks

Re: Distinct elements in a field

2011-09-17 Thread Bill Bell

SOLR-2242 can do it.

On 9/16/11 2:15 AM, "swiss knife"  wrote:

>I could get this number by using
>
> group.ngroups=true&group.limit=0
>
> but doing grouping for this seems like an overkill
>
> Would you advise using JIRA SOLR-1814 ?
>
>- Original Message -
>From: swiss knife
>Sent: 09/15/11 12:43 PM
>To: solr-user@lucene.apache.org
>Subject: Distinct elements in a field
>
> Simple question: I want to know how many distinct elements I have in a
>field and these verify a query. Do you know if there's a way to do it
>today in 3.4. I saw SOLR-1814 and SOLR-2242. SOLR-1814 seems fairly easy
>to use. What do you think ? Thank you

Re: Re; DIH Scheduling

2011-09-12 Thread Bill Bell

You can easily use cron with curl to do what you want to do.

On 9/12/11 2:47 PM, "Pulkit Singhal"  wrote:

>I don't see anywhere in:
>http://issues.apache.org/jira/browse/SOLR-2305
>any statement that shows the code's inclusion was "decided against"
>when did this happen and what is needed from the community before
>someone with the powers to do so will actually commit this?
>
>2011/6/24 Noble Paul നോബിള്‍ नोब्ळ् 
>
>> On Thu, Jun 23, 2011 at 9:13 PM, simon  wrote:
>> > The Wiki page describes a design for a scheduler, which has not been
>> > committed to Solr yet (I checked). I did see a patch the other day
>> > (see https://issues.apache.org/jira/browse/SOLR-2305) but it didn't
>> > look well tested.
>> >
>> > I think that you're basically stuck with something like cron at this
>> > time. If your application is written in java, take a look at the
>> > Quartz scheduler - http://www.quartz-scheduler.org/
>>
>> It was considered and decided against.
>> >
>> > -Simon
>> >
>>
>>
>>
>> --
>> -
>> Noble Paul
>>

Re: pagination with grouping

2011-09-08 Thread Bill Bell

There are 2 use cases:

1. rows=10 means 10 groups.
2. rows=10 means to results (irregardless of groups).

I thought there was a total number of groups (ngroups) or case #1.

I don't believe case #2 has been coded.

On 9/8/11 2:22 PM, "alx...@aim.com"  wrote:

>
> 
>
> Hello,
>
>When trying to implement pagination as in the case without grouping I see
>two issues.
>1. with rows=10 solr feed displays 10 groups not 10 results
>2. there is no total number of results with grouping  to show the last
>page.
>
>In detail:
>1. I need to display only 10 results in one page. For example if I have
>group.limit=5 and the first group has 5 docs, the second 3 and the third
>2 then only these 3 group must be displayed in the first page.
>Currently specifying rows=10, shows 10 groups and if we have 5 docs in
>each group then in the first page we will have 50 docs.
>
>2.I need to show the last page, for which I need total number of results
>with grouping. For example if I have 5 groups with number of docs 5, 4,
>3,2 1 then this total number must be 15.
>
>Any ideas how to achieve this.
>
>Thanks in advance.
>Alex.
>
>
>

Boost or BQ?

2011-08-22 Thread Bill Bell

What is the different between boost= and bq= ?

I cannot find any documentation

Re: copyField for big indexes

2011-08-22 Thread Bill Bell

It depends.

copyField may be good if you want to copy into a Soundex field, and then
boost the sounded field lower than the tokenized field.

What are you trying to do ?

On 8/22/11 11:14 AM, "Tom"  wrote:

>Is it a good rule of thumb, that when dealing with large indexes copyField
>should not be used.  It seems to duplicate the indexing of data.
>
>You don't need copyField to be able to search on multiple fields.
>Example,
>if I have two fields: title and post and I want to search on both, I could
>just query 
>title: OR post:
>
>So it seems to me if you have lot's of data and a large indexes, copyField
>should be avoided.
>
>Any thoughts?
>
>--
>View this message in context:
>http://lucene.472066.n3.nabble.com/copyField-for-big-indexes-tp3275712p327
>5712.html
>Sent from the Solr - User mailing list archive at Nabble.com.

Re: hierarchical faceting in Solr?

2011-08-22 Thread Bill Bell

Naomi,

Just create a login and update it!!


On 8/22/11 12:27 PM, "Erick Erickson"  wrote:

>Try searching the Solr user's list for "hierarchical", this topic
>has been discussed numerous times.
>
>It would be great if you could collate the various solutions
>and update the wiki, all you have to do is create a
>login...
>
>Best
>Erick
>
>On Mon, Aug 22, 2011 at 1:49 PM, Naomi Dushay 
>wrote:
>> Chris,
>>
>> Is there a document somewhere on how to do this?  If not, might you
>>create
>> one?   I could even imagine such a document living on the Solr wiki ...
>>  this one has mostly ancient content:
>>
>> http://wiki.apache.org/solr/HierarchicalFaceting
>>
>> - Naomi
>>

Re: Terms.regex performance issue

2011-08-22 Thread Bill Bell

We do something like:

http://localhost:8983/solr/provs/terms?terms.fl=payor&terms.regex.flag=case
_insensitive&terms.regex=%28.*%29WHAT USER TYPES%28.*%29&terms.limit=-1


We want not just prefix but anywhere in the terms.



On 8/19/11 5:21 PM, "Chris Hostetter"  wrote:

>
>: Subject: Terms.regex performance issue
>: 
>: As I want to use it in an Autocomplete it has to be fast. Terms.prefix
>gets
>: results in around 100 milliseconds, while terms.regex is 10 to 20 times
>: slower.
>
>can you elaborate on how you are using terms.regex?  what does your regex
>look like? .. particularly if your usecase is autocomplete terms.prefix
>seems like an odd choice.
>
>Possible XY Problem?
>https://people.apache.org/~hossman/#xyproblem
>
>Have you looked at using the Suggester plugin?
>
>https://wiki.apache.org/solr/Suggester
>
>
>-Hoss

Re: OOM due to JRE Issue (LUCENE-1566)

2011-08-16 Thread Bill Bell

Send gc log and force dump if you can when it happens.

Bill Bell
Sent from mobile


On Aug 16, 2011, at 5:27 AM, Pranav Prakash  wrote:

>> 
>> 
>> AFAIK, solr 1.4 is on Lucene 2.9.1 so this patch is already applied to
>> the version you are using.
>> maybe you can provide the stacktrace and more deatails about your
>> problem and report back?
>> 
> 
> Unfortunately, I have only this much information with me. However following
> is my speficiations, if they are any helpful :-
> 
> /usr/bin/java -d64 -Xms5000M -Xmx5000M -XX:+UseParallelGC -verbose:gc
> -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -Xloggc:$GC_LOGFILE
> -XX:+CMSPermGenSweepingEnabled -Dsolr.solr.home=multicore
> -Denable.slave=true -jar start.jar
> 
> 32GiB RAM
> 
> 
> Any thoughts? Will a switch to ConcurrentGC help in any means?

Score

2011-08-15 Thread Bill Bell

How do I change the score to scale it between 0 and 100 irregardless of the 
score? 

q.alt=*:*&bq=lang:Spanish&defType=dismax

Bill Bell
Sent from mobile

Re: exceeded limit of maxWarmingSearchers ERROR

2011-08-14 Thread Bill Bell

I understand.

Have you looked at Mark's patch? From his performance tests, it looks
pretty good.

When would RA work better?

Bill


On 8/14/11 8:40 PM, "Nagendra Nagarajayya" 
wrote:

>Bill:
>
>The technical details of the NRT implementation in Apache Solr with
>RankingAlgorithm (SOLR-RA) is available here:
>
>http://solr-ra.tgels.com/papers/NRT_Solr_RankingAlgorithm.pdf
>
>(Some changes for Solr 3.x, but for most it is as above)
>
>Regarding support for 4.0 trunk, should happen sometime soon.
>
>Regards
>
>- Nagendra Nagarajayya
>http://solr-ra.tgels.org
>http://rankingalgorithm.tgels.org
>
>
>
>
>
>On 8/14/2011 7:11 PM, Bill Bell wrote:
>> OK,
>>
>> I'll ask the elephant in the roomŠ.
>>
>> What is the difference between the new UpdateHandler from Mark and the
>> SOLR-RA?
>>
>> The UpdateHandler works with 4.0 does SOLR-RA work with 4.0 trunk?
>>
>> Pros/Cons?
>>
>>
>> On 8/14/11 8:10 PM, "Nagendra
>>Nagarajayya"
>> wrote:
>>
>>> Naveen:
>>>
>>> NRT with Apache Solr 3.3 and RankingAlgorithm does need a commit for a
>>> document to become searchable. Any document that you add through update
>>> becomes  immediately searchable. So no need to commit from within your
>>> update client code.  Since there is no commit, the cache does not have
>>> to be cleared or the old searchers closed or  new searchers opened, and
>>> warmed (error that you are facing).
>>>
>>> Regards
>>>
>>> - Nagendra Nagarajayya
>>> http://solr-ra.tgels.org
>>> http://rankingalgorithm.tgels.org
>>>
>>>
>>>
>>> On 8/14/2011 10:37 AM, Naveen Gupta wrote:
>>>> Hi Mark/Erick/Nagendra,
>>>>
>>>> I was not very confident about NRT at that point of time, when we
>>>> started
>>>> project almost 1 year ago, definitely i would try NRT and see the
>>>> performance.
>>>>
>>>> The current requirement was working fine till we were using
>>>> commitWithin 10
>>>> millisecs in the XMLDocument which we were posting to SOLR.
>>>>
>>>> But due to which, we were getting very poor performance (almost 3 mins
>>>> for
>>>> 15,000 docs) per user. There are many paraller user committing to our
>>>> SOLR.
>>>>
>>>> So we removed the commitWithin, and hence performance was much much
>>>> better.
>>>>
>>>> But then we are getting this maxWarmingSearcher Error, because we are
>>>> committing separately as a curl request after once entire doc is
>>>> submitted
>>>> for indexing.
>>>>
>>>> The question here is what is difference between commitWithin and
>>>>commit
>>>> (apart from the fact that commit takes memory and processes and
>>>> additional
>>>> hardware usage)
>>>>
>>>> Why we want it to be visible as soon as possible, since we are
>>>>applying
>>>> many
>>>> business rules on top of the results (older indexes as well as new
>>>>one)
>>>> and
>>>> apply different filters.
>>>>
>>>> upto 5 mins is fine for us. but more than that we need to think then
>>>> other
>>>> optimizations.
>>>>
>>>> We will definitely try NRT. But please tell me other options which we
>>>> can
>>>> apply in order to optimize.?
>>>>
>>>> Thanks
>>>> Naveen
>>>>
>>>>
>>>> On Sun, Aug 14, 2011 at 9:42 PM, Erick
>>>> Ericksonwrote:
>>>>
>>>>> Ah, thanks, Mark... I must have been looking at the wrong JIRAs.
>>>>>
>>>>> Erick
>>>>>
>>>>> On Sun, Aug 14, 2011 at 10:02 AM, Mark Miller
>>>>> wrote:
>>>>>> On Aug 14, 2011, at 9:03 AM, Erick Erickson wrote:
>>>>>>
>>>>>>> You either have to go to near real time (NRT), which is under
>>>>>>> development, but not committed to trunk yet
>>>>>> NRT support is committed to trunk.
>>>>>>
>>>>>> - Mark Miller
>>>>>> lucidimagination.com
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>
>>
>>
>

Re: Cache replication

2011-08-14 Thread Bill Bell

OK. But SOLR has built-in caching. Do you not like the caching? What so
you think we should change to the SOLR cache?

Bill


On 8/10/11 9:16 AM, "didier deshommes"  wrote:

>Consider putting a cache (memcached, redis, etc) *in front* of your
>solr slaves. Just make sure to update it when replication occurs.
>
>didier
>
>On Tue, Aug 9, 2011 at 6:07 PM, arian487  wrote:
>> I'm wondering if the caches on all the slaves are replicated across
>>(such as
>> queryResultCache).  That is to say, if I hit one of my slaves and cache
>>a
>> result, and I make a search later and that search happens to hit a
>>different
>> slave, will that first cached result be available for use?
>>
>> This is pretty important because I'm going to have a lot of slaves and
>>if
>> this isn't done, then I'd have a high chance of running a lot uncached
>> queries.
>>
>> Thanks :)
>>
>> --
>> View this message in context:
>>http://lucene.472066.n3.nabble.com/Cache-replication-tp3240708p3240708.ht
>>ml
>> Sent from the Solr - User mailing list archive at Nabble.com.
>>

Re: exceeded limit of maxWarmingSearchers ERROR

2011-08-14 Thread Bill Bell

OK,

I'll ask the elephant in the room.

What is the difference between the new UpdateHandler from Mark and the
SOLR-RA?

The UpdateHandler works with 4.0 does SOLR-RA work with 4.0 trunk?

Pros/Cons?


On 8/14/11 8:10 PM, "Nagendra Nagarajayya" 
wrote:

>Naveen:
>
>NRT with Apache Solr 3.3 and RankingAlgorithm does need a commit for a
>document to become searchable. Any document that you add through update
>becomes  immediately searchable. So no need to commit from within your
>update client code.  Since there is no commit, the cache does not have
>to be cleared or the old searchers closed or  new searchers opened, and
>warmed (error that you are facing).
>
>Regards
>
>- Nagendra Nagarajayya
>http://solr-ra.tgels.org
>http://rankingalgorithm.tgels.org
>
>
>
>On 8/14/2011 10:37 AM, Naveen Gupta wrote:
>> Hi Mark/Erick/Nagendra,
>>
>> I was not very confident about NRT at that point of time, when we
>>started
>> project almost 1 year ago, definitely i would try NRT and see the
>> performance.
>>
>> The current requirement was working fine till we were using
>>commitWithin 10
>> millisecs in the XMLDocument which we were posting to SOLR.
>>
>> But due to which, we were getting very poor performance (almost 3 mins
>>for
>> 15,000 docs) per user. There are many paraller user committing to our
>>SOLR.
>>
>> So we removed the commitWithin, and hence performance was much much
>>better.
>>
>> But then we are getting this maxWarmingSearcher Error, because we are
>> committing separately as a curl request after once entire doc is
>>submitted
>> for indexing.
>>
>> The question here is what is difference between commitWithin and commit
>> (apart from the fact that commit takes memory and processes and
>>additional
>> hardware usage)
>>
>> Why we want it to be visible as soon as possible, since we are applying
>>many
>> business rules on top of the results (older indexes as well as new one)
>>and
>> apply different filters.
>>
>> upto 5 mins is fine for us. but more than that we need to think then
>>other
>> optimizations.
>>
>> We will definitely try NRT. But please tell me other options which we
>>can
>> apply in order to optimize.?
>>
>> Thanks
>> Naveen
>>
>>
>> On Sun, Aug 14, 2011 at 9:42 PM, Erick
>>Ericksonwrote:
>>
>>> Ah, thanks, Mark... I must have been looking at the wrong JIRAs.
>>>
>>> Erick
>>>
>>> On Sun, Aug 14, 2011 at 10:02 AM, Mark Miller
>>> wrote:
 On Aug 14, 2011, at 9:03 AM, Erick Erickson wrote:

> You either have to go to near real time (NRT), which is under
> development, but not committed to trunk yet
 NRT support is committed to trunk.

 - Mark Miller
 lucidimagination.com









>

Loggly support

2011-08-14 Thread Bill Bell

How do you setup log4j to work with Loggly for SOLR logs?

Anyone have this set up?

Bill

Re: Problem with xinclude in solrconfig.xml

2011-08-13 Thread Bill Bell

What was it?

Bill Bell
Sent from mobile


On Aug 10, 2011, at 2:21 PM, Way Cool  wrote:

> Sorry for the spam. I just figured it out. Thanks.
> 
> On Wed, Aug 10, 2011 at 2:17 PM, Way Cool  wrote:
> 
>> Hi, Guys,
>> 
>> Based on the document below, I should be able to include a file under the
>> same directory by specifying relative path via xinclude in solrconfig.xml:
>> http://wiki.apache.org/solr/SolrConfigXml
>> 
>> However I am getting the following error when I use relative path (absolute
>> path works fine though):
>> SEVERE: org.xml.sax.SAXParseException: Error attempting to parse XML file
>> 
>> Any ideas?
>> 
>> Thanks,
>> 
>> YH
>>

Re: ideas for indexing large amount of pdf docs

2011-08-13 Thread Bill Bell

You could send PDF for processing using a queue solution like Amazon SQS. Kick 
off Amazon instances to process the queue.

Once you process with Tika to text just send the update to Solr.

Bill Bell
Sent from mobile


On Aug 13, 2011, at 10:13 AM, Erick Erickson  wrote:

> Yeah, parsing PDF files can be pretty resource-intensive, so one solution
> is to offload it somewhere else. You can use the Tika libraries in SolrJ
> to parse the PDFs on as many clients as you want, just transmitting the
> results to Solr for indexing.
> 
> HOw are all these docs being submitted? Is this some kind of
> on-the-fly indexing/searching or what? I'm mostly curious what
> your projected max ingestion rate is...
> 
> Best
> Erick
> 
> On Sat, Aug 13, 2011 at 4:49 AM, Rode Gonzalez (libnova)
>  wrote:
>> Hi all,
>> 
>> I want to ask about the best way to implement a solution for indexing a
>> large amount of pdf documents between 10-60 MB each one. 100 to 1000 users
>> connected simultaneously.
>> 
>> I actually have 1 core of solr 3.3.0 and it works fine for a few number of
>> pdf docs but I'm afraid about the moment when we enter in production time.
>> 
>> some possibilities:
>> 
>> i. clustering. I have no experience in this, so it will be a bad idea to
>> venture into this.
>> 
>> ii. multicore solution. make some kind of hash to choose one core at each
>> query (exact queries) and thus reduce the size of the individual indexes to
>> consult or to consult all the cores at same time (complex queries).
>> 
>> iii. do nothing more and wait for the catastrophe in the response times :P
>> 
>> 
>> Someone with experience can help a bit to decide?
>> 
>> Thanks a lot in advance.
>>

Re: SOLR 3.3.0 multivalued field sort problem

2011-08-13 Thread Bill Bell

I have a different use case. Consider a spatial multivalued field with latlong 
values for addresses. I would want sort by geodist() to return the closest 
distance in each group. For example find me the closest restaurant which each 
doc being a chain name like pizza hut. Or doctors with multiple offices.

Bill Bell
Sent from mobile


On Aug 13, 2011, at 12:31 PM, Martijn v Groningen  
wrote:

> The first solution would make sense to me. Some kind of a strategy
> mechanism
> for this would allow anyone to define their own rules. Duplicating results
> would be confusing to me.
> 
> On 13 August 2011 18:39, Michael Lackhoff  wrote:
> 
>> On 13.08.2011 18:03 Erick Erickson wrote:
>> 
>>> The problem I've always had is that I don't quite know what
>>> "sorting on multivalued fields" means. If your field had tokens
>>> a and z, would sorting on that field put the doc
>>> at the beginning or end of the list? Sure, you can define
>>> rules (first token, last token, average of all tokens (whatever
>>> that means)), but each solution would be wrong sometime,
>>> somewhere, and/or completely useless.
>> 
>> Of course it would need rules but I think it wouldn't be too hard to
>> find rules that are at least far better than the current situation.
>> 
>> My wish would include an option that decides if the field can be used
>> just once or every value on its own. If the option is set to FALSE, only
>> the first value would be used, if it is TRUE, every value of the field
>> would get its place in the result list.
>> 
>> so, if we have e.g.
>> record1: ccc and bbb
>> record2: aaa and zzz
>> it would be either
>> record2 (aaa)
>> record1 (ccc)
>> or
>> record2 (aaa)
>> record1 (bbb)
>> record1 (ccc)
>> record2 (zzz)
>> 
>> I find these two outcomes most plausible so I would allow them if
>> technical possible but whatever rule looks more plausible to the
>> experts: some solution is better than no solution.
>> 
>> -Michael
>> 
> 
> 
> 
> -- 
> Met vriendelijke groet,
> 
> Martijn van Groningen

Re: getting result count only

2011-08-06 Thread Bill Bell

q=*:*&rows=0



On 8/6/11 8:24 PM, "Jason Toy"  wrote:

>How can I run a query to get the result count only? I only need the count
>and so I dont need solr to send me all the results back.

Re: Problem with making Solr query

2011-08-05 Thread Bill Bell

String does no manipulation. You might need to switch the fieldtype. Also make 
sure your default field is title or add title:implementation in your search. 

Bill Bell
Sent from mobile


On Aug 5, 2011, at 8:43 AM, dhryvastov  wrote:

> Hi -
> 
> I am new to Solr and Lucene and I have started to research its capabilities
> this week. My current task seems very simple (and I believe it is) but I
> have some issue.
> 
> I have successfully done indexing of MSSQL database table. The table has
> probably 20 fields and I have indexed two of them: id and title.
> The question is: how can I get all records from this table (I mean the id's
> of them) were the word specifies in search appears???
> 
> I send the following get request to get result:
> http://localhost:8983/solr/db/select/?q=implementation. The response
> contains 0 results (numFound="0") but there are at least 5 records among the
> first 10 which contains this word in its title.
> 
> My schema.xml contains:
> 
>required="true" /> 
>required="true" /> 
> 
> 
> What get request should I do to get the expected results?
> 
> I feel that I have omitted something simple but it is the second day that I
> can't found what.
> Please help.
> 
> Thanks for your response.
> 
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Problem-with-making-Solr-query-tp3228877p3228877.html
> Sent from the Solr - User mailing list archive at Nabble.com.

1 2 >

1 - 100 of 196 matches

Mail list logo