Solr Wiki and mailing lists

2012-02-17 Thread Lance Norskog
The Apache Solr main page does not mention the mailing lists. The wiki
main page has a broken link. I have had to search my incoming mail to
find out how to unsubscribe to solr-user.

Someone with full access- please fix these problems.

Thanks,

-- 
Lance Norskog
goks...@gmail.com


UpdateRequestHandler coding

2012-02-16 Thread Lance Norskog
If I want to write a complex UpdateRequestHandler should I do it on
trunk or the 3.x branch? The criteria are a stable, debugged,
full-featured environment.

-- 
Lance Norskog
goks...@gmail.com


Re: Solr 3.5 not starting on CentOS 6 or RHEL 5

2012-02-13 Thread Lance Norskog
Is /tmp a separate file system? There are problems with people
mounting /tmp with 'noexec' as a security precaution, which then
causes Solr to fail.



On Mon, Feb 13, 2012 at 4:06 PM, Bernhardt, Russell (CIV)
 wrote:
> A software package we use recently upgraded to Solr 3.5 (from 1.4.1) and now 
> we're having problems getting the Solr server to start up under RHEL 5 or 
> CentOS 6.
>
> I upgraded our local install of Java to the latest from Oracle and it didn't 
> help, even removed the local OpenJDK just to be sure.
>
> When starting jetty manually (with java -jar start.jar) I get the following 
> messages:
>
> 2012-02-13 07:52:55.954::INFO:  Logging to STDERR via 
> org.mortbay.log.StdErrLog
> 2012-02-13 07:52:56.120::INFO:  jetty-6.1.11
> 2012-02-13 07:52:56.184::INFO:  Extract 
> jar:file:/opt/vufind/solr/jetty/webapps/solr.war!/ to 
> /tmp/Jetty_0_0_0_0_8080_solr.war__solr__7k9npr/webapp
> 2012-02-13 07:52:56.702::WARN:  Failed startup of context 
> org.mortbay.jetty.webapp.WebAppContext@15aaf0b3{/solr,jar:file:/opt/vufind/solr/jetty/webapps/solr.war!/}
> java.util.zip.ZipException: error in opening zip file
>        at java.util.zip.ZipFile.open(Native Method)
>        at java.util.zip.ZipFile.(Unknown Source)
>        at java.util.jar.JarFile.(Unknown Source)
>        at java.util.jar.JarFile.(Unknown Source)
>        at 
> org.mortbay.jetty.webapp.TagLibConfiguration.configureWebApp(TagLibConfiguration.java:168)
>        at 
> org.mortbay.jetty.webapp.WebAppContext.startContext(WebAppContext.java:1217)
>        at 
> org.mortbay.jetty.handler.ContextHandler.doStart(ContextHandler.java:513)
>        at 
> org.mortbay.jetty.webapp.WebAppContext.doStart(WebAppContext.java:448)
>        at 
> org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:39)
>        at 
> org.mortbay.jetty.handler.HandlerCollection.doStart(HandlerCollection.java:152)
>        at 
> org.mortbay.jetty.handler.ContextHandlerCollection.doStart(ContextHandlerCollection.java:156)
>        at 
> org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:39)
>        at 
> org.mortbay.jetty.handler.HandlerCollection.doStart(HandlerCollection.java:152)
>        at 
> org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:39)
>        at 
> org.mortbay.jetty.handler.HandlerWrapper.doStart(HandlerWrapper.java:130)
>        at org.mortbay.jetty.Server.doStart(Server.java:222)
>        at 
> org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:39)
>        at org.mortbay.xml.XmlConfiguration.main(XmlConfiguration.java:977)
>        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>        at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
>        at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
>        at java.lang.reflect.Method.invoke(Unknown Source)
>        at org.mortbay.start.Main.invokeMain(Main.java:194)
>        at org.mortbay.start.Main.start(Main.java:512)
>        at org.mortbay.start.Main.main(Main.java:119)
> 2012-02-13 07:52:56.713::INFO:  Opened 
> /opt/vufind/solr/jetty/logs/2012_02_13.request.log
> 2012-02-13 07:52:56.740::INFO:  Started SelectChannelConnector@0.0.0.0:8080
>
> Jetty starts up just fine but shows a 503 error when attempting to access 
> localhost:8080/solr/. The temp directory structure does exist in /tmp/. Any 
> ideas?
>
> Thanks,
>
> Russ Bernhardt
> Systems Analyst
> Library Information Systems
> Naval Postgraduate School, Monterey CA
>



-- 
Lance Norskog
goks...@gmail.com


Re: is there any practice to load index into RAM to accelerate solr performance?

2012-02-07 Thread Lance Norskog
Experience has shown that it is much faster to run Solr with a small
amount of memory and let the rest of the ram be used by the operating
system "disk cache". That is, the OS is very good at keeping the right
disk blocks in memory, much better than Solr.

How much RAM is in the server and how much RAM does the JVM get? How
big are the documents, and how large is the term index for your
searches? How many documents do you get with each search? And, do you
use filter queries- these are very powerful at limiting searches.

2012/2/7 James :
> Is there any practice to load index into RAM to accelerate solr performance?
> The over all documents is about 100 million. The search time around 100ms. I 
> am seeking some method to accelerate the respond time for solr.
> Just check that there is some practice use SSD disk. And SSD is also cost 
> much, just want to know is there some method like to load the index file in 
> RAM and keep the RAM index and disk index synchronized. Then I can search on 
> the RAM index.



-- 
Lance Norskog
goks...@gmail.com


Re: Indexing failover and replication

2012-01-29 Thread Lance Norskog
You could just have each Solr index and query from its own index, and
not copy indexes.

On Wed, Jan 25, 2012 at 11:24 AM, Anderson vasconcelos
 wrote:
> Thanks for the Reply Erick
> I will make the replication to both master manually.
>
> Thanks
>
> 2012/1/25, Erick Erickson :
>> No, there no good ways to have a single slave know about
>> two masters and just use the right one. It sounds like you've
>> got each machine being both a master and a slave? This is
>> not supported. What you probably want to do is either set
>> up a repeater or just index to the two masters and manually
>> change the back to the primary if the primary goes down, having
>> all replication happen from the master.
>>
>> Best
>> Erick
>>
>> On Tue, Jan 24, 2012 at 11:36 AM, Anderson vasconcelos
>>  wrote:
>>> Hi
>>> I'm doing now a test with replication using solr 1.4.1. I configured
>>> two servers (server1 and server 2) as master/slave to sincronized
>>> both. I put apache on the front side, and we index sometime in server1
>>> and sometime  in server2.
>>>
>>> I realized that the both index servers are now confused. In solr data
>>> folder, was created many index folders with the timestamp of
>>> syncronization (Exemple: index.20120124041340) with some segments
>>> inside.
>>>
>>> I thought that was possible to index in two master server and than
>>> synchronized both using replication. It's really possible do this with
>>> replication mechanism? If is possible, what I have done wrong?
>>>
>>> I need to have more than one node for indexing to guarantee failover
>>> feature for indexing. MultiMaster is the best way to guarantee
>>> failover feature for indexing?
>>>
>>> Thanks
>>



-- 
Lance Norskog
goks...@gmail.com


Re: Solr Warm-up performance issues

2012-01-28 Thread Lance Norskog
Another trick is to read in the parts of the index file that you
search against: term dictionary and maybe a few others. (The Lucene
wiki describes the various files.) That is, you copy the new index to
the server and then say "cat files > /dev/null". This pre-caches the
interesting files into memory.

This leads to: how large is you JVM and how much space do you leave to
the OS? The OS is much better at managing memory against the hard disk
that Solr/JVM is. The JVM should have enough memory to run your Solr
comfortably without slowdowns, and that is the most it should get.

You might find autowarming less useful than just picking a series of
queries that warm what you want to get rolling early: sort on the
fields you want, do a series of facet queries, search for words you
get a lot, etc.

Another problem is that you might be fighting garbage collection when
switching from the old to new collection. Just shut down Solr, switch
the index directory, cat the files mentioned above, and restart.


On Fri, Jan 27, 2012 at 10:36 PM, Otis Gospodnetic
 wrote:
> Hi Dan,
>
> I think this may be your problem:
>
>> Every day we produce a new dataset of 40 GB and have to switch one for the 
>> othe
>
> If you really replace an index with a new index one a day, you throw away all 
> the hard work the OS has been doing to cache hot parts of your index in 
> memory.  It takes it 30 minutes apparently in your cache to re-cache things.  
> Check the link in my signature.   If you use that and if I'm right about 
> this, you will see a big spike in Disk Reads after you switch to the new 
> index.  You want to minimize that spike.
>
>
> So see if you can avoid replacing the whole index and if that is really not 
> doable, you can try warmup queries, but of course while you run them, if they 
> are expensive, they will hurt system performance somewhat.
>
> Otis
>
> 
> Performance Monitoring SaaS for Solr - 
> http://sematext.com/spm/solr-performance-monitoring/index.html
>
>
>
>
>>
>> From: dan sutton 
>>To: solr-user 
>>Sent: Friday, January 27, 2012 9:44 AM
>>Subject: Solr Warm-up performance issues
>>
>>Hi List,
>>
>>We use Solr 4.0.2011.12.01.09.59.41 and have a dataset of roughly 40 GB.
>>Every day we produce a new dataset of 40 GB and have to switch one for
>>the other.
>>
>>Once the index switch over has taken place, it takes roughly 30 min for Solr
>>to reach maximum performance. Are there any hardware or software solutions
>>to reduce the warm-up time ? We tried warm-up queries but it didn't change
>>much.
>>
>>Our hardware specs is:
>>   * Dell Poweredge 1950
>>   * 2 x Quad-Core Xeon E5405 (2.00GHz)
>>   * 48 GB RAM
>>   * 2 x 146 GB SAS 3 Gb/s 15K RPM disk configured in RAID mirror
>>
>>One thing that does seem to take a long time is un-inverting a set of
>>multivalued fields, are there any optimizations we might be able to
>>use here?
>>
>>Thanks for your help.
>>Dan
>>
>>
>>



-- 
Lance Norskog
goks...@gmail.com


Re: SolrCloud on Trunk

2012-01-28 Thread Lance Norskog
If this is to do load balancing, the usual solution is to use many
small shards, so you can just move one or two without doing any
surgery on indexes.

On Sat, Jan 28, 2012 at 2:46 PM, Yonik Seeley
 wrote:
> On Sat, Jan 28, 2012 at 3:45 PM, Jamie Johnson  wrote:
>> Second question, I know there are discussion about storing the shard
>> assignments in ZK (i.e. shard 1 is responsible for hashed values
>> between 0 and 10, shard 2 is responsible for hashed values between 11
>> and 20, etc), this isn't done yet right?  So currently the hashing is
>> based on the number of shards instead of having the assignments being
>> calculated the first time you start the cluster (i.e. based on
>> numShards) so it could be adjusted later, right?
>
> Right.  Storing the hash range for each shard/node is something we'll
> need to dynamically change the number of shards (as opposed to
> replicas), so we'll need to start doing it sooner or later.
>
> -Yonik
> http://www.lucidimagination.com



-- 
Lance Norskog
goks...@gmail.com


Re: Permgen Space - GC

2012-01-28 Thread Lance Norskog
Correct. Each war file instance uses its own classloader, and in this
case pulling in Solr and all of the dependent jars uses that much
memory. This also occurs when you deploy/undeploy/redeploy the same
war file. Doing that over and over fills up PermGen. Accd. to this,
you should use both this and ClassUnload:

http://stackoverflow.com/questions/3717937/cmspermgensweepingenabled-vs-cmsclassunloadingenabled

On Fri, Jan 27, 2012 at 10:59 PM, Sujatha Arun  wrote:
> When Loading multiple solr instances in JVM ,we see the pergmen space going
> up by about 13mb per Instance ,but when we remove the instances ,that are
> no longer needed,we do not see the memory being released .This is our
> currnt JVM startup options .
>
> -Xms20g
> -Xmx20g
> -XX:NewSize=128m
> -XX:MaxNewSize=128m
> -XX:MaxPermSize=1024m
> -XX:+UseConcMarkSweepGC
> -XX:+CMSClassUnloadingEnabled
> -XX:+UseTLAB
> -XX:+UseParNewGC
> -XX:MaxTenuringThreshold=0
> -XX:SurvivorRatio=128
>
> Will enabling permgen GC help here  XX:+CMSPermGenSweepingEnabled? Will the
> classes not be unloaded unless we do a server restart?
>
> Regards
> Sujatha



-- 
Lance Norskog
goks...@gmail.com


Re: JSON response truncated

2012-01-27 Thread Lance Norskog
.ladles' slippers, $3.00 and$3.50 values,
>> at..children's slippers, in black, tan andwhite, to go
>> at..costa lot of ladies' house drosses, $1.25, $1.50 and $1 75
>> values at œj qqa lot of 15c and 25c summer dress goods, to go at only, 1
>> (\nper yard.. l"t25c $2.98 $2.78specials for 3 cans early juno
>> peasspecial for saturday .. 3 5c plcga. any wash powderspecial for
>> saturday-7 bars pearl white laundrysoap, special-...1 10c bars
>> of lloud soap forspecial for saturday . . ^þ6dl*25c can poaches, heavy
>> syruf special for saturday .saturday25c 10c 25c 25c 19c'crakd prize > win
>> ncr 'at, san franciscpthe low cloverleaf gives the manure two healthy
>> beatings1x7e used to think that if we threw manure onto the ground any old
>> way and plowed it under, we were doing a good job. but now we know that
>> won't do. to do any real good, the manure must be broken up into small
>> pieces and spread evenly.the low cloverleaf spreader is the one that does
>> this \vork best. it gives the manure two healthy beatings, one with the
>> regular beater, the other with the wide spread disks. when the manure
>> reaches the ground in that condition your soil gets all the good there is
>> in it, and gets it quickly. the low cloverleaf is one spreader it will pay
>> you to see before you buy. see tbe local dealer who has one set up for you
>> to look at.international harvester company of america(bcarpontcj) low
>> clorerlesf spreaders are sold byj. w. mangun, plover, iowai 1i 1!i*cheapest
>> in the long run best from the startimckenzie steel fence posts give
>> lln-rijjhl iippriinimx lo i he fiirm. tintl ttrc rapidly roming inlo
>> general use in ihe best loctil-ilies, where liind values arc highest, sleel
>> fence posls are very popularmckenzik steel fence posts can be ([tiickly
>> set. simpk drive ihcm into ihe ground; and can he easily moved lo another
>> location. this feature alone is worth more lltiin will first appear lo
>> you.mckenzie steel fence posts admit of ihe farm being kepi free from
>> noxious weeds by making it possible to burn over the ground along the fence
>> as often as desired, without injury to the fence.mckenzie steel fence posts
>> are much superior lo other sleel fence posls in many ways. they are heavy,
>> and the shape is such thai tbe utmost strength is assured. uigidky when in
>> the ground is sure, as the steel side plate is large .ind so secured that
>> it ciinnol become detached, or shear off ihe rivets holding il, when the
>> post is being driven into the bard ground. all wires attached lo tbe posts
>> are grounded, conse-tpienlh then- is no danger- of live stock being killed
>> by lightning, in coming m eon-lacl with the tenec.mckenzie steel cop.ner
>> and strain pos'i's are long and heavy, with large anchor plates, and make a
>> strong, permanent antl attractive anchorage for the fence. a mckenzik steel
>> fence is the fence hint you ought to liave on your farm.the principle is
>> righl-the posts are right-the prices are right-c-^jplease call and see
>> these postsplover lumber companyplover,
>> iowa-x":--xx-v<®x®.x-k-:^:-:®:-:-x><->:"><^ii!v v vit!vtii*iiivo trill havo
>> for sntiirtny i-'rosli stran-licrrlcs, celery, lotiicp, had-ishes,
>> aflpnragns, oranges ltnnanas antl appleshighest prices paidjfor butter and
>> eggs. call us uj saturday and find out what th>y are.yours for a square
>> dealed drljryby the clerksyear, to pay 6ald bonds and the interest thereon
>> until said ponds and the interest thereon are. completely paid"the polls of
>> said election will be open from el^ht o'clock a. m . until seven o'clock v
>> m, in the several election precincts of said county, at the corresponding
>> polling places, as fnllowb'(pri-cliictl- ll'ollinr places) -bclhillp ..town
>> of palmer t'edar.. . town of fonda. icenter ...townof pocahonui> iclinton
>> .town of rolfe.colfax . . .-center school housr cummins ..center school
>> house dos moines.. center school housedover.town of
>> varina.garfield_center school house.grant-___center school house.lake
>> no. 1 . -center school house. lake no. 2...town of gilmore
>> citylincoln_center school house.uzard...school
>> houhe.walsutjlst.marshallcenter school house.powhatan. . town of
>> ploverrooseveltcenter school houseshermanvillage of ware.swan
>> lake.. town of laurens. washington-town of havelock. at -which election all
>> of tho legal1 voters of said county are notified tr appear at said tlrno
>> and place.this nouce is given by order of tb< hoard of supervisors.dated at
>> pocahontas this 38th da.-of april. 1sk.m. m. noah.shcrif(((seall)i i,.
>> 0'donnel.l.county audito' first published in the arrow urn 27th day of
>> april, 1916. 2-*"how's tjiis?wo offer one hundred dollars reward for any.
>> case of catarit that cannot be cured by ham's-catarrh cure., f. j. cheney &
>> co.. toledo, o.we, the undersigned, have known i-.. > cheney for the last
>> 35 ycors, and bc"øv! him perfectly honor.italo in all buslnew transaction!)
>> und financially able to carrs out any obligations made by hla
>> ]jj!"i,national bank ok commerce, toledo, ohall's catarrh cure is taken
>> interna"* acting directly upon th'e blood ana mv cous fturfacrc-of the
>> system. testimonial sent free. price 75 cents per bottlo. b""i by nil
>> druggists. .,,.,ta>.t¯k® hull-* faml>t pllt¯ for eon¯up®.ut-
>> 
>> United States of America
>> 1916-01-01
>> scan_0003
>> English
>> 1
>> Iowa
>> The Arrow
>> Arrow/19160518_19200415/
>> 
>> 
>> 
>>
>> --
>> Sean



-- 
Lance Norskog
goks...@gmail.com


Re: DataImportHandler fails silently

2012-01-27 Thread Lance Norskog
Do all of the documents have unique id fields?

On Fri, Jan 27, 2012 at 10:44 AM, mathieu lacage
 wrote:
> On Fri, Jan 27, 2012 at 7:39 PM, mathieu lacage
> wrote:
>
>>
>> It seems to work but the following command reports that only 499 documents
>> were indexed (yes, there are many more documents in my database):
>>
>
> And before anyone asks:
> 
> 1
> 499
> 0
> 2012-01-27 19:37:16
> Indexing completed. Added/Updated: 499 documents. Deleted 0
> documents.
> 2012-01-27 19:37:17
> 2012-01-27 19:37:17
> 499
> 0:0:1.52
> 
>
>
> --
> Mathieu Lacage 



-- 
Lance Norskog
goks...@gmail.com


Re: Trying to understand SOLR memory requirements

2012-01-17 Thread Lance Norskog
apache.lucene.search.spell.HighFrequencyDictionary$HighFrequencyIterator.isFrequent(HighFrequencyDictionary.java:75)
>>> >  at
>>> >
>>> org.apache.lucene.search.spell.HighFrequencyDictionary$HighFrequencyIterator.hasNext(HighFrequencyDictionary.java:125)
>>> > at
>>> org.apache.lucene.search.suggest.fst.FSTLookup.build(FSTLookup.java:157)
>>> >  at org.apache.lucene.search.suggest.Lookup.build(Lookup.java:70)
>>> > at org.apache.solr.spelling.suggest.Suggester.build(Suggester.java:133)
>>> >  at
>>> >
>>> org.apache.solr.handler.component.SpellCheckComponent.prepare(SpellCheckComponent.java:109)
>>> > at
>>> >
>>> org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:173)
>>> >  at
>>> >
>>> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129)
>>> > at org.apache.solr.core.SolrCore.execute(SolrCore.java:1372)
>>> >  at
>>> >
>>> org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:356)
>>> > at
>>> >
>>> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:252)
>>> >  at
>>> >
>>> org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
>>> > at
>>> org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:399)
>>> >  at
>>> >
>>> org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
>>> > at
>>> org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182)
>>> >  at
>>> org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766)
>>> > at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:450)
>>> >  at
>>> >
>>> org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230)
>>> > at
>>> >
>>> org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114)
>>> >  at
>>> org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152)
>>> > at org.mortbay.jetty.Server.handle(Server.java:326)
>>> >  at
>>> org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:542)
>>> > at
>>> >
>>> org.mortbay.jetty.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:928)
>>> >
>>> >
>>> > I also get an error if after the dataimport command completes, I just
>>> exit
>>> > the SOLR process and restart it:
>>> >
>>> > Jan 16, 2012 4:06:15 PM org.apache.solr.common.SolrException log
>>> > SEVERE: java.lang.OutOfMemoryError: Java heap space
>>> > at org.apache.lucene.util.fst.NodeHash.rehash(NodeHash.java:158)
>>> > at org.apache.lucene.util.fst.NodeHash.add(NodeHash.java:128)
>>> >  at org.apache.lucene.util.fst.Builder.compileNode(Builder.java:161)
>>> > at org.apache.lucene.util.fst.Builder.compilePrevTail(Builder.java:247)
>>> >  at org.apache.lucene.util.fst.Builder.add(Builder.java:364)
>>> > at
>>> >
>>> org.apache.lucene.search.suggest.fst.FSTLookup.buildAutomaton(FSTLookup.java:486)
>>> >  at
>>> org.apache.lucene.search.suggest.fst.FSTLookup.build(FSTLookup.java:179)
>>> > at org.apache.lucene.search.suggest.Lookup.build(Lookup.java:70)
>>> >  at org.apache.solr.spelling.suggest.Suggester.build(Suggester.java:133)
>>> > at org.apache.solr.spelling.suggest.Suggester.reload(Suggester.java:153)
>>> >  at
>>> >
>>> org.apache.solr.handler.component.SpellCheckComponent$SpellCheckerListener.newSearcher(SpellCheckComponent.java:675)
>>> > at org.apache.solr.core.SolrCore$3.call(SolrCore.java:1181)
>>> >  at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
>>> > at java.util.concurrent.FutureTask.run(FutureTask.java:138)
>>> >  at
>>> >
>>> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>>> > at
>>> >
>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>>> >  at java.lang.Thread.run(Thread.java:662)
>>> >
>>> > Jan 16, 2012 4:06:15 PM org.apache.solr.core.SolrCore registerSearcher
>>> > INFO: [places] Registered new searcher Searcher@34b0ede5 main
>>> >
>>> >
>>> >
>>> > Basically this means once I've run a full-import, I cannot exit the SOLR
>>> > process because I receive this error no matter what when I restart the
>>> > process. I've tried with different -Xmx arguments, and I'm really at a
>>> loss
>>> > at this point. Is there any guideline to how much RAM I need? I've got
>>> 8GB
>>> > on this machine, although that could be increased if necessary. However,
>>> I
>>> > can't understand why it would need so much memory. Could I have something
>>> > configured incorrectly? I've been over the configs several times, trying
>>> to
>>> > get them down to the bare minimum.
>>> >
>>> > Thanks for any assistance!
>>> >
>>> > Dave
>>>
>>>
>>>
>>> --
>>> lucidimagination.com
>>>
>
>
>
> --
> lucidimagination.com



-- 
Lance Norskog
goks...@gmail.com


Re: Solr Cloud Indexing

2012-01-17 Thread Lance Norskog
Cloud upload bandwidth is free, but download bandwidth costs money. If
you upload a lot of data but do not query it often, Amazon can make
sense.  You can also rent much cheaper hardware in other hosting
services where you pay by the month or even by the year. If you know
you have a cap on how much resource you will need at once, the cheaper
sites make more sense.

On Tue, Jan 17, 2012 at 7:36 AM, Erick Erickson  wrote:
> This only really makes sense if you don't have enough in-house resources
> to do your indexing locally, but it certainly is possible.
>
> Amazon's EC2 has been used, but really any hosting service should do.
>
> Best
> Erick
>
> On Tue, Jan 17, 2012 at 12:09 AM, Sujatha Arun  wrote:
>> Would it make sense to  Index on the cloud and periodically [2-4 times
>> /day] replicate the index at  our server for searching .Which service to go
>> with for solr Cloud Indexing ?
>>
>> Any good and tried services?
>>
>> Regards
>> Sujatha



-- 
Lance Norskog
goks...@gmail.com


Re: GermanAnalyzer

2012-01-14 Thread Lance Norskog
Has the GermanAnalyzer behavior changed at all? This is another kind
of mismatch, and it can cause very subtle problems.  If text is
indexed and queried using different Analyzers, queries will not do
what you think they should.

On Sat, Jan 14, 2012 at 1:38 PM, Robert Muir  wrote:
> On Sat, Jan 14, 2012 at 12:58 PM,   wrote:
>> Hi,
>>
>> I'm switching from Lucene 2.3 to Solr 3.5. I want to reuse the existing
>> indexes (huge...).
>
> If you want to use a Lucene 2.3 index, then you should set this in
> your solrconfig.xml:
>
> LUCENE_23
>
>>
>> In Lucene I use an untweaked org.apache.lucene.analysis.de.GermanAnalyzer.
>>
>> What is an equivalent fieldType definition in Solr 3.5?
>
>    
>      
>    
>
> --
> lucidimagination.com



-- 
Lance Norskog
goks...@gmail.com


Re: Filtered search for subset of ids

2012-01-06 Thread Lance Norskog
If you want the Nth result in a result set, that would be:
start=N&rows=1

A document 'id' is field containing a unique value for a document. It
is not normally used for relevance scoring. You would instead search
for
id:value

On Thu, Jan 5, 2012 at 9:55 PM, solr_noob  wrote:
> Hello,
>
> I'm new to SOLR. I am facing the same set of problem to solve. The idea is
> to search for key phrase(s) within a set of documents. I understand the
> query syntax somewhat. What if the list of document ids to search gets to
> about say, 1 documents? what is the best way to craft the query?
>
> so it would be,in relational DB
>
>    SELECT * FROM documents WHERE query ='search term' and document_id in
> [.];
>
> Thanks :)
>
>
>
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Filtered-search-for-subset-of-ids-tp502245p3637150.html
> Sent from the Solr - User mailing list archive at Nabble.com.



-- 
Lance Norskog
goks...@gmail.com


Re: Solr Distributed Search vs Hadoop

2011-12-28 Thread Lance Norskog
Here is an example of schema design: a PDF file of 5MB might have
maybe 50k of actual text. The Solr ExtractingRequestHandler will find
that text and only index that. If you set the field to stored=true,
the 5mb will be saved. If saved=false, the PDF is not saved. Instead,
you would store a link to it.

One problem with indexing is that Solr continally copies data into
"segments" (index parts) while you index. So, each 5MB PDF might get
copied 50 times during a full index job. If you can strip the index
down to what you really want to search on, terabytes become gigabytes.
Solr seems to handle 100g-200g fine on modern hardware.

Lance

On Fri, Dec 23, 2011 at 1:54 AM, Nick Vincent  wrote:
> For data of this size you may want to look at something like Apache
> Cassandra, which is made specifically to handle data at this kind of
> scale across many machines.
>
> You can still use Hadoop to analyse and transform the data in a
> performant manner, however it's probably best to do some research on
> this on the relevant technical forums for those technologies.
>
> Nick



-- 
Lance Norskog
goks...@gmail.com


Re: Migration from Solr 1.4 to Solr 3.5

2011-12-28 Thread Lance Norskog
> me to reindex millions of data?
>> >>
>> >> • Are there any migration tool (or any other means?) available that would
>> >> convert old indexes (1.4) to new format (3.5)?
>> >>
>> >> • Consider this case.
>> >> http://myserver:8080/solr/mainindex/select/?q=solr&start=0&rows=10&shards=myserver:8080/solr/index1,myserver:8080/solr/mainindex,remoteserver:8080/solr/remotedata.
>> >> In this example, consider that 'myserver' has been upgraded with Solr 3.5,
>> >> but 'remoteserver' is still using Solr 1.4. The question is, would data
>> >> from remoteserver's Solr instance come/parsed fine or, would it cause
>> >> issues? If it results into issues, then of what type? how to resolve them?
>> >> Please suggest.
>> >>
>> >> • We are using various features of Solr like, searching, faceting,
>> >> spellcheck and highlighting. Will migrating from 1.4 to 3.5 cause any 
>> >> break
>> >> in functionality? is there anything changed in response XML format of here
>> >> mentioned features?
>> >>
>> >>  Thanks in advance,
>> >>
>> >> Bhavnik
>> >> **
>>
>>



-- 
Lance Norskog
goks...@gmail.com


Re: How to run the solr dedup for the document which match 80% or match almost.

2011-12-28 Thread Lance Norskog
You would have to implement this yourself in your indexing code. Solr
has an analysis plugin which does the analysis for your text and then
returns the result, but does not query or index. You can use this to
calculate the fuzzy hash, then search against index.

You might be able to code this in an UpdateRequestProcessor.

On Tue, Dec 27, 2011 at 9:45 PM, vibhoreng04  wrote:
> Hi Shashi,
>
> That's correct  !But I need something for index time comparision.Can cosine
> compare from the already indexed documents and compare the incrementally
> indexed files ?
>
>
>
> Regards,
>
>
> Vibhor
>
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/How-to-run-the-solr-dedup-for-the-document-which-match-80-or-match-almost-tp3614239p3615787.html
> Sent from the Solr - User mailing list archive at Nabble.com.



-- 
Lance Norskog
goks...@gmail.com


Re: solr keep old docs

2011-12-28 Thread Lance Norskog
The SignatureUpdateProcessor is for exactly this problem:

http://www.lucidimagination.com/search/link?url=http://wiki.apache.org/solr/Deduplication

On Tue, Dec 27, 2011 at 10:42 PM, Alexander Aristov
 wrote:
> I get docs from external sources and the only place I keep them is solr
> index. I have no a database or other means to track indexed docs (my
> personal oppinion is that it might be a huge headache).
>
> Some docs might change slightly in there original sources but I don't need
> that changes. In fact I need original data only.
>
> So I have no other ways but to either check if a document is already in
> index before I put it to solrj array (read - query solr) or develop my own
> update chain processor and implement ID check there and skip such docs.
>
> Maybe it's wrong place to aguee and probably it's been discussed before but
> I wonder why simple the overwrite parameter doesn't work here.
>
> My oppinion it perfectly suits here. In combination with unique ID it can
> cover all possible variants.
>
> cases:
>
> 1. overwrite=true and uniquID exists then newer doc should overwrite the
> old one.
>
> 2. overwrite=false and uniqueID exists then newer doc must be skipped since
> old exists.
>
> 3. uniqueID doesn't exist then newer doc just gets added regardless if old
> exists or not.
>
>
> Best Regards
> Alexander Aristov
>
>
> On 27 December 2011 22:53, Erick Erickson  wrote:
>
>> Mikhail is right as far as I know, the assumption built into Solr is that
>> duplicate IDs (when  is defined) should trigger the old
>> document to be replaced.
>>
>> what is your system-of-record? By that I mean what does your SolrJ
>> program do to send data to Solr? Is there any way you could just
>> *not* send documents that are already in the Solr index based on,
>> for instance, any timestamp associated with your system-of-record
>> and the last time you did an incremental index?
>>
>> Best
>> Erick
>>
>> On Tue, Dec 27, 2011 at 6:38 AM, Alexander Aristov
>>  wrote:
>> > Hi
>> >
>> > I am not using database. All needed data is in solr index that's why I
>> want
>> > to skip excessive checks.
>> >
>> > I will check DIH but not sure if it helps.
>> >
>> > I am fluent with Java and it's not a problem for me to write a class or
>> so
>> > but I want to check first  maybe there are any ways (workarounds) to make
>> > it working without codding, just by playing around with configuration and
>> > params. I don't want to go away from default solr implementation.
>> >
>> > Best Regards
>> > Alexander Aristov
>> >
>> >
>> > On 27 December 2011 09:33, Mikhail Khludnev > >wrote:
>> >
>> >> On Tue, Dec 27, 2011 at 12:26 AM, Alexander Aristov <
>> >> alexander.aris...@gmail.com> wrote:
>> >>
>> >> > Hi people,
>> >> >
>> >> > I urgently need your help!
>> >> >
>> >> > I have solr 3.3 configured and running. I do uncremental indexing 4
>> >> times a
>> >> > day using bulk updates. Some documents are identical to some extent
>> and I
>> >> > wish to skip them, not to index.
>> >> > But here is the problem as I could not find a way to tell solr ignore
>> new
>> >> > duplicate docs and keep old indexed docs. I don't care that it's new.
>> >> Just
>> >> > determine by ID that such document is in the index already and that's
>> it.
>> >> >
>> >> > I use solrj for indexing. I have tried setting overwrite=false and
>> dedupe
>> >> > apprache but nothing helped me. I either have that a newer doc
>> overwrites
>> >> > old one or I get duplicate.
>> >> >
>> >> > I think it's a very simple and basic feature and it must exist. What
>> did
>> >> I
>> >> > make wrong or didn't do?
>> >> >
>> >>
>> >> I guess, because  the mainstream approach is delta-import , when you
>> have
>> >> "updated" timestamps in your DB and "last-import" timestamp stored
>> >> somewhere. You can check how it works in DIH.
>> >>
>> >>
>> >> >
>> >> > Tried google but I couldn't find a solution there althoght many people
>> >> > encounted such problem.
>> >> >
>> >> >
>> >

Re: Identifying common text in documents

2011-12-24 Thread Lance Norskog
Great topic!

1) SignatureUpdateProcessor creates a hash of the exact byte stream of
the document. Often your crawling software can't do an incremental
update of your data, but can only re-index the entire corpus. The SUP
makes the hash, searches for it, and it it is there the document
indexer says "all done, give me the next document" without doing
anything.

2) TextProfileSignature does roughly the same, but operates on a
version of the document that is analyzed. I'm not sure what inspired,
but here is a wild guess: if you change some formatting in an HTML
page and re-index it, since the TUP only sees the text, it will ignore
the formatting change and the hashes will still match. (Maybe.)

3) The Mahout project includes a batch process: it takes all of your
documents, cuts them up into pieces in the same way that TUP does, and
then compares all of them to each other. It uses the Bayes theorem to
score the distances probabilistically. This can be run on many
machines simultaneously via Hadoop. I don't know if it has been run on
Wikipedia, but it should work.

Something like #3 could be done in Solr.

On Sat, Dec 24, 2011 at 12:41 PM, Mike O'Leary  wrote:
> I am looking for a way to identify blocks of text that occur in several 
> documents in a corpus for a research project with electronic medical records. 
> They can be copied and pasted sections inserted into another document, text 
> from a previous email in the corpus that is repeated in a follow-up email, 
> text templates that get inserted into groups of documents, and occurrences of 
> the same template more than once in the same document. Any of these 
> duplicated text blocks may contain minor differences from one instance to 
> another.
>
> I read in a document called "What's new in Solr 1.4" that there has been 
> support since 1.4 came out for duplicate text detection using the 
> SignatureUpdateProcessor and TextProfileSignature classes. Can these be used 
> to detect portions of documents that are alike or nearly alike, or are they 
> intended to detect entire documents that are alike or nearly alike? Has 
> additional support for duplicate detection been added to Solr since 1.4? It 
> seems like some of the features of Solr and Lucene such as term positions and 
> shingling could help in finding sections of matching or nearly matching text 
> in documents. Does anyone have any experience in this area that they would be 
> willing to share?
> Thanks,
> Mike



-- 
Lance Norskog
goks...@gmail.com


Re: UUID field changed when document is updated

2011-12-07 Thread Lance Norskog
http://www.lucidimagination.com/search/link?url=http://wiki.apache.org/solr/UniqueKey

On Wed, Dec 7, 2011 at 5:04 PM, Lance Norskog  wrote:

> Yes, the SignatureUpdateProcessor is what you want. The 128-bit hash is
> exactly what you want to use in this situation.  You will never get the
> same ID for two urls- collisions have never been observed "in the wild" for
> this hash algorithm.
>
> Another cool thing about using hash-codes as fields is this: you can give
> the first few letters of the code and a wildcard to get a random subset of
> the index with a given size. For example, 0a0* gives 1/(16^3) of the index.
>
> On Wed, Dec 7, 2011 at 2:48 AM, blaise thomson wrote:
>
>> Hi Hoss,
>>
>> Thanks for getting back to me on this.
>>
>> : I've been trying to use the UUIDField in solr to maintain ids of the
>> >: pages I've crawled with nutch (as per
>> >: http://wiki.apache.org/solr/UniqueKey). The use case is that I want to
>> >: have the server able to use these ids in another database for various
>> >: statistics gathering. So I want the link url to act like a primary key
>> >: for determining if a page exists, and if it doesn't exist to generate a
>> >: new uuid.
>> >
>> >
>> >i'm confused ... if you want the URL to be the primary key, then use the
>> >URL as the primary key, why use the UUID Field at all?
>>
>> I do use the URL as the primary key. The thing is that I want to have a
>> fixed length id for the document so that I can reference it in another
>> database. For example, if I want to count clicks of the url, then I was
>> thinking of using a mysql database along with solr, where each document id
>> has a count of the clicks. I didn't want to use the url itself in that db
>> because of its arbitrary length.
>>
>>
>> : 2. Looking at the code for UUIDField (relevant bit pasted below), it
>> >: seems that the UUID is just generated randomly. There is no check if
>> the
>> >: generated UUID has already been used.
>> >
>> >
>> >correct ... if you specify "NEW" then it generates a new UUID for you --
>> >if you wnat to update an existing doc with an existing UUID then you need
>> >to send the real, existing, value of the UUID for the doc you are
>> >updating.
>> >
>> >
>> >: I can sort of solve this problem by generating the UUID myself, as a
>> >: hash of the link url, but that doesn't help me for those random cases
>> >: when the hash might happen to generate the same UUID.
>> >:
>> >: Does anyone know if there is a way for solr to only add a uuid if the
>> >: document doesn't already exist?
>> >
>> >
>> >I don't really understand your second sentence, but based on that first
>> >sentence it sounds like what you want may be to use something like the
>> >SignatureUpdateProcessor to generate a hash based on the URL...
>> >
>> >
>> >https://wiki.apache.org/solr/Deduplication
>>
>>
>> I didn't know actually about this, so thanks for sharing. I'm not sure it
>> does exactly what I want though. I think it is more for checking if the two
>> docs are the same, which for my purposes, the url works fine for.
>>
>> I think I've sort of come to realise that generating a uuid from the url
>> might be the way to go. There is a chance of getting the same uuid from
>> different urls, but it's only 1 in 2^128, so it's basically non-existant.
>>
>> Thanks again,
>> Blaise
>
>
>
>
> --
> Lance Norskog
> goks...@gmail.com
>
>


-- 
Lance Norskog
goks...@gmail.com


Re: UUID field changed when document is updated

2011-12-07 Thread Lance Norskog
Yes, the SignatureUpdateProcessor is what you want. The 128-bit hash is
exactly what you want to use in this situation.  You will never get the
same ID for two urls- collisions have never been observed "in the wild" for
this hash algorithm.

Another cool thing about using hash-codes as fields is this: you can give
the first few letters of the code and a wildcard to get a random subset of
the index with a given size. For example, 0a0* gives 1/(16^3) of the index.

On Wed, Dec 7, 2011 at 2:48 AM, blaise thomson  wrote:

> Hi Hoss,
>
> Thanks for getting back to me on this.
>
> : I've been trying to use the UUIDField in solr to maintain ids of the
> >: pages I've crawled with nutch (as per
> >: http://wiki.apache.org/solr/UniqueKey). The use case is that I want to
> >: have the server able to use these ids in another database for various
> >: statistics gathering. So I want the link url to act like a primary key
> >: for determining if a page exists, and if it doesn't exist to generate a
> >: new uuid.
> >
> >
> >i'm confused ... if you want the URL to be the primary key, then use the
> >URL as the primary key, why use the UUID Field at all?
>
> I do use the URL as the primary key. The thing is that I want to have a
> fixed length id for the document so that I can reference it in another
> database. For example, if I want to count clicks of the url, then I was
> thinking of using a mysql database along with solr, where each document id
> has a count of the clicks. I didn't want to use the url itself in that db
> because of its arbitrary length.
>
>
> : 2. Looking at the code for UUIDField (relevant bit pasted below), it
> >: seems that the UUID is just generated randomly. There is no check if the
> >: generated UUID has already been used.
> >
> >
> >correct ... if you specify "NEW" then it generates a new UUID for you --
> >if you wnat to update an existing doc with an existing UUID then you need
> >to send the real, existing, value of the UUID for the doc you are
> >updating.
> >
> >
> >: I can sort of solve this problem by generating the UUID myself, as a
> >: hash of the link url, but that doesn't help me for those random cases
> >: when the hash might happen to generate the same UUID.
> >:
> >: Does anyone know if there is a way for solr to only add a uuid if the
> >: document doesn't already exist?
> >
> >
> >I don't really understand your second sentence, but based on that first
> >sentence it sounds like what you want may be to use something like the
> >SignatureUpdateProcessor to generate a hash based on the URL...
> >
> >
> >https://wiki.apache.org/solr/Deduplication
>
>
> I didn't know actually about this, so thanks for sharing. I'm not sure it
> does exactly what I want though. I think it is more for checking if the two
> docs are the same, which for my purposes, the url works fine for.
>
> I think I've sort of come to realise that generating a uuid from the url
> might be the way to go. There is a chance of getting the same uuid from
> different urls, but it's only 1 in 2^128, so it's basically non-existant.
>
> Thanks again,
> Blaise




-- 
Lance Norskog
goks...@gmail.com


Re: is there a way using 1.4 index at 4.0 trunk?

2011-11-30 Thread Lance Norskog
No, you will have to upgrade your index. See the wiki for more information.
(To my knowledge, you should be able to drop in your 1.4 (.1?) schema.xml
and re-index.)

On Wed, Nov 30, 2011 at 6:44 PM, Jason, Kim  wrote:

> Hello,
> I'm using solr 1.4 version.
> I want to use some plugin in trunk version.
> But I got IndexFormatTooOldException when it run old version index at
> trunk.
> Is there a way using 1.4 index at 4.0 trunk?
>
> Thanks,
> Jason
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/is-there-a-way-using-1-4-index-at-4-0-trunk-tp3550430p3550430.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>



-- 
Lance Norskog
goks...@gmail.com


Re: [Profiling] How to profile/tune Solr server

2011-11-06 Thread Lance Norskog
NewRelic also offers an online Solr monitoring tool. You can sign up via
the Lucid Imagination site.

http://www.lucidimagination.com/search/?q=new+relic#/s:lucid

On Sun, Nov 6, 2011 at 3:31 AM, Sujatha Arun  wrote:

> hi ,
>
> I am planning to try Sematext Monitoring. Is there anything to watch out
> for ?
>
> Regards
> Sujatha
>
>
>
> On Fri, Nov 4, 2011 at 9:21 PM,  wrote:
>
> > Hi Spark,
> >
> > 2009 there was a monitor from lucidimagination:
> >
> >
> http://www.lucidimagination.com/about/news/releases/lucid-imagination-releases-performance-monitoring-utility-open-source-apache-lucene
> >
> > A colleague of mine calls the sematext-monitor "trojan" because "SPM
> phone
> > home":
> > "Easy in, easy out - if you try SPM and don't like it, simply stop and
> > remove the small client-side piece that sends us your data"
> > http://sematext.com/spm/solr-performance-monitoring/index.html
> >
> > Looks like other people using a "real profiler" like YourKit Java
> Profiler
> > http://forums.yourkit.com/viewtopic.php?f=3&t=3850
> >
> > There is also an article about Zabbix
> >
> >
> http://www.lucidimagination.com/blog/2011/10/02/monitoring-apache-solr-and-lucidworks-with-zabbix/
> >
> > In your case any profiler would do, but if you find out a Profiler with
> > solr-specific default-filter let me know.
> >
> >
> >
> > Best regrads
> >  Karsten
> >
> > P.S. eMail in context
> >
> >
> http://lucene.472066.n3.nabble.com/Profiling-How-to-profile-tune-Solr-server-td3467027.html
> >
> >  Original-Nachricht 
> > > Datum: Mon, 31 Oct 2011 18:35:32 +0800
> > > Von: yu shen 
> > > An: solr-user@lucene.apache.org
> > > Betreff: Re: [Profiling] How to profile/tune Solr server
> >
> > > No idea so far, try to figure out.
> > >
> > > Spark
> > >
> > > 2011/10/31 Jan Høydahl 
> > >
> > > > Hi,
> > > >
> > > > There are no official tools other than looking at the built-in stats
> > > pages
> > > > and perhaps using JConsole or similar JVM monitoring tools. Note that
> > > > Solr's JMX capabilities may let you hook your enterprise's existing
> > > > monitoring dashboard up with Solr.
> > > >
> > > > Also check out the new monitoring service from Sematext which will
> give
> > > > you graphs and all. So far it's free evaluation:
> > > > http://sematext.com/spm/index.html
> > > >
> > > > Do you have a clue for why the indexing is slow?
> > > >
> > > > --
> > > > Jan Høydahl, search solution architect
> > > > Cominvent AS - www.cominvent.com
> > > > Solr Training - www.solrtraining.com
> > > >
> > > > On 31. okt. 2011, at 04:59, yu shen wrote:
> > > >
> > > > > Hi All,
> > > > >
> > > > > I am a solr newbie. I find solr documents easy to access and use,
> > > which
> > > > is
> > > > > really good thing. While my problem is I did not find a solr home
> > > grown
> > > > > profiling/monitoring tool.
> > > > >
> > > > > I set up the server as a multi-core server, each core has
> > > approximately
> > > > 2GB
> > > > > index. And I need to update solr and re-generate index in a real
> time
> > > > > manner (In java code, using SolrJ). Sometimes the update operation
> is
> > > > slow.
> > > > > And it is expected that in a year, the index size may increase to
> > 4GB.
> > > > And
> > > > > I need to do something to prevent performance downgrade.
> > > > >
> > > > > Is there any solr official monitoring & profiling tool for this?
> > > > >
> > > > > Spark
> > > >
> > > >
> >
>



-- 
Lance Norskog
goks...@gmail.com


Re: Stream still in memory after tika exception? Possible memoryleak?

2011-11-06 Thread Lance Norskog
Yes, please open a JIRA for this, with as much info as possible.

Lance

On Thu, Nov 3, 2011 at 9:48 AM, P Williams
wrote:

> Hi All,
>
> I'm experiencing a similar problem to the other's in the thread.
>
> I've recently upgraded from apache-solr-4.0-2011-06-14_08-33-23.war to
> apache-solr-4.0-2011-10-14_08-56-59.war and then
> apache-solr-4.0-2011-10-30_09-00-00.war to index ~5300 pdfs, of various
> sizes, using the TikaEntityProcessor.  My indexing would run to completion
> and was completely successful under the June build.  The only error was
> readability of the fulltext in highlighting.  This was fixed in Tika 0.10
> (TIKA-611).  I chose to use the October 14 build of Solr because Tika 0.10
> had recently been included (SOLR-2372).
>
> On the same machine without changing any memory settings my initial problem
> is a Perm Gen error.  Fine, I increase the PermGen space.
>
> I've set the "onError" parameter to "skip" for the TikaEntityProcessor.
>  Now I get several (6)
>
> *SEVERE: Exception thrown while getting data*
> *java.net.SocketTimeoutException: Read timed out*
> *SEVERE: Exception in entity :
> tika:org.apache.solr.handler.dataimport.DataImport*
> *HandlerException: Exception in invoking url  # 2975*
>
> pairs.  And after ~3881 documents, with auto commit set unreasonably
> frequently I consistently get an Out of Memory Error
>
> *SEVERE: Exception while processing: f document :
> null:org.apache.solr.handle**r.dataimport.DataImportHandlerException:
> java.lang.OutOfMemoryError: Java heap s**pace*
>
> The stack trace points
> to
> org.apache.pdfbox.io.RandomAccessBuffer.expandBuffer(RandomAccessBuffer.java:151)
> and org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilde
> r.java:718).
>
> The October 30 build performs identically.
>
> Funny thing is that monitoring via JConsole doesn't reveal any memory
> issues.
>
> Because the out of Memory error did not occur in June, this leads me to
> believe that a bug has been introduced to the code since then.  Should I
> open an issue in JIRA?
>
> Thanks,
> Tricia
>
> On Tue, Aug 30, 2011 at 12:22 PM, Marc Jacobs  wrote:
>
> > Hi Erick,
> >
> > I am using Solr 3.3.0, but with 1.4.1 the same problems.
> > The connector is a homemade program in the C# programming language and is
> > posting via http remote streaming (i.e.
> >
> >
> http://localhost:8080/solr/update/extract?stream.file=/path/to/file.doc&literal.id=1
> > )
> > I'm using Tika to extract the content (comes with the Solr Cell).
> >
> > A possible problem is that the filestream needs to be closed, after
> > extracting, by the client application, but it seems that there is going
> > something wrong while getting a Tika-exception: the stream never leaves
> the
> > memory. At least that is my assumption.
> >
> > What is the common way to extract content from officefiles (pdf, doc,
> rtf,
> > xls etc) and index them? To write a content extractor / validator
> yourself?
> > Or is it possible to do this with the Solr Cell without getting a huge
> > memory consumption? Please let me know. Thanks in advance.
> >
> > Marc
> >
> > 2011/8/30 Erick Erickson 
> >
> > > What version of Solr are you using, and how are you indexing?
> > > DIH? SolrJ?
> > >
> > > I'm guessing you're using Tika, but how?
> > >
> > > Best
> > > Erick
> > >
> > > On Tue, Aug 30, 2011 at 4:55 AM, Marc Jacobs 
> wrote:
> > > > Hi all,
> > > >
> > > > Currently I'm testing Solr's indexing performance, but unfortunately
> > I'm
> > > > running into memory problems.
> > > > It looks like Solr is not closing the filestream after an exception,
> > but
> > > I'm
> > > > not really sure.
> > > >
> > > > The current system I'm using has 150GB of memory and while I'm
> indexing
> > > the
> > > > memoryconsumption is growing and growing (eventually more then 50GB).
> > > > In the attached graph I indexed about 70k of office-documents
> > > (pdf,doc,xls
> > > > etc) and between 1 and 2 percent throws an exception.
> > > > The commits are after 64MB, 60 seconds or after a job (there are 6
> > evenly
> > > > divided jobs).
> > > >
> > > > After indexing the memoryconsumption isn't dropping. Even after an
> > > optimize
> > > > command it's still there.
> > > > What am I doing wrong? I can't imagine I'm the only one with this
> > > problem.
> > > > Thanks in advance!
> > > >
> > > > Kind regards,
> > > >
> > > > Marc
> > > >
> > >
> >
>



-- 
Lance Norskog
goks...@gmail.com


Re: DIH doesn't handle bound namespaces?

2011-11-04 Thread Lance Norskog
Yes, the xpath thing is a custom lightweight thing for high-speed use.

There is a separate full XSL processor.
http://wiki.apache.org/solr/DataImportHandler#Configuration_in_data-config.xml-1

I think this lets you run real XSL on input files. I assume it lets you
throw in your favorite XSL implementation.

On Thu, Nov 3, 2011 at 12:45 PM, Chris Hostetter
wrote:

>
> : *It does not support namespaces , but it can handle xmls with namespaces
> .
>
> The real crux of hte issue is that XPathEntityProcessor is terribly named.
> it should have been called "LimitedXPathishSyntaxEntityProcessor" or
> something like that because it doesn't support full xpath syntax...
>
> "The XPathEntityProcessor implements a streaming parser which supports a
> subset of xpath syntax. Complete xpath syntax is not supported but most of
> the common use cases are covered..."
>
> ...i thought there was a DIH FAQ about this, but if not there really
> should be.
>
>
> -Hoss
>



-- 
Lance Norskog
goks...@gmail.com


Re: Xsl for query output

2011-10-13 Thread Lance Norskog
http://wiki.apache.org/solr/XsltResponseWriter

This is for the single-core example. It is easiest to just go to
solr/example, run java -jar start.jar, and hit the URL in the above wiki
page. Then poke around in solr/example/solr/conf/xslt. There is no
solrconfig.xml change needed.

It is generally easiest to use the solr/example 'java -jar start.jar'
example to test out features. It is easy to break configuration linkages.

Lance

On Thu, Oct 13, 2011 at 12:42 PM, Jeremy Cunningham <
jeremy.cunningham.h...@statefarm.com> wrote:

> I am new to solr and not a web developer.  I am a data warehouse guy trying
> to use solr for the first time.  I am familiar with xsl but I can't figure
> out how to get the example.xsl to be applied to my xml results.  I am
> running tomcat and have solr working.  I copied over the solr mulitiple core
> example to the conf directory on my tomcat server. I also added the war file
> and the search is fine.  I can't seem to figure out what I need to add to
> the solrcofig.xml or where ever so that the example.xsl is used.  Basically
> can someone tell me where to put the xsl and where to configure its usage?
>
> Thanks
>



-- 
Lance Norskog
goks...@gmail.com


Re: SOLR HttpCache Qtime

2011-10-04 Thread Lance Norskog
Solr supports having the browser cache the results. If your client code
supports this caching, or your code goes through an HTTP cacher like Squid,
it could return a cached page for a query. Is this what you mean?

On Tue, Oct 4, 2011 at 4:55 PM, Nicholas Chase  wrote:

> Seems to me what you're asking is how to have an accurate query time when
> you're getting a response that's been cached by an HTTP cache.  This might
> be from the browser, or from a proxy, or from something else, but it's not
> from Solr.  The reason that the QTime doesn't change is because it's the
> entire response -- results, parameters, Qtime, and all -- that's cached.
>  Solr isn't making a new request; it doesn't even know that a request has
> been made.  So if you do 6 requests, and the last 5 come from the cache,
> Solr has done only one request, with one Qtime.
>
> So it sounds to me that you are looking for the RESPONSE time, which would
> be different from the QTime, and would, I suppose, come from your
> application, and not from Solr.
>
>   Nick
>
>
> On 10/4/2011 7:44 PM, Erick Erickson wrote:
>
>> Still doesn't make sense to me. There is no
>> Solr HTTP cache that I know of. There is a
>> queryResultCache. There is a filterCache.
>> There is a documentCache. There's may
>> even be custom cache implementations.
>> There's a fieldValueCache. There's
>> no http cache internal to Solr as far as I
>> can tell.
>>
>> If you're asking if documents returned from
>> the queryResultCache have QTimes that
>> reflect the actual time spent (near 0), I'm
>> pretty sure the answer is "yes".
>>
>> If this doesn't answer your question, please
>> take the time to formulate a complete question.
>> It'll get you your answers quicker than multiple
>> twitter-style exchanges.
>>
>> Best
>> Erick
>>
>> On Tue, Oct 4, 2011 at 2:22 PM, Lord Khan Han
>>  wrote:
>>
>>> I just want to be sure..  because its solr internal HTTP cache.. not an
>>> outside httpcacher
>>>
>>> On Tue, Oct 4, 2011 at 5:39 PM, Erick 
>>> Erickson
>>> >wrote:
>>>
>>>  But if the HTTP cache is what's returning the value,
>>>> Solr never sees anything at all, right? So Solr
>>>> doesn't have a chance to do anything here.
>>>>
>>>> Best
>>>> Erick
>>>>
>>>> On Tue, Oct 4, 2011 at 9:24 AM, Lord Khan Han
>>>> wrote:
>>>>
>>>>> We are using this Qtime field and publishing in our front web. Even the
>>>>> httpCache decreasing the Qtime in reality, its still using the cached
>>>>> old
>>>>> Qtime value . We can use our internal qtime instead of Solr's but I
>>>>> just
>>>>> wonder is there any way to say Solr if its coming httpCache
>>>>>  re-calculate
>>>>> the Qtime.
>>>>>
>>>>> On Tue, Oct 4, 2011 at 4:16 AM, Erick Erickson>>>> com 
>>>>> wrote:
>>>>>
>>>>>  Why do you want to? QTime is the time Solr
>>>>>> spends searching. The cached value will,
>>>>>> indeed, be from the query that filled
>>>>>> in the HTTP cache. But what are you doing
>>>>>> with that information that you want to "correct"
>>>>>> it?
>>>>>>
>>>>>> That said, I have no clue how you'd attempt to
>>>>>> do this.
>>>>>>
>>>>>> Best
>>>>>> Erick
>>>>>>
>>>>>> On Sat, Oct 1, 2011 at 5:55 PM, Lord Khan Han>>>>> >
>>>>>> wrote:
>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> Is there anyway to get correct Qtime when we use http caching ? I
>>>>>>>
>>>>>> think
>>>>
>>>>> Solr
>>>>>>
>>>>>>> caching also the Qtime so giving the the same Qtime in response what
>>>>>>>
>>>>>> ever
>>>>
>>>>> takes it to finish ..  How I can set Qtime correcly from solr when I
>>>>>>>
>>>>>> use
>>>>
>>>>> http caching On.
>>>>>>>
>>>>>>> thanks
>>>>>>>
>>>>>>>
>>


-- 
Lance Norskog
goks...@gmail.com


Re: Bug in DIH?

2011-10-01 Thread Lance Norskog
Should bugs in LogProcessor should be ignored by DIH? They are not required
to index data, right?

Please open an issue for this. The fix should have two parts:
1) fix the exception
2) log and ignore exceptions in the LogProcessor

On Sat, Oct 1, 2011 at 2:02 PM, Pulkit Singhal wrote:

> Its rather strange stacktrace(at the bottom).
> An entire 1+ dataset finishes up only to end up crashing & burning
> due to a log statement :)
>
> Based on what I can tell from the stacktrace and the 4.x trunk source
> code, it seems that the follwoign log statement dies:
>//LogUpdateProcessorFactory.java:188
>log.info( ""+toLog + " 0 " + (elapsed) );
>
> Eventually at the strict cast:
>//NamedList.java:127
>return (String)nvPairs.get(idx << 1);
>
> I was wondering what kind of mistaken data would I have ended up
> getting misplaced into:
>//LogUpdateProcessorFactory.java:76
>private final NamedList toLog;
>
> To cause the java.util.ArrayList cannot be cast to java.lang.String issue?
> Could it be due to the multivalued fields that I'm trying to index?
> Is this a bug or just a mistake in how I use DIH, please let me know
> your thoughts!
>
> SEVERE: Full Import failed:java.lang.ClassCastException:
> java.util.ArrayList cannot be cast to java.lang.String
>at org.apache.solr.common.util.NamedList.getName(NamedList.java:127)
>at
> org.apache.solr.common.util.NamedList.toString(NamedList.java:263)
>at java.lang.String.valueOf(String.java:2826)
>at java.lang.StringBuilder.append(StringBuilder.java:115)
>at
> org.apache.solr.update.processor.LogUpdateProcessor.finish(LogUpdateProcessorFactory.java:188)
>at
> org.apache.solr.handler.dataimport.SolrWriter.close(SolrWriter.java:57)
>at
> org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:265)
>at
> org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:372)
>at
> org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:440)
>at
> org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:421)
>



-- 
Lance Norskog
goks...@gmail.com


Re: strange performance issue with many shards on one server

2011-09-28 Thread Lance Norskog
Come cache hit problems can be fixed with the Large Pages feature.

http://www.google.com/search?q=large+pages

On Wed, Sep 28, 2011 at 3:30 PM, Federico Fissore wrote:

> Frederik Kraus, il 28/09/2011 23:16, ha scritto:
>
>   Yep, I'm not getting more than 50-60% CPU during those load tests.
>>
>>
> I would try reducing the number of shards. A part from the memory
> discussion, this really seems to me a concurrency issue: too many threads
> waiting for other threads to complete, too many context switches...
>
> recently, on a lots-of-cores database server, we INCREASED speed by
> REDUCING the number of cores/threads each query was allowed to use (making
> sense of our customer investment)
> maybe you can get a similar effect by reducing the number of pieces your
> distributed search has to merge
>
> my 2 eurocents
>
> federico
>



-- 
Lance Norskog
goks...@gmail.com


Re: ClassCastException: SmartChineseWordTokenFilterFactory to TokenizerFactory

2011-09-15 Thread Lance Norskog
Tokenizers and TokenFilters are different. Look in the schema for how other
TokenFilterFactory classes are used.

On Thu, Sep 15, 2011 at 8:05 PM, Xue-Feng Yang  wrote:

> Hi all,
>
> I am trying to use SmartChineseWordTokenFilterFactory in solr 3.4.0, but
> come to the error
>
> SEVERE: java.lang.ClassCastException:
> org.apache.solr.analysis.SmartChineseWordTokenFilterFactory cannot be cast
> to org.apache.solr.analysis.TokenizerFactory
>
>
> Any thought?




-- 
Lance Norskog
goks...@gmail.com


Re: Generating large datasets for Solr proof-of-concept

2011-09-15 Thread Lance Norskog
http://aws.amazon.com/datasets

DBPedia might be the easiest to work with:
http://aws.amazon.com/datasets/2319

Amazon has a lot of these things.
Infochimps.com is a marketplace for free & pay versions.


Lance

On Thu, Sep 15, 2011 at 6:55 PM, Pulkit Singhal wrote:

> Ah missing } doh!
>
> BTW I still welcome any ideas on how to build an e-commerce test base.
> It doesn't have to be amazon that was jsut my approach, any one?
>
> - Pulkit
>
> On Thu, Sep 15, 2011 at 8:52 PM, Pulkit Singhal 
> wrote:
> > Thanks for all the feedback thus far. Now to get  little technical about
> it :)
> >
> > I was thinking of feeding a file with all the tags of amazon that
> > yield close to roughly 5 results each into a file and then running
> > my rss DIH off of that, I came up with the following config but
> > something is amiss, can someone please point out what is off about
> > this?
> >
> >
> > >processor="LineEntityProcessor"
> >url="file:///xxx/yyy/zzz/amazonfeeds.txt"
> >rootEntity="false"
> >dataSource="myURIreader1"
> >transformer="RegexTransformer,DateFormatTransformer"
> >>
> > >pk="link"
> >url="${amazonFeeds.rawLine"
> >processor="XPathEntityProcessor"
> >forEach="/rss/channel | /rss/channel/item"
> >
> >
> transformer="RegexTransformer,HTMLStripTransformer,DateFormatTransformer,script:skipRow">
> > ...
> >
> > The rawline should feed into the url key but instead i get:
> >
> > Caused by: java.net.MalformedURLException: no protocol:
> > null${amazonFeeds.rawLine
> >at
> org.apache.solr.handler.dataimport.URLDataSource.getData(URLDataSource.java:90)
> >
> > Sep 15, 2011 8:48:01 PM org.apache.solr.update.DirectUpdateHandler2
> rollback
> > INFO: start rollback
> >
> > Sep 15, 2011 8:48:01 PM org.apache.solr.handler.dataimport.SolrWriter
> rollback
> > SEVERE: Exception while solr rollback.
> >
> > Thanks in advance!
> >
> > On Thu, Sep 15, 2011 at 4:12 PM, Markus Jelsma
> >  wrote:
> >> If we want to test with huge amounts of data we feed portions of the
> internet.
> >> The problem is it takes a lot of bandwith and lots of computing power to
> get
> >> to a `reasonable` size. On the positive side, you deal with real text so
> it's
> >> easier to tune for relevance.
> >>
> >> I think it's easier to create a simple XML generator with mock data,
> prices,
> >> popularity rates etc. It's fast to generate millions of mock products
> and once
> >> you have a large quantity of XML files, you can easily index, test,
> change
> >> config or schema and reindex.
> >>
> >> On the other hand, the sample data that comes with the Solr example is a
> good
> >> set as well as it proves the concepts well, especially with the stock
> Velocity
> >> templates.
> >>
> >> We know Solr will handle enormous sets but quantity is not always a part
> of a
> >> PoC.
> >>
> >>> Hello Everyone,
> >>>
> >>> I have a goal of populating Solr with a million unique products in
> >>> order to create a test environment for a proof of concept. I started
> >>> out by using DIH with Amazon RSS feeds but I've quickly realized that
> >>> there's no way I can glean a million products from one RSS feed. And
> >>> I'd go mad if I just sat at my computer all day looking for feeds and
> >>> punching them into DIH config for Solr.
> >>>
> >>> Has anyone ever had to create large mock/dummy datasets for test
> >>> environments or for POCs/Demos to convince folks that Solr was the
> >>> wave of the future? Any tips would be greatly appreciated. I suppose
> >>> it sounds a lot like crawling even though it started out as innocent
> >>> DIH usage.
> >>>
> >>> - Pulkit
> >>
> >
>



-- 
Lance Norskog
goks...@gmail.com


Re: MMapDirectory failed to map a 23G compound index segment

2011-09-09 Thread Lance Norskog
I remember now: by memory-mapping one block of address space that big, the
garbage collector has problems working around it. If the OOM is repeatable,
you could try watching the app with jconsole and watch the memory spaces.

Lance

On Thu, Sep 8, 2011 at 8:58 PM, Lance Norskog  wrote:

> Do you need to use the compound format?
>
> On Thu, Sep 8, 2011 at 3:57 PM, Rich Cariens wrote:
>
>> I should add some more context:
>>
>>   1. the problem index included several cfs segment files that were around
>>   4.7G, and
>>   2. I'm running four SOLR instances on the same box, all of which have
>>   similiar problem indeces.
>>
>> A colleague thought perhaps I was bumping up against my 256,000 open files
>> ulimit. Do the MultiMMapIndexInput ByteBuffer arrays each consume a file
>> handle/descriptor?
>>
>> On Thu, Sep 8, 2011 at 5:19 PM, Rich Cariens 
>> wrote:
>>
>> > FWiW I optimized the index down to a single segment and now I have no
>> > trouble opening an MMapDirectory on that index, even though the 23G cfx
>> > segment file remains.
>> >
>> >
>> > On Thu, Sep 8, 2011 at 4:27 PM, Rich Cariens > >wrote:
>> >
>> >> Thanks for the response. "free -g" reports:
>> >>
>> >> totalusedfreesharedbuffers
>> >> cached
>> >> Mem:  141  95  46 0
>> >> 093
>> >> -/+ buffers/cache:  2 139
>> >> Swap:   3   0   3
>> >>
>> >> 2011/9/7 François Schiettecatte 
>> >>
>> >>> My memory of this is a little rusty but isn't mmap also limited by mem
>> +
>> >>> swap on the box? What does 'free -g' report?
>> >>>
>> >>> François
>> >>>
>> >>> On Sep 7, 2011, at 12:25 PM, Rich Cariens wrote:
>> >>>
>> >>> > Ahoy ahoy!
>> >>> >
>> >>> > I've run into the dreaded OOM error with MMapDirectory on a 23G cfs
>> >>> compound
>> >>> > index segment file. The stack trace looks pretty much like every
>> other
>> >>> trace
>> >>> > I've found when searching for OOM & "map failed"[1]. My
>> configuration
>> >>> > follows:
>> >>> >
>> >>> > Solr 1.4.1/Lucene 2.9.3 (plus
>> >>> > SOLR-1969<https://issues.apache.org/jira/browse/SOLR-1969>
>> >>> > )
>> >>> > CentOS 4.9 (Final)
>> >>> > Linux 2.6.9-100.ELsmp x86_64 yada yada yada
>> >>> > Java SE (build 1.6.0_21-b06)
>> >>> > Hotspot 64-bit Server VM (build 17.0-b16, mixed mode)
>> >>> > ulimits:
>> >>> >core file size (blocks, -c) 0
>> >>> >data seg size(kbytes, -d) unlimited
>> >>> >file size (blocks, -f) unlimited
>> >>> >pending signals(-i) 1024
>> >>> >max locked memory (kbytes, -l) 32
>> >>> >max memory size (kbytes, -m) unlimited
>> >>> >open files(-n) 256000
>> >>> >pipe size (512 bytes, -p) 8
>> >>> >POSIX message queues (bytes, -q) 819200
>> >>> >stack size(kbytes, -s) 10240
>> >>> >cpu time(seconds, -t) unlimited
>> >>> >max user processes (-u) 1064959
>> >>> >virtual memory(kbytes, -v) unlimited
>> >>> >file locks(-x) unlimited
>> >>> >
>> >>> > Any suggestions?
>> >>> >
>> >>> > Thanks in advance,
>> >>> > Rich
>> >>> >
>> >>> > [1]
>> >>> > ...
>> >>> > java.io.IOException: Map failed
>> >>> > at sun.nio.ch.FileChannelImpl.map(Unknown Source)
>> >>> > at
>> org.apache.lucene.store.MMapDirectory$MMapIndexInput.(Unknown
>> >>> > Source)
>> >>> > at
>> org.apache.lucene.store.MMapDirectory$MMapIndexInput.(Unknown
>> >>> > Source)
>> >>> > at org.apache.lucene.store.MMapDirectory.openInput(Unknown Source)
>> >>> > at org.apache.lucene.index.SegmentReader$CoreReaders.(Unknown
>> >>> Source)
>> >>> >
>> >>> > at org.apache.lucene.index.SegmentReader.get(Unknown Source)
>> >>> > at org.apache.lucene.index.SegmentReader.get(Unknown Source)
>> >>> > at org.apache.lucene.index.DirectoryReader.(Unknown Source)
>> >>> > at org.apache.lucene.index.ReadOnlyDirectoryReader.(Unknown
>> >>> Source)
>> >>> > at org.apache.lucene.index.DirectoryReader$1.doBody(Unknown Source)
>> >>> > at org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(Unknown
>> >>> > Source)
>> >>> > at org.apache.lucene.index.DirectoryReader.open(Unknown Source)
>> >>> > at org.apache.lucene.index.IndexReader.open(Unknown Source)
>> >>> > ...
>> >>> > Caused by: java.lang.OutOfMemoryError: Map failed
>> >>> > at sun.nio.ch.FileChannelImpl.map0(Native Method)
>> >>> > ...
>> >>>
>> >>>
>> >>
>> >
>>
>
>
>
> --
> Lance Norskog
> goks...@gmail.com
>
>


-- 
Lance Norskog
goks...@gmail.com


Re: MMapDirectory failed to map a 23G compound index segment

2011-09-08 Thread Lance Norskog
Do you need to use the compound format?

On Thu, Sep 8, 2011 at 3:57 PM, Rich Cariens  wrote:

> I should add some more context:
>
>   1. the problem index included several cfs segment files that were around
>   4.7G, and
>   2. I'm running four SOLR instances on the same box, all of which have
>   similiar problem indeces.
>
> A colleague thought perhaps I was bumping up against my 256,000 open files
> ulimit. Do the MultiMMapIndexInput ByteBuffer arrays each consume a file
> handle/descriptor?
>
> On Thu, Sep 8, 2011 at 5:19 PM, Rich Cariens 
> wrote:
>
> > FWiW I optimized the index down to a single segment and now I have no
> > trouble opening an MMapDirectory on that index, even though the 23G cfx
> > segment file remains.
> >
> >
> > On Thu, Sep 8, 2011 at 4:27 PM, Rich Cariens  >wrote:
> >
> >> Thanks for the response. "free -g" reports:
> >>
> >> totalusedfreesharedbuffers
> >> cached
> >> Mem:  141  95  46 0
> >> 093
> >> -/+ buffers/cache:  2 139
> >> Swap:   3   0   3
> >>
> >> 2011/9/7 François Schiettecatte 
> >>
> >>> My memory of this is a little rusty but isn't mmap also limited by mem
> +
> >>> swap on the box? What does 'free -g' report?
> >>>
> >>> François
> >>>
> >>> On Sep 7, 2011, at 12:25 PM, Rich Cariens wrote:
> >>>
> >>> > Ahoy ahoy!
> >>> >
> >>> > I've run into the dreaded OOM error with MMapDirectory on a 23G cfs
> >>> compound
> >>> > index segment file. The stack trace looks pretty much like every
> other
> >>> trace
> >>> > I've found when searching for OOM & "map failed"[1]. My configuration
> >>> > follows:
> >>> >
> >>> > Solr 1.4.1/Lucene 2.9.3 (plus
> >>> > SOLR-1969<https://issues.apache.org/jira/browse/SOLR-1969>
> >>> > )
> >>> > CentOS 4.9 (Final)
> >>> > Linux 2.6.9-100.ELsmp x86_64 yada yada yada
> >>> > Java SE (build 1.6.0_21-b06)
> >>> > Hotspot 64-bit Server VM (build 17.0-b16, mixed mode)
> >>> > ulimits:
> >>> >core file size (blocks, -c) 0
> >>> >data seg size(kbytes, -d) unlimited
> >>> >file size (blocks, -f) unlimited
> >>> >pending signals(-i) 1024
> >>> >max locked memory (kbytes, -l) 32
> >>> >max memory size (kbytes, -m) unlimited
> >>> >open files(-n) 256000
> >>> >pipe size (512 bytes, -p) 8
> >>> >POSIX message queues (bytes, -q) 819200
> >>> >stack size(kbytes, -s) 10240
> >>> >cpu time(seconds, -t) unlimited
> >>> >max user processes (-u) 1064959
> >>> >virtual memory(kbytes, -v) unlimited
> >>> >file locks(-x) unlimited
> >>> >
> >>> > Any suggestions?
> >>> >
> >>> > Thanks in advance,
> >>> > Rich
> >>> >
> >>> > [1]
> >>> > ...
> >>> > java.io.IOException: Map failed
> >>> > at sun.nio.ch.FileChannelImpl.map(Unknown Source)
> >>> > at
> org.apache.lucene.store.MMapDirectory$MMapIndexInput.(Unknown
> >>> > Source)
> >>> > at
> org.apache.lucene.store.MMapDirectory$MMapIndexInput.(Unknown
> >>> > Source)
> >>> > at org.apache.lucene.store.MMapDirectory.openInput(Unknown Source)
> >>> > at org.apache.lucene.index.SegmentReader$CoreReaders.(Unknown
> >>> Source)
> >>> >
> >>> > at org.apache.lucene.index.SegmentReader.get(Unknown Source)
> >>> > at org.apache.lucene.index.SegmentReader.get(Unknown Source)
> >>> > at org.apache.lucene.index.DirectoryReader.(Unknown Source)
> >>> > at org.apache.lucene.index.ReadOnlyDirectoryReader.(Unknown
> >>> Source)
> >>> > at org.apache.lucene.index.DirectoryReader$1.doBody(Unknown Source)
> >>> > at org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(Unknown
> >>> > Source)
> >>> > at org.apache.lucene.index.DirectoryReader.open(Unknown Source)
> >>> > at org.apache.lucene.index.IndexReader.open(Unknown Source)
> >>> > ...
> >>> > Caused by: java.lang.OutOfMemoryError: Map failed
> >>> > at sun.nio.ch.FileChannelImpl.map0(Native Method)
> >>> > ...
> >>>
> >>>
> >>
> >
>



-- 
Lance Norskog
goks...@gmail.com


Re: category tree navigation with the help of solr

2011-09-05 Thread Lance Norskog
First rule is: denormalize when possible. Just store a separate document
with each combination of attributes: if the Reebok HF comes in red and blue,
store two documents:
Reebok HF,red
Reebok HF,blue

Then, use grouping and facets to decide what to show.

"Category is heierarchical": Pivot facets, facet.pivot=field1,field2,field3
gives a tree-structured list of all documents with shared values of
this&that facet.

"More than one category can share the same name.": In your database you
should have a unique key for every variation of this problem: even if 5 keys
show as "Reebok HF", that is ok. You know internally exactly what these are,
even if they show the same text string.

To search on these strings from clicking on a facet, you need a way to
combine a string to show in the browser with the unique search key. One way
is to add the display string to the key in javascript encoding: the UI
strips off the key and shows the display string, but sends the entire key in
with a search.

On Mon, Sep 5, 2011 at 1:34 AM, Ranveer  wrote:

> Hi Priti,
>
> You can do this by adding an extra field (string type) for facet on which
> you need to send query.
>
>
> 1.One product can belong to more than one categories.
>
> You can put internal flag for that category at index time, and at the time
> of query you can send that flag to query.
>
> More of less same thing I am doing in www.jagranjosh.com and
> videos.jagran.com
>
> regards
> Ranveer
>
>
>
>
> On Monday 05 Septyember 2011 01:09 PM, Tony Qiu wrote:
>
>> Dear Gupta,
>>
>> In my case, I am doing something similar to you.
>> I use tow core, one core I build category's tree, another core I use to
>> build the product's information include the leaf cat of products. So when
>> one search I get, I will facet the leaf cat, then get the category tree in
>> the category core via product's leafcat (or you can cache the category
>> tree
>> in you memery).
>>
>> Hope this can help u.
>>
>> 2011/9/5 Priti Gupta
>>
>>  Hi,
>>>
>>> We are using solr in our ecommerce application.we are indexing on
>>> different
>>> attributes of products.
>>> We want to create category tree with the help of solr.
>>>
>>> There are following points about catgory and products to be considered,
>>> 1.One product can belong to more than one categories.
>>> 2.category is a hierarchical facet.
>>> 3.More than one categories can share same name.
>>>
>>> It would be a great help if someone can suggest a way to index and query
>>> data based on the above architecture.
>>>
>>> Thanks,
>>> Priti
>>>
>>>
>


-- 
Lance Norskog
goks...@gmail.com


Re: Solr Geodist

2011-08-30 Thread Lance Norskog
Lucid also has an online forum for questions about the LucidWorksEnterprise
product:

http://www.lucidimagination.com/forum/lwe

The Lucidi Imagination engineers all read the forum and endeavor to quickly
answer questions like this.

On Tue, Aug 30, 2011 at 6:09 PM, solrnovice  wrote:

> hi Erik, today i had the distance working. Since the solr version under
> LucidImagination is not returning geodist(),  I downloaded Solr 4.0 from
> the
> nightly build. On lucid we had the full schema defined. So i copied that
> schema to the "example" directory of solr-4 and removed all references to
> Lucid and started the index.
> I wanted to try our schema under solr-4.
>
> Then i had the data indexed ( we have a rake written in ruby to index the
> contents) and ran the geodist queries and they all run like a charm. I do
> get distance as a pseudo column.
>
> Is there any documentation that gives me all the arguments of geodist(), i
> couldnt find it online.
>
>
> Erick, thanks for your help in going through my examples. NOw they all work
> on my solr installation.
>
>
> thanks
> SN
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Solr-Geodist-tp3287005p3297088.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>



-- 
Lance Norskog
goks...@gmail.com


Re: how to differentiate multiple datasources when building solr query....

2011-08-26 Thread Lance Norskog
Did you mean datasource-1 and datasource-2 ?

On Fri, Aug 26, 2011 at 2:41 AM, vighnesh  wrote:

> hi all
>
> I have a two data sources in data-config file and i need data from first
> datasource , second datasource and from both .how can acheive this in solr
> query.
>
> example like: first datasource:
>
>
> http://localhost:8983/solr/db/select/?q=newthread&version=2.2&start=0&rows=200&indent=on&datasource=datasource-1
>
> example like: second datasource:
>
>
> http://localhost:8983/solr/db/select/?q=newthread&version=2.2&start=0&rows=200&indent=on&datasource=datasource-1
>
> example like: both datasources:
>
>
> http://localhost:8983/solr/db/select/?q=newthread&version=2.2&start=0&rows=200&indent=on&datasource=datasource-1&datasource=datasource-1
>
> is this querys are correct or not ? plese let me know that.
>
>
> thanks in advance.
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/how-to-differentiate-multiple-datasources-when-building-solr-query-tp3286309p3286309.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>



-- 
Lance Norskog
goks...@gmail.com


Re: Query vs Filter Query Usage

2011-08-25 Thread Lance Norskog
The point of filter queries is that they are applied very early in the
searching algorithm, and thus cut the amount of work later on. Some
complex queries take a lot of time and so this pre-trimming helps a
lot.

On Thu, Aug 25, 2011 at 2:37 PM, Yonik Seeley
 wrote:
> On Thu, Aug 25, 2011 at 5:19 PM, Michael Ryan  wrote:
>>> 10,000,000 document index
>>> Internal Document id is 32 bit unsigned int
>>> Max Memory Used by a single cache slot in the filter cache = 32 bits x
>>> 10,000,000 docs = 320,000,000 bits or 38 MB
>>
>> I think it depends on where exactly the result set was generated. I believe 
>> the result set will usually be represented by a BitDocSet, which requires 1 
>> bit per doc in your index (result set size doesn't matter), so in your case 
>> it would be about 1.2MB.
>
> Right - and Solr switches between the implementation depending on set
> size... so if the number of documents in the set were 100, then it
> would only take up 400 bytes.
>
> -Yonik
> http://www.lucidimagination.com
>



-- 
Lance Norskog
goks...@gmail.com


Re: Optimize requires 50% more disk space when there are exactly 20 segments

2011-08-24 Thread Lance Norskog
Which Solr version do you have? In 3.x and trunk, Tiered and
BalancedSegment are there for exactly this reason.

In Solr 1.4, your only trick is to do a partial optimize with
maxSegments. This lets you say "optimize until there are 15 segments,
then stop". Do this with smaller and smaller numbers.

On Wed, Aug 24, 2011 at 8:35 PM, Michael Ryan  wrote:
> I'm using Solr 3.2 with a mergeFactor of 10 and no merge policy configured, 
> thus using the default LogByteSizeMergePolicy.  Before I do an optimize, 
> typically the largest segment will be about 90% of the total index size.
>
> When I do an optimize, the total disk space required is usually about 2x the 
> index size.  But about 10% of the time, the disk space required is about 3x 
> the index size - when this happens, I see a very large segment created, 
> roughly the size of the original index size, followed by another slightly 
> larger segment.
>
> After some investigating, I found that this would happen when there were 
> exactly 20 segments in the index when the optimize started.  My hypothesis is 
> that this is a side-effect of the 20 segments being evenly divisible by the 
> mergeFactor of 10.  I'm thinking that when there are 20 segments, the largest 
> segment is being merged twice - first when merging the 20 segments down to 2, 
> then again when merging from 2 to 1.
>
> I would like to avoid this if at all possible, as it requires 50% more disk 
> space and takes almost twice as long to optimize.  Would using 
> TieredMergePolicy help me here, or some other config I can change?
>
> -Michael
>



-- 
Lance Norskog
goks...@gmail.com


Re: Problem using stop words

2011-08-24 Thread Lance Norskog
A note: in the first schema, you had the stopwords after the stemmer.
This would not work, since the stopwords are not stemmed.

On Wed, Aug 24, 2011 at 12:59 AM, _snake_  wrote:
> I forgot to say that my stopwords file is in the same location as the schema
> file and the solrconfig file.
>
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Problem-using-stop-words-tp3274598p3280319.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>



-- 
Lance Norskog
goks...@gmail.com


Re: Solr DIH import - Date Question

2011-08-04 Thread Lance Norskog
You might have to do this with an external script. The DIH lets you
process fields with javascript or Groovy.

Also, somewhere in the DIH you can give an XSL stylesheet instead of
just an XPath.

On Thu, Aug 4, 2011 at 10:31 PM, solruser@9913  wrote:
> This is perhaps a 'truly newbie' question.  I am processing some files via
> DIH handler/XPATH Processor.  Some of the date fields in the XML are in
> 'Java Long format' i.e. just a big long number.  I am wondering how to map
> them Solr Date field.  I used the DIH  DateFormatTransformer for some other
> 'date' fields that were written out in a regular date format.
>
> However I am stumped on this - thought it would be simple but I was not able
> to find a solution 
>
> Any help would be much appreciated
>
> -g
>
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Solr-DIH-import-Date-Question-tp3227720p3227720.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>



-- 
Lance Norskog
goks...@gmail.com


Re: Best practices for Calculationg QPS

2011-07-18 Thread Lance Norskog
Easiest way to count QPS:

Take one solr log file. Make it have date stamps and log entries on
the same line.
Grab all lines with 'qTime='.
Strip these lines of all text after the timestamp.
Run this Unix program to get counts in line of how many times each
timestamp appears in a row:
uniq -c

Works a treat. After this I make charts in Excel. Use the "X-Y" or
"Scatter plot" chart. Make the timestamp the X dimension, and the
count the Y dimension. This gets you a plot of QPS.

On Mon, Jul 18, 2011 at 5:08 PM, Erick Erickson  wrote:
> Measure. On an index with your real data with real queries ...
>
> Being a smart-aleck aside, using something like jMeter is useful. The
> basic idea is that you can use such a tool to fire queries at a Solr
> index, configuring it with some number of threads that all run
> in parallel, and keep upping the number of threads until the server
> falls over.
>
> But it's critical that you use your real data. All of it (i.e. don't run with
> a partial set of data and expect the results to hold when you add the
> rest of the data). It's equally critical that you use real queries that
> reflect what the users actually send at your index.
>
> Of course, with a new app, getting "real" user queries isn't possible,
> and you're forced to guess. Which is much better than nothing, but
> you need to monitor what happens when real users do start using your
> system...
>
> Do be aware that what I have seen when doing this is that your
> QPS will plateau, but the response time for each query will
> increase at some threshold...
>
> FWIW
> Erick
>
> On Mon, Jul 18, 2011 at 10:35 AM, Siddhesh Shirode
>  wrote:
>> Hi Everyone,
>>
>> I would like to know the best practices or  best tools for Calculating QPS  
>> in Solr. Thanks.
>>
>> Thanks,
>> SIDDHESH SHIRODE
>> Technical Consultant
>>
>> M +1 240 274 5183
>>
>> SEARCH TECHNOLOGIES
>> THE EXPERT IN THE SEARCH SPACE
>> www.searchtechnologies.com<http://www.searchtechnologies.com>
>>
>>
>



-- 
Lance Norskog
goks...@gmail.com


Re: ' invisible ' words

2011-07-11 Thread Lance Norskog
German/Germany is being changed by an English-language stemmer.
Strip your analysis chains down the minimum and walk through what
happens when you add each step.

On Mon, Jul 11, 2011 at 8:43 PM, deniz  wrote:
> Thank you Erick, I have did what you have said but unfortunately, nothing is
> changed with the problem... Even if I use the admin interface the result is
> still the same...
>
> I have tried removing stopword and synonym files too  but still i got that
> weird result...
>
> another fact is that i can match those invisible words partially... i mean
> something like this:
>
> I work in Germany.
>
> when i make a search with the word German, I got partial match, which
> should not be matched actually... but when i make a search with Germany, no
> match
>
> -
> Zeki ama calismiyor... Calissa yapar...
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/invisible-words-tp3158060p3161306.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>



-- 
Lance Norskog
goks...@gmail.com


Join and Range Queries

2011-07-09 Thread Lance Norskog
Does the Join feature work with Range queries?

Given a time series of events stored as documents with time ranges, is
it possible to do a search that finds certain events, and then add
other documents whose time ranges overlap?

-- 
Lance Norskog
goks...@gmail.com


Re: Local Search – Rank results by businesses density

2011-07-09 Thread Lance Norskog
Voting precinct maps are a good proxy for this. US voting precincts
tend to have similar numbers of voters, and so the number of retail
businesses should also be of similar size. Precinct maps shrink the
closer to downtown.

On Fri, Jul 8, 2011 at 5:38 PM, SR  wrote:
> Hi there,
>
> For local business searches in big cities (e.g., Restaurant in NYC), I’d like 
> to sort the results by the density of the businesses in the underlying 
> neighborhoods (e.g., return x restaurants from the neighborhood that has the 
> highest density of restaurants).
>
> A solution would be to search against each neighborhood in NYC, then return 
> those restaurants that belong to the neighborhood that has the highest 
> density. To my knowledge, this solution requires to have in the geo database 
> all the neighborhoods with their lat & long. Unfortunately, I don't have 
> these data in my database.
>
> I'm writing to see whether some of you have already faced this problem. If 
> so, I'd be thankful if you could share your thoughts with me.
>
> May thanks,
>
> -SR



-- 
Lance Norskog
goks...@gmail.com


Re: Solr sentiment analysis

2011-07-09 Thread Lance Norskog
There is much more to learn about sentiment analysis than about Solr.
I suggest getting one of these toolkits yourself, write some code, and
make some charts.

Classification is a two-part process: first make a large dataset of
"positive" & "negative" text and train a model to understand the
difference. Second, use the model to evaluate unknown text. The second
part could be added to Solr as an updating process, to include a
positive/negative score with each document.  The first part, training
the model, is a batch process done outside.

The Weka toolkit has more features than Lingpipe, OpenNLP, UIMA etc. I
would start with that. And learn how to program with R.

Lance

On Fri, Jul 8, 2011 at 8:13 PM, Zheng Qin  wrote:
> Thanks, Bruno and Matthew. I saw that tutorial before and Lingpipe requires
> a license while we are looking at open source solutions. We are not clear
> yet on how to use Solr to do sentiment analysis. Does a NLP or learning tool
> have to be used to accomplish this task? If a tool is needed, how it can be
> integrated with Solr? Then, what are the steps? By using classification? We
> are new to sentiment analysis and any suggestion is welcomed.
>
> On Sat, Jul 9, 2011 at 4:07 AM, Matthew Painter
> wrote:
>
>> Note you can't use lingpipe commercially without a license though I
>> believe.
>>
>> Sent from my iPhone
>>
>> On 8 Jul 2011, at 18:20, Bruno Adam Osiek  wrote:
>>
>> > Try Lingpipe. They use Language Models as their engine for sentiment
>> analysis. At (http://alias-i.com/lingpipe/) you will find a step-by-step
>> tutorial on how to implement it.
>> >
>> > On 07/08/2011 07:14 AM, Zheng Qin wrote:
>> >> Hi,
>> >>
>> >> We are starting a project on Twitter data sentiment analysis. We have
>> >> installed LucidWorks, which also has a Solr admin page. By reading the
>> >> posts, it seems that sentiment analysis can be done by using OpenNLP or
>> >> machine learning (Mahout or Weka). Can you share with us which tool is
>> good
>> >> at classifying positive/negative tweets? Also how to use it together
>> with
>> >> Solr (we only found one posted by Grant on March 16 2010 about
>> integrating
>> >> Solr with Mahout). Your reply will be appreciated. Thanks.
>> >>
>> >
>>
>



-- 
Lance Norskog
goks...@gmail.com


Re: How to import dynamic fields

2011-07-01 Thread Lance Norskog
SOLR-1499 is a DIH plugin that reads from another Solr.

https://issues.apache.org/jira/browse/SOLR-1499

It is not in active development, but is being updated to current source trees.

Lance

On Fri, Jul 1, 2011 at 12:51 PM, randolf.julian
 wrote:
> I am trying to import from one SOLR index to another (with different schema)
> using data import handler via http: However, there are dynamic fields in the
> source that I need to import. In the schema.xml, this field has been
> declared as:
>
>  
>
> When I query SOLR, this comes up:
>
> 2011-05-31T00:00:00Z name="END_DATE_1171485">2011-05-31T00:00:00Z name="END_DATE_14211203">2011-07-26T08:15:25Z name="END_DATE_163969688">2011-05-31T00:00:00Z name="END_DATE_215089986">2011-07-26T08:15:25Z name="END_DATE_355673498">2011-05-31T00:00:00Z name="END_DATE_4329407">2011-07-26T08:15:25Z name="END_DATE_660666924">2011-07-19T21:00:35Z name="END_DATE_669781160">2011-07-26T08:15:25Z name="END_DATE_793694814">2011-07-26T08:15:25Z name="END_DATE_824977178">2011-07-26T08:15:25Z
>
> How can I import these to the other SOLR index using dataimporthandler via
> http?
>
> Thanks,
> Randolf
>
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/How-to-import-dynamic-fields-tp3130553p3130553.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>



-- 
Lance Norskog
goks...@gmail.com


Re: Using RAMDirectoryFactory in Master/Slave setup

2011-06-28 Thread Lance Norskog
Using RAMDirectory really does not help performance. Java garbage
collection has to work around all of the memory taken by the segments.
It works out that Solr works better (for most indexes) without using
the RAMDirectory.



On Sun, Jun 26, 2011 at 2:07 PM, nipunb  wrote:
> PS: Sorry if this is a repost, I was unable to see my message in the mailing
> list - this may have been due to my outgoing email different from the one I
> used to subscribe to the list with.
>
> Overview – Trying to evaluate if keeping the index in memory using
> RAMDirectoryFactory can help in query performance.I am trying to perform the
> indexing on the master using solr.StandardDirectoryFactory and make those
> indexes accesible to the slave using solr.RAMDirectoryFactory
>
> Details:
> We have set-up Solr in a master/slave enviornment. The index is built on the
> master and then replicated to slaves which are used to serve the query.
> The replication is done using the in-built Java replication in Solr.
> On the master, in the  of solrconfig.xml we have
>         class="solr.StandardDirectoryFactory"/>
>
> On the slave, I tried to use the following in the 
>
>          class="solr.RAMDirectoryFactory"/>
>
> My slave shows no data for any queries. In solrconfig.xml it is mentioned
> that replication doesn’t work when using RAMDirectoryFactory, however this (
> https://issues.apache.org/jira/browse/SOLR-1379) mentions that you can use
> it to have the index on disk and then load into memory.
>
> To test the sanity of my set-up, I changed solrconfig.xml in the slave to
> and replicated:
>         class="solr.StandardDirectoryFactory"/>
> I was able to see the results.
>
> Shouldn’t RAMDirectoryFactory be used for reading index from disk into
> memory?
>
> Any help/pointers in the right direction would be appreciated.
>
> Thanks!
>
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Using-RAMDirectoryFactory-in-Master-Slave-setup-tp3111792p3111792.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>



-- 
Lance Norskog
goks...@gmail.com


Re: SolrCloud Feedback

2011-06-11 Thread Lance Norskog
s. While the files in conf are 
>>> normally well below the 1M limit ZK imposes, we should perhaps consider 
>>> using a lightweight distributed object or k/v store for holding the 
>>> /CONFIGS and let ZK store a reference only
>>>
>>> d) How are admins supposed to update configs in ZK? Install their favourite 
>>> ZK editor?
>>>
>>> e) We should perhaps not be so afraid to make ZK a requirement for Solr in 
>>> v4. Ideally you should interact with a 1-node Solr in the same manner as 
>>> you do with a 100-node Solr. An example is the Admin GUI where the "schema" 
>>> and "solrconfig" links assume local file. This requires decent tool support 
>>> to make ZK interaction intuitive, such as "import" and "export" commands.
>>>
>>> --
>>> Jan Høydahl, search solution architect
>>> Cominvent AS - www.cominvent.com
>>>
>>> On 19. jan. 2011, at 21.07, Mark Miller wrote:
>>>
>>>> Hello Users,
>>>>
>>>> About a little over a year ago, a few of us started working on what we 
>>>> called SolrCloud.
>>>>
>>>> This initial bit of work was really a combination of laying some base work 
>>>> - figuring out how to integrate ZooKeeper with Solr in a limited way, 
>>>> dealing with some infrastructure - and picking off some low hanging search 
>>>> side fruit.
>>>>
>>>> The next step is the indexing side. And we plan on starting to tackle that 
>>>> sometime soon.
>>>>
>>>> But first - could you help with some feedback?ISome people are using our 
>>>> SolrCloud start - I have seen evidence of it ;) Some, even in production.
>>>>
>>>> I would love to have your help in targeting what we now try and improve. 
>>>> Any suggestions or feedback? If you have sent this before, I/others likely 
>>>> missed it - send it again!
>>>>
>>>> I know anyone that has used SolrCloud has some feedback. I know it because 
>>>> I've used it too ;) It's too complicated to setup still. There are still 
>>>> plenty of pain points. We accepted some compromise trying to fit into what 
>>>> Solr was, and not wanting to dig in too far before feeling things out and 
>>>> letting users try things out a bit. Thinking that we might be able to 
>>>> adjust Solr to be more in favor of SolrCloud as we go, what is the ideal 
>>>> state of the work we have currently done?
>>>>
>>>> If anyone using SolrCloud helps with the feedback, I'll help with the 
>>>> coding effort.
>>>>
>>>> - Mark Miller
>>>> -- lucidimagination.com
>>>
>>
>
> - Mark Miller
> lucidimagination.com
>
>
>
>
>
>
>
>
>



-- 
Lance Norskog
goks...@gmail.com


Re: 400 MB Fields

2011-06-07 Thread Lance Norskog
The Salesforce book is 2800 pages of PDF, last I looked.

What can you do with a field that big? Can you get all of the snippets?

On Tue, Jun 7, 2011 at 5:33 PM, Fuad Efendi  wrote:
> Hi Otis,
>
>
> I am recalling "pagination" feature, it is still unresolved (with default
> scoring implementation): even with small documents, searching-retrieving
> documents 1 to 10 can take 0 milliseconds, but from 100,000 to 100,010 can
> take few minutes (I saw it with trunk version 6 months ago, and with very
> small documents, total 100 mlns docs); it is advisable to restrict search
> results to top-1000 in any case (as with Google)...
>
>
>
> I believe things can get wrong; yes, most plain-text retrieved from books
> should be 2kb per page, 500 pages, :=> 1,000,000 bytes (or double it for
> UTF-8)
>
> Theoretically, it doesn't make any sense to index BIG document containing
> all terms from dictionary without any "terms frequency" calcs, but even
> with it... I can't imagine we should index 1000s docs and each is just
> (different) version of whole Wikipedia, should be wrong design...
>
> Ok, use case: index single HUGE document. What will we do? Create index
> with _the_only_ document? And all search will return the same result (or
> nothing)? Paginate it; split into pages. I am pragmatic...
>
>
> Fuad
>
>
>
> On 11-06-07 8:04 PM, "Otis Gospodnetic"  wrote:
>
>>Hi,
>>
>>
>>> I think the question is strange... May be you are wondering about
>>>possible
>>> OOM exceptions?
>>
>>No, that's an easier one. I was more wondering whether with 400 MB Fields
>>(indexed, not stored) it becomes incredibly slow to:
>>* analyze
>>* commit / write to disk
>>* search
>>
>>> I think we can pass to Lucene single document  containing
>>> comma separated list of "term, term, ..." (few billion times)...  Except
>>> "stored" and "TermVectorComponent"...
>
>
>



-- 
Lance Norskog
goks...@gmail.com


Re: SOLR Custom datasource integration

2011-05-19 Thread Lance Norskog
What is JPA?

You are better off pulling from JPA yourself than coding with the
DataImportHandler. It will be much easier.

EmbeddedSolr is just like web solr: when you commit data it is on the
disk. If you crash during indexing, it may or may not be available to
commit. EmbeddedSolr does not do anything special with index storage.

Lance

On Thu, May 19, 2011 at 2:08 AM, amit.b@gmail.com
 wrote:
> Hi,
>
> We are trying build enterprise search solution using SOLR , out data source
> is Database which is interfaced with JPA.
>
> Solution looks like
>
> SOLR INDEX > JPA > Oracle database.
>
> We need help to findout what is the best approch integrate Solr Index with
> JPA.
>
> We tried out two appoches
>
> Approch 1 -
> 1> Polulating SolrInputDocument with data from JPA
> 2> Updating EmbeddedSolrServer with captured data using SolrJ API.
>
> Approch 2 -
> 1> Customizing dataimporthandler of HTTPSolrServer
> 2> Retrieving data in dataimporthandler using JPA entity.
>
> Functional requirement -
> 1> Solution should be performant for huge magnitude of data
> 2> Should be scalable
>
> We have few question which will help us to decide solution
>>Will like know which one is better approch to meet our requirement.
>>Is it good idea to integrate with Lucene against using EmbeddedSolrServer +
> JPA
>>If JVM is crashes ,  EmbeddedSolrServer content will be lost on reboot.
>>Can we get support from Jasper Experts team ? can we buy ? how ?
>
>
>
>
>
>
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/SOLR-Custom-datasource-integration-tp2960475p2960475.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>



-- 
Lance Norskog
goks...@gmail.com


Re: upper limit to boost weight/value ?

2011-05-08 Thread Lance Norskog
There is no upper limit. These are floats. But they can be small too.
Boost < 1 are 'under normal'.

One radix sorting trick is to boost one field 1000 and another field
5. If the first field is a string facet, this gives each group of
results in one long query. Lucene sorting does radix also, of course,
but sometimes it is not the best tool.




On 5/8/11, Ravi Gidwani  wrote:
> Hello:
>
> Is there any upper limit to the boost weight/value ? For example in the
> following query :
>
> &qf=exact_title^2000+exact_category^1900+exact_tags^1700
>
> are these boost values acceptable and work as expected ?
>
> Thanks,
> ~Ravi Gidwani
>


-- 
Lance Norskog
goks...@gmail.com


Re: How to take differential backup of Solr Index

2011-05-02 Thread Lance Norskog
The Replication feature does this. If you configure a query server as
a 'backup' server, it downloads changes but does not read them.

On Mon, May 2, 2011 at 9:56 PM, Gaurav Shingala
 wrote:
>
> Hi,
>
> Is there any way to take differential backup of Solr Index?
>
> Thanks,
> Gaurav
>
>



-- 
Lance Norskog
goks...@gmail.com


Re: Searching performance suffers tremendously during indexing

2011-05-01 Thread Lance Norskog
http://lucene.apache.org/java/3_0_3/api/contrib-misc/org/apache/lucene/index/BalancedSegmentMergePolicy.html

Look in solrconfig.xml for where MergePolicy is configured.

On Sun, May 1, 2011 at 6:31 PM, Lance Norskog  wrote:
> Yes, indexing generally slows down querying. Most sites do indexing in
> one Solr and queries from another. The indexing system does 'merging',
> which involves copying data around in files.
>
> With 10g allocated to a 1g load, the JVM is doing a lot more garbage
> collection that it would with a 1.5g allocation. Suggest dropping the
> memory allocation. Also, the operating system is very good at managing
> the disk cache, and this actually works better than Solr caching the
> index data.
>
> The main problem is index merging. If you are on 3.1, there is an
> alternate merging "policy" called the BalancedSegmentMergePolicy. This
> is fine-tuned to mix indexing and querying on one Solr; it was written
> at LinkedIn because their web-facing searchers do indexing & searching
> simultaneously.
>
> 2011/5/1 François Schiettecatte :
>> Couple of things. One you are not swaping which is a good thing. Second (and 
>> I am not sure what delay you selected for dstat, I would assume the default 
>> of 1 second) there is some pretty heavy write activity like this:
>>
>> 26   1  71   2   0   0 |4096B 1424k|   0     0 | 719   415 | 197M  11G|1.00  
>> 46.0 |4.0 9.0   0  13
>>
>> where you are writing out 1.4GB for example, this is happening pretty 
>> regularly so I suspect you are swamping your drive.
>>
>> You might also want to run atop and check the drive busy percentage, I would 
>> guess that you hitting high percentages.
>>
>> François
>>
>> On May 1, 2011, at 4:29 PM, Daniel Huss wrote:
>>
>>>
>>> Thanks for the tool recommendation! This is the dstat output during
>>> commit bombardment / concurrent search requests:
>>>
>>> total-cpu-usage -dsk/total- ---paging-- ---system-- swap---
>>> --io/total- ---file-locks--
>>> usr sys idl wai hiq siq| read  writ|  in   out | int   csw | used  free|
>>> read  writ|pos lck rea wri
>>> 11   1  87   1   0   0|1221k  833k| 538B  828B| 784   920 | 197M
>>> 11G|16.8  15.5 |4.0 9.0   0  13
>>> 60   0  40   0   0   0|   0     0 |   0     0 | 811   164 | 197M
>>> 11G|   0     0 |4.0 9.0   0  13
>>> 25   0  75   0   0   0|   0     0 |   0     0 | 576    85 | 197M
>>> 11G|   0     0 |4.0 9.0   0  13
>>> 25   0  75   0   0   0|   0     0 |   0     0 | 572    90 | 197M
>>> 11G|   0     0 |4.0 9.0   0  13
>>> 25   0  74   0   0   0|   0     0 |   0     0 | 730   204 | 197M
>>> 11G|   0     0 |4.0 9.0   0  13
>>> 26   1  71   2   0   0|4096B 1424k|   0     0 | 719   415 | 197M
>>> 11G|1.00  46.0 |4.0 9.0   0  13
>>> 31   1  68   0   0   0|   0   136k|   0     0 | 877   741 | 197M
>>> 11G|   0  6.00 |5.0 9.0   0  14
>>> 70   6  24   0   0   0|   0   516k|   0     0 |1705  1027 | 197M
>>> 11G|   0  46.0 |5.0  11 1.0  15
>>> 72   3  25   0   0   0|4096B  384k|   0     0 |1392   910 | 197M
>>> 11G|1.00  25.0 |5.0 9.0   0  14
>>> 60   2  25  12   0   0| 688k  108k|   0     0 |1162   509 | 197M
>>> 11G|79.0  9.00 |4.0 9.0   0  13
>>> 94   1   5   0   0   0| 116k    0 |   0     0 |1271   654 | 197M
>>> 11G|4.00     0 |4.0 9.0   0  13
>>> 57   0  43   0   0   0|   0     0 |   0     0 |1076   238 | 197M
>>> 11G|   0     0 |4.0 9.0   0  13
>>> 26   0  73   0   0   0|   0    16k|   0     0 | 830   188 | 197M
>>> 11G|   0  2.00 |4.0 9.0   0  13
>>> 29   1  70   0   0   0|   0     0 |   0     0 |1088   360 | 197M
>>> 11G|   0     0 |4.0 9.0   0  13
>>> 29   1  70   0   0   1|   0   228k|   0     0 | 890   590 | 197M
>>> 11G|   0  21.0 |4.0 9.0   0  13
>>> 81   6  13   0   0   0|4096B 1596k|   0     0 |1227   441 | 197M
>>> 11G|1.00  52.0 |5.0 9.0   0  14
>>> 48   2  48   1   0   0| 172k    0 |   0     0 | 953   292 | 197M
>>> 11G|21.0     0 |5.0 9.0   0  14
>>> 25   0  74   0   0   0|   0     0 |   0     0 | 808   222 | 197M
>>> 11G|   0     0 |5.0 9.0   0  14
>>> 25   0  74   0   0   0|   0     0 |   0     0 | 607    90 | 197M
>>> 11G|   0     0 |5.0 9.0   0  14
>>> 25   0  75   0   0   0|   0     0 |   0     0 | 603   106 | 197M
>>> 11G|   0     0 |5.0 9.0   0  14
>>> 25   0  75   0   0   0|   0   144k|   0     0 | 625   104 | 197M
>>> 11G|   0  7.00 |5.0 9.0   0  14
>>> 85   3   9   2   0   0| 248k   92k|   0     0 |1441   887 | 197M
&

Re: solr sorting problem

2011-05-01 Thread Lance Norskog
Two scenarios:
1) you call the indexing API but do not call 'commit'. This makes data
visible to searching.
2) the default solrconfig has HTTP caching turned on. So you redo the
search url, and you get the old result. This is really really
annoying. And you have to change the configuration to specifically not
cache, as shown in the comments.

On Sun, May 1, 2011 at 6:44 AM, Pratik  wrote:
> Hello,
>
> I got over that problem but now i am facing a new problem.
>
> Indexing works but search does not.
>
> I used the following line in the schema:-
> 
> and
> 
>
> I'm trying to use the default "alphaOnlySort" in the sample schema.xml.
> Database is MySQL, there is a column/field named ColXYZ
> My data-config looks like :-
> 
> 
>
> In which scenarios would SOLR index the records/documents but the search
> won't work
>
> Thanks --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/solr-sorting-problem-tp486144p2886248.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>



-- 
Lance Norskog
goks...@gmail.com


Re: Searching performance suffers tremendously during indexing

2011-05-01 Thread Lance Norskog
 if
>>>>> bumping up the ram (-Xms512m -Xmx1024m) alleviates it.  Even if this isn't
>>>>> the fix, you can at least isolate if it's a memory issue, or if your issue
>>>>> is related to a disk I/O issue (e.g. running optimization on every 
>>>>> commit).
>>>>>
>>>>>
>>>>> Also, is worth having a look in your logs to see if the server is having
>>>>> complaints about memory or issues with your schema, or some other 
>>>>> unexpected
>>>>> issue.
>>>>>
>>>>> A resource that has been helpful for me
>>>>> http://wiki.apache.org/solr/SolrPerformanceFactors
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> -Original Message-
>>>>> From: Daniel Huss [mailto:hussdl1985-solrus...@yahoo.de]
>>>>> Sent: Sunday, 1 May 2011 5:35 AM
>>>>> To: solr-user@lucene.apache.org
>>>>> Subject: Searching performance suffers tremendously during indexing
>>>>>
>>>>> Hi everyone,
>>>>>
>>>>> our Solr-based search is unresponsive while documents are being indexed.
>>>>> The documents to index (results of a DB query) are sent to Solr by a
>>>>> daemon in batches of varying size. The number of documents per batch may
>>>>> vary between one and several hundreds of thousands.
>>>>>
>>>>> Before investigating any further, I would like to ask if this can be
>>>>> considered an issue at all. I was expecting Solr to handle concurrent
>>>>> indexing/searching quite well, in fact this was one of the main reasons
>>>>> for chosing Solr over the searching capabilities of our RDMS.
>>>>>
>>>>> Is searching performance *supposed* to drop while documents are being
>>>>> indexed?
>>>>>
>>>>>
>>
>
>



-- 
Lance Norskog
goks...@gmail.com


Re: Automatic synonyms for multiple variations of a word

2011-04-25 Thread Lance Norskog
This has come up with stemming: you can stem your synonym list with
the FieldAnalyzer Solr http call, then save the final chewed-up terms
as a new synonym file. You then use that one in the analyzer stack
below the stemmer filter.

On Mon, Apr 25, 2011 at 9:15 PM, Otis Gospodnetic
 wrote:
> Hi Otis & Robert,
>
>  - Original Message 
>
>>
>> How do people handle cases where synonyms are used and there are  multiple
>> version of the original word that really need to point to the same  set of
>> synonyms?
>>
>> For example:
>> Consider singular and plural of the  word "responsibility".  One might have
>> synonyms defined like  this:
>>
>>   responsibility, obligation, duty
>>
>> But the plural  "responsibilities" is not in there, and thus it will not get
>> expanded to the  synonyms above! That's a problem.
>>
>> Sure, one could change the synonyms  file to look like this:
>>
>>   responsibility, responsibilities,  obligation, duty
>>
>> But that means somebody needs to think of all variations  of the word!
>
> Yes, that seems to be the case now, as it was in 2008:
> http://search-lucene.com/m/gLwUCV0qU02&subj=Re+Synonyms+and+stemming+revisited
> http://search-lucene.com/m/7lqdp1ldrqx (Hoss replied, but I think that
> suggestion doesn't actually work)
>
>> Is there a something one can do to get all variations of  the word to map to
>>the
>>
>> same synonyms without having to explicitly specify  all variations of the
> word?
>
> I think this is where Robert's 2+2lemma pointer may help because the 2+lemma
> list contains "records" where a headword is followed by a list of other
> variations of the word.  The way I think this would help is by simply taking
> that list and turning it into the synonyms file format, and then merging in 
> the
> actual synonyms.
>
> For example, if I have the word "responsibility", then from 2+2lemma I should 
> be
> able to get that "responsibilities" is one of the variants of 
> "responsibility".
> I should then be able to take those 2 words and stick them in synonyms file 
> like
> this:
>
>  responsibility, responsibilities
>
> And then append actual synonyms to that:
>
>  responsibility, responsibilities, obligation, duty
>
> But I may then need to actually expand synonyms themselves, too (again using
> data from 2+2lemma):
>
>  responsibility, responsibilities, obligation, obligations, duty, duties
>
>
> I haven't tried this yet.  Just theorizing and hoping for feedback.
>
> Does this sound about right?
>
> Thanks,
> Otis
> 
> Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
> Lucene ecosystem search :: http://search-lucene.com/
>
>



-- 
Lance Norskog
goks...@gmail.com


Re: tika/pdfbox knobs & levers

2011-04-14 Thread Lance Norskog
Tika creates document-level metadata and text from the input file.
That's it. If you want to use PDFbox directly, you need your own Solr
plugin.

On 4/13/11, Markus Jelsma  wrote:
> Hi,
>
> I'm not sure how Solr allows for adjusting these Tika settings to get the
> desired output. At least a few desirable Tika subsystems cannot be called
> from
> the ExtractingRequestHandler such as Tika's BoilerPlateContentHandler. I'm
> also not really sure if it's a good idea to normalize diacritics in Tika
> output, this way the stored data would also be normalized which is not
> desirable.
>
> You can, however, normalize diacritics in your field analyzer. This way your
> search is normalized but the returned data still holds diacritics which is
> good.
>
> Cheers,
>
>> Hi all,
>>
>> I'm wondering if there are any knobs or levers i can set in
>> solrconfig.xml that affect how pdfbox text extraction is performed by
>> the extraction handler. I would like to take advantage of pdfbox's
>> ability to normalize diacritics and ligatures [1], but that doesn't
>> seem to be the default behavior. Is there a way to enable this?
>>
>> Thanks,
>> --jay
>>
>> [1]
>> http://pdfbox.apache.org/apidocs/index.html?org/apache/pdfbox/util/TextNor
>> malize.html
>


-- 
Lance Norskog
goks...@gmail.com


Re: Vetting Our Architecture: 2 Repeaters and Slaves.

2011-04-14 Thread Lance Norskog
y first time working
>>> with Solr, so I want to be sure what I'm planning isn't totally weird,
>>> unsupported, etc.
>>>
>>> We've got a a pair of F5 loadbalancers and 4 hosts.  2 of those hosts
>>>will
>>> be repeaters (master+slave), and 2 of those hosts will be pure slaves.
>>>One
>>> of the F5 vips, "Index-vip" will have members HOST1 and HOST2, but HOST2
>>> will be "downed" and not taking traffic from that vip.  The second vip,
>>> "Search-vip" will have 3 members: HOST2, HOST3, and HOST4.  The
>>> "Index-vip" is intended to be used to post and commit index changes.
>>>The
>>> "Search-vip" is intended to be customer facing.
>>>
>>> Here is some ASCII art.  The line with the "X"'s thru it denotes a
>>> "downed" member of a vip, one that isn't taking any traffic.  The "M:"
>>> denotes the value in the solrconfig.xml that the host uses as the
>>>master.
>>>
>>>
>>>  Index-vip Search-vip
>>> / \ /   |   \
>>>/   X   /|\
>>>   / \ / | \
>>>  /   X   /  |  \
>>> / \ /   |   \
>>>/   X   /|\
>>>   / \ / | \
>>> HOST1  HOST2  HOST3  HOST4
>>>   REPEATERREPEATERSLAVE  SLAVE
>>>  M:Index-vipM:Index-vip M:Index-vip  M:Index-vip
>>>
>>>
>>> I've been working through a couple failure scenarios.  Recovering from a
>>> failure of HOST2, HOST3, or HOST4 is pretty straightforward.  Loosing
>>> HOST1 is my major concern.  My plan for recovering from a failure of
>>>HOST1
>>> is as follows: Enable HOST2 as a member of the Index-vip, while
>>>disabling
>>> member HOST1.  HOST2 effectively becomes the Master.  HOST2, 3, and 4
>>> continue fielding customer requests and pulling indexes from
>>>"Index-vip."
>>> Since HOST2 is now in charge of crunching indexes and fielding customer
>>> requests, I assume load will increase on that box.
>>>
>>> When we recover HOST1, we will simply make sure it has replicated
>>>against
>>> "Index-vip" and then re-enable HOST1 as a member of the Index-vip and
>>> disable HOST2.
>>>
>>> Hopefully this makes sense.  If all goes correctly, I've managed to keep
>>> all services up and running without loosing any index data.
>>>
>>> So, I have a few questions:
>>>
>>> 1. Has anyone else tried this dual repeater approach?
>>> 2. Am I going to have any semaphore/blocking issues if a repeater is
>>> pulling index data from itself?
>>> 3. Is there a better way to do this?
>>>
>>>
>>> Thanks,
>>> Parker
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>
>
>


-- 
Lance Norskog
goks...@gmail.com


Re: Solr 3.1 performance compared to 1.4.1

2011-04-11 Thread Lance Norskog
Marius: "I have copied the configuration from 1.4.1 to the 3.1."

Does the Directory implementation show up in the JMX beans? In
admin/statistics.jsp ? Or the Solr startup logs? (Sorry, don't have a
Solr available.)

Yonik:
> What platform are you on?  I believe the Lucene Directory
> implementation now tries to be smarter (compared to lucene 2.9) about
> picking the best default (but it may not be working out for you for
> some reason)

Lance

On Sun, Apr 10, 2011 at 12:46 PM, Yonik Seeley
 wrote:
> On Fri, Apr 8, 2011 at 9:53 AM, Marius van Zwijndregt
>  wrote:
>> Hello !
>>
>> I'm new to the list, have been using SOLR for roughly 6 months and love it.
>>
>> Currently i'm setting up a 3.1 installation, next to a 1.4.1 installation
>> (Ubuntu server, same JVM params). I have copied the configuration from 1.4.1
>> to the 3.1.
>> Both version are running fine, but one thing ive noticed, is that the QTime
>> on 3.1, is much slower for initial searches than on the (currently
>> production) 1.4.1 installation.
>>
>> For example:
>>
>> Searching with 3.1; http://mysite:9983/solr/select?q=grasmaaier: QTime
>> returns 371
>> Searching with 1.4.1: http://mysite:8983/solr/select?q=grasmaaier: QTime
>> returns 59
>>
>> Using debugQuery=true, i can see that the main time is spend in the query
>> component itself (org.apache.solr.handler.component.QueryComponent).
>>
>> Can someone explain this, and how can i analyze this further ? Does it take
>> time to build up a decent query, so could i switch to 3.1 without having to
>> worry ?
>
> Thanks for the report... there's no reason that anything should really
> be much slower, so it would be great to get to the bottom of this!
>
> Is this using the same index as the 1.4.1 server, or did you rebuild it?
>
> Are there any other query parameters (that are perhaps added by
> default, like faceting or anything else that could take up time) or is
> this truly just a term query?
>
> What platform are you on?  I believe the Lucene Directory
> implementation now tries to be smarter (compared to lucene 2.9) about
> picking the best default (but it may not be working out for you for
> some reason).
>
> -Yonik
> http://www.lucenerevolution.org -- Lucene/Solr User Conference, May
> 25-26, San Francisco
>



-- 
Lance Norskog
goks...@gmail.com


Re: Tika, Solr running under Tomcat 6 on Debian

2011-04-11 Thread Lance Norskog
Ah! Did you set the UTF-8 parameter in Tomcat?

On Mon, Apr 11, 2011 at 2:49 AM, Mike  wrote:
> Hi Roy,
>
> Thank you for the quick reply. When i tried to index the PDF file i was able
> to see the response:
>
>
> 0
> 479
>
>
>
> Query:
> http://localhost:8080/solr/update/extract?stream.file=D:\mike\lucene\apache-solr-1.4.1\example\exampledocs\Struts%202%20Design%20and%20Programming1.pdf&stream.contentType=application/pdf&literal.id=Struts%202%20Design%20and%20Programming1.pdf&defaultField=text&commit=true
>
> But when i tried to search the content in the pdf i could not get any
> results:
>
>
>
> 0
> 2
> −
>
> on
> 0
> struts
> 10
> 2.2
>
>
>
>
>
> Could you please let me know if I am doing anything wrong. It works fine
> when i tried with default jetty server prior to integrating on the tomcat6.
>
> I have followed installation steps from
> http://wiki.apache.org/solr/SolrTomcat
> (Tomcat on Windows Single Solr app).
>
> Thanks,
> Mike
>
>
>
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Tika-Solr-running-under-Tomcat-6-on-Debian-tp993295p2805974.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>



-- 
Lance Norskog
goks...@gmail.com


Re: Indexing Best Practice

2011-04-11 Thread Lance Norskog
SOLR-1499 is a plug-in for the DIH that uses Solr as a DataSource.
This means that you can read the database and PDFs separately. You
could index all of the PDF content in one DIH script. Then, when
there's a database update, you have a separate DIH scripts that reads
the old row from Solr, and pulls the stripped text from the PDF, and
then re-indexes the whole thing. This would cut out the need to
reparse the PDF.

Lance

On Mon, Apr 11, 2011 at 8:48 AM, Shaun Campbell
 wrote:
> If it's of any help I've split the processing of PDF files from the
> indexing. I put the PDF content into a text file (but I guess you could load
> it into a database) and use that as part of the indexing.  My processing of
> the PDF files also compares timestamps on the document and the text file so
> that I'm only processing documents that have changed.
>
> I am a newbie so perhaps there's more sophisticated approaches.
>
> Hope that helps.
> Shaun
>
> On 11 April 2011 07:20, Darx Oman  wrote:
>
>> Hi guys
>>
>> I'm wondering how to best configure solr to fulfills my requirements.
>>
>> I'm indexing data from 2 data sources:
>> 1- Database
>> 2- PDF files (password encrypted)
>>
>> Every file has related information stored in the database.  Both the file
>> content and the related database fields must be indexed as one document in
>> solr.  Among the DB data is *per-user* permissions for every document.
>>
>> The file contents nearly never change, on the other hand, the DB data and
>> especially the permissions change very frequently which require me to
>> re-index everything for every modified document.
>>
>> My problem is in process of decrypting the PDF files before re-indexing
>> them
>> which takes too much time for a large number of documents, it could span to
>> days in full re-indexing.
>>
>> What I'm trying to accomplish is eliminating the need to re-index the PDF
>> content if not changed even if the DB data changed.  I know this is not
>> possible in solr, because solr doesn't update documents.
>>
>> So how to best accomplish this:
>>
>> Can I use 2 indexes one for PDF contents and the other for DB data and have
>> a common id field for both as a link between them, *and results are treated
>> as one Document*?
>>
>



-- 
Lance Norskog
goks...@gmail.com


Re: Solr under Tomcat

2011-04-11 Thread Lance Norskog
Hi Mike-

Please start a new thread for this.

On Mon, Apr 11, 2011 at 2:47 AM, Mike  wrote:
> Hi All,
>
> I have installed solr instance on tomcat6. When i tried to index the PDF
> file i was able to see the response:
>
>
> 0
> 479
>
>
> Query:
> http://localhost:8080/solr/update/extract?stream.file=D:\mike\lucene\apache-solr-1.4.1\example\exampledocs\Struts%202%20Design%20and%20Programming1.pdf&stream.contentType=application/pdf&literal.id=Struts%202%20Design%20and%20Programming1.pdf&defaultField=text&commit=true
>
> But when i tried to search the content in the pdf i could not get any
> results:
>
>
>
> 0
> 2
> −
>
> on
> 0
> struts
> 10
> 2.2
>
>
>
>
>
> Could you please let me know if I am doing anything wrong. It works fine
> when i tried with default jetty server prior to integrating on the tomcat6.
>
> I have followed installation steps from
> http://wiki.apache.org/solr/SolrTomcat
> (Tomcat on Windows Single Solr app).
>
> Thanks,
> Mike
>
>
>
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Solr-under-Tomcat-tp2613501p2805970.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>



-- 
Lance Norskog
goks...@gmail.com


Re: DIH: Enhance XPathRecordReader to deal with //body(FLATTEN=true) and //body/h1

2011-04-11 Thread Lance Norskog
The DIH has multi-threading. You can have one thread fetching files
and then give them to different threads.

On Mon, Apr 11, 2011 at 11:40 AM,   wrote:
> Hi Lance,
>
> I used XPathEntityProcessor with attribut "xsl" and generate a xml-File "in 
> the form of the standard Solr update schema".
> I lost a lot of performance, it is a pity that XPathEntityProcessor does only 
> use one thread.
>
> My tests with a collection of 350T Document:
> 1. use of XPathRecordReader without xslt:  28min
> 2. use of XPathEntityProcessor with xslt (Standard solr-war / Xalan): 44min
> 2. use of XPathEntityProcessor with saxon-xslt: 36min
>
>
> Best regards
>  Karsten
>
>
>
>  Lance
>> There is an option somewhere to use the full XML DOM implementation
>> for using xpaths. The purpose of the XPathEP is to be as simple and
>> dumb as possible and handle most cases: RSS feeds and other open
>> standards.
>>
>> Search for xsl(optional)
>>
>> http://wiki.apache.org/solr/DataImportHandler#Configuration_in_data-config.xml-1
>>
> --karsten
>> > Hi Folks,
>> >
>> > does anyone improve DIH XPathRecordReader to deal with nested xpaths?
>> > e.g.
>> > data-config.xml with
>> >  > >  
>> >  
>> > and the XML stream contains
>> >  /html/body/h1...
>> > will only fill field “alltext” but field “title” will be empty.
>> >
>> > This is a known issue from 2009
>> >
>> https://issues.apache.org/jira/browse/SOLR-1437#commentauthor_12756469_verbose
>> >
>> > So three questions:
>> > 1. How to fill a “search over all”-Field without nested xpaths?
>> >   (schema.xml   will not help,
>> because we lose the original token order)
>> > 2. Does anyone try to improve XPathRecordReader to deal with nested
>> xpaths?
>> > 3. Does anyone else need this feature?
>> >
>> >
>> > Best regards
>> >  Karsten
>> >
>
> http://lucene.472066.n3.nabble.com/DIH-Enhance-XPathRecordReader-to-deal-with-body-FLATTEN-true-and-body-h1-td2799005.html
>



-- 
Lance Norskog
goks...@gmail.com


Re: DIH OutOfMemoryError?

2011-04-10 Thread Lance Norskog
Make sure streaming is on.
Try using autoCommit in solrconfig.xml. This will push documents out
of memory onto disk at a regular interval.

On Thu, Mar 31, 2011 at 8:51 AM, Markus Jelsma
 wrote:
> Try splitting the files into smaller chunks. It'll help.
>
>> Hi,
>>
>> I'm trying to index a big XML file (800Mo) using DIH, but i'm getting an
>> OutOfMemoryError!
>>
>> I've got 2048mo of RAM on this server, obviously it's not enough... How
>> much RAM is recomended for indexing big files?
>>
>> Thanks for your help
>>
>>
>> Here is the error from DIH mode verbose:
>>
>> java.lang.ClassCastException:
>> java.lang.OutOfMemoryError cannot be cast to java.lang.Exception
>>      at
>> org.apache.solr.handler.dataimport.DebugLogger.log(DebugLogger.java:139)
>>      at
>> org.apache.solr.handler.dataimport.SolrWriter.log(SolrWriter.java:237)
>>      at
>> org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java
>> :422) at
>> org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java
>> :383) at
>> org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:24
>> 2) at
>> org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:180)
>>      at
>> org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.j
>> ava:331) at
>> org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:38
>> 9) at
>> org.apache.solr.handler.dataimport.DataImportHandler.handleRequestBody(Data
>> ImportHandler.java:203) at
>> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase
>> .java:131) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316)
>>      at
>> org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:
>> 338) at
>> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java
>> :241) at
>> org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(Applicatio
>> nFilterChain.java:235) at
>> org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterC
>> hain.java:206) at
>> org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.j
>> ava:233) at
>> org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.j
>> ava:191) at
>> org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:12
>> 7) at
>> org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:10
>> 2) at
>> org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.jav
>> a:109) at
>> org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:298)
>>      at
>> org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:857)
>>      at
>> org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Htt
>> p11Protocol.java:588) at
>> org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:489)
>>      at java.lang.Thread.run(Thread.java:636)
>> 
>



-- 
Lance Norskog
goks...@gmail.com


Re: Concatenate multivalued DIH fields

2011-04-10 Thread Lance Norskog
The XPathEntityProcessor allows you to use an external XSL transform
file. In that you can do anything you want. Another option is to use
the script transformer:

http://wiki.apache.org/solr/DataImportHandler#ScriptTransformer

On Wed, Apr 6, 2011 at 12:16 PM, alexei  wrote:
> Hi Everyone,
>
> I am having an identical problem with concatenating author's first and last
> names stored in an xml blob.
> Because this field is multivalued copyfield does not work.
>
> Does anyone have a solution?
>
> Regards,
> Alexei
>
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Concatenate-multivalued-DIH-fields-tp2749988p2786506.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>



-- 
Lance Norskog
goks...@gmail.com


Re: Solrj performance bottleneck

2011-04-10 Thread Lance Norskog
There is a separate auto-suggest tool that creates a simple in-memory
database outside of the Lucene index. This is called TST.

On Tue, Apr 5, 2011 at 3:36 AM, rahul  wrote:
> Thanks Stefan and Victor ! we are using GWT for front end. We stopped issuing
> multiple asynchronous queries and issue a request and fetch results and then
> filter the results based on what has
> been typed subsequent to the request and then re trigger the request only if
> we don't get the expected results.
>
> Thanks Victor, I appreciate the link to the Jquery example and we will look
> into it as a reference.
>
> Regards,
> Rahul.
>
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Solrj-performance-bottleneck-tp2682797p2779387.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>



-- 
Lance Norskog
goks...@gmail.com


Re: difference between geospatial search from database angle and from solr angle

2011-04-10 Thread Lance Norskog
Wait! How can you do distance calculations across different shards efficiently?

On Thu, Apr 7, 2011 at 7:19 AM, Smiley, David W.  wrote:
> I haven't used PostGIS so I can't offer a real comparison. I think if you 
> were to try out both, you'd be impressed with Solr's performance/scalability 
> thanks in large part to its sharding.  But for "functionality richness" in so 
> far as geospatial is concerned, that's where Solr currently comes short. It 
> just has the basic stuff 80% of people want.
>
> ~ David Smiley
> Author: http://www.packtpub.com/solr-1-4-enterprise-search-server/
>
> On Apr 7, 2011, at 2:24 AM, Sean Bigdatafun wrote:
>
>> Thanks, David.
>>
>> I am thinking of a scenario that billions of objects, whose indices are too
>> big for a single machine to serve the indexing, to serve the querying. Is
>> there any sharding mechanism?
>>
>>
>> Can you give a comparison between solr-based geospatial search and PostGIS
>> based geospatial search?
>>          * scalability
>>          * functionality richness
>>          * incremental indexing (re-indexing) cost
>>          * query cost
>>          * sharding scheme support
>
>
>
>
>
>
>



-- 
Lance Norskog
goks...@gmail.com


Re: Problems indexing very large set of documents

2011-04-10 Thread Lance Norskog
There is a library called iText. It parses and writes PDFs very very
well, and a simple program will let you do a batch conversion.  PDFs
are made by a wide range of programs, not just Adobe code. Many of
these do weird things and make small mistakes that Tika does not know
to handle. In other words there is "dirty PDF" just like "dirty HTML".

A percentage of PDFs will fail and that's life. One site that gets
press releases from zillions of sites (and thus a wide range of PDF
generators) has a 15% failure rate with Tika.

Lance

On Fri, Apr 8, 2011 at 9:44 AM, Brandon Waterloo
 wrote:
> I think I've finally found the problem.  The files that work are PDF version 
> 1.6.  The files that do NOT work are PDF version 1.4.  I'll look into 
> updating all the old documents to PDF 1.6.
>
> Thanks everyone!
>
> ~Brandon Waterloo
> 
> From: Ezequiel Calderara [ezech...@gmail.com]
> Sent: Friday, April 08, 2011 11:35 AM
> To: solr-user@lucene.apache.org
> Cc: Brandon Waterloo
> Subject: Re: Problems indexing very large set of documents
>
> Maybe those files are created with a different Adobe Format version...
>
> See this: 
> http://lucene.472066.n3.nabble.com/PDF-parser-exception-td644885.html
>
> On Fri, Apr 8, 2011 at 12:14 PM, Brandon Waterloo 
> mailto:brandon.water...@matrix.msu.edu>> 
> wrote:
> A second test has revealed that it is something to do with the contents, and 
> not the literal filenames, of the second set of files.  I renamed one of the 
> second-format files and tested it and Solr still failed.  However, the 
> problem still only applies to those files of the second naming format.
> 
> From: Brandon Waterloo 
> [brandon.water...@matrix.msu.edu<mailto:brandon.water...@matrix.msu.edu>]
> Sent: Friday, April 08, 2011 10:40 AM
> To: solr-user@lucene.apache.org<mailto:solr-user@lucene.apache.org>
> Subject: RE: Problems indexing very large set of documents
>
> I had some time to do some research into the problems.  From what I can tell, 
> it appears Solr is tripping up over the filename.  These are strictly 
> examples, but, Solr handles this filename fine:
>
> 32-130-A0-84-african_activist_archive-a0a6s3-b_12419.pdf
>
> However, it fails with either a parsing error or an EOF exception on this 
> filename:
>
> 32-130-A08-84-al.sff.document.nusa197102.pdf
>
> The only significant difference is that the second filename contains multiple 
> periods.  As there are about 1700 files whose filenames are similar to the 
> second format it is simply not possible to change their filenames.  In 
> addition they are being used by other applications.
>
> Is there something I can change in Solr configs to fix this issue or am I 
> simply SOL until the Solr dev team can work on this? (assuming I put in a 
> ticket)
>
> Thanks again everyone,
>
> ~Brandon Waterloo
>
>
> 
> From: Chris Hostetter 
> [hossman_luc...@fucit.org<mailto:hossman_luc...@fucit.org>]
> Sent: Tuesday, April 05, 2011 3:03 PM
> To: solr-user@lucene.apache.org<mailto:solr-user@lucene.apache.org>
> Subject: RE: Problems indexing very large set of documents
>
> : It wasn't just a single file, it was dozens of files all having problems
> : toward the end just before I killed the process.
>       ...
> : That is by no means all the errors, that is just a sample of a few.
> : You can see they all threw HTTP 500 errors.  What is strange is, nearly
> : every file succeeded before about the 2200-files-mark, and nearly every
> : file after that failed.
>
> ..the root question is: do those files *only* fail if you have already
> indexed ~2200 files, or do they fail if you start up your server and index
> them first?
>
> there may be a resource issued (if it only happens after indexing 2200) or
> it may just be a problem with a large number of your PDFs that your
> iteration code just happens to get to at that point.
>
> If it's the former, then there may e something buggy about how Solr is
> using Tika to cause the problem -- if it's the later, then it's a straight
> Tika parsing issue.
>
> : > now, commit is set to false to speed up the indexing, and I'm assuming 
> that
> : > Solr should be auto-committing as necessary.  I'm using the default
> : > solrconfig.xml file included in apache-solr-1.4.1\example\solr\conf.  Once
>
> solr does no autocommitting by default, you need to check your
> solrconfig.xml
>
>
> -Hoss
>
>
>
> --
> __
> Ezequiel.
>
> Http://www.ironicnet.com
>



-- 
Lance Norskog
goks...@gmail.com


Re: How to index PDF file stored in SQL Server 2008

2011-04-10 Thread Lance Norskog
You have to upgrade completely to the Apache Solr 3.1 release. It is
worth the effort. You cannot copy any jars between Solr releases.
Also, you cannot copy over jars from newer Tika releases.

On Fri, Apr 8, 2011 at 10:47 AM, Darx Oman  wrote:
> Hi again
> what you are missing is field mapping
> 
> 
>
>
> no need for TikaEntityProcessor  since you are not accessing pdf files
>



-- 
Lance Norskog
goks...@gmail.com


Re: Lucid Works

2011-04-10 Thread Lance Norskog
Just to be clear, we are talking about two different Lucid Imagination products.

The Certified Distribution is a repackaging of the public Solr
releases with various add-on goodies that Lucid and others have
written over the years. This is the "drop-in replacement" for the
Apache release of Solr.

LucidWorks is a commercial product with a free download for
developers. LucidWorks has a core server and an external Ruby-based
UI.

On Fri, Apr 8, 2011 at 3:14 PM, Erik Hatcher  wrote:
>
> On Apr 8, 2011, at 17:32 , Mark wrote:
>
>> How come this new version is bundled with rails and why is there no .war 
>> output format?
>
> Rails, via JRuby, is used in LucidWorks Enterprise for both the admin and 
> search interfaces. (and also powers the Alerts REST API).
>
>> I wanted a simple drop in replacement for my current war :(
>
> LucidWorks Enterprise is a standalone install, and intentionally controls the 
> full environment (using Jetty).  While it *is* Solr, it is also a lot more 
> around it and a product we are supporting thoroughly.   It's better for us, 
> and our customers, to have as controlled an environment as possible.
>
> As a search engine user, especially with LucidWorks Enterprise, the idea is 
> to speak an API to it (a REST control API, as well as standard Solr HTTP 
> interface)... making dropping in newer versions relatively straightforward.
>
> [an aside, there are plans to release a LucidWorks for Solr 3.1/.x certified 
> distribution too - I can't speak to the timeline of such a release though]
>
>        Erik
>
>



-- 
Lance Norskog
goks...@gmail.com


Re: Solr architecture diagram

2011-04-10 Thread Lance Norskog
Very cool! "The Life Cycle of the IndexSearcher" would also be a great
diagram. The whole dance that happens during a commit is hard to
explain. Also, it would help show why garbage collection can act up
around commits.

Lance

On Sun, Apr 10, 2011 at 2:05 AM, Jan Høydahl  wrote:
>> Looks really good, but two bits that i think might confuse people are
>> the implications that a "Query Parser" then invokes a series of search
>> components; and that "analysis" (and the pieces of an analyzer chain)
>> are what to lookups in the underlying lucene index.
>>
>> the first might just be the ambiguity of "Query" .. using the term
>> "request parser" might make more sense, in comparison to the "update
>> parsing" from the other side of hte diagram.
>
> Thanks for commenting.
>
> Yea, the purpose is more to show a conceptual rather than actual relation
> between the different components, focusing on the flow. A 100% technical
> correct diagram would be too complex for beginners to comprehend,
> although it could certainly be useful for developers.
>
> I've removed the arrow between QueryParser and search components to clarify.
> The boxes first and foremost show that query parsing and response writers
> are within the realm of search request handler.
>
>> the analysis piece is a little harder to fix cleanly.  you really want the
>> end of the analysis chain to feed back up to the searh components, and
>> then show it (most of hte search components really) talking to the Lucene
>> index.
>
> Yea, I know. Showing how Faceting communicate with the main index and
> spellchecker with its spellchecker index could also be useful, but I think
> that would be for another more detailed diagram.
>
> I felt it was more important for beginners to realize visually that
> analysis happens both at index and search time, and that the analyzers
> align 1:1. At this stage in the digram I often explain the importance
> of matching up the analysis on both sides to get a match in the index.
>
> --
> Jan Høydahl, search solution architect
> Cominvent AS - www.cominvent.com
>
>



-- 
Lance Norskog
goks...@gmail.com


Re: DIH: Enhance XPathRecordReader to deal with //body(FLATTEN=true) and //body/h1

2011-04-10 Thread Lance Norskog
There is an option somewhere to use the full XML DOM implementation
for using xpaths. The purpose of the XPathEP is to be as simple and
dumb as possible and handle most cases: RSS feeds and other open
standards.

Search for xsl(optional)

http://wiki.apache.org/solr/DataImportHandler#Configuration_in_data-config.xml-1

On Sat, Apr 9, 2011 at 5:32 AM,   wrote:
> Hi Folks,
>
> does anyone improve DIH XPathRecordReader to deal with nested xpaths?
> e.g.
> data-config.xml with
>    
>  
> and the XML stream contains
>  /html/body/h1...
> will only fill field “alltext” but field “title” will be empty.
>
> This is a known issue from 2009
> https://issues.apache.org/jira/browse/SOLR-1437#commentauthor_12756469_verbose
>
> So three questions:
> 1. How to fill a “search over all”-Field without nested xpaths?
>   (schema.xml   will not help, because 
> we lose the original token order)
> 2. Does anyone try to improve XPathRecordReader to deal with nested xpaths?
> 3. Does anyone else need this feature?
>
>
> Best regards
>  Karsten
>



-- 
Lance Norskog
goks...@gmail.com


Re: How to index PDF file stored in SQL Server 2008

2011-04-07 Thread Lance Norskog
You need the TikaEntityProcessor to unpack the PDF image. You are
sticking binary blobs into the index. Tika unpacks the text out of the
file.

TikaEP is not in Solr 1.4, but it is in the new Solr 3.1 release.

On Thu, Apr 7, 2011 at 7:14 PM, Roy Liu  wrote:
> Hi,
>
> I have a table named *attachment *in MS SQL Server 2008.
>
> COLUMN    TYPE
> -     
> id               int
> title            varchar(200)
> attachment image
>
> I need to index the attachment(store pdf files) column from database via
> DIH.
>
> After access this URL, it returns "Indexing completed. Added/Updated: 5
> documents. Deleted 0 documents."
> http://localhost:8080/solr/dataimport?command=full-import
>
> However, I can not search anything.
>
> Anyone can help me ?
>
> Thanks.
>
>
> 
> *data-config-sql.xml*
> 
>                driver="com.microsoft.sqlserver.jdbc.SQLServerDriver"
>              url="jdbc:sqlserver://localhost:1433;databaseName=master"
>              user="user"
>              password="pw"/>
>  
>                query="select id,title,attachment from attachment">
>    
>  
> 
>
> *schema.xml*
> 
>
>
>
> Best Regards,
> Roy Liu
>



-- 
Lance Norskog
goks...@gmail.com


Re: SOLR - problems with non-english symbols when extracting HTML

2011-04-06 Thread Lance Norskog
Tomcat has to be configured to use UTF-8.

http://wiki.apache.org/solr/SolrTomcat?highlight=%28tomcat%29#URI_Charset_Config

On Fri, Mar 25, 2011 at 6:58 PM, kushti  wrote:
>
> Grijesh wrote:
>>
>> Try to send HTML data using format CDATA .
>>
> Doesn't work with
>
>
>> $content = "";
>>
>
> And my goal is not to avoid extraction, but have no problems with
> non-english chars
>
>
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/SOLR-problems-with-non-english-symbols-when-extracting-HTML-tp2729126p2733858.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>



-- 
Lance Norskog
goks...@gmail.com


Re: Using MLT feature

2011-04-06 Thread Lance Norskog
gt;
>> > > Hi again,
>> > > I guess I was wrong on my early post... There's no automated way to
>> >
>> > avoid
>> >
>> > > the indexation of the duplicate doc.
>> >
>> > Yes there is, try set overwriteDupes to true and documents yielding the
>> > same
>> > signature will be overwritten. If you have need both fuzzy and exact
>> > matching
>> > then add a second update processor inside the chain and create another
>> > signature field.
>> >
>> > > I guess I have 2 options:
>> > >
>> > > 1. Create a temp index with signatures and then have an app that for
>> >
>> > each
>> >
>> > > new doc verifies if sig exists on my primary index. If not, add the
>> > > article.
>> > >
>> > > 2. Before adding the doc, create a signature (using the same algorithm
>> >
>> > that
>> >
>> > > SOLR uses) on my indexing app and then verify if signature exists
>> >
>> > before
>> >
>> > > adding.
>> > >
>> > > I'm way thinking the right way here? :)
>> > >
>> > > Thank you,
>> > > Frederico
>> > >
>> > >
>> > >
>> > > -Original Message-----
>> > > From: Frederico Azeiteiro [mailto:frederico.azeite...@cision.com]
>> > > Sent: segunda-feira, 4 de Abril de 2011 11:59
>> > > To: solr-user@lucene.apache.org
>> > > Subject: RE: Using MLT feature
>> > >
>> > > Thank you Markus it looks great.
>> > >
>> > > But the wiki is not very detailed on this.
>> > > Do you mean if I:
>> > >
>> > > 1. Create:
>> > > 
>> > >
>> > >     > >
>> > class="org.apache.solr.update.processor.SignatureUpdateProcessorFactory"
>> >
>> > > true
>> > >
>> > >       false
>> > >       signature
>> > >       headline,body,medianame
>> > >       > >
>> > name="signatureClass">org.apache.solr.update.processor.Lookup3Signature<
>> > /s
>> >
>> > > tr> 
>> > >
>> > >     
>> > >     
>> > >
>> > >   
>> > >
>> > > 2. Add the request as the default update request
>> > > 3. Add a "signature" indexed field to my schema.
>> > >
>> > > Then,
>> > > When adding a new doc to my index, it is only added of not considered
>> >
>> > a
>> >
>> > > duplicate using a Lookup3Signature on the field defined? All
>> >
>> > duplicates
>> >
>> > > are ignored and not added to my index?
>> > > Is it so simple as that?
>> > >
>> > > Does it works even if the medianame should be an exact match (not
>> >
>> > similar
>> >
>> > > match as the headline and bodytext are)?
>> > >
>> > > Thank you for your help,
>> > >
>> > > 
>> > > Frederico Azeiteiro
>> > > Developer
>> > >
>> > >
>> > >
>> > > -Original Message-
>> > > From: Markus Jelsma [mailto:markus.jel...@openindex.io]
>> > > Sent: segunda-feira, 4 de Abril de 2011 10:48
>> > > To: solr-user@lucene.apache.org
>> > > Subject: Re: Using MLT feature
>> > >
>> > > http://wiki.apache.org/solr/Deduplication
>> > >
>> > > On Monday 04 April 2011 11:34:52 Frederico Azeiteiro wrote:
>> > > > Hi,
>> > > >
>> > > > The ideia is don't index if something similar (headline+bodytext)
>> >
>> > for
>> >
>> > > > the same exact medianame.
>> > > >
>> > > > Do you mean I would need to index the doc first (maybe in a temp
>> >
>> > index)
>> >
>> > > > and then use the MLT feature to find similar docs before adding to
>> >
>> > final
>> >
>> > > > index?
>> > > >
>> > > > Thanks,
>> > > > Frederico
>> > > >
>> > > >
>> > > > -Original Message-
>> > > > From: Chris Fauerbach [mailto:chris.fauerb...@gmail.com]
>> > > > Sent: segunda-feira, 4 de Abril de 2011 10:22
>> > > > To: solr-user@lucene.apache.org
>> > > > Subject: Re: Using MLT feature
>> > > >
>> > > > Do you want to not index if something similar? Or don't index if
>> >
>> > exact.
>> >
>> > > > I would look into a hash code of the document if you don't want to
>> >
>> > index
>> >
>> > > > exact.    Similar though, I think has to be based off a document in
>> >
>> > the
>> >
>> > > > index.
>> > > >
>> > > > On Apr 4, 2011, at 5:16, Frederico Azeiteiro
>> > > >
>> > > >  wrote:
>> > > > > Hi,
>> > > > >
>> > > > >
>> > > > >
>> > > > > I would like to hear your opinion about the MLT feature and if
>> >
>> > it's a
>> >
>> > > > > good solution to what I need to implement.
>> > > > >
>> > > > >
>> > > > >
>> > > > > My index has fields like: headline, body and medianame.
>> > > > >
>> > > > > What I need to do is, before adding a new doc, verify if a similar
>> >
>> > doc
>> >
>> > > > > exists for this media.
>> > > > >
>> > > > >
>> > > > >
>> > > > > My idea is to use the MorelikeThisHandler
>> > > > > (http://wiki.apache.org/solr/MoreLikeThisHandler) in the following
>> > > >
>> > > > way:
>> > > > > For each new doc, perform a MLT search with q= medianame and
>> > > > > stream.body=headline+bodytext.
>> > > > >
>> > > > > If no similar docs are found than I can safely add the doc.
>> > > > >
>> > > > >
>> > > > >
>> > > > > Is this feasible using the MLT handler? Is it a good approach? Are
>> > > >
>> > > > there
>> > > >
>> > > > > a better way to perform this comparison?
>> > > > >
>> > > > >
>> > > > >
>> > > > > Thank you for your help.
>> > > > >
>> > > > >
>> > > > >
>> > > > > Best regards,
>> > > > >
>> > > > > 
>> > > > >
>> > > > > Frederico Azeiteiro
>
> --
> Markus Jelsma - CTO - Openindex
> http://www.linkedin.com/in/markus17
> 050-8536620 / 06-50258350
>



-- 
Lance Norskog
goks...@gmail.com


Re: Very very large scale Solr Deployment = how to do (Expert Question)?

2011-04-06 Thread Lance Norskog
I would not use replication. LinkedIn consumer search is a flat system
where one process indexes new entries and does queries simultaneously.
It's a custom Lucene app called Zoie. Their stuff is on Github..

I would get documents to indexers via a multicast IP-based queueing
system. This scales very well and there's a lot of hardware support.

The problem with distributed search is that it is a) inherently slower
and b) has inherently more and longer jitter. The "airplane wing"
distribution of query times becomes longer and flatter.

This is going to have to be a "federated" system, where the front-end
app aggregates results rather than Solr.

On Mon, Apr 4, 2011 at 6:25 PM, Jens Mueller  wrote:
> Hello Experts,
>
>
>
> I am a Solr newbie but read quite a lot of docs. I still do not understand
> what would be the best way to setup very large scale deployments:
>
>
>
> Goal (threoretical):
>
>  A.) Index-Size: 1 Petabyte (1 Document is about 5 KB in Size)
>
>  B) Queries: 10 Queries/ per Second
>
>  C) Updates: 10 Updates / per Second
>
>
>
>
> Solr offers:
>
> 1.)    Replication => Scales Well for B)  BUT  A) and C) are not satisfied
>
>
> 2.)    Sharding => Scales well for A) BUT B) and C) are not satisfied (=> As
> I understand the Sharding approach all goes through a central server, that
> dispatches the updates and assembles the quries retrieved from the different
> shards. But this central server has also some capacity limits...)
>
>
>
>
> What is the right approach to handle such large deployments? I would be
> thankfull for just a rough sketch of the concepts so I can experiment/search
> further…
>
>
> Maybe I am missing something very trivial as I think some of the “Solr
> Users/Use Cases” on the homepage are that kind of large deployments. How are
> they implemented?
>
>
>
> Thanky very much!!!
>
> Jens
>



-- 
Lance Norskog
goks...@gmail.com


Re: DIH Issue(newbie to solr)

2011-03-20 Thread Lance Norskog
In general, it works best to start with one of the examples and slowly
change it to match what you want.

On Sun, Mar 20, 2011 at 10:16 PM, Gora Mohanty  wrote:
> (Sorry for the delay in replying: Was a little busy this weekend.)
> On Sun, Mar 20, 2011 at 5:20 AM, neha  wrote:
>> The path is correct and also the base dir points to the list of .xml files
>> not just a single file.
>
> OK.
>
>> This is the link to the xml file:
>
> Another thing that I see is that the xpath in your field entries
> is a relative one. I think, though am not 100% sure, that it needs
> to be an absolute one, i.e., xpath="/sciserv/sci_article/journal/issn",
> instead of xpath="/journal/issn".
>
> Please take a look at the setup for XPathEntityProcessor in
> example/example-DIH/solr/rss/conf/rss-data-config.xml under
> the main Solr data directory. You can compare the xpath
> definitions to the structure of the XML in the Atom feed that
> it indexes.
>
> Regards,
> Gora
>



-- 
Lance Norskog
goks...@gmail.com


Re: DIH : modify document in sibling entity of root entity

2011-03-10 Thread Lance Norskog
The DIH is strictly tree-structured. Data flows down the tree. If the
first sibling is the root entity, nothing is used from the second
sibling. This configuration is something that it the DIH should fail.

On Thu, Mar 10, 2011 at 9:14 AM, Chantal Ackermann
 wrote:
> Hi Gora,
>
> thanks for making me read this part of the documentation again!
> This processor probably cannot do what I need out of the box but I will
> try to extend it to allow specifying a regular expression in its "where"
> attribute.
>
> Thanks!
> Chantal
>
> On Thu, 2011-03-10 at 17:39 +0100, Gora Mohanty wrote:
>> On Thu, Mar 10, 2011 at 8:42 PM, Chantal Ackermann
>>  wrote:
>> [...]
>> > Is this supposed to work at all? I haven't found anything so far on the
>> > net but I could have used the wrong keywords for searching, of course.
>> >
>> > As answer to the maybe obvious question why I'm not using a subentity:
>> > I thought that this solution might be faster because it iterates over
>> > the second data source instead of hitting it with a query per each
>> > document.
>> [...]
>>
>> I think that what you are after can be handled by Solr's
>> CachedSqlEntityProcessor:
>> http://wiki.apache.org/solr/DataImportHandler#CachedSqlEntityProcessor
>>
>> Two major caveats here:
>> * I am not 100% sure that I have understood your requirements.
>> * The documentation for CachedSqlEntityProcessor needs to be improved.
>>   Will see if I can test it, and come up with a better example. As I have
>>   not actually used this, it could be that I have misunderstood its purpose.
>>
>> Regards,
>> Gora
>
>



-- 
Lance Norskog
goks...@gmail.com


Re: NRT in Solr

2011-03-09 Thread Lance Norskog
Please start new threads for new conversations.

On Wed, Mar 9, 2011 at 2:27 AM, stockii  wrote:
> question: http://wiki.apache.org/solr/NearRealtimeSearchTuning
>
>
> 'PERFORMANCE WARNING: Overlapping onDeckSearchers=x
>
> i got this message.
> in my solrconfig.xml: maxWarmingSearchers=4, if i set this to 1 or 2 i got
> exception. with 4 i got nothing, but the Performance Warning. the
> wiki-articel says, that the best solution is to set the warmingSearcher to
> 1!!! how can this work ?
>
> -
> --- System 
> 
>
> One Server, 12 GB RAM, 2 Solr Instances, 7 Cores,
> 1 Core with 31 Million Documents other Cores < 100.000
>
> - Solr1 for Search-Requests - commit every Minute  - 5GB Xmx
> - Solr2 for Update-Request  - delta every Minute - 4GB Xmx
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/NRT-in-Solr-tp2652689p2654696.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>



-- 
Lance Norskog
goks...@gmail.com


Re: StreamingUpdateSolrServer

2011-03-08 Thread Lance Norskog
Yes. Each thread uses its own connection, and each becomes a new
thread in the servlet container.

On Mon, Mar 7, 2011 at 2:54 AM, Isan Fulia  wrote:
> Hi all,
> I am using StreamingUpdateSolrServer with queuesize = 5 and threadcount=4
> The no. of connections created are same as threadcount.
> Is it that it creates a new connection for every thread.
>
>
> --
> Thanks & Regards,
> Isan Fulia.
>



-- 
Lance Norskog
goks...@gmail.com


Re: memory leak during undeploying

2011-03-05 Thread Lance Norskog
Classes get saved in PermGen and are never freed. Apparently there are
JVM options to fix this.
I'm not sure if the old String.intern() use in Lucene had this problem.

Lance

On Wed, Mar 2, 2011 at 10:23 PM, Chris Hostetter
 wrote:
>
> : When I did heap analysis, the culprit always seems to
> : be TimeLimitedCollector thread. Because of this, considerable amount of
> : classes are not getting unloaded.
>        ...
> : > > There are couple of JIRA's related to this:
> : > > https://issues.apache.org/jira/browse/LUCENE-2237,
> : > > https://issues.apache.org/jira/browse/SOLR-1735. Even after applying
> : > these
> : > > patches, the issue still remains.
>
> can you clarify what you mean by this -- are you still seeing that
> TimeLimitedCollector is the culprit in your heap analysis (even with the
> patches) or are you still getting problems with PermGen running out, but
> it's caused by other classes and TimeLimitedCollector is no longer the
> culprit? (and if so: which other classes)
>
> FYI: LUCENE-2822 is realted to LUCENE-2237 and has attracted some more
> attention/comments (i suspect largely because it was filed as a bug
> instead of an improvement)
>
>
> -Hoss
>



-- 
Lance Norskog
goks...@gmail.com


Re: Blacklist keyword list on dataimporter

2011-02-27 Thread Lance Norskog
The basic Solr document ingestion process does not currently support this.

In the DataImportHandler you can configure it to skip any document
that fails a processor. You would have to write your own processor
that hunts for that word and throws an exception.


On Sat, Feb 26, 2011 at 2:12 PM, Rosa (Anuncios)
 wrote:
> Hi,
>
> Is there a way to drop document when indexing based of a blacklist keyword
> list?
>
> Something like the stopwords.txt...
>
> But in this case when one keyword is detected in a specific field at
> indexing, the whole doc would be skipped.
>
> Regards
>
>
>



-- 
Lance Norskog
goks...@gmail.com


Re: Question Solr Index main in RAM

2011-02-27 Thread Lance Norskog
This sounds like a great idea but rarely works out. Garbage collection
has to work around the data stored in memory, and most of the data you
want to hit frequently is in the indexed and cached. The operating
system is very smart about keeping the popular parts of the index in
memory, and there is no garbage collection there.

I do not know if the RAMDirectoryFactory in current development has
disk-backed persistence.

On Thu, Feb 24, 2011 at 7:26 AM, Bill Bell  wrote:
> How to use this?
>
> Bill Bell
> Sent from mobile
>
>
> On Feb 24, 2011, at 7:19 AM, Koji Sekiguchi  wrote:
>
>> (11/02/24 21:38), Andrés Ospina wrote:
>>>
>>> Hi,
>>>
>>> My name is Felipe and i want to use the index main of solr in RAM memory.
>>>
>>> How it's possible? I have solr 1.4
>>>
>>> Thank you!
>>>
>>> Felipe
>>
>> Welcome Felipe!
>>
>> If I understand your question correctly, you can use RAMDirectoryFactory:
>>
>> https://hudson.apache.org/hudson/job/Solr-3.x/javadoc/org/apache/solr/core/RAMDirectoryFactory.html
>>
>> But I believe it is available 3.1 (to be released soon...).
>>
>> Koji
>> --
>> http://www.rondhuit.com/en/
>



-- 
Lance Norskog
goks...@gmail.com


Problems with JSP pages?

2011-02-25 Thread Lance Norskog
I'm on Windows Vista, using the trunk. Some of the JSP pages do not
execute, but instead Jetty downloads them.

solr/admin/get-properties.jsp for example. This is called by the 'JAVA
PROPERTIES' button in the main admin page.

Is this a known problem/quirk for Windows? Or fallout from a jetty
change? Or...?

-- 
Lance Norskog
goks...@gmail.com


Re: MailEntityProcessor

2011-02-23 Thread Lance Norskog
The DIH config does not mention port numbers, or security options. I
recently wrote a custom app to download and index mail- there were
several complexities (the above problems, attachments, calculating
mail threads).

I think your best bet is to find a utility that downloads mail into
mbox files. Tika has a parser for these files. The
ExtractingRequestHandler will parse them, but I don't know if it makes
separate documents for each email.

Solr 3.x (unreleased) has Tika in the DataImportHandler. This might be
more flexible.

On Wed, Feb 23, 2011 at 3:52 PM, Smiley, David W.  wrote:
> I assume you found this?: http://wiki.apache.org/solr/MailEntityProcessor
> You don't provide enough information to get assistance when you simply say "I 
> couldn't get it working".
>
> (disclaimer: I haven't used DIH's mail feature)
> ~ David
>
> On Feb 23, 2011, at 5:15 PM, Husrev Yilmaz wrote:
>
>> Hi,
>>
>> I am new to Solr, without any Java knowledge.
>>
>> I downloaded and run Solr under Tomcat. At the other hand I have a working
>> IMAP server on the same machine. I want to index Date, From, To, Cc, Bcc,
>> Subject, Body.
>>
>> How can I set up Solr to do this? Could you write a small guide to help me?
>> (where to put which xml by which content). There is enough documentation
>> about DBs, but I couldn't get it working for IMAP?
>>
>> Regards.
>>
>> --
>> Husrev Yilmaz
>> +90 554 3304911
>
>



-- 
Lance Norskog
goks...@gmail.com


Re: [ANN] new SolrMeter release

2011-02-23 Thread Lance Norskog
Cool!

On 2/23/11, Tomás Fernández Löbbe  wrote:
> Hi All, I'm happy to announce a new release of SolrMeter, an open source
> stress test tool for Solr.
>
> You can obtain the code or executable jar from the google code page at:
>
> http://code.google.com/p/solrmeter
>
> There have been a lot of improvements since the last release, you can see
> what's new by checking the "issues" tool or entering here:
>
> http://code.google.com/p/solrmeter/issues/list?can=1&q=Milestone%3DRelease-0.2.0+&colspec=ID+Type+Status+Priority+Milestone+Owner+Summary&cells=tiles
>
>
> Best Regards,
>
> Tomás
>


-- 
Lance Norskog
goks...@gmail.com


Re: DIH threads

2011-02-20 Thread Lance Norskog
It does not substitute values correctly in many cases. It seems to
work at a top level, but not in sub-levels.

On Fri, Feb 18, 2011 at 5:41 PM, Bill Bell  wrote:
> I used it on 4,0 and it did not help us. We were bound on SQL io
>
> Bill Bell
> Sent from mobile
>
>
> On Feb 18, 2011, at 4:47 PM, Mark  wrote:
>
>> Has anyone applied the DIH threads patch on 1.4.1 
>> (https://issues.apache.org/jira/browse/SOLR-1352)?
>>
>> Does anyone know if this works and/or does it improve performance?
>>
>> Thanks
>>
>>
>



-- 
Lance Norskog
goks...@gmail.com


Re: My Plan to Scale Solr

2011-02-17 Thread Lance Norskog
Or even better, search with 'LSA'.

On Thu, Feb 17, 2011 at 9:22 AM, Walter Underwood  wrote:
> http://lmgtfy.com/?q=SLA
>
> wunder
>
> On Feb 17, 2011, at 11:04 AM, Dennis Gearon wrote:
>
>> What's an 'LSA'
>>
>> Dennis Gearon
>>
>>
>> Signature Warning
>> 
>> It is always a good idea to learn from your own mistakes. It is usually a 
>> better
>> idea to learn from others’ mistakes, so you do not have to make them 
>> yourself.
>> from 'http://blogs.techrepublic.com.com/security/?p=4501&tag=nl.e036'
>>
>>
>> EARTH has a Right To Life,
>> otherwise we all die.
>>
>>
>>
>>
>> 
>> From: Stijn Vanhoorelbeke 
>> To: solr-user@lucene.apache.org; bing...@asu.edu
>> Sent: Thu, February 17, 2011 4:28:13 AM
>> Subject: Re: My Plan to Scale Solr
>>
>> Hi,
>>
>> I'm currently looking at SolrCloud. I've managed to set up a scalable
>> cluster with ZooKeeper.
>> ( see the examples in http://wiki.apache.org/solr/SolrCloud for a quick
>> understanding )
>> This way, all different shards / replicas are stored in a centralised
>> configuration.
>>
>> Moreover the ZooKeeper contains out-of-the-box loadbalancing.
>> So, lets say - you have 2 different shards and each is replicated 2 times.
>> Your zookeeper config will look like this:
>>
>> \config
>> ...
>>   /live_nodes (v=6 children=4)
>>          lP_Port:7500_solr (ephemeral v=0)
>>          lP_Port:7574_solr (ephemeral v=0)
>>          lP_Port:8900_solr (ephemeral v=0)
>>          lP_Port:8983_solr (ephemeral v=0)
>>     /collections (v=20 children=1)
>>          collection1 (v=0 children=1) "configName=myconf"
>>               shards (v=0 children=2)
>>                    shard1 (v=0 children=3)
>>                         lP_Port:8983_solr_ (v=4)
>> "node_name=lP_Port:8983_solr url=http://lP_Port:8983/solr/";
>>                         lP_Port:7574_solr_ (v=1)
>> "node_name=lP_Port:7574_solr url=http://lP_Port:7574/solr/";
>>                         lP_Port:8900_solr_ (v=1)
>> "node_name=lP_Port:8900_solr url=http://lP_Port:8900/solr/";
>>                    shard2 (v=0 children=2)
>>                         lP_Port:7500_solr_ (v=0)
>> "node_name=lP_Port:7500_solr url=http://lP_Port:7500/solr/";
>>                         lP_Port:7574_solr_ (v=1)
>> "node_name=lP_Port:7574_solr url=http://lP_Port:7574/solr/";
>>
>> --> This setup can be realised, by 1 ZooKeeper module - the other solr
>> machines need just to know the IP_Port were the zookeeper is active & that's
>> it.
>> --> So no configuration / installing is needed to realise quick a scalable /
>> load balanced cluster.
>>
>> Disclaimer:
>> ZooKeeper is a relative new feature - I'm not sure if it will work out in a
>> real production environment, which has a tight SLA pending.
>> But - definitely keep your eyes on this stuff - this will mature quickly!
>>
>> Stijn Vanhoorelbeke
>
> --
> Walter Underwood
> Venture ASM, Troop 14, Palo Alto
>
>
>
>



-- 
Lance Norskog
goks...@gmail.com


Re: Which version of Solr?

2011-02-14 Thread Lance Norskog
II! I feel your pain!

On Mon, Feb 14, 2011 at 3:27 PM, Jeff Schmidt  wrote:
> Wow,  okay, it's Cassandra's fault... :)
>
> I create unit tests to use HttpClient and even HttpURLConnection, and the 
> former got the non-response from the server, and the latter got unexpected 
> end of file.  But, if I use curl or telnet, things would work. Anyway, I 
> noticed (Mac OS X 10.6.6):
>
> [imac:apache/cassandra/apache-cassandra-0.7.0] jas% netstat -an | grep 8080
> tcp4       0      0  *.8080                 *.*                    LISTEN
> tcp46      0      0  *.8080                 *.*                    LISTEN
> [imac:apache/cassandra/apache-cassandra-0.7.0] jas%
>
> After shutting down tomcat, the tcp4 line would still show up. Only after 
> also shutting down Cassandra were there no listeners on port 8080. Starting 
> tomcat and Cassandra in either order, neither failed to bind to 8080.  Why my 
> Java programs tried to talk to Cassandra, and telnet, Firefox, curl etc. 
> managed to hook up with Solr, I don't know.
>
> I moved tomcat to port 8090 and things are good... Sigh..  What a big waste 
> of time.
>
> Cheers,
>
> Jeff
>
> On Feb 14, 2011, at 2:29 PM, Jeff Schmidt wrote:
>
>> I figured instead of trying to index content, I'd simply issue a query via 
>> SolrJ. This seems related to my problem below.  I create a 
>> CommonsHttpSolrServer instance in the manner already described and in a new 
>> method:
>>
>>       @Override
>>       public List getNodeIdsForProductId(final String productId, 
>> final String partnerId) {
>>
>>               final List nodes = new ArrayList();
>>
>>               final CommonsHttpSolrServer solrServer = 
>> (CommonsHttpSolrServer)getSolrServer(partnerId);
>>               final SolrQuery query = new SolrQuery();
>>               query.setQuery("productId:" + productId);
>>               query.addField("nodeId");
>>               try {
>>                       final QueryResponse response = solrServer.query(query);
>>                       final SolrDocumentList docs = response.getResults();
>>                       log.info(String.format("getNodeIdsForProductId - got 
>> %d nodes for productId: %s",
>>                                       docs.getNumFound(), productId));
>>                       for (SolrDocument doc : docs) {
>>                               log.info(doc);
>>                       }
>>               } catch (SolrServerException ex) {
>>                       final String msg = String.format("Unable to query Solr 
>> server %s, for query: %s", solrServer.getBaseURL(), query);
>>                       log.error(msg);
>>                       throw new ServiceException(msg, ex);
>>               }
>>
>>               return nodes;
>>       }
>>
>> When issuing the query I get:
>>
>> 2011-02-14 13:13:28 INFO  solr.SolrProductIndexService - getSolrServer - 
>> Solr url: http://localhost:8080/solr/partner-tmo
>> 2011-02-14 13:13:28 INFO  solr.SolrProductIndexService - getSolrServer - 
>> construct server for url: http://localhost:8080/solr/partner-tmo
>> 2011-02-14 13:13:28 ERROR solr.SolrProductIndexService - Unable to query 
>> Solr server http://localhost:8080/solr/partner-tmo, for query: 
>> q=productId%3Aproduct4&fl=nodeId
>> ...
>> Caused by: org.apache.solr.client.solrj.SolrServerException: 
>> org.apache.commons.httpclient.NoHttpResponseException: The server localhost 
>> failed to respond
>>       at 
>> org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:484)
>> ...
>> Caused by: org.apache.commons.httpclient.NoHttpResponseException: The server 
>> localhost failed to respond
>>       at 
>> org.apache.commons.httpclient.HttpMethodBase.readStatusLine(HttpMethodBase.java:1976)
>>
>> If I run this through the proxy again, I can see the request being made as:
>>
>> GET 
>> /solr/partner-tmo/select?q=productId%3Aproduct4&fl=nodeId&wt=xml&version=2.2 
>> HTTP/1.1
>> User-Agent: Solr[org.apache.solr.client.solrj.impl.CommonsHttpSolrServer] 1.0
>> Host: localhost:8080
>>
>> And I get no response from Solr.  If instead I use this URL in Firefox:
>>
>> http://localhost:8080/solr/partner-tmo/select?q=productId%3Aproduct4&fl=nodeId&wt=xml&version=2.2
>>
>> I get search results.  What is it about SolrJ that is just not working out?  
>> What basic thing am I missing? Using Firefox here, or curl below, I can talk 
>> to

Re: Difference between Solr and Lucidworks distribution

2011-02-14 Thread Lance Norskog
Right. LWE binaries are distributed for free, and may be used for
non-production purposes. For a production deployment, separate
subscriptions are required. It is effectively the same as requiring
payment for a production deployment license bundled with a support
subscription.

On Sun, Feb 13, 2011 at 8:29 AM, Adam Estrada
 wrote:
> I believe that the Lucid Works distro for Solr is free and as you mentioned 
> they only appear to sell their services for it. I have used that version for 
> several demos because it does seem to have all the bells and whistles already 
> included and it's super easy to set up. The only downside in my case is that 
> they are still on the official release version 1.4.1 which has an older 
> version of PDFBox that doesn't parse PDF's generated from newer adobe 
> software. Thanks Adobe ;-) It's easy enough to just rebuild Tika, PDFBox, 
> FontBox, etc. and swap them out...If you want spatial support, you can use 
> the plugin from the Spatial Solr project out of the Netherlands which is 
> designed to support 1.4.1 and from what I can tell seems to work pretty well.
>
> Anyway, when 4.0 is released, hopefully with the extended spatial support 
> from projects like SIS and JTS, I hope to see the office distro version 
> change from Lucid.
>
> Thanks for all hard work the Lucid Team has provided over the years!
>
> Adam
>
> On Feb 12, 2011, at 10:55 PM, Andy wrote:
>
>> Now I'm confused.
>>
>> In http://www.lucidimagination.com/lwe/subscriptions-and-pricing, the price 
>> of LucidWorks Enterprise Software is stated as "FREE". I thought the price 
>> for "Production" was for the support service, not for the software.
>>
>> But you seem to be saying that 'LucidWorks Enterprise' is a separate 
>> software that isn't free. Did I misunderstand?
>>
>> --- On Sat, 2/12/11, Lance Norskog  wrote:
>>
>>> From: Lance Norskog 
>>> Subject: Re: Difference between Solr and Lucidworks distribution
>>> To: solr-user@lucene.apache.org, markus.jel...@openindex.io
>>> Date: Saturday, February 12, 2011, 8:10 PM
>>> There are two distributions.
>>>
>>> The company is Lucid Imagination. 'Lucidworks for Solr' is
>>> the
>>> certified distribution of Solr 1.4.1, with several
>>> enhancements.
>>>
>>> Markus refers to 'LucidWorks Enterprise', which is LWE.
>>> This is a
>>> separate app with tools and a REST API for managing a Solr
>>> instance.
>>>
>>> Lance Norskog
>>>
>>> On Fri, Feb 11, 2011 at 8:36 AM, Markus Jelsma
>>> 
>>> wrote:
>>>> It is not free for production environments.
>>>> http://www.lucidimagination.com/lwe/subscriptions-and-pricing
>>>>
>>>> On Friday 11 February 2011 17:31:22 Greg Georges
>>> wrote:
>>>>> Hello all,
>>>>>
>>>>> I just started watching the webinars from
>>> Lucidworks, and they mention
>>>>> their distribution which has an installer, etc..
>>> Is there any other
>>>>> differences? Is it a good idea to use this free
>>> distribution?
>>>>>
>>>>> Greg
>>>>
>>>> --
>>>> Markus Jelsma - CTO - Openindex
>>>> http://www.linkedin.com/in/markus17
>>>> 050-8536620 / 06-50258350
>>>>
>>>
>>>
>>>
>>> --
>>> Lance Norskog
>>> goks...@gmail.com
>>>
>>
>>
>>
>
>



-- 
Lance Norskog
goks...@gmail.com


Re: Monitor the QTime.

2011-02-12 Thread Lance Norskog
If you're a unix shell scripting wiz, here are a few strategies.

Tail the logfile and filter for the string 'QTime'. The number is the
very last string in the line. So, strip the text between the timestamp
and the number- sort by the timestamp first and the number second. Now
 grab the first qtime for each timestamp. I don't know a command for
this. This gives you the longest query time for each second.

As a separate trick: tail the logfile and filter for QTime. Then,
strip out all text after the time. Now you have a stream of lines with
a timestamp. Run this through 'uniq -c' and voila! you get the queries
per second for each timestamp.

On Sat, Feb 12, 2011 at 1:51 AM, Gora Mohanty  wrote:
> On Sat, Feb 12, 2011 at 4:54 AM, Stijn Vanhoorelbeke
>  wrote:
> [...]
>> Can you access this URL from a web browser (tried but doesn't work ) ? Or
>> must this used in jConsole / custom made java program.
>
> Please try http://localhost:8983/solr/admin/stats.jsp (change hostname/port as
> needed).
>
>> Could you please point me to a good guide to implement this JMX stuff, cause
>> I'm a newbie for JMX.
>
> The easiest way to get access to JMX is indeed a Java console, like jconsole.
> There are various open-source JMX clients available, but we could find none
> that met our needs, and were being actively maintained. We have been
> toying with the idea of a JMX client that offers a REST API to Solr MBeans
> (or even to any generic MBeans). This would be a more natural interface for
> people used to web development.
>
> Regards,
> Gora
>



-- 
Lance Norskog
goks...@gmail.com


Re: Difference between Solr and Lucidworks distribution

2011-02-12 Thread Lance Norskog
There are two distributions.

The company is Lucid Imagination. 'Lucidworks for Solr' is the
certified distribution of Solr 1.4.1, with several enhancements.

Markus refers to 'LucidWorks Enterprise', which is LWE. This is a
separate app with tools and a REST API for managing a Solr instance.

Lance Norskog

On Fri, Feb 11, 2011 at 8:36 AM, Markus Jelsma
 wrote:
> It is not free for production environments.
> http://www.lucidimagination.com/lwe/subscriptions-and-pricing
>
> On Friday 11 February 2011 17:31:22 Greg Georges wrote:
>> Hello all,
>>
>> I just started watching the webinars from Lucidworks, and they mention
>> their distribution which has an installer, etc.. Is there any other
>> differences? Is it a good idea to use this free distribution?
>>
>> Greg
>
> --
> Markus Jelsma - CTO - Openindex
> http://www.linkedin.com/in/markus17
> 050-8536620 / 06-50258350
>



-- 
Lance Norskog
goks...@gmail.com


Re: Which version of Solr?

2011-02-12 Thread Lance Norskog
There is momentum towards doing a release of 3.x. I would be
comfortable using the 3.x branch.

--- But I'm unable to get SolrJ to work due to the 'javabin' version
mismatch. I'm using the 1.4.1 version of SolrJ, but I always get an
HTTP response code of 200, but the return entity is simply a null
byte, which does not match the version number of 1 defined in Solr
common.  ---

I've never seen this problem. At this point you are better off
starting with 3.x instead of chasing this problem down.

On Sat, Feb 12, 2011 at 1:37 PM, Jeff Schmidt  wrote:
> Hello:
>
> I'm working on incorporating Solr into a SaaS based life sciences semantic 
> search project. This will be released in about six months. I'm trying to 
> determine which version of Solr makes the most sense. When going to the Solr 
> download page, there are 1.3.0, 1.4.0, and 1.4.1. I've been using 1.4.1 while 
> going through some examples in my Packt book ("Solr 1.4 Enterprise Search 
> Server").
>
> But, I also see that Solr 3.1 and 4.0 are in the works.  According to:
>
>        
> https://issues.apache.org/jira/browse/#selectedTab=com.atlassian.jira.plugin.system.project%3Aroadmap-panel
>
> there is a high degree of progress on both of those releases; including a 
> slew of bug fixes, new features, performance enhancements etc. Should I be 
> making use of one of the newer versions?  The hierarchical faceting seems 
> like it could be quite useful.  Are there any guesses on when either 3.1 or 
> 4.0 will be officially released?
>
> So far, 1.4.1 has been good. But I'm unable to get SolrJ to work due to the 
> 'javabin' version mismatch. I'm using the 1.4.1 version of SolrJ, but I 
> always get an HTTP response code of 200, but the return entity is simply a 
> null byte, which does not match the version number of 1 defined in Solr 
> common.  Anyway, I can follow up on that issue if 1.4.1 is still the most 
> appropriate version to use these days. Otherwise, I'll try again with 
> whatever version you suggest.
>
> Thanks a lot!
>
> Jeff
> --
> Jeff Schmidt
> 535 Consulting
> j...@535consulting.com
> (650) 423-1068
>
>
>
>
>
>



-- 
Lance Norskog
goks...@gmail.com


Re: DIH keeps failing during full-import

2011-02-07 Thread Lance Norskog
.runCmd(DataImporter.java:391)
>>    at
>> org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:370)
>> Feb 7, 2011 7:03:29 AM org.apache.solr.handler.dataimport.JdbcDataSource
>> closeConnection
>> SEVERE: Ignoring Error when closing connection
>> com.mysql.jdbc.exceptions.jdbc4.MySQLNonTransientConnectionException:
>> Communications link failure during rollback(). Transaction resolution
>> unknown.
>>    at sun.reflect.GeneratedConstructorAccessor27.newInstance(Unknown
>> Source)
>>    at
>> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>>    at java.lang.reflect.Constructor.newInstance(Constructor.java:532)
>>    at com.mysql.jdbc.Util.handleNewInstance(Util.java:407)
>>    at com.mysql.jdbc.Util.getInstance(Util.java:382)
>>    at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:1013)
>>    at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:987)
>>    at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:982)
>>    at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:927)
>>    at com.mysql.jdbc.ConnectionImpl.rollback(ConnectionImpl.java:4751)
>>    at com.mysql.jdbc.ConnectionImpl.realClose(ConnectionImpl.java:4345)
>>    at com.mysql.jdbc.ConnectionImpl.close(ConnectionImpl.java:1564)
>>    at
>> org.apache.solr.handler.dataimport.JdbcDataSource.closeConnection(JdbcDataSource.java:399)
>>    at
>> org.apache.solr.handler.dataimport.JdbcDataSource.close(JdbcDataSource.java:390)
>>    at
>> org.apache.solr.handler.dataimport.DataConfig$Entity.clearCache(DataConfig.java:174)
>>    at
>> org.apache.solr.handler.dataimport.DataConfig.clearCaches(DataConfig.java:332)
>>    at
>> org.apache.solr.handler.dataimport.DataImporter.doDeltaImport(DataImporter.java:360)
>>    at
>> org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:391)
>>    at
>> org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:370)
>> Feb 7, 2011 7:03:29 AM org.apache.solr.handler.dataimport.JdbcDataSource
>> closeConnection
>> SEVERE: Ignoring Error when closing connection
>> com.mysql.jdbc.exceptions.jdbc4.MySQLNonTransientConnectionException:
>> Communications link failure during rollback(). Transaction resolution
>> unknown.
>>    at sun.reflect.GeneratedConstructorAccessor27.newInstance(Unknown
>> Source)
>>    at
>> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>>    at java.lang.reflect.Constructor.newInstance(Constructor.java:532)
>>    at com.mysql.jdbc.Util.handleNewInstance(Util.java:407)
>>    at com.mysql.jdbc.Util.getInstance(Util.java:382)
>>    at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:1013)
>>    at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:987)
>>    at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:982)
>>    at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:927)
>>    at com.mysql.jdbc.ConnectionImpl.rollback(ConnectionImpl.java:4751)
>>    at com.mysql.jdbc.ConnectionImpl.realClose(ConnectionImpl.java:4345)
>>    at com.mysql.jdbc.ConnectionImpl.close(ConnectionImpl.java:1564)
>>    at
>> org.apache.solr.handler.dataimport.JdbcDataSource.closeConnection(JdbcDataSource.java:399)
>>    at
>> org.apache.solr.handler.dataimport.JdbcDataSource.close(JdbcDataSource.java:390)
>>    at
>> org.apache.solr.handler.dataimport.DataConfig$Entity.clearCache(DataConfig.java:174)
>>    at
>> org.apache.solr.handler.dataimport.DataConfig$Entity.clearCache(DataConfig.java:165)
>>    at
>> org.apache.solr.handler.dataimport.DataConfig.clearCaches(DataConfig.java:332)
>>    at
>> org.apache.solr.handler.dataimport.DataImporter.doDeltaImport(DataImporter.java:360)
>>    at
>> org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:391)
>>    at
>> org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:370)
>>
>



-- 
Lance Norskog
goks...@gmail.com


Re: prices

2011-02-05 Thread Lance Norskog
Jonathan- right in one!

Using floats for prices will lead to madness. My mortgage UI kept
changing the loan's interest rate.

On Fri, Feb 4, 2011 at 12:13 PM, Dennis Gearon  wrote:
> That's a good idea, Yonik. So, fields that aren't stored don't get displayed, 
> so
> the float field in the schema never gets seen by the user. Good, I like it.
>
>  Dennis Gearon
>
>
> Signature Warning
> 
> It is always a good idea to learn from your own mistakes. It is usually a 
> better
> idea to learn from others’ mistakes, so you do not have to make them yourself.
> from 'http://blogs.techrepublic.com.com/security/?p=4501&tag=nl.e036'
>
>
> EARTH has a Right To Life,
> otherwise we all die.
>
>
>
> - Original Message 
> From: Yonik Seeley 
> To: solr-user@lucene.apache.org
> Sent: Fri, February 4, 2011 10:49:42 AM
> Subject: Re: prices
>
> On Fri, Feb 4, 2011 at 12:56 PM, Dennis Gearon  wrote:
>> Using solr 1.4.
>>
>> I have a price in my schema. Currently it's a tfloat. Somewhere along the way
>> from php, json, solr, and back, extra zeroes are getting truncated along with
>> the decimal point for even dollar amounts.
>>
>> So I have two questions, neither of which seemed to be findable with google.
>>
>> A/ Any way to keep both zeroes going inito a float field? (In the analyzer,
>>with
>> XML output, the values are shown with 1 zero)
>> B/ Can strings be used in range queries like a float and work well for 
>> prices?
>
> You could do a copyField into a stored string field and use the tfloat
> (or tint and store cents)
> for range queries, searching, etc, and the string field just for display.
>
> -Yonik
> http://lucidimagination.com
>
>
>
>
>>
>>  Dennis Gearon
>>
>>
>> Signature Warning
>> 
>> It is always a good idea to learn from your own mistakes. It is usually a
>>better
>> idea to learn from others’ mistakes, so you do not have to make them 
>> yourself.
>> from 'http://blogs.techrepublic.com.com/security/?p=4501&tag=nl.e036'
>>
>>
>> EARTH has a Right To Life,
>> otherwise we all die.
>>
>>
>
>



-- 
Lance Norskog
goks...@gmail.com


Re: SolrCloud Questions for MultiCore Setup

2011-01-27 Thread Lance Norskog
Hello-

I have not used SolrCloud.

On 1/27/11, Em  wrote:
>
> Hi,
>
> excuse me for pushing this for a second time, but I can't figure it out by
> looking at the source code...
>
> Thanks!
>
>
>
>> Hi Lance,
>>
>> thanks for your explanation.
>>
>> As far as I know in distributed search i have to tell Solr what other
>> shards it has to query. So, if I want to query a specific core, present in
>> all my shards, i could tell Solr this by using the shards-param plus
>> specified core on each shard.
>>
>> Using SolrCloud's distrib=true feature (it sets all the known shards
>> automatically?), a collection should consist only of one type of
>> core-schema, correct?
>> How does SolrCloud knows that shard_x and shard_y are replicas of
>> eachother (I took a look at the  possibility to specify alternative shards
>> if one is not available)? If it does not know that they are replicas of
>> eachother, I should use the syntax of specifying alternative shards for
>> failover due to performance-reasons, because querying 2 identical and
>> available cores seems to be wasted capacity, no?
>>
>> Thank you!
>>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/SolrCloud-Questions-for-MultiCore-Setup-tp2309443p2363396.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>


-- 
Lance Norskog
goks...@gmail.com


Re: Solr for noSQL

2011-01-27 Thread Lance Norskog
There no special connectors available to read from the key-value
stores like memcache/cassandra/mongodb. You would have to get a Java
client library for the DB and code your own dataimporthandler
datasource.  I cannot recommend this; you should make your own program
to read data and upload to Solr with one of the Solr client libraries.

Lance

On 1/27/11, Jianbin Dai  wrote:
> Hi,
>
>
>
> Do we have data import handler to fast read in data from noSQL database,
> specifically, MongoDB I am thinking to use?
>
> Or a more general question, how does Solr work with noSQL database?
>
> Thanks.
>
>
>
> Jianbin
>
>
>
>


-- 
Lance Norskog
goks...@gmail.com


Re: Tika config in ExtractingRequestHandler

2011-01-27 Thread Lance Norskog
The tika.config file is obsolete. I don't know what replaces it.

On 1/27/11, Erlend Garåsen  wrote:
>
> If this configuration file is the same as the tika-mimetypes.xml file
> inside Nutch' conf file, I have an example.
>
> I was trying to implement language detection for Solr and thought I had
> to invoke some Tika functionality by this configuration file in order to
> do so, but found out that I could rewrite some of the
> ExtractingRequestHandler classes instead.
>
> Erlend
>
> On 27.01.11 16.12, Adam Estrada wrote:
>> I believe that as along as Tika is included in a folder that is
>> referenced by solrconfig.xml you should be good. Solr will
>> automatically throw mime types to Tika for parsing. Can anyone else
>> add to this?
>>
>> Thanks,
>> Adam
>>
>> On Thu, Jan 27, 2011 at 5:06 AM, Erlend Garåsen
>> wrote:
>>>
>>> The wiki page for the ExtractingRequestHandler says that I can add the
>>> following configuration:
>>> /my/path/to/tika.config
>>>
>>> I have tried to google for an example of such a Tika config file, but
>>> haven't found anything.
>>>
>>> Erlend
>>>
>>> --
>>> Erlend Garåsen
>>> Center for Information Technology Services
>>> University of Oslo
>>> P.O. Box 1086 Blindern, N-0317 OSLO, Norway
>>> Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP:
>>> 31050
>>>
>
>
> --
> Erlend Garåsen
> Center for Information Technology Services
> University of Oslo
> P.O. Box 1086 Blindern, N-0317 OSLO, Norway
> Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP: 31050
>


-- 
Lance Norskog
goks...@gmail.com


Re: Delta Import occasionally missing records.

2011-01-26 Thread Lance Norskog
The SolrEntityProcessor would be a top-level entity. You would do a
query like this: &sort=timestamp,desc&rows=1&fl=timestamp. This gives
you one data item: the timestamp of the last item added to the index.

With this, the JDBC sub-entity would create a query that chooses all
rows with a timestamp >= this latest timestamp. It will not be easy to
put this together, but it is possible :)

Good luck!

Lance

On Mon, Jan 24, 2011 at 2:04 AM, btucker  wrote:
>
> Thank you for your response.
>
> In what way is 'timestamp' not perfect?
>
> I've looked into the SolrEntityProcessor and added a timestamp field to our
> index.
> However i'm struggling to work out a query to get the max value od the
> timestamp field
> and does the SolrEntityProcessor entity appear before the root entity or
> does it wrap around the root entity.
>
> On 22 January 2011 07:24, Lance Norskog-2 [via Lucene] <
> ml-node+2307215-627680969-326...@n3.nabble.com
>> wrote:
>
>> The timestamp thing is not perfect. You can instead do a search
>> against Solr and find the latest timestamp in the index. SOLR-1499
>> allows you to search against Solr in the DataImportHandler.
>>
>> On Fri, Jan 21, 2011 at 2:27 AM, btucker <[hidden 
>> email]<http://user/SendEmail.jtp?type=node&node=2307215&i=0>>
>> wrote:
>>
>> >
>> > Hello
>> >
>> > We've just started using solr to provide search functionality for our
>> > application with the DataImportHandler performing a delta-import every 1
>> > fired by crontab, which works great, however it does occasionally miss
>> > records that are added to the database while the delta-import is running.
>>
>> >
>> > Our data-config.xml has the following queries in its root entity:
>> >
>> > query="SELECT id, date_published, date_created, publish_flag FROM Item
>> WHERE
>> > id > 0
>> >
>> > AND record_type_id=0
>> >
>> > ORDER BY id DESC"
>> > preImportDeleteQuery="SELECT item_id AS Id FROM
>> > gnpd_production.item_deletions"
>> > deletedPkQuery="SELECT item_id AS id FROM gnpd_production.item_deletions
>> > WHERE deletion_date >=
>> >
>> > SUBDATE('${dataimporter.last_index_time}', INTERVAL 5 MINUTE)"
>> > deltaImportQuery="SELECT id, date_published, date_created, publish_flag
>> FROM
>> > Item WHERE id > 0
>> >
>> > AND record_type_id=0
>> >
>> > AND id=${dataimporter.delta.id}
>> >
>> > ORDER BY id DESC"
>> > deltaQuery="SELECT id, date_published, date_created, publish_flag FROM
>> Item
>> > WHERE id > 0
>> >
>> > AND record_type_id=0
>> >
>> > AND sys_time_stamp >=
>> >
>> > SUBDATE('${dataimporter.last_index_time}', INTERVAL 1 MINUTE) ORDER BY id
>>
>> > DESC">
>> >
>> > I think the problem i'm having comes from the way solr stores the
>> > last_index_time in conf/dataimport.properties as stated on the wiki as
>> >
>> > ""When delta-import command is executed, it reads the start time stored
>> in
>> > conf/dataimport.properties. It uses that timestamp to run delta queries
>> and
>> > after completion, updates the timestamp in conf/dataimport.properties.""
>> >
>> > Which to me seems to indicate that any records with a time-stamp between
>> > when the dataimport starts and ends will be missed as the last_index_time
>> is
>> > set to when it completes the import.
>> >
>> > This doesn't seem quite right to me. I would have expected the
>> > last_index_time to refer to when the dataimport was last STARTED so that
>> > there was no gaps in the timestamp covered.
>> >
>> > I changed the deltaQuery of our config to include the SUBDATE by INTERVAL
>> 1
>> > MINUTE statement to alleviate this problem, but it does only cover times
>> > when the delta-import takes less than a minute.
>> >
>> > Any ideas as to how this can be overcome? ,other than increasing the
>> > INTERVAL to something larger.
>> >
>> > Regards
>> >
>> > Barry Tucker
>> > --
>> > View this message in context:
>> http://lucene.472066.n3.nabble.com/Delta-Import-occasionally-missing-records-tp2300877p2300877.html<http://lucene.472066.n3.nabble.com/Delta-Import-occasionally-missing-records-tp2300877p2300877.html?by-user=t>
>

Re: SolrCloud Questions for MultiCore Setup

2011-01-22 Thread Lance Norskog
A "collection" is your data, like newspaper articles or movie titles.
It is a user-level concept, not really a Solr design concept.

A "core" is a Solr/Lucene index. It is addressable as
solr/collection-name on one machine.

You can use a core to store a collection, or you can break it up among
multiple cores (usually for performance reasons). When you use a core
like this, it is called a "shard". All of the different shards of a
collection form the collection.

Solr has a feature called Distributed Search that presents the
separate shards as if it were one Solr collection. You should set up
Distributed Search first. It does not use SolrCloud, but shows you how
these ideas work. After that, Solr Cloud will make more sense.

Lance

On Sat, Jan 22, 2011 at 9:35 AM, Em  wrote:
>
> Hello list,
>
> i want to experiment with the new SolrCloud feature. So far, I got
> absolutely no experience in distributed search with Solr.
> However, there are some things that remain unclear to me:
>
> 1 ) What is the usecase of a collection?
> As far as I understood: A collection is the same as a core but in a
> distributed sense. It contains a set of cores on one or multiple machines.
> It makes sense that all the cores in a collection got the same schema and
> solrconfig - right?
> Can someone tell me if I understood the concept of a collection correctly?
>
> 2 ) The wiki says this will cause an update
> -Durl=http://localhost:8983/solr/collection1/update
> However, as far as I know this cause an update to a CORE named "collection1"
> at localhost:8983, not to the full collection. Am I correct here?
> So *I* have to care about consistency between the different replicas inside
> my cloud?
>
> 3 ) If I got replicas of the same shard inside a collection, how does
> SolrCloud determine that two documents in a result set are equal? Is it
> neccessary to define a unique key? Is it random which of the two documents
> is picked into the final resultset?
>
> ---
> I think these are my most basic questions.
> However, there is one more tricky thing:
>
> If I understood the collection-idea correctly: What happens if I create two
> cores and each core belongs to a different collection and THEN I do a SWAP.
> Say: core1->collection1, core2->collection2
> SWAP core1,core2
> Does core2 now maps to collection1?
>
> Thank you!
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/SolrCloud-Questions-for-MultiCore-Setup-tp2309443p2309443.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>



-- 
Lance Norskog
goks...@gmail.com


<    1   2   3   4   5   6   7   8   9   10   >