Re: [POLL] Where do you get Lucene/Solr from? Maven? ASF Mirrors?

2011-01-18 Thread Markus Jelsma

> [X] ASF Mirrors (linked in our release announcements or via the Lucene
> website)
> 
> [] Maven repository (whether you use Maven, Ant+Ivy, Buildr, etc.)
> 
> [X] I/we build them from source via an SVN/Git checkout.
> 
> [] Other (someone in your company mirrors them internally or via a
> downstream project)


Re: How to find Master & Slave are in sync

2011-01-19 Thread Markus Jelsma
Notice the index version number? If it's equal, then they are in sync.

On Wednesday 19 January 2011 13:37:32 Shanmugavel SRD wrote:
> How to find Master & Slave are in sync?
> Is there a way apart from checking the index version of master and slave
> using below two HTTP APIs?
> 
> http://master_host:port/solr/replication?command=indexversion
> http://slave_host:port/solr/replication?command=details

-- 
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350


Re: Replication: abort-fetch and restarting

2011-01-19 Thread Markus Jelsma
Issue created:
https://issues.apache.org/jira/browse/SOLR-2323

On Tuesday 04 January 2011 20:08:40 Markus Jelsma wrote:
> Hi,
> 
> It seems abort-fetch nicely removes the index directory which i'm
> replicating to which is fine. Restarting, however, does not trigger the
> the same feature as the abort-fetch command does. At least, that's what my
> tests seems to tell me.
> 
> Shouldn't a restart of Solr nicely clean up the mess before exiting? And,
> shouldn't starting Solr also look for mess left behind by a possible sudden
> shutdown of the server at which the mess obviously cannot get cleaned?
> 
> If i now stop, clean and start my slave it will attempt to download an
> existing index. If i abort-fetch it will clean up the mess and (due to low
> interval polling) make another attempt. If i, however, restart (instead of
> abort-fetch) the old temporary directory will stay and needs to be deleted
> manually.
> 
> Cheers,

-- 
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350


Re: Switching existing solr indexes from Segment to Compound Style index files

2011-01-19 Thread Markus Jelsma
Indeed, wouldn't reducing the number of segments be a better idea? Speeds up 
searching too! Do you happen to have a very high mergeFactor value for each 
core?

On Wednesday 19 January 2011 17:53:12 Erick Erickson wrote:
> You're perhaps exactly right in your approach, but with a bit more info
> we may be able to suggest other alternatives.

-- 
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350


Re: Documentaion: For newbies and recent newbies

2011-01-19 Thread Markus Jelsma
That someone should just visit the wiki:
http://wiki.apache.org/solr/SolrResources

> If someone is looking for good documentation and getting started guides, I
> am putting this in the newsgroups to be searched upon. I recommend:
> 
> A/ The Wikis: (FREE)
>http://wiki.apache.org/solr/FrontPage
> 
> B/ The book and eBook: (COSTS $45.89)
>   https://www.packtpub.com/solr-1-4-enterprise-search-server/book
> 
> C/ The (seemingly) total reference guide:(FREE, with registration)
> 
> http://www.lucidimagination.com/software_downloads/certified/cdrg/lucidwork
> s-solr-refguide-1.4.pdf
> 
> 
> 
> D/ The webinar on optimizing the search engine to Do a GOOD search,
>  based on YOUR needs, not general ones: (FREE, with registration)
> 
> http://www.lucidimagination.com/Solutions/Webinars/Analyze-This-Tips-and-tr
> icks-getting-LuceneSolr-Analyzer-index-and-search-your-content
> 
> 
> Personally, I am working on being more than barely informed on items A & B
> :-)
> 
> Dennis Gearon
> 
> 
> Signature Warning
> 
> It is always a good idea to learn from your own mistakes. It is usually a
> better idea to learn from others’ mistakes, so you do not have to make
> them yourself. from
> 'http://blogs.techrepublic.com.com/security/?p=4501&tag=nl.e036'
> 
> 
> EARTH has a Right To Life,
> otherwise we all die.


Re: How to index my users info

2011-01-19 Thread Markus Jelsma
http://lucene.apache.org/solr/#getstarted

> I would like to index the information of my employees to be able to get
> through some fields such as: e-mail, registration, ID, cell phone, name.
> 
> I am very new to SOLR and would like to know how to index these fields this
> way and how to search filtering by some of these fields.
> 
> Thanks in advance
> 
> Jota.


Re: Mem allocation - SOLR vs OS

2011-01-19 Thread Markus Jelsma
You only need so much for Solr so it can do its thing. Faceting can take quite 
some memory on a large index but sorting can be a really big RAM consumer.

As Erick pointed out, inspect and tune the cache settings and adjust RAM 
allocated to the JVM if required. Using tools like JConsole you can monitor 
various things via JMX including RAM consumption.

> Hi,
> 
> I know this is a subjective topic but from what I have read it seems more
> RAM should be spared for OS caching and much less for SOLR/Tomcat even on a
> dedicated SOLR server.
> 
> Can someone give me an idea about the theoretically ideal proportion b/w
> them for a dedicated Windows server with 32GB RAM? Also the index is
> updated every hour.


Re: facet or filter based on user's history

2011-01-19 Thread Markus Jelsma
Hi,

I've never seen Solr's behaviour with a huge amount of values in a multi 
valued but i think it should work alright. Then you can stored a list of user 
ID's along with each book document and user filter queries to include or 
exclude the book from the result set.

Cheers,

> Hi,
> 
> I'm looking for ideas on how to make an efficient facet query on a
> user's history with respect to the catalog of documents (something
> like "Read document already: yes / no"). The catalog is around 100k
> titles and there are several thousand users. Of course, each user has
> a different history, many having read fewer than 500 titles, but some
> heavy users having read perhaps 50k titles.
> 
> Performance is not terribly important right now so all I did was bump
> up the boolean query limit and put together a big string of document
> id's that the user has read. The first query is slow but once it's in
> the query cache it's fine. I would like to find a better way of doing
> it though.
> 
> What type of solr plugin would be best suited to helping in this
> situation? I could make a function plugin that provides something like
> hasHadBefore() - true/false, but would that be efficient for faceting
> and filtering? Another idea is a QParserPlugin that looks for a field
> like hasHadBefore:userid and somehow substitutes in the list of docs.
> But I'm not sure how a new parser plugin would interact with the
> existing parser. Can solr use a parser plugin to only handle one
> field, and leave all the other fields to the default parser?
> 
> Thanks,
> Jon


Re: No system property or default value specified for...

2011-01-19 Thread Markus Jelsma
Hi,

I'm unsure if i completely understand but you first had the error for 
local.code and then set the property in solr.xml? Then of course it will give 
an error for the next undefined property that has no default set.

If you use a property without default it _must_ be defined in solr.xml or 
solrcore.properties. And since you don't use defaults in your dataconfig they 
all must be explicitely defined.

This is proper behaviour.

Cheers,

> I'm trying to dynamically add a core to a multi core system using the
> following command:
> 
> http://localhost:8983/solr/admin/cores?action=CREATE&name=items&instanceDir
> =items&config=data-config.xml&schema=schema.xml&dataDir=data&persist=true
> 
> the data-config.xml looks like this:
> 
> 
>   url="jdbc:mysql://localhost/"
>...
>name="server"/>
>   
>query="select code from master.locals"
>rootEntity="false">
>  query="select '${local.code}' as localcode,
> items.*
> FROM ${local.code}_meta.item
> WHERE
>   item.lastmodified > '${dataimporter.last_index_time}'
> OR
>   '${dataimporter.request.clean}' != 'false'
> order by item.objid"
> />
> 
> 
> 
> 
> this same configuration works for a core that is already imported into the
> system, but when trying to add the core with the above command I get the
> following error:
> 
> No system property or default value specified for local.code
> 
> so I added a  tag in the solr.xml figuring that it needed some
> type of default value for this to work, then I restarted solr, but now when
> I try the import I get:
> 
> No system property or default value specified for
> dataimporter.last_index_time
> 
> Do I have to define a default value for every variable I will conceivably
> use for future cores? is there a way to bypass this error?
> 
> Thanks in advance


Re: Mem allocation - SOLR vs OS

2011-01-19 Thread Markus Jelsma
Sorting on field X will build an array of the size of maxDoc. The data type 
equals the one used by the field you're sorting on. Also, if you have a very 
high amount of deletes per update it might be a good idea to optimize as well 
since it reduces maxDoc to the number of documents that actually can be found.

> We do have sorting but not faceting. OK so I guess there is no 'hard and
> fast rule' as such so I will play with it and see.
> 
> Thanks for the help
> 
> On Wed, Jan 19, 2011 at 11:48 PM, Markus Jelsma
> 
> wrote:
> > You only need so much for Solr so it can do its thing. Faceting can take
> > quite
> > some memory on a large index but sorting can be a really big RAM
> > consumer.
> > 
> > As Erick pointed out, inspect and tune the cache settings and adjust RAM
> > allocated to the JVM if required. Using tools like JConsole you can
> > monitor various things via JMX including RAM consumption.
> > 
> > > Hi,
> > > 
> > > I know this is a subjective topic but from what I have read it seems
> > > more RAM should be spared for OS caching and much less for SOLR/Tomcat
> > > even on
> > 
> > a
> > 
> > > dedicated SOLR server.
> > > 
> > > Can someone give me an idea about the theoretically ideal proportion
> > > b/w them for a dedicated Windows server with 32GB RAM? Also the index
> > > is updated every hour.


Re: dataDir in solr.xml

2011-01-19 Thread Markus Jelsma
You have set the property already but i haven't seen you use that same 
property for the dataDir setting in solrconfig.

> I've checked the archive, and plenty of people have suggested an
> arrangement where you can have two cores which share a configuration but
> maintain separate data paths.  But I can't seem to get solr to stop
> thinking solrconfig.xml is the first and last word for any value
> regarding data.  I am running 1.4
> 
> solr.xml:
> 
>  persistent="true">
> 
> 
> 
> 
> 
> 
> 
> .
> .
> .
> 
> In all other respects, my multicore setup is working as it should.  So
> the setup is finding solr.xml at the value set for solr home as it
> should.  I can get into admin, etc.  However, if I comment out the
>  stanza in cores/staff/conf/solrconfig.xml, and restart, I just
> get this:
> 
> WARNING: [staff] Solr index directory
> '/usr/local/solr/cores/staff/data/index' doesn't exist. Creating new
> index...
> 
> Ignoring the value set in solr.xml.
> 
> Is there some other override I'm ignoring?
> 
> thanks,
> 
> Fred


Re: performance during index switch

2011-01-19 Thread Markus Jelsma
> Hi,
>  
> Are there performance issues during the index switch?

What do you mean by index switch?

>  
> As the size of index gets bigger, response time slows down?  Are there any
> studies on this? 

I haven't seen any studies as of yet but response time will slow down for some 
components. Sorting and faceting will tend to consume more RAM and CPU cycles 
with the increase of documents and unique values. It also becomes increasingly 
slow if you query for very high start values. And, of course, cache warming 
queries usually take some more time as well increasing latency between commit 
and availability.

> Thanks,
>  
> Tri


Re: No system property or default value specified for...

2011-01-19 Thread Markus Jelsma
No, you only need defaults if you use properties that are not defined in 
solr.xml or solrcore.properties.

What would the value for local.core be if you don't define it anyway and you 
don't specify a default? Quite unpredictable i gues =)

> i even have to define default values for the dataimport.delta values? that
> doesn't seem right
> 
> On Wed, Jan 19, 2011 at 11:57 AM, Markus Jelsma
> 
> wrote:
> > Hi,
> > 
> > I'm unsure if i completely understand but you first had the error for
> > local.code and then set the property in solr.xml? Then of course it will
> > give
> > an error for the next undefined property that has no default set.
> > 
> > If you use a property without default it _must_ be defined in solr.xml or
> > solrcore.properties. And since you don't use defaults in your dataconfig
> > they
> > all must be explicitely defined.
> > 
> > This is proper behaviour.
> > 
> > Cheers,
> > 
> > > I'm trying to dynamically add a core to a multi core system using the
> > 
> > > following command:
> > http://localhost:8983/solr/admin/cores?action=CREATE&name=items&instanceD
> > ir
> > 
> > > =items&config=data-config.xml&schema=schema.xml&dataDir=data&persist=tr
> > > ue
> > > 
> > > the data-config.xml looks like this:
> > > 
> > > 
> > > 
> > >> >   
> > >url="jdbc:mysql://localhost/"
> > >...
> > >name="server"/>
> > >   
> > >   
> > >   
> > > > >
> > >query="select code from master.locals"
> > >rootEntity="false">
> > > 
> > >  > > 
> > > query="select '${local.code}' as localcode,
> > > items.*
> > > 
> > > FROM ${local.code}_meta.item
> > > WHERE
> > > 
> > >   item.lastmodified > '${dataimporter.last_index_time}'
> > > 
> > > OR
> > > 
> > >   '${dataimporter.request.clean}' != 'false'
> > > 
> > > order by item.objid"
> > > />
> > > 
> > > 
> > > 
> > > 
> > > this same configuration works for a core that is already imported into
> > 
> > the
> > 
> > > system, but when trying to add the core with the above command I get
> > > the following error:
> > > 
> > > No system property or default value specified for local.code
> > > 
> > > so I added a  tag in the solr.xml figuring that it needed
> > > some type of default value for this to work, then I restarted solr,
> > > but now
> > 
> > when
> > 
> > > I try the import I get:
> > > 
> > > No system property or default value specified for
> > > dataimporter.last_index_time
> > > 
> > > Do I have to define a default value for every variable I will
> > > conceivably use for future cores? is there a way to bypass this error?
> > > 
> > > Thanks in advance


Re: No system property or default value specified for...

2011-01-19 Thread Markus Jelsma
Ok, have you defined dataimporter.last_index_time in solr.xml or 
solrcore.properties? If not, then you can either define the default value or 
set it in solrcore.properties or solr.xml.

Maybe a catch up on the wiki clears things up:
http://wiki.apache.org/solr/SolrConfigXml#System_property_substitution

> there error I am getting is that I have no default value
> for ${dataimporter.last_index_time}
> 
> should I just define -00-00 00:00:00 as the default for that field?
> 
> On Wed, Jan 19, 2011 at 12:45 PM, Markus Jelsma
> 
> wrote:
> > No, you only need defaults if you use properties that are not defined in
> > solr.xml or solrcore.properties.
> > 
> > What would the value for local.core be if you don't define it anyway and
> > you
> > don't specify a default? Quite unpredictable i gues =)
> > 
> > > i even have to define default values for the dataimport.delta values?
> > 
> > that
> > 
> > > doesn't seem right
> > > 
> > > On Wed, Jan 19, 2011 at 11:57 AM, Markus Jelsma
> > > 
> > > wrote:
> > > > Hi,
> > > > 
> > > > I'm unsure if i completely understand but you first had the error for
> > > > local.code and then set the property in solr.xml? Then of course it
> > 
> > will
> > 
> > > > give
> > > > an error for the next undefined property that has no default set.
> > > > 
> > > > If you use a property without default it _must_ be defined in
> > > > solr.xml
> > 
> > or
> > 
> > > > solrcore.properties. And since you don't use defaults in your
> > 
> > dataconfig
> > 
> > > > they
> > > > all must be explicitely defined.
> > > > 
> > > > This is proper behaviour.
> > > > 
> > > > Cheers,
> > > > 
> > > > > I'm trying to dynamically add a core to a multi core system using
> > > > > the
> > 
> > > > > following command:
> > http://localhost:8983/solr/admin/cores?action=CREATE&name=items&instanceD
> > 
> > > > ir
> > 
> > =items&config=data-config.xml&schema=schema.xml&dataDir=data&persist=tr
> > 
> > > > > ue
> > > > > 
> > > > > the data-config.xml looks like this:
> > > > > 
> > > > > 
> > > > > 
> > > > >> > > >   
> > > > >url="jdbc:mysql://localhost/"
> > > > >...
> > > > >name="server"/>
> > > > >   
> > > > >   
> > > > >   
> > > > > > > > >
> > > > >query="select code from master.locals"
> > > > >rootEntity="false">
> > > > > 
> > > > >  > > > > 
> > > > > query="select '${local.code}' as localcode,
> > > > > items.*
> > > > > 
> > > > > FROM ${local.code}_meta.item
> > > > > WHERE
> > > > > 
> > > > >   item.lastmodified > '${dataimporter.last_index_time}'
> > > > > 
> > > > > OR
> > > > > 
> > > > >   '${dataimporter.request.clean}' != 'false'
> > > > > 
> > > > > order by item.objid"
> > > > > />
> > > > > 
> > > > > 
> > > > > 
> > > > > 
> > > > > this same configuration works for a core that is already imported
> > 
> > into
> > 
> > > > the
> > > > 
> > > > > system, but when trying to add the core with the above command I
> > > > > get the following error:
> > > > > 
> > > > > No system property or default value specified for local.code
> > > > > 
> > > > > so I added a  tag in the solr.xml figuring that it
> > > > > needed some type of default value for this to work, then I
> > > > > restarted solr, but now
> > > > 
> > > > when
> > > > 
> > > > > I try the import I get:
> > > > > 
> > > > > No system property or default value specified for
> > > > > dataimporter.last_index_time
> > > > > 
> > > > > Do I have to define a default value for every variable I will
> > > > > conceivably use for future cores? is there a way to bypass this
> > 
> > error?
> > 
> > > > > Thanks in advance


Re: using dismax

2011-01-20 Thread Markus Jelsma
Did i write wt? Oh dear. The q and w are too close =)
> Markus,
> 
> Its not wt its qt, wt for response type,
> Also qt is not for Query Parser its for Request Handler ,In solrconfig.xml
> there are many Request Handlers can be Defined using "dismax" Query Parser
> Or Using "lucene" Query Parser.
> 
> If you want to change Query parser then its "defType"  parameter for
> defining Query Parser .
> And you are right if defType=dismax ,then there must be "qf" parameter to
> be given.
> 
> -
> Thanx:
> Grijesh


Re: Multicore Search "Map size must not be negative"

2011-01-20 Thread Markus Jelsma
That looks like this issue:
https://issues.apache.org/jira/browse/SOLR-2278

On Thursday 20 January 2011 13:02:41 Jörg Agatz wrote:
> Hallo..
> 
> I have create multicore search and will search in more then one Core!
> 
> Now i have done:
> 
> http://192.168.105.59:8080/solr/mail/select?wt=phps&q=*:*&shards=192.168.10
> 5.59:8080/solr/mail,192.168.105.59:8080/solr/mail11
> 
> But Error...
> 
> HTTP Status 500 - Map size must not be negative
> java.lang.IllegalArgumentException: Map size must not be negative at
> org.apache.solr.request.PHPSerializedWriter.writeMapOpener(PHPSerializedRes
> ponseWriter.java:224) at
> org.apache.solr.request.JSONWriter.writeSolrDocument(JSONResponseWriter.jav
> a:398) at
> org.apache.solr.request.JSONWriter.writeSolrDocumentList(JSONResponseWriter
> .java:553) at
> org.apache.solr.request.TextResponseWriter.writeVal(TextResponseWriter.java
> :148) at
> org.apache.solr.request.JSONWriter.writeNamedListAsMapMangled(JSONResponseW
> riter.java:154) at
> org.apache.solr.request.PHPSerializedWriter.writeNamedList(PHPSerializedRes
> ponseWriter.java:100) at
> org.apache.solr.request.PHPSerializedWriter.writeResponse(PHPSerializedResp
> onseWriter.java:95) at
> org.apache.solr.request.PHPSerializedResponseWriter.write(PHPSerializedResp
> onseWriter.java:69) at
> org.apache.solr.servlet.SolrDispatchFilter.writeResponse(SolrDispatchFilter
> .java:325) at
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java
> :254) at
> org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(Applicatio
> nFilterChain.java:235) at
> org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterC
> hain.java:206) at
> org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.j
> ava:233) at
> org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.j
> ava:191) at
> org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:12
> 7) at
> org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:10
> 2) at
> org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.jav
> a:109) at
> org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:298)
> at
> org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:859)
> at
> org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Htt
> p11Protocol.java:588) at
> org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:489) at
> java.lang.Thread.run(Thread.java:636)
> 
> When i search
> http://192.168.105.59:8080/solr/mail/select?wt=php&q=*:*&shards=192.168.105
> .59:8080/solr/mail,192.168.105.59:8080/solr/mail11
> 
> it works but i need wt=phps it is important!
> 
> but i dont understand the Problem!!!
> 
> 
> Jörg

-- 
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350


Re: Master and Slaves

2011-01-21 Thread Markus Jelsma
You'll can use a property and define it for each slave in solrcore.properties.
http://wiki.apache.org/solr/SolrConfigXml#System_property_substitution

On Friday 21 January 2011 14:04:28 Ezequiel Calderara wrote:
> I have setup a Master with two slaves. Let's call the Master "Jabba" and
> the slaves "Leia" and "C3PO"  (very nerdy! lol).
> Well, i have setup in Jabba the replication, with the following confFiles
>  name="confFiles">solrconfig_slave.xml:solrconfig.xml,schema.xml,stopwords.t
> xt,elevate.xml
> 
> But in the slaves i want to override the "dataDir" value of the
> solrconfig.xml, but it get overrided by the one replicated.
> Is there a way to have the slaves having their solrconfig replicated, but
> with some "special" configurations?
> 
> I want to avoid having to enter to each slave to configure it, i prefer to
> do it in a centralized way.

-- 
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350


Re: Master and Slaves

2011-01-21 Thread Markus Jelsma
You have defined the property and its value but you're not using it. Set the 
property in solrconfig.xml.

On Friday 21 January 2011 15:38:58 Ezequiel Calderara wrote:
> Somehow it's not working :(
> 
> i have set it up like:
> >  #solrcore.properties
> >  
> >  data.dir=D:\Solr\PAU\data
> > 
> > But it keeps going to the dataDir configured in the solrconfig.xml.
> 
> Also, when i go to the replication admin i see this:
>   *Master* http://10.11.33.180:8787/solr/replication  *Poll Interval*
> 00:00:60 *Local Index* Index Version: 1295466104884, Generation: 4 
> Location: C:\Program Files\Apache Software Foundation\Tomcat
> 7.0\data\index  Size: 6,99 KB  Times Replicated Since Startup: 50 
> Previous Replication Done At: Fri Jan 21 11:36:19 ART 2011  *Config Files
> Replicated At: null * ** *Config Files Replicated: null * ** *Times Config
> Files Replicated Since Startup: null*  Next Replication Cycle At: Fri Jan
> 21 11:37:19 ART 2011
> 
> And i know that the files were replicated ok. i see the solrconfig backup
> with name "solrconfig.xml.20110120030345", and the datadir changed also...
> 
> So i don't understand why isn't figuring as replicated.
> Maybe i'm doing something wrong. Don't know
> 
> On Fri, Jan 21, 2011 at 10:16 AM, Ezequiel Calderara 
wrote:
> > Thanks!, thats what i needed!
> > 
> > There is always some much to learn about Solr/Lucene!
> > 
> > 
> > On Fri, Jan 21, 2011 at 10:08 AM, Markus Jelsma <
> > 
> > markus.jel...@openindex.io> wrote:
> >> solrcore.properties
> > 
> > --
> > __
> > Ezequiel.
> > 
> > Http://www.ironicnet.com <http://www.ironicnet.com/>

-- 
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350


Re: Indexing FTP Documents through SOLR??

2011-01-21 Thread Markus Jelsma
Hi,

Please take a look at Apache Nutch. I can crawl through a file system over FTP. 
After crawling, it can use Tika to extract the content from your PDF files and 
other. Finally you can then send the data to your Solr server for indexing.

http://nutch.apache.org/

> Hi All,
>   Is there is any way in SOLR or any plug-in through which the folders and
> documents in FTP location can be indexed.
> 
> / Pankaj Bhatt.


Re: Is solr 4.0 ready for prime time? (or other ways to use geo distance in search)

2011-01-21 Thread Markus Jelsma
Hi,

You can use Solr 1.4.1 and a third party plugin [1]. It does a pretty good job 
in spatial search. You could also try the Solr 3.1 branch which also has some 
spatial features on-board. It, however, does not return computed distances but 
can filter and sort using the great circle algorithm or a bounding box and also 
covers the problem of the poles.

I would migrate to the 3.1 branch first and see how 4.0 behaves when it is 
being released and got a few bugfix updates.

[1]: http://blog.jteam.nl/2010/12/22/ssp-2-0/

Cheers,

> Hi all,
>  I've been using solr 1.4 and it's working great for what I'm
> doing.  However, I'm now finding a need to filter results by location.
> Searching around, I see that the distance functions are implemented in
> solr 4.0, but there's no full release yet.
> 
> So my question is, is solr 4.0-dev ready to be used in prime time?  My
> other option would appear to be using the cartesian distance, which
> isn't totally accurate, but it probably good enough for my purposes.
> Something like including this in my filter query:
> sum(pow(sub(input_latitiude,stored_latitude),2),pow(sub(input_longitude,sto
> red_longitude),2)) 
> What's anyone else out there using?
> 
> Thanks in advance,
> Alex


Re: fieldType textgen. tokens > 2

2011-01-24 Thread Markus Jelsma
This is not the fieldType but your query that is giving you trouble. You only 
specify fieldName for value name1, so Solr will use defaultField for values 
name2 and name3. You also omitted an operator, so Solr will use 
defaultOperator instead. See you schema.xml for the values for the defaults 
and use debugQuery=true to, well, debug queries.

On Monday 24 January 2011 11:48:07 stockii wrote:
> Hello.
> 
> my field sender with fieldType=textgen cannot find any documents wich are
> more than 2 tokens long.
> 
> ->q=sender:name1 name2 name3 => 0 Documents found
> 
> WHY ???
> 
> that is my field (original from default schema.xml)
> 
>  positionIncrementGap="100"> 
> 
>  words="stopwords.txt" enablePositionIncrements="true"/>
>  generateNumberParts="1" catenateWords="1" catenateNumbers="1"
> catenateAll="0" splitOnCaseChange="0"/>
> 
> 
> 
> 
>  ignoreCase="true" expand="true"/>
>  words="stopwords.txt" enablePositionIncrements="true"/>
>  generateNumberParts="1" catenateWords="0" catenateNumbers="0"
> catenateAll="0" splitOnCaseChange="0"/>
> 
> 
> 
> 
> -
> --- System
> --------
> 
> One Server, 12 GB RAM, 2 Solr Instances, 7 Cores,
> 1 Core with 31 Million Documents other Cores < 100.000
> 
> - Solr1 for Search-Requests - commit every Minute  - 4GB Xmx
> - Solr2 for Update-Request  - delta every 2 Minutes - 4GB Xmx

-- 
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350


Re: How data is replicating from Master to Slave?

2011-01-24 Thread Markus Jelsma
It's all explained on the wiki:
http://wiki.apache.org/solr/SolrReplication#How_does_the_slave_replicate.3F


On Monday 24 January 2011 11:25:45 dhanesh wrote:
> Hi,
> I'm currently facing an issue with SOLR (exactly with the slaves
> replication) and after having spent quite a few time reading online I
> find myself having to ask for some enlightenment.
> To be more factual, here is the context that led me to this question.
> If the website administrator edited  an existing category name, then I
> need to re-index all the documents with the newly edited category.
> Suppose the category is linked with more than 10 million records.I need
> to re-index all the 10 million documents in SOLR
> 
> In the case of MySQL it should be like master server writes updates to
> its binary log files and maintains an index of those files.These binary
> log files serve as a record of updates to be sent to slave servers.
> My doubt is in SOLR how the data is replicating from Master to  Slave?
> I'd like to know the internal process of data replication.
> Is that huge amount of data(10 million records) is copying from Master
> to slave?
> This is my first work with Solr. So I'm not sure how to tackle this issue.
> 
> Regds
> dhanesh s.r

-- 
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350


Re: Migrating from 1.4.0 to 1.4.1 solr

2011-01-24 Thread Markus Jelsma
We can't guess what's wrong with the cores but you need to reindex anyway:
http://svn.apache.org/repos/asf/lucene/solr/branches/branch-1.4/CHANGES.txt

On Monday 24 January 2011 12:06:10 Prasad Joshi wrote:
> Hi,
> I want to migrate from 1.4.0 to 1.4.1 . Tried keeping the same conf for the
> cores as in 1.4.0, added the relevant core names in solr.xml and restarted
> solr but the old cores dont show up on the browser "localhost:8983". There
> were a few cores in examples/multicore/ in the solr1.4.1 source from where
> I downloaded, these cores when included in solr.xml do show up on the
> browser.
> 
> Pl do let me know the reason. Is there anything I need to do for the core
> migration? I dont have any data in these cores. Also if there was data is
> there a nice way of migrating from 1.4.0 to 1.4.1 (Which does not involve
> reindexing) ?
> Regards,
> Prasad

-- 
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350


Re: Possible Memory Leaks / Upgrading to a Later Version of Solr or Lucene

2011-01-24 Thread Markus Jelsma
Are you using 3rd-party plugins?

> We have two slaves replicating off one master every 2 minutes.
> 
> Both using the CMS + ParNew Garbage collector. Specifically
> 
> -server -XX:+UseConcMarkSweepGC -XX:+UseParNewGC
> -XX:+CMSIncrementalMode -XX:+CMSIncrementalPacing
> 
> but periodically they both get into a GC storm and just keel over.
> 
> Looking through the GC logs the amount of memory reclaimed in each GC
> run gets less and less until we get a concurrent mode failure and then
> Solr effectively dies.
> 
> Is it possible there's a memory leak? I note that later versions of
> Lucene have fixed a few leaks. Our current versions are relatively old
> 
>   Solr Implementation Version: 1.4.1 955763M - mark - 2010-06-17
> 18:06:42
> 
>   Lucene Implementation Version: 2.9.3 951790 - 2010-06-06 01:30:55
> 
> so I'm wondering if upgrading to later version of Lucene might help (of
> course it might not but I'm trying to investigate all options at this
> point). If so what's the best way to go about this? Can I just grab the
> Lucene jars and drop them somewhere (or unpack and then repack the solr
> war file?). Or should I use a nightly solr 1.4?
> 
> Or am I barking up completely the wrong tree? I'm trawling through heap
> logs and gc logs at the moment trying to to see what other tuning I can
> do but any other hints, tips, tricks or cluebats gratefully received.
> Even if it's just "Yeah, we had that problem and we added more slaves
> and periodically restarted them"
> 
> thanks,
> 
> Simon


Re: Solr set up issues with Magento

2011-01-24 Thread Markus Jelsma
Hi,

You haven't defined the field in Solr's schema.xml configuration so it needs to 
be added first. Perhaps following the tutorial might be a good idea.

http://lucene.apache.org/solr/tutorial.html

Cheers.

> Hello Team:
> 
> 
>   I am in the process of setting up Solr 1.4 with Magento ENterprise
> Edition 1.9.
> 
> When I try to index the products I get the following error message.
> 
> Jan 24, 2011 3:30:14 PM org.apache.solr.update.processor.LogUpdateProcessor
> fini
> sh
> INFO: {} 0 0
> Jan 24, 2011 3:30:14 PM org.apache.solr.common.SolrException log
> SEVERE: org.apache.solr.common.SolrException: ERROR:unknown field
> 'in_stock' at
> org.apache.solr.update.DocumentBuilder.toDocument(DocumentBuilder.jav
> a:289)
> at
> org.apache.solr.update.processor.RunUpdateProcessor.processAdd(RunUpd
> ateProcessorFactory.java:60)
> at
> org.apache.solr.handler.XMLLoader.processUpdate(XMLLoader.java:139)
> at org.apache.solr.handler.XMLLoader.load(XMLLoader.java:69)
> at
> org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(Co
> ntentStreamHandlerBase.java:54)
> at
> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandl
> erBase.java:131)
> at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316)
> at
> org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter
> .java:338)
> at
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilte
> r.java:241)
> at
> org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(Appl
> icationFilterChain.java:244)
> at
> org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationF
> ilterChain.java:210)
> at
> org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperV
> alve.java:240)
> at
> org.apache.catalina.core.StandardContextValve.invoke(StandardContextV
> alve.java:161)
> at
> org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.j
> ava:164)
> at
> org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.j
> ava:100)
> at
> org.apache.catalina.valves.AccessLogValve.invoke(AccessLogValve.java:
> 550)
> at
> org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineVal
> ve.java:118)
> at
> org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.jav
> a:380)
> at
> org.apache.coyote.http11.Http11Processor.process(Http11Processor.java
> 
> :243)
> 
> at
> org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.proce
> ss(Http11Protocol.java:188)
> at
> org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.proce
> ss(Http11Protocol.java:166)
> at
> org.apache.tomcat.util.net.JIoEndpoint$SocketProcessor.run(JIoEndpoin
> t.java:288)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExec
> utor.java:886)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor
> .java:908)
> at java.lang.Thread.run(Thread.java:662)
> 
> Jan 24, 2011 3:30:14 PM org.apache.solr.core.SolrCore execute
> INFO: [] webapp=/solr path=/update params={wt=json} status=400 QTime=0
> Jan 24, 2011 3:30:14 PM org.apache.solr.update.DirectUpdateHandler2
> rollback INFO: start rollback
> Jan 24, 2011 3:30:14 PM org.apache.solr.update.DirectUpdateHandler2
> rollback INFO: end_rollback
> Jan 24, 2011 3:30:14 PM org.apache.solr.update.processor.LogUpdateProcessor
> fini
> sh
> INFO: {rollback=} 0 16
> Jan 24, 2011 3:30:14 PM org.apache.solr.core.SolrCore execute
> 
> I am a new to both Magento and SOlr. I could have done some thing stupid
> during installation. I really look forward for your help.
> 
> Thank you,
> Sandhya


Re: Possible Memory Leaks / Upgrading to a Later Version of Solr or Lucene

2011-01-25 Thread Markus Jelsma
Hi,

Are you sure you need CMS incremental mode? It's only adviced when running on 
a machine with one or two processors. If you have more you should consider 
disabling the incremental flags.

Cheers,

On Monday 24 January 2011 19:32:38 Simon Wistow wrote:
> We have two slaves replicating off one master every 2 minutes.
> 
> Both using the CMS + ParNew Garbage collector. Specifically
> 
> -server -XX:+UseConcMarkSweepGC -XX:+UseParNewGC
> -XX:+CMSIncrementalMode -XX:+CMSIncrementalPacing
> 
> but periodically they both get into a GC storm and just keel over.
> 
> Looking through the GC logs the amount of memory reclaimed in each GC
> run gets less and less until we get a concurrent mode failure and then
> Solr effectively dies.
> 
> Is it possible there's a memory leak? I note that later versions of
> Lucene have fixed a few leaks. Our current versions are relatively old
> 
>   Solr Implementation Version: 1.4.1 955763M - mark - 2010-06-17
> 18:06:42
> 
>   Lucene Implementation Version: 2.9.3 951790 - 2010-06-06 01:30:55
> 
> so I'm wondering if upgrading to later version of Lucene might help (of
> course it might not but I'm trying to investigate all options at this
> point). If so what's the best way to go about this? Can I just grab the
> Lucene jars and drop them somewhere (or unpack and then repack the solr
> war file?). Or should I use a nightly solr 1.4?
> 
> Or am I barking up completely the wrong tree? I'm trawling through heap
> logs and gc logs at the moment trying to to see what other tuning I can
> do but any other hints, tips, tricks or cluebats gratefully received.
> Even if it's just "Yeah, we had that problem and we added more slaves
> and periodically restarted them"
> 
> thanks,
> 
> Simon

-- 
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350


Re: Recommendation on RAM-/Cache configuration

2011-01-25 Thread Markus Jelsma
On Tuesday 25 January 2011 11:54:55 Martin Grotzke wrote:
> Hi,
> 
> recently we're experiencing OOMEs (GC overhead limit exceeded) in our
> searches. Therefore I want to get some clarification on heap and cache
> configuration.
> 
> This is the situation:
> - Solr 1.4.1 running on tomcat 6, Sun JVM 1.6.0_13 64bit
> - JVM Heap Params: -Xmx8G -XX:MaxPermSize=256m -XX:NewSize=2G
> -XX:MaxNewSize=2G -XX:SurvivorRatio=6 -XX:+UseParallelOldGC
> -XX:+UseParallelGC

Consider switching to HotSpot JVM, use the -server as the first switch.

> - The machine has 32 GB RAM
> - Currently there are 4 processors/cores in the machine, this shall be
> changed to 2 cores in the future.
> - The index size in the filesystem is ~9.5 GB
> - The index contains ~ 5.500.000 documents
> - 1.500.000 of those docs are available for searches/queries, the rest are
> inactive docs that are excluded from searches (via a flag/field), but
> they're still stored in the index as need to be available by id (solr is
> the main document store in this app)

How do you exclude them? It should use filter queries. I also remember (but i 
just cannot find it back so please correct my if i'm wrong) that in 1.4.x 
sorting is done before filtering. It should be an improvement if filtering is 
done before sorting.
If you use sorting, it takes up a huge amount of RAM if filtering is not done 
first.

> - Caches are configured with a big size (the idea was to prevent filesystem
> access / disk i/o as much as possible):

There is only disk I/O if the kernel can't keep the index (or parts) in its 
page cache.

>   - filterCache (solr.LRUCache): size=20, initialSize=3,
> autowarmCount=1000, actual size =~ 60.000, hitratio =~ 0.99
>   - documentCache (solr.LRUCache): size=20, initialSize=10,
> autowarmCount=0, actual size =~ 160.000 - 190.000, hitratio =~ 0.74
>   - queryResultCache (solr.LRUCache): size=20, initialSize=3,
> autowarmCount=1, actual size =~ 10.000 - 60.000, hitratio =~ 0.71

You should decrease the initialSize values. But your hitratio's seem very 
nice.

> - Searches are performed using a catchall text field using standard request
> handler, all fields are fetched (no fl specified)
> - Normally ~ 5 concurrent requests, peaks up to 30 or 40 (mostly during GC)
> - Recently we also added a feature that adds weighted search for special
> fields, so that the query might become s.th. like this
>   q=(some query) OR name_weighted:(some query)^2.0 OR brand_weighted:(some
> query)^4.0 OR longDescription_weighted:(some query)^0.5
>   (it seemed as if this was the cause of the OOMEs, but IMHO it only
> increased RAM usage so that now GC could not free enough RAM)
> 
> The OOMEs that we get are of type "GC overhead limit exceeded", one of the
> OOMEs was thrown during auto-warming.

Warming takes additional RAM. The current searcher still has its caches full 
and newSearcher is getting filled up. Decreasing sizes might help.

> 
> I checked two different heapdumps, the first one autogenerated
> (by -XX:+HeapDumpOnOutOfMemoryError) the second one generated manually via
> jmap.
> These show the following distribution of used memory - the autogenerated
> dump:
>  - documentCache: 56% (size ~ 195.000)
> - filterCache: 15% (size ~ 60.000)
> - queryResultCache: 8% (size ~ 61.000)
> - fieldCache: 6% (fieldCache referenced  by WebappClassLoader)
> - SolrIndexSearcher: 2%
> 
> The manually generated dump:
> - documentCache: 48% (size ~ 195.000)
> - filterCache: 20% (size ~ 60.000)
> - fieldCache: 11% (fieldCache hängt am WebappClassLoader)
> - queryResultCache: 7% (size ~ 61.000)
> - fieldValueCache: 3%
> 
> We are also running two search engines with 17GB heap, these don't run into
> OOMEs. Though, with these bigger heap sizes the longest requests are even
> longer due to longer stop-the-world gc cycles.
> Therefore my goal is to run with a smaller heap, IMHO even smaller than 8GB
> would be good to reduce the time needed for full gc.
> 
> So what's the right path to follow now? What would you recommend to change
> on the configuration (solr/jvm)?

Try tuning the GC
http://java.sun.com/performance/reference/whitepapers/tuning.html
http://www.oracle.com/technetwork/java/gc-tuning-5-138395.html

> 
> Would you say it is ok to reduce the cache sizes? Would this increase disk
> i/o, or would the index be hold in the OS's disk cache?

Yes! If you also allocate less RAM to the JVM then there is more for the OS to 
cache.

> 
> Do have other recommendations to follow / questions?
> 
> Thanx && cheers,
> Martin

-- 
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350


Re: List of indexed or stored fields

2011-01-25 Thread Markus Jelsma
The index version. Can be used in replication to determine whether to 
replicate or not.

On Tuesday 25 January 2011 20:30:21 kenf_nc wrote:
> refers to under the  section? I have 2 cores on one Tomcat instance,
> and 1 on a second instance (different server) and all 3 have different
> numbers for "version", so I don't think it's the version of Luke.

-- 
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350


Re: EdgeNgram Auto suggest - doubles ignore

2011-01-25 Thread Markus Jelsma
Then you don't need NGrams at all. A wildcard will suffice or you can use the 
TermsComponent.

If these strings are indexed as single tokens (KeywordTokenizer with 
LowercaseFilter) you can simply do field:app* to retrieve the "apple milk 
shake". You can also use the string field type but then you must make sure the 
values are already lowercased before indexing.

Be careful though, there is no query time analysis for wildcard (and fuzzy) 
queries so make sure 

> Hi Eric,
> 
> What I want here is, lets say I have 3 documents like
> 
> ["pineapple vers apple", "milk with apple", "apple milk shake" ]
> 
> and If i search for "apple", it should return only "apple milk shake"
> because that term alone starts with the letter "apple" which I typed in. It
> should not bring others and if I type "milk" it should return only "milk
> with apple"
> 
> I want an output Similar like a Google auto suggest.
> 
> Is there a way to achieve  this without encapsulating with double quotes.
> 
> Thanks,
> 
> Johnny


Re: EdgeNgram Auto suggest - doubles ignore

2011-01-25 Thread Markus Jelsma
Oh, i should perhaps mention that EdgeNGrams will yield results a lot quicker 
than using wildcards at the cost of a larger index. You should, of course, use 
EdgeNGrams if you worry about performance and have a huge index and a number 
of queries per second.

> Then you don't need NGrams at all. A wildcard will suffice or you can use
> the TermsComponent.
> 
> If these strings are indexed as single tokens (KeywordTokenizer with
> LowercaseFilter) you can simply do field:app* to retrieve the "apple milk
> shake". You can also use the string field type but then you must make sure
> the values are already lowercased before indexing.
> 
> Be careful though, there is no query time analysis for wildcard (and fuzzy)
> queries so make sure
> 
> > Hi Eric,
> > 
> > What I want here is, lets say I have 3 documents like
> > 
> > ["pineapple vers apple", "milk with apple", "apple milk shake" ]
> > 
> > and If i search for "apple", it should return only "apple milk shake"
> > because that term alone starts with the letter "apple" which I typed in.
> > It should not bring others and if I type "milk" it should return only
> > "milk with apple"
> > 
> > I want an output Similar like a Google auto suggest.
> > 
> > Is there a way to achieve  this without encapsulating with double quotes.
> > 
> > Thanks,
> > 
> > Johnny


Re: in-index representaton of tokens

2011-01-25 Thread Markus Jelsma
This should shed some light on the matter
http://lucene.apache.org/java/2_9_0/fileformats.html

> I am saying there is a list of tokens that have been parsed (a table of
> them) for each column? Or one for the whole index?
> 
>  Dennis Gearon
> 
> 
> Signature Warning
> 
> It is always a good idea to learn from your own mistakes. It is usually a
> better idea to learn from others’ mistakes, so you do not have to make
> them yourself. from
> 'http://blogs.techrepublic.com.com/security/?p=4501&tag=nl.e036'
> 
> 
> EARTH has a Right To Life,
> otherwise we all die.
> 
> 
> 
> - Original Message 
> From: Jonathan Rochkind 
> To: "solr-user@lucene.apache.org" 
> Sent: Tue, January 25, 2011 9:29:36 AM
> Subject: Re: in-index representaton of tokens
> 
> Why does it matter?  You can't really get at them unless you store them.
> 
> I don't know what "table per column" means, there's nothing in Solr
> architecture called a "table" or a "column". Although by column you
> probably mean more or less Solr "field".  There is nothing like a
> "table" in Solr.
> 
> Solr is still not an rdbms.
> 
> On 1/25/2011 12:26 PM, Dennis Gearon wrote:
> > So, the index is a list of tokens per column, right?
> > 
> > There's a table per column that lists the analyzed tokens?
> > 
> > And the tokens per column are represented as what, system integers? 32/64
> > bit unsigned ints?
> > 
> >   Dennis Gearon
> > 
> > Signature Warning
> > 
> > It is always a good idea to learn from your own mistakes. It is usually a
> >
> >better
> >
> > idea to learn from others’ mistakes, so you do not have to make them
> > yourself. from
> > 'http://blogs.techrepublic.com.com/security/?p=4501&tag=nl.e036'
> > 
> > 
> > EARTH has a Right To Life,
> > otherwise we all die.


Re: SOLR deduplication

2011-01-26 Thread Markus Jelsma
Not right now:
https://issues.apache.org/jira/browse/SOLR-1909

> Hi - I have the SOLR deduplication configured and working well.
> 
> Is there any way I can tell which documents have been not added to the
> index as a result of the deduplication rejecting subsequent identical
> documents?
> 
> Many Thanks
> 
> Jason Brown.
> 
> If you wish to view the St. James's Place email disclaimer, please use the
> link below
> 
> http://www.sjp.co.uk/portal/internet/SJPemaildisclaimer


Re: SolrDocumentList Size vs NumFound

2011-01-26 Thread Markus Jelsma
Hi,

If your query yields 1000 documents and the rows parameter is 10 then you'll 
get only 10 documents.  Consult the wiki on the start and rows parameters:

http://wiki.apache.org/solr/CommonQueryParameters

Cheers.

> Dear all,
> 
> I got a weird problem. The number of searched documents is much more than
> 10. However, the size of SolrDocumentList is 10 and the getNumFound() is
> the exact count of results. When I need to iterate the results as follows,
> only 10 are displayed. How to get the rest ones?
> 
> ..
> for (SolrDocument doc : docs)
> {
> 
> System.out.println(doc.getFieldValue(Fields.CATEGORIZED_HUB_TITLE_FIELD) +
> ": " + doc.getFieldValue(Fields.CATEGORIZED_HUB_URL_FIELD) + "; " +
> doc.getFieldValue(Fields.HUB_CATEGORY_NAME_FIELD) + "/" +
> doc.getFieldValue(Fields.HUB_PARENT_CATEGORY_NAME_FIELD));
> }
> ..
> 
> Could you give me a hand?
> 
> Thanks,
> LB


Re: How to group result when search on multiple fields

2011-01-26 Thread Markus Jelsma
http://wiki.apache.org/solr/ClusteringComponent
http://wiki.apache.org/solr/FieldCollapsing


Re: DIH and duplicate content

2011-01-27 Thread Markus Jelsma
http://wiki.apache.org/solr/Deduplication


On Thursday 27 January 2011 12:32:29 Rosa (Anuncios) wrote:
> Is there a way to avoid duplicate content in a index at the moment i'm 
> uploading my xml feed via DIH?
> 
> I would like to have only one entry for a given description. I mean if 
> the desciption of one product already exist in index not import this new 
> product.

-- 
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350


Malformed XML with exotic characters

2011-02-01 Thread Markus Jelsma
\u0430\u0434  
\u2022 Chamoru  \u2022 Chichewa  \u2022 Cuengh  \u2022 Dolnoserbski  \u2022 
E\u028begbe  \u2022 Frasch  \u2022 Fulfulde  \u2022 Gagauz  \u2022 
G\u0129k\u0169y\u0169  \u2022 
\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd  
\u2022 Hausa / 
\u0647\u064e\u0648\u064f\u0633\u064e\u0627  \u2022 Igbo  \u2022 
\u1403\u14c4\u1483\u144e\u1450\u1466 / Inuktitut  \u2022 Iñupiak  \u2022 
Kalaallisut  \u2022 \u0915\u0936\u094d\u092e\u0940\u0930\u0940 / 
\u0643\u0634\u0645\u064a\u0631\u064a  \u2022 Kongo  \u2022 
\u041a\u044b\u0440\u044b\u043a \u041c\u0430\u0440\u044b  \u2022 
\u0e9e\u0eb2\u0eaa\u0eb2\u0ea5\u0eb2\u0ea7  \u2022 
\u041b\u0430\u043a\u043a\u0443  \u2022 Luganda  \u2022 Mìng-d\u0115\u0324ng-
ng\u1e73\u0304  \u2022 Mirandés  \u2022 
\u041c\u043e\u043a\u0448\u0435\u043d\u044c  \u2022 
\u041c\u043e\u043b\u0434\u043e\u0432\u0435\u043d\u044f\u0441\u043a\u044d  
\u2022 Na Vosa Vaka-Viti  \u2022 Dorerin Naoero  \u2022 N\u0113hiyaw\u0113win 
/ \u14c0\u1426\u1403\u152d\u140d\u140f\u1423  \u2022 Norfuk / Pitkern  \u2022 
\u0b13\u0b21\u0b3f\u0b3c\u0b06  \u2022 Afaan Oromoo  \u2022 
\u0985\u09b8\u09ae\u09c0\u09af\u09be\u09bc  \u2022 
\u041f\u0435\u0440\u0435\u043c \u041a\u043e\u043c\u0438  \u2022 Pfälzisch  
\u2022 \u03a0\u03bf\u03bd\u03c4\u03b9\u03b1\u03ba\u03ac  \u2022 Qaraqalpaqsha  
\u2022 \u0f62\u0fab\u0f7c\u0f44\u0f0b\u0f41  \u2022 Romani / 
\u0930\u094b\u092e\u093e\u0928\u0940  \u2022 Kirundi  \u2022 Gagana S\u0101moa  
\u2022 Sängö  \u2022 Sesotho  \u2022 Setswana  \u2022 \u0633\u0646\u068c\u064a  
\u2022 \u0421\u043b\u043e\u0432\u0463\u0301\u043d\u044c\u0441\u043a\u044a / 
\u2c14\u2c0e\u2c11\u2c02\u2c21\u2c10\u2c20\u2c14\u2c0d\u2c1f  \u2022 SiSwati  
\u2022 Sranantongo  \u2022 Reo Tahiti  \u2022 Taqbaylit  \u2022 Tetun  \u2022 
\u1275\u130d\u122d\u129b  \u2022 Tok Pisin  \u2022 \u13e3\u13b3\u13a9  \u2022 
chiTumbuka  \u2022 Xitsonga  \u2022 Tshiven\u1e13a  \u2022 isiXhosa  \u2022 
Zeêuws  \u2022 isiZulu Other languages  \u2022 Weitere Sprachen  \u2022 Autres 
langues  \u2022 Kompletna lista j\u0119zyków  \u2022 \u4ed6\u306e\u8a00\u8a9e  
\u2022 Otros idiomas  \u2022 \u5176\u4ed6\u8a9e\u8a00  \u2022 
\u0414\u0440\u0443\u0433\u0438\u0435 \u044f\u0437\u044b\u043a\u0438  \u2022 
Aliaj lingvoj  \u2022 \ub2e4\ub978 \uc5b8\uc5b4  \u2022 Ngôn ng\u1eef khác  
Wiktionary  Wikinews  Wikiquote  Wikibooks  Wikispecies  
Wikisource  Wikiversity  Commons  Meta-Wiki
--------
^

-- 
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350


Re: Malformed XML with exotic characters

2011-02-01 Thread Markus Jelsma
It's throwing out a lot of disturbing messages:

select.xml:17: parser error : Char 0xD800 out of allowed range
ki  • Eʋegbe  • Frasch  • Fulfulde  • Gagauz  • Gĩkũyũ  • 
   ^
select.xml:17: parser error : PCDATA invalid Char value 55296
ki  • Eʋegbe  • Frasch  • Fulfulde  • Gagauz  • Gĩkũyũ  • 
   ^
select.xml:17: parser error : Char 0xDF32 out of allowed range
 • Eʋegbe  • Frasch  • Fulfulde  • Gagauz  • Gĩkũyũ  • �
   ^
select.xml:17: parser error : PCDATA invalid Char value 57138
 • Eʋegbe  • Frasch  • Fulfulde  • Gagauz  • Gĩkũyũ  • �
   ^
select.xml:17: parser error : Char 0xD800 out of allowed range
�� Eʋegbe  • Frasch  • Fulfulde  • Gagauz  • Gĩkũyũ  • ��
   ^
select.xml:17: parser error : PCDATA invalid Char value 55296
�� Eʋegbe  • Frasch  • Fulfulde  • Gagauz  • Gĩkũyũ  • ��
   ^
select.xml:17: parser error : Char 0xDF3F out of allowed range
Eʋegbe  • Frasch  • Fulfulde  • Gagauz  • Gĩkũyũ  • ���
   ^
select.xml:17: parser error : PCDATA invalid Char value 57151
Eʋegbe  • Frasch  • Fulfulde  • Gagauz  • Gĩkũyũ  • ���
   ^
select.xml:17: parser error : Char 0xD800 out of allowed range
egbe  • Frasch  • Fulfulde  • Gagauz  • Gĩkũyũ  • 
   ^
select.xml:17: parser error : PCDATA invalid Char value 55296
egbe  • Frasch  • Fulfulde  • Gagauz  • Gĩkũyũ  • 
   ^
select.xml:17: parser error : Char 0xDF44 out of allowed range
e  • Frasch  • Fulfulde  • Gagauz  • Gĩkũyũ  • �
   ^
select.xml:17: parser error : PCDATA invalid Char value 57156
e  • Frasch  • Fulfulde  • Gagauz  • Gĩkũyũ  • �
   ^
select.xml:17: parser error : Char 0xD800 out of allowed range
�• Frasch  • Fulfulde  • Gagauz  • Gĩkũyũ  • ��
   ^
select.xml:17: parser error : PCDATA invalid Char value 55296
�• Frasch  • Fulfulde  • Gagauz  • Gĩkũyũ  • ��
   ^
select.xml:17: parser error : Char 0xDF39 out of allowed range
� Frasch  • Fulfulde  • Gagauz  • Gĩkũyũ  • ���
   ^
select.xml:17: parser error : PCDATA invalid Char value 57145
� Frasch  • Fulfulde  • Gagauz  • Gĩkũyũ  • ���
   ^
select.xml:17: parser error : Char 0xD800 out of allowed range
rasch  • Fulfulde  • Gagauz  • Gĩkũyũ  • 
   ^
select.xml:17: parser error : PCDATA invalid Char value 55296
rasch  • Fulfulde  • Gagauz  • Gĩkũyũ  • 
   ^
select.xml:17: parser error : Char 0xDF43 out of allowed range
ch  • Fulfulde  • Gagauz  • Gĩkũyũ  • �
   ^
select.xml:17: parser error : PCDATA invalid Char value 57155
ch  • Fulfulde  • Gagauz  • Gĩkũyũ  • �
   ^
select.xml:17: parser error : Char 0xD800 out of allowed range
 • Fulfulde  • Gagauz  • Gĩkũyũ  • ��
   ^
select.xml:17: parser error : PCDATA invalid Char value 55296
 • Fulfulde  • Gagauz  • Gĩkũyũ  • ��
   ^
select.xml:17: parser error : Char 0xDF3A out of allowed range
�� Fulfulde  • Gagauz  • Gĩkũyũ  • ���
   ^
select.xml:17: parser error : PCDATA invalid Char value 57146
�� Fulfulde  • Gagauz  • Gĩkũyũ  • ���


On Tuesday 01 February 2011 17:00:19 Stefan Matheis wrote:
> Hi Markus,
> 
> to verify that it's not an Firefox-Issue, try xmllint on your shell to
> check the given xml?
> 
> Regards
> Stefan
> 
> On Tue, Feb 1, 2011 at 4:43 PM, Markus Jelsma
> 
>  wrote:
> > There is

Re: Malformed XML with exotic characters

2011-02-01 Thread Markus Jelsma
Hi,

There is no typical encoding issues on my system. I can index, query and 
display english, german, chinese, vietnamese etc.

Cheers

On Tuesday 01 February 2011 17:23:49 François Schiettecatte wrote:
> Markus
> 
> A few things to check, make sure whatever SOLR is hosted on is outputting
> utf-8 ( URIEncoding="UTF-8" in the Connector section in server.xml on
> Tomcat for example), which it looks like here, also make sure that
> whatever http header there is tells firefox that it is getting utf-8
> (otherwise it defaults to iso-8859-1/latin-1), finally make sure that
> whatever font you use in firefox has the 'exotic' characters you are
> expecting. There might also be some issues on your platform with mixing
> script direction but that is probably not likely.
> 
> Cheers
> 
> François
> 
> On Feb 1, 2011, at 10:43 AM, Markus Jelsma wrote:
> > There is an issue with the XML response writer. It cannot cope with some
> > very exotic characters or possibly the right-to-left writing systems.
> > The issue can be reproduced by indexing the content of the home page of
> > wikipedia as it contains a lot of exotic matter. The problem does not
> > affect the JSON response writer.
> > 
> > The problem is, i am unsure whether this is a bug in Solr or that perhaps
> > Firefox itself trips over.
> > 
> > 
> > Here's the output of the JSONResponeWriter for a query returning the home
> > page:
> > {
> > "responseHeader":{
> > 
> >  "status":0,
> >  "QTime":1,
> >  "params":{
> >  
> > "fl":"url,content",
> > "indent":"true",
> > "wt":"json",
> > "q":"*:*",
> > "rows":"1"}},
> > 
> > "response":{"numFound":6744,"start":0,"docs":[
> > 
> > {
> > 
> >  "url":"http://www.wikipedia.org/";,
> >  "content":"Wikipedia English The Free Encyclopedia 3 543 000+ articles
> >  日
> > 
> > 本語 フリー百科事典 730 000+ 記事 Deutsch Die freie Enzyklopädie 1 181 000+ Artikel
> > Español La enciclopedia libre 710 000+ artículos Français L’encyclopédie
> > libre 1 061 000+ articles Русский Свободная энциклопедия 654 000+ статей
> > Italiano L’enciclopedia libera 768 000+ voci Português A enciclopédia
> > livre 669 000+ artigos Polski Wolna encyklopedia 769 000+ haseł
> > Nederlands De vrije encyclopedie 668 000+ artikelen Search  • Suchen  •
> > Rechercher  • Szukaj  • Ricerca  • 検索  • Buscar  • Busca  • Zoeken  •
> > Поиск  • Sök  • 搜尋  • Cerca  • Søk  • Haku  • Пошук  • Hledání  •
> > Keresés  • Căutare  • 찾기  • Tìm kiếm  • Ara • Cari  • Søg  • بحث  •
> > Serĉu  • Претрага  • Paieška  • Hľadať  • Suk  • جستجو • חיפוש  •
> > Търсене  • Poišči  • Cari  • Bilnga العربية Български Català Česky Dansk
> > Deutsch English Español Esperanto فارسی Français 한국어 Bahasa Indonesia
> > Italiano עברית Lietuvių Magyar Bahasa Melayu Nederlands 日本語 Norsk
> > (bokmål) Polski Português Română Русский Slovenčina Slovenščina Српски /
> > Srpski Suomi Svenska Türkçe Українська Tiếng Việt Volapük Winaray 中文  
> > 100 000+   العربية • Български  • Català  • Česky  • Dansk  • Deutsch  •
> > English  • Español  • Esperanto  • فارسی  • Français  • 한국어  • Bahasa
> > Indonesia  • Italiano  • עברית • Lietuvių  • Magyar  • Bahasa Melayu  •
> > Nederlands  • 日本語  • Norsk (bokmål) • Polski  • Português  • Русский  •
> > Română  • Slovenčina  • Slovenščina  • Српски / Srpski  • Suomi  •
> > Svenska  • Türkçe  • Українська  • Tiếng Việt  • Volapük  • Winaray  •
> > 中文   10 000+   Afrikaans  • Aragonés  • Armãneashce  • Asturianu  •
> > Kreyòl Ayisyen  • Azərbaycan / آذربايجان ديلی  • বাংলা  • Беларуская (
> > Акадэмічная  • Тарашкевiца )  • বিষ্ণুপ্রিযা় মণিপুরী  • Bosanski  •
> > Brezhoneg  • Чăваш • Cymraeg  • Eesti  • Ελληνικά  • Euskara  • Frysk  •
> > Gaeilge  • Galego  • ગુજરાતી  • Հայերեն  • हिन्दी  • Hrvatski  • Ido  •
> > Íslenska  • Basa Jawa  • ಕನ್ನಡ  • ქართული  • Kurdî / كوردی  • Latina  •
> > Latviešu  • Lëtzebuergesch  • Lumbaart • Македонски  • മലയാളം  • मराठी 
> > • नेपाल भाषा  • नेपाली  • Norsk (nynorsk)  • Nnapulitano • Occitan  •
> > Piemontèis  • Plattdüütsch  • Ripoarisch  • Runa Simi  • شاہ مکھی پنجابی
> >  • Shqip  • Sicilianu  • Simple English  • Sinugboanon  • Srpskohrvatski
> > / Српскохрватски  • Basa Sunda  • Kiswahili  • Tagalog  • தமிழ் • తెలుగు
> >  • ไทย  • اردو  • Walon  • Yorùbá  • 粵語  • Žemaitėška   1 000+   Bahsa
> > Acè

Re: Malformed XML with exotic characters

2011-02-01 Thread Markus Jelsma
You can exclude the input's involvement by checking if other response writers 
do work. For me, the JSONResponseWriter works perfectly with the same returned 
data in some AJAX environment.

On Tuesday 01 February 2011 18:29:06 Sascha Szott wrote:
> Hi folks,
> 
> I've made the same observation when working with Solr's
> ExtractingRequestHandler on the command line (no browser interaction).
> 
> When issuing the following curl command
> 
> curl
> 'http://mysolrhost/solr/update/extract?extractOnly=true&extractFormat=text&;
> wt=xml&resource.name=foo.pdf' --data-binary @foo.pdf -H
> 'Content-type:text/xml; charset=utf-8' > foo.xml
> 
> Solr's XML response writer returns malformed xml, e.g., xmllint gives me:
> 
> foo.xml:21: parser error : Char 0xD835 out of allowed range
> foo.xml:21: parser error : PCDATA invalid Char value 55349
> 
> I'm not totally sure, if this is an Tika/PDFBox issue. However, I would
> expect in every case that the XML output produced by Solr is well-formed
> even if the libraries used under the hood return "garbage".
> 
> 
> -Sascha
> 
> p.s. I can provide the pdf file in question, if anybody would like to
> see it in action.
> 
> On 01.02.2011 16:43, Markus Jelsma wrote:
> > There is an issue with the XML response writer. It cannot cope with some
> > very exotic characters or possibly the right-to-left writing systems.
> > The issue can be reproduced by indexing the content of the home page of
> > wikipedia as it contains a lot of exotic matter. The problem does not
> > affect the JSON response writer.
> > 
> > The problem is, i am unsure whether this is a bug in Solr or that perhaps
> > Firefox itself trips over.
> > 
> > 
> > Here's the output of the JSONResponeWriter for a query returning the home
> > page:
> > {
> > 
> >   "responseHeader":{
> >   
> >"status":0,
> >"QTime":1,
> >"params":{
> > 
> > "fl":"url,content",
> > "indent":"true",
> > "wt":"json",
> > "q":"*:*",
> > "rows":"1"}},
> > 
> >   "response":{"numFound":6744,"start":0,"docs":[
> > 
> > {
> > 
> >  "url":"http://www.wikipedia.org/";,
> >  "content":"Wikipedia English The Free Encyclopedia 3 543 000+ articles
> >  日
> > 
> > 本語 フリー百科事典 730 000+ 記事 Deutsch Die freie Enzyklopädie 1 181 000+ Artikel
> > Español La enciclopedia libre 710 000+ artículos Français L’encyclopédie
> > libre 1 061 000+ articles Русский Свободная энциклопедия 654 000+ статей
> > Italiano L’enciclopedia libera 768 000+ voci Português A enciclopédia
> > livre 669 000+ artigos Polski Wolna encyklopedia 769 000+ haseł
> > Nederlands De vrije encyclopedie 668 000+ artikelen Search  • Suchen  •
> > Rechercher  • Szukaj  • Ricerca  • 検索  • Buscar  • Busca  • Zoeken  •
> > Поиск  • Sök  • 搜尋  • Cerca  • Søk  • Haku  • Пошук  • Hledání  •
> > Keresés  • Căutare  • 찾기  • Tìm kiếm  • Ara • Cari  • Søg  • بحث  •
> > Serĉu  • Претрага  • Paieška  • Hľadať  • Suk  • جستجو • חיפוש  •
> > Търсене  • Poišči  • Cari  • Bilnga العربية Български Català Česky Dansk
> > Deutsch English Español Esperanto فارسی Français 한국어 Bahasa Indonesia
> > Italiano עברית Lietuvių Magyar Bahasa Melayu Nederlands 日本語 Norsk
> > (bokmål) Polski Português Română Русский Slovenčina Slovenščina Српски /
> > Srpski Suomi Svenska Türkçe Українська Tiếng Việt Volapük Winaray 中文  
> > 100 000+   العربية • Български  • Català  • Česky  • Dansk  • Deutsch  •
> > English  • Español  • Esperanto  • فارسی  • Français  • 한국어  • Bahasa
> > Indonesia  • Italiano  • עברית • Lietuvių  • Magyar  • Bahasa Melayu  •
> > Nederlands  • 日本語  • Norsk (bokmål) • Polski  • Português  • Русский  •
> > Română  • Slovenčina  • Slovenščina  • Српски / Srpski  • Suomi  •
> > Svenska  • Türkçe  • Українська  • Tiếng Việt  • Volapük  • Winaray  •
> > 中文   10 000+   Afrikaans  • Aragonés  • Armãneashce  • Asturianu  •
> > Kreyòl Ayisyen  • Azərbaycan / آذربايجان ديلی  • বাংলা  • Беларуская (
> > Акадэмічная  • Тарашкевiца )  • বিষ্ণুপ্রিযা় মণিপুরী  • Bosanski  •
> > Brezhoneg  • Чăваш • Cymraeg  • Eesti  • Ελληνικά  • Euskara  • Frysk  •
> > Gaeilge  • Galego  • ગુજરાતી  • Հայերեն  • हिन्दी  • Hrvatski  • Ido  •
> > Íslenska  • Basa Jawa  • ಕನ್ನಡ  • ქართული  • Kurdî / كوردی  • Latina  •
> > Latviešu  • Lëtzeb

Re: Index MS office

2011-02-02 Thread Markus Jelsma
http://wiki.apache.org/solr/ExtractingRequestHandler

On Wednesday 02 February 2011 16:49:12 Thumuluri, Sai wrote:
> Good Morning,
> 
>  I am planning to get started on indexing MS office using ApacheSolr -
> can someone please direct me where I should start?
> 
> Thanks,
> Sai Thumuluri

-- 
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350


Re: Open Too Many Files

2011-02-03 Thread Markus Jelsma
Or decrease the mergeFactor.

> or change the index to a compound-index
> 
> solrconfig.xml: true
> 
> so solr creates one index file and not thousands.
> 
> -
> --- System
> 
> 
> One Server, 12 GB RAM, 2 Solr Instances, 7 Cores,
> 1 Core with 31 Million Documents other Cores < 100.000
> 
> - Solr1 for Search-Requests - commit every Minute  - 4GB Xmx
> - Solr2 for Update-Request  - delta every 2 Minutes - 4GB Xmx


Re: Malformed XML with exotic characters

2011-02-03 Thread Markus Jelsma
Hi

I've seen almost all funky charsets but gothic is always trouble. I'm also 
unsure if its really a bug in Solr. It could well be the Xerces being unable 
to cope. Besides, most systems indeed don't go well with gothic. This mail 
client does, but my terminal can't find its cursor after (properly) displaying 
such text.
 
http://got.wikipedia.org/wiki/%F0%90%8C%B7%F0%90%8C%B0%F0%90%8C%BF%F0%90%8C%B1%F0%90%8C%B9%F0%90%8C%B3%F0%90%8C%B0%F0%90%8C%B1%F0%90%8C%B0%F0%90%8C%BF%F0%90%8D%82%F0%90%8C%B2%F0%90%8D%83/Haubidabaurgs

Thanks for the input.

Cheers,

On Tuesday 01 February 2011 19:59:33 Robert Muir wrote:
> Hi, it might only be a problem with your xml tools (e.g. firefox).
> the problem here is characters outside of the basic multilingual plane
> (in this case Gothic).
> XML tools typically fall apart on these portions of unicode (in lucene
> we recently reverted to a patched/hacked copy of xerces specifically
> for this reason).
> 
> If you care about characters outside of the basic multilingual plane
> actually working, unfortunately you have to start being very very very
> particular about what software you use... you can assume most
> software/setups WON'T work.
> For example, if you were to use mysql's "utf8" character set you would
> find it doesn't actually support all of UTF-8! in this case you would
> need to use the recent 'utf8mb4' or something instead, that is
> actually utf-8!
> Thats just one example of a well-used piece of software that suffers
> from issues like this, there are others.
> 
> Its for reasons like these that if support for these languages is
> important to you, I would stick with the most simple/textual methods
> for input and output: e.g. using things like CSV and JSON if you can.
> I would also fully test every component/jar in your application
> individually and once you get it working, don't ever upgrade.
> 
> In any case, if you are having problems with characters outside of the
> basic multilingual plane, and you suspect its actually a bug in Solr,
> please open a JIRA issue, especially if you can provide some way to
> reproduce it
> 


Re: Possible Memory Leaks / Upgrading to a Later Version of Solr or Lucene

2011-02-07 Thread Markus Jelsma
Heap usage can spike after a commit. Existing caches are still in use and new 
caches are being generated and/or auto warmed. Can you confirm this is the 
case?

On Friday 28 January 2011 00:34:42 Simon Wistow wrote:
> On Tue, Jan 25, 2011 at 01:28:16PM +0100, Markus Jelsma said:
> > Are you sure you need CMS incremental mode? It's only adviced when
> > running on a machine with one or two processors. If you have more you
> > should consider disabling the incremental flags.
> 
> I'll test agin but we added those to get better performance - not much
> but there did seem to be an improvement.
> 
> The problem seems to not be in average use but that occasionally there's
> huge spike in load (there doesn't seem to be a particular "killer
> query") and Solr just never recovers.
> 
> Thanks,
> 
> Simon

-- 
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350


Re: dynamic fields revisited

2011-02-07 Thread Markus Jelsma
It would be quite annoying if it behaves as you were hoping for. This way it 
is possible to use different field types (and analyzers) for the same field 
value. In faceting, for example, this can be important because you should use 
analyzed fields for q and fq but unanalyzed fields for facet.field.

The same goes for sorting and range queries where you can use the same field 
value to end up in different field types, one for sorting and one for a range 
query.

Without the prefix or suffix of the dynamic field, one must statically declare 
the 
fields beforehand and loose the dynamic advantage.

> Just so anyone else can know and save themselves 1/2 hour if they spend 4
> minutes searching.
> 
> When putting a dynamic field into a document into an index, the name of the
> field RETAINS the 'constant' part of the dynamic field name.
> 
> Example
> -
> If a dynamic integer field is named '*_i' in the schema.xml file,
>   __and__
> you insert a field names 'my_integer_i', which matches the globbed field
> name '*_i',
>   __then__
> the name of the field will be 'my_integer_i' in the index
> and in your GETs/(updating)POSTs to the index on that document and
>   __NOT__
> 'my_integer' like I was kind of hoping that it would be :-(
> 
> I.E., the suffix (or prefix if you set it up that way,) will NOT be
> dropped. I was hoping that everything except the globbing character, '*',
> would just be a flag to the query processor and disappear after being
> 'noticed'.
> 
> Not so :-)


Re: q.alt=*:* for every request?

2011-02-07 Thread Markus Jelsma
There is no measurable performance penalty when setting the parameter, except 
maybe the execution of the query with a high value for rows. To make things 
easy, you can define q.alt=*:* as default in your request handler. No need to 
specifiy it in the URL.


> Hi,
> 
> I use dismax handler with solr 1.4.
> Sometimes, my request comes with q and fq, and others doesn't come with q
> (only fq and q.alt=*:*). It's quite ok if I send q.alt=*:* for every
> request? Does it have side effects on performance?


Re: Possible Memory Leaks / Upgrading to a Later Version of Solr or Lucene

2011-02-07 Thread Markus Jelsma
Do you have GC logging enabled? Tail -f the log file and you'll see what CMS is 
telling you. Tuning the occupation fraction of the tenured generation to a 
lower value than default and telling the JVM to only use your value to 
initiate a collection can help a lot. The same goes for sizing the young 
generation and sometimes the survivor ratio.

Consult the HotSpot CMS settings and young generation (or new) sizes. They are 
very important.

If you have multiple slaves under the same load you can easily try different 
configurations. Keeping an eye on the nodes with a tool like JConsole and at 
the same time tailing the GC log will help a lot. Don't forget to send updates 
and frequent commits or you won't be able to replay. I've never seen a Solr 
instance go down under heavy load and without commits but they tend to behave 
badly when commits occur while under heavy load with long cache warming times 
(and heap consumption).

You might also be suffering from memory fragmentation, this is bad and can lead 
to failure. You can configure the JVM to fore a compaction before a GC, that's 
nice but it does consume CPU time.

A query of death can, in theory, also happen when you sort on a very large 
dataset that isn't optimized, in this case the maxDoc value is too high.

Anyway, try some settings and monitor the nodes and please report your 
findings.

> On Mon, Feb 07, 2011 at 02:06:00PM +0100, Markus Jelsma said:
> > Heap usage can spike after a commit. Existing caches are still in use and
> > new caches are being generated and/or auto warmed. Can you confirm this
> > is the case?
> 
> We see spikes after replication which I suspect is, as you say, because
> of the ensuing commit.
> 
> What we seem to have found is that when we weren't using the Concurrent
> GC stop-the-world gc runs would kill the app. Now that we're using CMS
> we occasionally find ourselves in situations where the app still has
> memory "left over" but the load on the machine spikes, the GC duty cycle
> goes to 100 and the app never recovers.> 
> Restarting usually helps but sometimes we have to take the machine out
> of the laod balancer, wait for a number of minutes and then out it back
> in.
> 
> We're working on two hypotheses
> 
> Firstly - we're CPU bound somehow and that at some point we cross some
> threshhold and GC or something else is just unable to to keep up. So
> whilst it looks like instantaneous death of the app it's actually
> gradual resource exhaustion where the definition of 'gradual' is 'a very
> short period of time' (as opposed to some cataclysmic infinite loop bug
> somewhere).
> 
> Either that or ... Secondly - there's some sort of Query Of Death that
> kills machines. We just haven't found it yet, even when replaying logs.
> 
> Or some combination of both. Or other things. It's maddeningly
> frustrating.
> 
> We're also got to try deploying a custom solr.war and try using the
> MMapDirectory to see if that helps with anything.


Re: does copyField recurse?

2011-02-08 Thread Markus Jelsma
Field values are copied before being analyzed. There is no cascading of 
analyzers.

> Hello list,
> 
> if I have a field title which copied to text and a field text that is
> copied to text.stemmed. Am I going to get the copy from the field title to
> the field text.stemmed or should I include it?
> 
> thanks in advance
> 
> paul


Re: q.alt=*:* for every request?

2011-02-08 Thread Markus Jelsma
I'm not sure what you mean but you may be looking for debugQuery=true ?

On Tuesday 08 February 2011 08:28:12 Paul Libbrecht wrote:
> To be able to "see" this well, it would be lovely to have a switch that
> would activate a logging of the query expansion result. The Dismax
> QParserPlugin is particularly powerful in there so it'd be nice to see
> what's happening.
> 
> Any logging category I need to activate?
> 
> paul
> 
> Le 8 févr. 2011 à 03:22, Markus Jelsma a écrit :
> > There is no measurable performance penalty when setting the parameter,
> > except maybe the execution of the query with a high value for rows. To
> > make things easy, you can define q.alt=*:* as default in your request
> > handler. No need to specifiy it in the URL.
> > 
> >> Hi,
> >> 
> >> I use dismax handler with solr 1.4.
> >> Sometimes, my request comes with q and fq, and others doesn't come with
> >> q (only fq and q.alt=*:*). It's quite ok if I send q.alt=*:* for every
> >> request? Does it have side effects on performance?

-- 
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350


Re: difference between filter_queries and parsed_filter_queries

2011-02-08 Thread Markus Jelsma
Hi,

The parsed_filter_queries contains the value after it passed through the 
analyzer. In this case it remains the same because it was already lowercased 
and no synonyms were used.

You're also using single quotes, these have no special meaning so you're 
searching for 'noida' in the first and noida in the second fq.

Cheers,

On Tuesday 08 February 2011 15:52:23 Bagesh Sharma wrote:
> Hi everybody, please suggest me what's the difference between these two
> things. After what processing on filter_queries the parsed_filter_queries
> are generated.
> 
> Basically ... when i am searching city as fq=city:'noida'
> 
> then filter_queries and parsed_filter_queries both are same as 'noida'.  In
> this case i do not get any result.
> 
> But when i do query like this  fq=city:"noida" then filter_queries is
> "noida" but parsed_filter_queries is noida and it matches with the city and
> i am getting correct results.
> 
> what processing is going on from filter_queries to parsed_filter_queries.
> 
> my schema for city is : -
> 
>   sortMissingLast="true" >
>   
>   
>    synonyms="synonyms_city_facet.txt" ignoreCase="true" expand="false" />
>   
>   
> 
> 
> please suggest me please.

-- 
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350


Re: Cache size

2011-02-08 Thread Markus Jelsma
You can dump the heap and analyze it with a tool like jhat. IBM's heap 
analyzer is also a very good tool and if i'm not mistaken people also use one 
that comes with Eclipse.

On Tuesday 08 February 2011 16:35:35 Mehdi Ben Haj Abbes wrote:
> Hi folks,
> 
> Is there any way to know the size *in bytes* occupied by a cache (filter
> cache, doc cache ...)? I don't find such information within the stats page.
> 
> Regards

-- 
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350


Re: Nutch and Solr search on the fly

2011-02-09 Thread Markus Jelsma
The parsed data is only sent to the Solr index of you tell a segment to be 
indexed; solrindex   

If you did this only once after injecting  and then the consequent 
fetch,parse,update,index sequence then you, of course, only see those URL's. 
If you don't index a segment after it's being parsed, you need to do it later 
on.

On Wednesday 09 February 2011 04:29:44 .: Abhishek :. wrote:
> Hi all,
> 
>  I am a newbie to nutch and solr. Well relatively much newer to Solr than
> Nutch :)
> 
>  I have been using nutch for past two weeks, and I wanted to know if I can
> query or search on my nutch crawls on the fly(before it completes). I am
> asking this because the websites I am crawling are really huge and it takes
> around 3-4 days for a crawl to complete. I want to analyze some quick
> results while the nutch crawler is still crawling the URLs. Some one
> suggested me that Solr would make it possible.
> 
>  I followed the steps in
> http://www.lucidimagination.com/blog/2009/03/09/nutch-solr/ for this. By
> this process, I see only the injected URLs are shown in the Solr search. I
> know I did something really foolish and the crawl never happened, I feel I
> am missing some information here. I think somewhere in the process there
> should be a crawling happening and I missed it out.
> 
>  Just wanted to see if some one could help me pointing this out and where I
> went wrong in the process. Forgive my foolishness and thanks for your
> patience.
> 
> Cheers,
> Abi

-- 
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350


Re: Nutch and Solr search on the fly

2011-02-09 Thread Markus Jelsma
Are you using the depth parameter with the crawl command or are you using the 
separate generate, fetch etc. commands?

What's $  nutch readdb  -stats returning?

On Wednesday 09 February 2011 15:06:40 .: Abhishek :. wrote:
> Hi Markus,
> 
>  I am sorry for not being clear, I meant to say that...
> 
>  Suppose if a url namely www.somehost.com/gifts/greetingcard.html(which in
> turn contain links to a.html, b.html, c.html, d.html) is injected into the
> seed.txt, after the whole process I was expecting a bunch of other pages
> which crawled from this seed url. However, at the end of it all I see is
> the contents from only this page namely
> www.somehost.com/gifts/greetingcard.htmland I do not see any other
> pages(here a.html, b.html, c.html, d.html)
> crawled from this one.
> 
>  The crawling happens only for the URLs mentioned in the seed.txt and does
> not proceed further from there. So I am just bit confused. Why is it not
> crawling the linked pages(a.html, b.html, c.html and d.html). I get a
> feeling that I am missing something that the author of the blog(
> http://www.lucidimagination.com/blog/2009/03/09/nutch-solr/) assumed
> everyone would know.
> 
> Thanks,
> Abi
> 
> On Wed, Feb 9, 2011 at 7:09 PM, Markus Jelsma 
wrote:
> > The parsed data is only sent to the Solr index of you tell a segment to
> > be indexed; solrindex   
> > 
> > If you did this only once after injecting  and then the consequent
> > fetch,parse,update,index sequence then you, of course, only see those
> > URL's.
> > If you don't index a segment after it's being parsed, you need to do it
> > later
> > on.
> > 
> > On Wednesday 09 February 2011 04:29:44 .: Abhishek :. wrote:
> > > Hi all,
> > > 
> > >  I am a newbie to nutch and solr. Well relatively much newer to Solr
> > >  than
> > > 
> > > Nutch :)
> > > 
> > >  I have been using nutch for past two weeks, and I wanted to know if I
> > 
> > can
> > 
> > > query or search on my nutch crawls on the fly(before it completes). I
> > > am asking this because the websites I am crawling are really huge and
> > > it
> > 
> > takes
> > 
> > > around 3-4 days for a crawl to complete. I want to analyze some quick
> > > results while the nutch crawler is still crawling the URLs. Some one
> > > suggested me that Solr would make it possible.
> > > 
> > >  I followed the steps in
> > > 
> > > http://www.lucidimagination.com/blog/2009/03/09/nutch-solr/ for this.
> > > By this process, I see only the injected URLs are shown in the Solr
> > > search.
> > 
> > I
> > 
> > > know I did something really foolish and the crawl never happened, I
> > > feel
> > 
> > I
> > 
> > > am missing some information here. I think somewhere in the process
> > > there should be a crawling happening and I missed it out.
> > > 
> > >  Just wanted to see if some one could help me pointing this out and
> > >  where
> > 
> > I
> > 
> > > went wrong in the process. Forgive my foolishness and thanks for your
> > > patience.
> > > 
> > > Cheers,
> > > Abi
> > 
> > --
> > Markus Jelsma - CTO - Openindex
> > http://www.linkedin.com/in/markus17
> > 050-8536620 / 06-50258350

-- 
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350


Re: Solr 1.4.1 using more memory than Solr 1.3

2011-02-09 Thread Markus Jelsma
Searching and sorting is now done on a per-segment basis, meaning that
the FieldCache entries used for sorting and for function queries are
created and used per-segment and can be reused for segments that don't
change between index updates.  While generally beneficial, this can lead
to increased memory usage over 1.3 in certain scenarios: 
  1) A single valued field that was used for both sorting and faceting
in 1.3 would have used the same top level FieldCache entry.  In 1.4, 
sorting will use entries at the segment level while faceting will still
use entries at the top reader level, leading to increased memory usage.
  2) Certain function queries such as ord() and rord() require a top level
FieldCache instance and can thus lead to increased memory usage.  Consider
replacing ord() and rord() with alternatives, such as function queries
based on ms() for date boosting.


http://svn.apache.org/repos/asf/lucene/solr/branches/branch-1.4/CHANGES.txt



On Wednesday 09 February 2011 16:07:01 Rachita Choudhary wrote:
> Hi Solr Users,
> 
> We are in the process of upgrading from Solr 1.3 to Solr 1.4.1.
> While performing stress test on Solr 1.4.1 to measure the performance
> improvement in Query times (QTime) and no more blocked threads, we ran into
> memory issues with Solr 1.4.1.
> 
> Test Setup details:
> - 2 identical hosts running Solr 1.3 and Solr 1.4.1 individually.
> - 3 cores with index sizes : 10 GB, 2 GB, 1 GB.
> - JVM Max RAM : 3GB ( Xmx3072m) , Total RAM : 4GB
> - No other application/service running on the servers.
> - For querying solr servers, we are using wget queries from a standalone
> host.
> 
> For the same index data and same set of queries, Solr 1.3 is hovering
> between 1.5 to 2.2 GB, whereas with about 20K requests Solr 1.4.1 is
> reaching its 3 GB limit and performing FULL GC after almost every query.
> The Full GC is also not freeing up any memory.
> 
> Has anyone also faced similar issues with Solr 1.4.1 ?
> 
> Also why is Solr 1.4.1 using more memory for the same amount of processing
> compared to Solr 1.3 ?
> 
> Is there any particular configuration that needs to be done to avoid this
> high memory usage ?
> 
> Thanks,
> Rachita

-- 
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350


Re: Solr Out of Memory Error

2011-02-09 Thread Markus Jelsma
Bing Li,

One should be conservative when setting Xmx. Also, just setting Xmx might not 
do the trick at all because the garbage collector might also be the issue 
here. Configure the JVM to output debug logs of the garbage collector and 
monitor the heap usage (especially the tenured generation) with a good tool 
like JConsole.

You might also want to take a look at your cache settings and autowarm 
parameters. In some scenario's with very frequent updates, a large corpus and 
a high load of heterogenous queries you might want to dump the documentCache 
and queryResultCache, the cache hitratio tends to be very low and the caches 
will just consume a lot of memory and CPU time.

One of my projects i finally decided to only use the filterCache. Using the 
other caches took too much RAM and CPU while running and had a lot of 
evictions and still a lot hitratio. I could, of course, make the caches a lot 
bigger and increase autowarming but that would take a lot of time before a 
cache is autowarmed and a very, very, large amount of RAM. I choose to rely on 
the OS-cache instead.

Cheers,

> Dear Adam,
> 
> I also got the OutOfMemory exception. I changed the JAVA_OPTS in
> catalina.sh as follows.
> 
>...
>if [ -z "$LOGGING_MANAGER" ]; then
>  JAVA_OPTS="$JAVA_OPTS
> -Djava.util.logging.manager=org.apache.juli.ClassLoaderLogManager"
>else
> JAVA_OPTS="$JAVA_OPTS -server -Xms8096m -Xmx8096m"
>fi
>...
> 
> Is this change correct? After that, I still got the same exception. The
> index is updated and searched frequently. I am trying to change the code to
> avoid the frequent updates. I guess only changing JAVA_OPTS does not work.
> 
> Could you give me some help?
> 
> Thanks,
> LB
> 
> 
> On Wed, Jan 19, 2011 at 10:05 PM, Adam Estrada <
> 
> estrada.adam.gro...@gmail.com> wrote:
> > Is anyone familiar with the environment variable, JAVA_OPTS? I set
> > mine to a much larger heap size and never had any of these issues
> > again.
> > 
> > JAVA_OPTS = -server -Xms4048m -Xmx4048m
> > 
> > Adam
> > 
> > On Wed, Jan 19, 2011 at 3:29 AM, Isan Fulia 
> > 
> > wrote:
> > > Hi all,
> > > By adding more servers do u mean sharding of index.And after sharding ,
> > 
> > how
> > 
> > > my query performance will be affected .
> > > Will the query execution time increase.
> > > 
> > > Thanks,
> > > Isan Fulia.
> > > 
> > > On 19 January 2011 12:52, Grijesh  wrote:
> > >> Hi Isan,
> > >> 
> > >> It seems your index size 25GB si much more compared to you have total
> > 
> > Ram
> > 
> > >> size is 4GB.
> > >> You have to do 2 things to avoid Out Of Memory Problem.
> > >> 1-Buy more Ram ,add at least 12 GB of more ram.
> > >> 2-Increase the Memory allocated to solr by setting XMX values.at least
> > 
> > 12
> > 
> > >> GB
> > >> allocate to solr.
> > >> 
> > >> But if your all index will fit into the Cache memory it will give you
> > 
> > the
> > 
> > >> better result.
> > >> 
> > >> Also add more servers to load balance as your QPS is high.
> > >> Your 7 Laks data makes 25 GB of index its looking quite high.Try to
> > 
> > lower
> > 
> > >> the index size
> > >> What are you indexing in your 25GB of index?
> > >> 
> > >> -
> > >> Thanx:
> > >> Grijesh
> > >> --
> > 
> > >> View this message in context:
> > http://lucene.472066.n3.nabble.com/Solr-Out-of-Memory-Error-tp2280037p228
> > 5779.html
> > 
> > >> Sent from the Solr - User mailing list archive at Nabble.com.
> > > 
> > > --
> > > Thanks & Regards,
> > > Isan Fulia.


Re: Solr Out of Memory Error

2011-02-09 Thread Markus Jelsma
I should also add that reducing the caches and autowarm sizes (or not using 
them at all) drastically reduces memory consumption when a new searcher is 
being prepares after a commit. The memory usage will spike at these events. 
Again, use a monitoring tool to get more information on your specific scenario.

> Bing Li,
> 
> One should be conservative when setting Xmx. Also, just setting Xmx might
> not do the trick at all because the garbage collector might also be the
> issue here. Configure the JVM to output debug logs of the garbage
> collector and monitor the heap usage (especially the tenured generation)
> with a good tool like JConsole.
> 
> You might also want to take a look at your cache settings and autowarm
> parameters. In some scenario's with very frequent updates, a large corpus
> and a high load of heterogenous queries you might want to dump the
> documentCache and queryResultCache, the cache hitratio tends to be very
> low and the caches will just consume a lot of memory and CPU time.
> 
> One of my projects i finally decided to only use the filterCache. Using the
> other caches took too much RAM and CPU while running and had a lot of
> evictions and still a lot hitratio. I could, of course, make the caches a
> lot bigger and increase autowarming but that would take a lot of time
> before a cache is autowarmed and a very, very, large amount of RAM. I
> choose to rely on the OS-cache instead.
> 
> Cheers,
> 
> > Dear Adam,
> > 
> > I also got the OutOfMemory exception. I changed the JAVA_OPTS in
> > catalina.sh as follows.
> > 
> >...
> >if [ -z "$LOGGING_MANAGER" ]; then
> >
> >  JAVA_OPTS="$JAVA_OPTS
> > 
> > -Djava.util.logging.manager=org.apache.juli.ClassLoaderLogManager"
> > 
> >else
> >
> > JAVA_OPTS="$JAVA_OPTS -server -Xms8096m -Xmx8096m"
> >
> >fi
> >...
> > 
> > Is this change correct? After that, I still got the same exception. The
> > index is updated and searched frequently. I am trying to change the code
> > to avoid the frequent updates. I guess only changing JAVA_OPTS does not
> > work.
> > 
> > Could you give me some help?
> > 
> > Thanks,
> > LB
> > 
> > 
> > On Wed, Jan 19, 2011 at 10:05 PM, Adam Estrada <
> > 
> > estrada.adam.gro...@gmail.com> wrote:
> > > Is anyone familiar with the environment variable, JAVA_OPTS? I set
> > > mine to a much larger heap size and never had any of these issues
> > > again.
> > > 
> > > JAVA_OPTS = -server -Xms4048m -Xmx4048m
> > > 
> > > Adam
> > > 
> > > On Wed, Jan 19, 2011 at 3:29 AM, Isan Fulia 
> > > 
> > > wrote:
> > > > Hi all,
> > > > By adding more servers do u mean sharding of index.And after sharding
> > > > ,
> > > 
> > > how
> > > 
> > > > my query performance will be affected .
> > > > Will the query execution time increase.
> > > > 
> > > > Thanks,
> > > > Isan Fulia.
> > > > 
> > > > On 19 January 2011 12:52, Grijesh  wrote:
> > > >> Hi Isan,
> > > >> 
> > > >> It seems your index size 25GB si much more compared to you have
> > > >> total
> > > 
> > > Ram
> > > 
> > > >> size is 4GB.
> > > >> You have to do 2 things to avoid Out Of Memory Problem.
> > > >> 1-Buy more Ram ,add at least 12 GB of more ram.
> > > >> 2-Increase the Memory allocated to solr by setting XMX values.at
> > > >> least
> > > 
> > > 12
> > > 
> > > >> GB
> > > >> allocate to solr.
> > > >> 
> > > >> But if your all index will fit into the Cache memory it will give
> > > >> you
> > > 
> > > the
> > > 
> > > >> better result.
> > > >> 
> > > >> Also add more servers to load balance as your QPS is high.
> > > >> Your 7 Laks data makes 25 GB of index its looking quite high.Try to
> > > 
> > > lower
> > > 
> > > >> the index size
> > > >> What are you indexing in your 25GB of index?
> > > >> 
> > > >> -
> > > >> Thanx:
> > > >> Grijesh
> > > >> --
> > > 
> > > >> View this message in context:
> > > http://lucene.472066.n3.nabble.com/Solr-Out-of-Memory-Error-tp2280037p2
> > > 28 5779.html
> > > 
> > > >> Sent from the Solr - User mailing list archive at Nabble.com.
> > > > 
> > > > --
> > > > Thanks & Regards,
> > > > Isan Fulia.


Re: Tomcat6 and Log4j

2011-02-10 Thread Markus Jelsma
Add it to the CATALINA_OPTS, on Debian systems you could edit 
/etc/default/tomcat

On Thursday 10 February 2011 12:27:59 Xavier SCHEPLER wrote:
>  -Dlog4j.configuration=$CATALINA_HOME/webapps/solr/WEB-INF/classes/log4j.pr
> operties

-- 
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350


Re: Tomcat6 and Log4j

2011-02-10 Thread Markus Jelsma
Oh, now looking at your log4j.properties, i believe it's wrong. You declared 
INFO as rootLogger but you use SOLR. 

-log4j.rootLogger=INFO
+log4j.rootLogger=SOLR

try again




On Thursday 10 February 2011 09:41:29 Xavier Schepler wrote:
> Hi,
> 
> I added “slf4j-log4j12-1.5.5.jar” and “log4j-1.2.15.jar” to
> $CATALINA_HOME/webapps/solr/WEB-INF/lib ,
> then deleted the library “slf4j-jdk14-1.5.5.jar” from
> $CATALINA_HOME/webapps/solr/WEB-INF/lib,
> then created a directory $CATALINA_HOME/webapps/solr/WEB-INF/classes.
> and created $CATALINA_HOME/webapps/solr/WEB-INF/classes/log4j.properties
> with the following contents :
> 
> log4j.rootLogger=INFO
> log4j.appender.SOLR.logfile=org.apache.log4j.DailyRollingFileAppender
> log4j.appender.SOLR.logfile.file=/home/quetelet_bdq/logs/bdq.log
> log4j.appender.SOLR.logfile.DatePattern='.'-MM-dd
> log4j.appender.SOLR.logfile.layout=org.apache.log4j.PatternLayout
> log4j.appender.SOLR.logfile.layout.conversionPattern=%d %p [%c{3}] -
> [%t] - %X{ip}: %m%n
> log4j.appender.SOLR.logfile = true
> 
> I restarted solr and I got the following message in the catalina.out log :
> 
> log4j:WARN No appenders could be found for logger
> (org.apache.solr.core.SolrResourceLoader).
> log4j:WARN Please initialize the log4j system properly.
> log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for
> more info.
> 
> What is told on this page is that this error occurs what the
> log4j.properties isn't found.
> 
> Could someone help me to have it working ?
> 
> Thanks in advance,
> 
> Xavier

-- 
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350


Re: Tomcat6 and Log4j

2011-02-10 Thread Markus Jelsma
Oh, and for sharing purposes; we use a configuration like this one. It'll have 
an info and error log and stores them next to Tomcat's own logs in 
/var/log/tomat on Debian systems (or whatever catalina.base is on other 
distros).

log4j.rootLogger=DEBUG, info, error
 
log4j.appender.info=org.apache.log4j.DailyRollingFileAppender
log4j.appender.info.Threshold=INFO
log4j.appender.info.MaxFileSize=500KB
log4j.appender.info.MaxBackupIndex=20
log4j.appender.info.Append=true
log4j.appender.info.File=${catalina.base}/logs/info.log
log4j.appender.info.DatePattern='.'-MM-dd'.log'
log4j.appender.info.layout=org.apache.log4j.PatternLayout
log4j.appender.info.layout.ConversionPattern=%d %p [%c{3}] - [%t] - %X{ip}: 
%m%n
 
 
log4j.appender.error=org.apache.log4j.DailyRollingFileAppender
log4j.appender.error.Threshold=ERROR
log4j.appender.error.File=${catalina.base}/logs/error.log
log4j.appender.error.DatePattern='.'-MM-dd'.log'
log4j.appender.error.layout=org.apache.log4j.PatternLayout
log4j.appender.error.layout.ConversionPattern=%d %p [%c{3}] - [%t] - %X{ip}: 
%m%n
log4j.appender.error.MaxFileSize=500KB
log4j.appender.error.MaxBackupIndex=20



On Thursday 10 February 2011 12:51:13 Markus Jelsma wrote:
> Oh, now looking at your log4j.properties, i believe it's wrong. You
> declared INFO as rootLogger but you use SOLR.
> 
> -log4j.rootLogger=INFO
> +log4j.rootLogger=SOLR
> 
> try again
> 
> On Thursday 10 February 2011 09:41:29 Xavier Schepler wrote:
> > Hi,
> > 
> > I added “slf4j-log4j12-1.5.5.jar” and “log4j-1.2.15.jar” to
> > $CATALINA_HOME/webapps/solr/WEB-INF/lib ,
> > then deleted the library “slf4j-jdk14-1.5.5.jar” from
> > $CATALINA_HOME/webapps/solr/WEB-INF/lib,
> > then created a directory $CATALINA_HOME/webapps/solr/WEB-INF/classes.
> > and created $CATALINA_HOME/webapps/solr/WEB-INF/classes/log4j.properties
> > with the following contents :
> > 
> > log4j.rootLogger=INFO
> > log4j.appender.SOLR.logfile=org.apache.log4j.DailyRollingFileAppender
> > log4j.appender.SOLR.logfile.file=/home/quetelet_bdq/logs/bdq.log
> > log4j.appender.SOLR.logfile.DatePattern='.'-MM-dd
> > log4j.appender.SOLR.logfile.layout=org.apache.log4j.PatternLayout
> > log4j.appender.SOLR.logfile.layout.conversionPattern=%d %p [%c{3}] -
> > [%t] - %X{ip}: %m%n
> > log4j.appender.SOLR.logfile = true
> > 
> > I restarted solr and I got the following message in the catalina.out log
> > :
> > 
> > log4j:WARN No appenders could be found for logger
> > (org.apache.solr.core.SolrResourceLoader).
> > log4j:WARN Please initialize the log4j system properly.
> > log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for
> > more info.
> > 
> > What is told on this page is that this error occurs what the
> > log4j.properties isn't found.
> > 
> > Could someone help me to have it working ?
> > 
> > Thanks in advance,
> > 
> > Xavier

-- 
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350


Re: Wikipedia table of contents.

2011-02-10 Thread Markus Jelsma
Yes but it's not very useful:
http://wiki.apache.org/solr/TitleIndex


On Thursday 10 February 2011 16:14:40 Dennis Gearon wrote:
> Is there a detailed, perhaps alphabetical & hierarchical table of
> contents for all ether wikis on the sole site? Sent
> from Yahoo! Mail on Android

-- 
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350


Re: solr admin result page error

2011-02-11 Thread Markus Jelsma
It looks like you hit the same issue as i did a while ago:
http://www.mail-archive.com/solr-user@lucene.apache.org/msg46510.html


On Friday 11 February 2011 08:59:27 Bernd Fehling wrote:
> Dear list,
> 
> after loading some documents via DIH which also include urls
> I get this yellow XML error page as search result from solr admin GUI
> after a search.
> It says XML processing error "not well-formed".
> The code it argues about is:
> 
> 
> http://eprints.soton.ac.uk/43350/
> http://dx.doi.org/doi:10.1112/S0024610706023143
> Martinez-Perez, Conchita and Nucinkis, Brita E.A. (2006) Cohomological
> dimension of Mackey functors for infinite groups. Journal of the London
> Mathematical Society, 74, (2), 379-396. (doi:10.1112/S0024610706023143
> <http://dx.doi.org/10.1112/S002461070602314\u>;)
> 
> See the \u utf8-code in the last line.
> 
> 1. the loaded data is valid, well-formed and checked with xmllint. No
> errors. 2. there is no \u utf8-code in the source data.
> 3. the data is loaded via DIH without any errors.
> 4. if opening the source-view of the result page with firefox there is also
> no \u utf8-code.
> 
> Only idea I have is solr itself or the result page generation.
> 
> How to proceed, what else to check?
> 
> Regards,
> Bernd

-- 
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350


Re: solr admin result page error

2011-02-11 Thread Markus Jelsma
No i haven't located the issue. It might be Solr but it could also be Xerces 
having trouble with it. You can possibly work around the problem by using the 
JSONResponseWriter.

On Friday 11 February 2011 15:45:23 Bernd Fehling wrote:
> Hi Markus,
> 
> yes it looks like the same issue. There is also a \u utf8-code in your
> dump. Till now I followed it into XMLResponseWriter.
> Some steps before the result in a buffer looks good and the utf8-code is
> correct. Really hard to debug this freaky problem.
> 
> Have you looked deeper into this and located the bug?
> 
> It is definately a bug and has nothing to do with firefox.
> 
> Regards,
> Bernd
> 
> Am 11.02.2011 13:48, schrieb Markus Jelsma:
> > It looks like you hit the same issue as i did a while ago:
> > http://www.mail-archive.com/solr-user@lucene.apache.org/msg46510.html
> > 
> > On Friday 11 February 2011 08:59:27 Bernd Fehling wrote:
> >> Dear list,
> >> 
> >> after loading some documents via DIH which also include urls
> >> I get this yellow XML error page as search result from solr admin GUI
> >> after a search.
> >> It says XML processing error "not well-formed".
> >> The code it argues about is:
> >> 
> >> 
> >> http://eprints.soton.ac.uk/43350/
> >> http://dx.doi.org/doi:10.1112/S0024610706023143
> >> Martinez-Perez, Conchita and Nucinkis, Brita E.A. (2006)
> >> Cohomological dimension of Mackey functors for infinite groups. Journal
> >> of the London Mathematical Society, 74, (2), 379-396.
> >> (doi:10.1112/S0024610706023143
> >> <http://dx.doi.org/10.1112/S002461070602314\u>;)
> >> 
> >> See the \u utf8-code in the last line.
> >> 
> >> 1. the loaded data is valid, well-formed and checked with xmllint. No
> >> errors. 2. there is no \u utf8-code in the source data.
> >> 3. the data is loaded via DIH without any errors.
> >> 4. if opening the source-view of the result page with firefox there is
> >> also no \u utf8-code.
> >> 
> >> Only idea I have is solr itself or the result page generation.
> >> 
> >> How to proceed, what else to check?
> >> 
> >> Regards,
> >> Bernd

-- 
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350


Re: Title index to wiki

2011-02-11 Thread Markus Jelsma
What do you mean, there are two links to the Frontpage on each page.

On Friday 11 February 2011 16:56:41 Dennis Gearon wrote:
> I think it would be an improvement to the wikis if the link to the title
> index were at the top of the index page of the wikis :-) I looked on that
> index page & did not see that link on that page. Who's got
> write access to wikis pages?
> Sent from Yahoo! Mail on Android

-- 
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350


Re: Difference between Solr and Lucidworks distribution

2011-02-11 Thread Markus Jelsma
It is not free for production environments.
http://www.lucidimagination.com/lwe/subscriptions-and-pricing

On Friday 11 February 2011 17:31:22 Greg Georges wrote:
> Hello all,
> 
> I just started watching the webinars from Lucidworks, and they mention
> their distribution which has an installer, etc.. Is there any other
> differences? Is it a good idea to use this free distribution?
> 
> Greg

-- 
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350


Re: carrot2 clustering component error

2011-02-15 Thread Markus Jelsma
I've seen that before on a 3.1 check out after i compiled the clustering 
component, copied the jars and started Solr. For some reason , recompiling 
didn't work and doing an ant clean in front didn't fix it either. Updating to a 
revision i knew did work also failed.

I just removed the entire checkout and took it back in, repeated my steps and 
it works fine now.

> help me out of this error:
> 
>   java.lang.NoClassDefFoundError: org/apache/solr/util/plugin/SolrCoreAware


Re: clustering with tomcat

2011-02-16 Thread Markus Jelsma
On Debian you can edit /etc/default/tomcat6

> hi,
>  i am  using  solr1.4  with apache tomcat. to enable the
> clustering feature
> i follow the link
> http://wiki.apache.org/solr/ClusteringComponent
> Plz help me how to   add-Dsolr.clustering.enabled=true to $CATALINA_OPTS.
> after that which steps be will required.


Re: clustering with tomcat

2011-02-16 Thread Markus Jelsma
What distro are you using? On at least Debian systems you can put the -
Dsolr.clustering.enabled=true environment variable in /etc/default/tomcat6.

You can also, of course, remove all occurences of ${solr.clustering.enabled} 
from you solrconfig.xml

On Wednesday 16 February 2011 10:52:35 Isha Garg wrote:
> On Wednesday 16 February 2011 02:41 PM, Markus Jelsma wrote:
> > On Debian you can edit /etc/default/tomcat6
> > 
> >> hi,
> >> 
> >>   i am  using  solr1.4  with apache tomcat. to enable the
> >> 
> >> clustering feature
> >> i follow the link
> >> http://wiki.apache.org/solr/ClusteringComponent
> >> Plz help me how to   add-Dsolr.clustering.enabled=true to
> >> $CATALINA_OPTS. after that which steps be will required.
> 
> i did nt understand  can u plz elaborate how to do this

-- 
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350


Snappull failed

2011-02-16 Thread Markus Jelsma
Hi,

There are a couple of Solr 1.4.1 slaves, all doing the same. Pulling some 
snaps, handling some queries, nothing exciting. But can anyone explain a 
sudden nightly occurence of this error?

2011-02-16 01:23:04,527 ERROR [solr.handler.ReplicationHandler] - [pool-238-
thread-1] - : SnapPull failed 
org.apache.solr.common.SolrException: Unable to download _gv.frq completely. 
Downloaded 209715200!=583644834
at 
org.apache.solr.handler.SnapPuller$FileFetcher.cleanup(SnapPuller.java:1026)
at 
org.apache.solr.handler.SnapPuller$FileFetcher.fetchFile(SnapPuller.java:906)
at 
org.apache.solr.handler.SnapPuller.downloadIndexFiles(SnapPuller.java:541)
at 
org.apache.solr.handler.SnapPuller.fetchLatestIndex(SnapPuller.java:294)
at 
org.apache.solr.handler.ReplicationHandler.doFetch(ReplicationHandler.java:264)
at org.apache.solr.handler.SnapPuller$1.run(SnapPuller.java:159)
at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)
at 
java.util.concurrent.FutureTask$Sync.innerRunAndReset(FutureTask.java:317)
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:150)
at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$101(ScheduledThreadPoolExecutor.java:98)
at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.runPeriodic(ScheduledThreadPoolExecutor.java:181)
at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:205)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:619)

All i know is that it was unable to download but the reason eludes me. 
Sometimes, a machine rolls out many of these errors and increasing the index 
size because it can't handle the already downloaded data.

Cheers,
-- 
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350


Re: clustering with tomcat

2011-02-16 Thread Markus Jelsma
I have no idea, seems you haven't compiled Carrot2 or haven't included all 
jars.

On Wednesday 16 February 2011 11:29:30 Isha Garg wrote:
> On Wednesday 16 February 2011 03:32 PM, Markus Jelsma wrote:
> > What distro are you using? On at least Debian systems you can put the -
> > Dsolr.clustering.enabled=true environment variable in
> > /etc/default/tomcat6.
> > 
> > You can also, of course, remove all occurences of
> > ${solr.clustering.enabled} from you solrconfig.xml
> > 
> > On Wednesday 16 February 2011 10:52:35 Isha Garg wrote:
> >> On Wednesday 16 February 2011 02:41 PM, Markus Jelsma wrote:
> >>> On Debian you can edit /etc/default/tomcat6
> >>> 
> >>>> hi,
> >>>> 
> >>>>i am  using  solr1.4  with apache tomcat. to enable the
> >>>> 
> >>>> clustering feature
> >>>> i follow the link
> >>>> http://wiki.apache.org/solr/ClusteringComponent
> >>>> Plz help me how to   add-Dsolr.clustering.enabled=true to
> >>>> $CATALINA_OPTS. after that which steps be will required.
> >> 
> >> i did nt understand  can u plz elaborate how to do this
> 
> I have embed solr with  apache-tomcat5.5 on linux i am getting error
> 
> HTTP Status 500 - Severe errors in solr configuration. Check your log
> files for more detailed information on what may be wrong. If you want
> solr to continue after configuration errors, change:
> false in null
> -
> java.lang.NoSuchMethodError:
> org.carrot2.util.pool.SoftUnboundedPool.(Lorg/carrot2/util/pool/IInst
> antiationListener;Lorg/carrot2/util/pool/IActivationListener;Lorg/carrot2/u
> til/pool/IPassivationListener;Lorg/carrot2/util/pool/IDisposalListener;)V
> at org.carrot2.core.CachingController.init(CachingController.java:189) at
> org.carrot2.core.CachingController.init(CachingController.java:115) at
> org.apache.solr.handler.clustering.carrot2.CarrotClusteringEngine.init(Carr
> otClusteringEngine.java:94) at
> org.apache.solr.handler.clustering.ClusteringComponent.inform(ClusteringCom
> ponent.java:123) at
> org.apache.solr.core.SolrResourceLoader.inform(SolrResourceLoader.java:486)
> at org.apache.solr.core.SolrCore.(SolrCore.java:588) at
> org.apache.solr.core.CoreContainer$Initializer.initialize(CoreContainer.jav
> a:137) at
> org.apache.solr.servlet.SolrDispatchFilter.init(SolrDispatchFilter.java:83)
> at
> org.apache.catalina.core.ApplicationFilterConfig.getFilter(ApplicationFilte
> rConfig.java:221) at
> org.apache.catalina.core.ApplicationFilterConfig.setFilterDef(ApplicationFi
> lterConfig.java:302) at
> org.apache.catalina.core.ApplicationFilterConfig.(ApplicationFilterCo
> nfig.java:78) at
> org.apache.catalina.core.StandardContext.filterStart(StandardContext.java:3
> 666) at
> org.apache.catalina.core.StandardContext.start(StandardContext.java:4258)
> at
> org.apache.catalina.core.ContainerBase.addChildInternal(ContainerBase.java
> :760) at
> org.apache.catalina.core.ContainerBase.addChild(ContainerBase.java:740)
> at org.apache.catalina.core.StandardHost.addChild(StandardHost.java:544)
> at
> org.apache.catalina.startup.HostConfig.deployDirectory(HostConfig.java:980)
> at
> org.apache.catalina.startup.HostConfig.deployDirectories(HostConfig.java:94
> 3) at
> org.apache.catalina.startup.HostConfig.deployApps(HostConfig.java:500)
> at org.apache.catalina.startup.HostConfig.start(HostConfig.java:1203) at
> org.apache.catalina.startup.HostConfig.lifecycleEvent(HostConfig.java:319)
> at
> org.apache.catalina.util.LifecycleSupport.fireLifecycleEvent(LifecycleSuppo
> rt.java:120) at
> org.apache.catalina.core.ContainerBase.start(ContainerBase.java:1022) at
> org.apache.catalina.core.StandardHost.start(StandardHost.java:736) at
> org.apache.catalina.core.ContainerBase.start(ContainerBase.java:1014) at
> org.apache.catalina.core.StandardEngine.start(StandardEngine.java:443) at
> org.apache.catalina.core.StandardService.start(StandardService.java:448)
> at
> org.apache.catalina.core.StandardServer.start(StandardServer.java:700)
> at org.apache.catalina.startup.Catalina.start(Catalina.java:552) at
> sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:3
> 9) at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImp
> l.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at
> org.apache.catalina.startup.Bootstrap.start(Bootstrap.java:295) at
> org.apache.catalina.startup.Bootstrap.main(Bootstrap.java:433)
> 
> Now can u tell me what to do I am not familiar with distro and Debian
> systems

-- 
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350


Re: Term Vector Query on Single Document

2011-02-16 Thread Markus Jelsma
On Wednesday 16 February 2011 16:49:51 Tod wrote:
> I have a couple of semi-related questions regarding the use of the Term
> Vector Component:
> 
> 
> - Using curl is there a way to query a specific document (maybe using
> Tika when required?) to get a distribution of the terms it contains?

No Tika involved here. You can just query a document q=id:whatever and enable 
the TVComponent. Make sure you list your fields in the tv.fl parameter. Those 
fields, of course, need TermVectors enabled.
> 
> - When I set the termVector on a field do I need to reindex?  I'm
> thinking 'yes'

Yes.

> 
> - How expensive is setting the termVector on a field?

Takes up additional disk space and RAM. Can be a lot.

> 
> 
> Thanks - Tod

-- 
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350


Re: optimize and mergeFactor

2011-02-16 Thread Markus Jelsma
> In my own Solr 1.4, I am pretty sure that running an index optimize does
> give me significant better performance. Perhaps because I use some
> largeish (not huge, maybe as large as 200k) stored fields.

200.000 stored fields? I asume that number includes your number of documents? 
Sounds crazy =)

> 
> So I'm interested in always keeping my index optimized.
> 
> Am I right that if I set mergeFactor to '1', essentially my index will
> always be optimized after every commit, and actually running 'optimize'
> will be redundant?

You can set mergeFactor to 2, not lower. 

> 
> What are the possible negative repurcussions of setting mergeFactor to
> 1? Is this a really bad idea?  If not 1, what about some other
> lower-than-usually-recommended value like 2 or 3?  Anyone done this?
> I imagine it will slow down my commits, but if the alternative is
> running optimize a lot anyway I wonder at what point I get 'break
> even' (if I optimize after every single commit, clearly might as well
> just set the mergeFactor low, right? But if I optimize after every X
> documents or Y commits don't know what X/Y are break-even).

This depends on commit rate and if there are a lot of updates and deletes 
instead of adds. Setting it very low will indeed cause a lot of merging and 
slow commits. It will also be very slow in replication because merged files are 
copied over again and again, causing high I/O on your slaves.

There is always a `break even` but it depends (as usual) on your scenario and 
business demands.

> 
> Jonathan


Re: Shutdown hook executing for a long time

2011-02-16 Thread Markus Jelsma
Closing a core will shutdown almost everything related to the workings of a 
core. Update and search handlers, possible warming searchers etc.

Check the implementation of the close method:
http://svn.apache.org/viewvc/lucene/dev/branches/branch_3x/solr/src/java/org/apache/solr/core/SolrCore.java?view=markup

> 2011-02-16 11:32:45.489::INFO:  Shutdown hook executing
> 2011-02-16 11:35:36.002::INFO:  Shutdown hook complete
> 
> The shutdown time seems to be proportional to the amount of time that Solr
> has been running.  If I immediately restart and shut down again, it takes
> a fraction of a second.  What causes it to take so long to shut down and
> is there anything I can do to make it happen quicker?


Re: optimize and mergeFactor

2011-02-16 Thread Markus Jelsma

> Thanks for the answers, more questions below.
> 
> On 2/16/2011 3:37 PM, Markus Jelsma wrote:
> > 200.000 stored fields? I asume that number includes your number of
> > documents? Sounds crazy =)
> 
> Nope, I wasn't clear. I have less than a dozen stored field, but the
> value of a stored field can sometimes be as large as 200kb.
> 
> > You can set mergeFactor to 2, not lower.
> 
> Am I right though that manually running an 'optimize' is the equivalent
> of a mergeFactor=1?  So there's no way to get Solr to keep the index in
> an 'always optimized' state, if I'm understanding correctly? Cool. Just
> want to understand what's going on.

That should be it. If i remember correctly a second segment is always written, 
new updates aren't merged immediately. 

> 
> > This depends on commit rate and if there are a lot of updates and deletes
> > instead of adds. Setting it very low will indeed cause a lot of merging
> > and slow commits. It will also be very slow in replication because
> > merged files are copied over again and again, causing high I/O on your
> > slaves.
> > 
> > There is always a `break even` but it depends (as usual) on your scenario
> > and business demands.
> 
> There are indeed sadly lots of updates and deletes, which is why I need
> to run optimize periodically. I am aware that this will cause more work
> for replication -- I think this is true whether I manually issue an
> optimize before replication _or_ whether I just keep the mergeFactor
> very low, right? Same issue either way.

Yes. But having several segments shouldn't make that much of a difference. If 
search latency is just a few addidional milliseconds than i'd rather have a 
few more segments being copied over more quickly.

> 
> So... if I'm going to do lots of updates and deletes, and my other
> option is running an optimize before replication anyway   is there
> any reason it's going to be completely stupid to set the mergeFactor to
> 2 on the master?  I realize it'll mean all index files are going to have
> to be replicated, but that would be the case if I ran a manual optimize
> in the same situation before replication too, I think.

No, it's not stupid if you allow for slow indexing and slow copying of files 
but want a very quick search.

> 
> Jonathan


Re: Solr multi cores or not

2011-02-16 Thread Markus Jelsma
Hi,

That depends (as usual) on your scenario. Let me ask some questions:

1. what is the sum of documents for your applications?
2. what is the expected load in queries/minute
3. what is the update frequency in documents/minute and how many documents per 
commit?
4. how many different applications do you have?
5. are the query demands for the business the same (or very similar) for all 
applications?
6. can you easily upgrade hardware or demand more machines?
7. must you enforce security between applications and are the clients not 
under your control?

I'm puzzled though, you have so much memory but so little CPU. What about the 
disks? Size? Spinning or SSD?

Cheers,

> Hi,
> 
> I have a need to index multiple applications using Solr, I also have the
> need to share indexes or run a search query across these application
> indexes. Is solr multi-core - the way to go?  My server config is
> 2virtual CPUs @ 1.8 GHz and has about 32GB of memory. What is the
> recommendation?
> 
> Thanks,
> Sai Thumuluri


Re: Solr multi cores or not

2011-02-16 Thread Markus Jelsma
You can also easily abuse shards to query multiple cores that share parts of 
the schema. This way you have isolation with the ability to query them all. 
The same can, of course, also be achieved using a sinlge index with a simple 
field identying the application and using fq on that one.

> Yes, you're right, from now on when I say that, I'll say "except
> shards". It is true.
> 
> My understanding is that shards functionality's intended use case is for
> when your index is so large that you want to split it up for
> performance. I think it works pretty well for that, with some
> limitations as you mention.
> 
>  From reading the list, my impression is that when people try to use
> shards to solve some _other_ problem, they generally run into problems.
> But maybe that's just because the people with the problems are the ones
> who appear on the list?
> 
> My personal advice is still to try and put everything together in one
> big index, Solr will give you the least trouble with that, it's what
> Solr "likes" to do best;  move to shards certainly if your index is so
> large that moving to shards will give you performance advantage you
> need, that's what they're for; be very cautious moving to shards for
> other challenges that 'one big index' is giving you that you're thinking
> shards will solve. Shards is, as I understand it, _not_ intended as a
> general purpose "federation" function, it's specifically intended to
> split an index accross multiple hosts for performance.
> 
> Jonathan
> 
> On 2/16/2011 4:37 PM, Bob Sandiford wrote:
> > Hmmm.  Maybe I'm not understanding what you're getting at, Jonathan, when
> > you say 'There is no good way in Solr to run a query across multiple
> > Solr indexes'.
> > 
> > What about the 'shards' parameter?  That allows searching across multiple
> > cores in the same instance, or shards across multiple instances.
> > 
> > There are certainly implications here (like Relevance not being
> > consistent across cores / shards), but it works pretty well for us...
> > 
> > Thanks!
> > 
> > Bob Sandiford | Lead Software Engineer | SirsiDynix
> > P: 800.288.8020 X6943 | bob.sandif...@sirsidynix.com
> > www.sirsidynix.com
> > 
> >> -Original Message-
> >> From: Jonathan Rochkind [mailto:rochk...@jhu.edu]
> >> Sent: Wednesday, February 16, 2011 4:09 PM
> >> To: solr-user@lucene.apache.org
> >> Cc: Thumuluri, Sai
> >> Subject: Re: Solr multi cores or not
> >> 
> >> Solr multi-core essentially just lets you run multiple seperate
> >> distinct
> >> Solr indexes in the same running Solr instance.
> >> 
> >> It does NOT let you run queries accross multiple cores at once. The
> >> cores are just like completely seperate Solr indexes, they are just
> >> conveniently running in the same Solr instance. (Which can be easier
> >> and
> >> more compact to set up than actually setting up seperate Solr
> >> instances.
> >> And they can share some config more easily. And it _may_ have
> >> implications on JVM usage, not sure).
> >> 
> >> There is no good way in Solr to run a query accross multiple Solr
> >> indexes, whether they are multi-core or single cores in seperate Solr
> >> doesn't matter.
> >> 
> >> Your first approach should be to try and put all the data in one Solr
> >> index. (one Solr 'core').
> >> 
> >> Jonathan
> >> 
> >> On 2/16/2011 3:45 PM, Thumuluri, Sai wrote:
> >>> Hi,
> >>> 
> >>> I have a need to index multiple applications using Solr, I also have
> >> 
> >> the
> >> 
> >>> need to share indexes or run a search query across these application
> >>> indexes. Is solr multi-core - the way to go?  My server config is
> >>> 2virtual CPUs @ 1.8 GHz and has about 32GB of memory. What is the
> >>> recommendation?
> >>> 
> >>> Thanks,
> >>> Sai Thumuluri


Re: Replication and newSearcher registerd > poll interval

2011-02-17 Thread Markus Jelsma
If you set maxWarmingSearchers to 1 then you cannot issue an overlapping 
commit. Slaves won't poll for a new index version while replication is in 
progress.

It works well in my environment where there is a high update/commit frequency, 
about a thousand documents per minute. The system even behaves well a thousand 
updates per second and a commit per minute with a poll interval of 2 seconds.

On Thursday 17 February 2011 11:54:32 dan sutton wrote:
> Hi,
> 
> Keeping the thread alive, any thought on only doing replication if
> there is no warming currently going on?
> 
> Cheers,
> Dan
> 
> On Thu, Feb 10, 2011 at 11:09 AM, dan sutton  wrote:
> > Hi,
> > 
> > If the replication window is too small to allow a new searcher to warm
> > and close the current searcher before the new one needs to be in
> > place, then the slaves continuously has a high load, and potentially
> > an OOM error. we've noticed this in our environment where we have
> > several facets on large multivalued fields.
> > 
> > I was wondering what the list though about modifying the replication
> > process to skip polls (though warning to logs) when there is a
> > searcher in the process of warming? Else as in our case it brings the
> > slave to it's knees, workaround was to extend the poll interval,
> > though not ideal.
> > 
> > Cheers,
> > Dan

-- 
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350


Re: My Plan to Scale Solr

2011-02-17 Thread Markus Jelsma
Hi Bing Li,

On Thursday 17 February 2011 10:32:11 Bing Li wrote:
> Dear all,
> 
> I started to learn how to use Solr three months ago. My experiences are
> still limited.
> 
> Now I crawl Web pages with my crawler and send the data to a single Solr
> server. It runs fine.
> 
> Since the potential users are large, I decide to scale Solr. After
> configuring replication, a single index can be replicated to multiple
> servers.
> 
> For shards, I think it is also required. I attempt to split the index
> according to the data categories and priorities. After that, I will use the
> above replication techniques and get high performance. The following work
> is not so difficult.

It's better to use a consistent hashing algorithm to decide which server takes 
which documents if you want good relevancy. Using a modulo with the number of 
servers will return the shard per document. If you have integers as unique key 
then just a modulo will suffice.

> 
> I noticed some new terms, such as SolrClould, Katta and ZooKeeper.
> According to my current understandings, it seems that I can ignore them.
> Am I right? What benefits can I get if using them?

SolrCloud comes with ZooKeeper. It's designed to provide a fail-over cluster 
and more useful features. I haven't tried Katta.

> 
> Thanks so much!
> LB

-- 
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350


Re: Validate Query Syntax of Solr Request Before Sending

2011-02-17 Thread Markus Jelsma
Uh, how about the LuceneQParser? It does some checks and can return 
appropriate error messages.

On Thursday 17 February 2011 06:44:16 csj wrote:
> Hi,
> 
> I wonder if it is possible to let the user build up a Solr Query and have
> it validated by some java API before sending it to Solr.
> 
> Is there a parser that could help with that? I would like to help the user
> building a valid query as she types by showing messages like "The query is
> not valid" or purhaps even more advanced: "The parentheses are not
> balanced".
> 
> Maybe one day it would also be possible to analyse the semantics of the
> query like: "This query has a build-in inconsistency because the two dates
> you have specified requires documents to be before AND after these date".
> But this is far future...
> 
> Regards,
> 
> Christian Sonne Jensen

-- 
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350


Re: fine tuning the solr search

2011-02-17 Thread Markus Jelsma
Have a read:
http://lucene.apache.org/java/2_9_1/scoring.html

On Thursday 17 February 2011 12:50:08 Churchill Nanje Mambe wrote:
> Hi
>  I would love to know how to do this with solr
>  say a user inputs "Account manager files",
>  I wish that solr puts priority on the documents it finds as follows
>  1) documents containing account manager files gets a greater score
>  2) then documents with account manager come next
>  3) then documents with account can come
>  before the other words are used to get documents in the search
> 
> right now I think it works different as it finds documents with
> accounts and puts them like in first position or documents with word
> manager in second position or so
> 
> thanks

-- 
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350


Re: is solr dynamic calculation??

2011-02-17 Thread Markus Jelsma
Both, you should also read about scoring.
http://lucene.apache.org/java/2_9_1/scoring.html

On Thursday 17 February 2011 13:39:05 satya swaroop wrote:
> Hi All,
>  I have a query whether the solr shows the results of documents by
> calculating the score on dynamic or is it pre calculating and supplying??..
> 
> for example:
> if a query is made on q=solr in my index... i get a results of 25
> documents... what is it calculating?? i am very keen to know its way of
> calculation of score and ordering of results
> 
> 
> Regards,
> satya

-- 
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350


Re: last item in results page is always the same

2011-02-17 Thread Markus Jelsma
Its fixed in 1.4.1.
https://issues.apache.org/jira/browse/SOLR-1777

On Thursday 17 February 2011 16:04:18 Paul wrote:
> Thanks, going to update now. This is a system that is currently
> deployed. Should I just update to 1.4.1, or should I go straight to
> 3.0? Does 1.4 => 3.0 require reindexing?
> 
> On Wed, Feb 16, 2011 at 5:37 PM, Yonik Seeley
> 
>  wrote:
> > On Wed, Feb 16, 2011 at 5:08 PM, Paul  wrote:
> >> Is this a known solr bug or is there something subtle going on?
> > 
> > Yes, I think it's the following bug, fixed in 1.4.1:
> > 
> > * SOLR-1777: fieldTypes with sortMissingLast=true or
> > sortMissingFirst=true can result in incorrectly sorted results.
> > 
> > -Yonik
> > http://lucidimagination.com

-- 
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350


Re: GET or POST for large queries?

2011-02-18 Thread Markus Jelsma
Increase the setting in solrconfig

On Friday 18 February 2011 15:30:11 mrw wrote:
> Thanks for the response.
> 
> POSTing the data appears to avoid the header threshold issue, but it breaks
> because of the "too many boolean clauses" error.
> 
> gearond wrote:
> > Probably you could do it, and solving a problem in business supersedes
> > 'rightness' concerns, much to the dismay of geeks and 'those who like
> > rightness
> > and say the word "Neemph!" '.
> > 
> > 
> > the not rightness about this is that:
> > POST, PUT, DELETE are assumed to make changes to the URL's backend.
> > GET is assumed NOT to make changes.
> > 
> > So if your POST does not make a change . . . it breaks convention. But if
> > it
> > solves the problem . . . :-)
> > 
> > Another way would be to GET with a 'query file' location, and then have
> > the
> > server fetch that query and execute it.
> > 
> > Boy!!! I'd love to see one of your queries!!! You must have a few
> > ANDs/ORs in
> > them :-)
> > 
> >  Dennis Gearon

-- 
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350


Re: [Solr] and CouchDB

2011-02-19 Thread Markus Jelsma
CouchDB is a good piece of software for some scenario's and easy to use. It 
has update handlers to which you could attach a small program that takes the 
input, transforms it to Solr XML and send it over.

CouchDB lucene is a bit different. It lacks the power of Solr but allows and 
you need to write a custom view to send data over. But, it works and brings 
full-text search to CouchDB.

> I am curious to see if anyone has messed around with Solr and the
> Couch-Lucene incarnation that is out there...I was passed this article
> this morning and it really opened my eyes about CouchDB
> http://m.readwriteweb.com/hack/2011/02/hacker-chat-max-ogden.php
> 
> Thoughts,
> Adam


Re: Delete doc in index by date range?

2011-02-19 Thread Markus Jelsma
Sure
http://wiki.apache.org/solr/UpdateXmlMessages#A.22delete.22_by_ID_and_by_Query

> Hi,
> 
> I'm wondering if it's possible to delete documents in m'y index by date
> range?
> 
> I've got a field in my schema: indexed_date in date type and i would like
> to remove docs older than 90 days.
> 
> Thanks for your help
> 
> Marc


Re: How to get a field that starts with a minus?

2011-02-20 Thread Markus Jelsma
He could also just escape it or am i missing something?

> --- On Sun, 2/20/11, Paul Tomblin  wrote:
> > From: Paul Tomblin 
> > Subject: Re: How to get a field that starts with a minus?
> > To: solr-user@lucene.apache.org
> > Date: Sunday, February 20, 2011, 5:53 PM
> > Yes, it's string:
> > > class="solr.StrField"
> >
> > sortMissingLast="true" omitNorms="true"/>
> >  > type="string" stored="true" indexed="true"/>
> 
> No, string is OK. In this case it is better to use raw or field query
> parser.
> 
> SolrQuery.setQuery("{!raw f=id}-3f66fdfb1ef5f8719f65a7403e93cc9d");


Re: UpdateProcessor and copyField

2011-02-22 Thread Markus Jelsma
Yes. But did you actually search the mailing list or Solr's wiki? I guess not.

Here it is:
http://wiki.apache.org/solr/UpdateRequestProcessor

> Can fields created by copyField instructions be processed by
> UpdateProcessors?
> Or only raw input fields can?
> 
> So far my experiment is suggesting the latter.
> 
> 
> T. "Kuro" Kurosaka


Re: Sort Stability With Date Boosting and Rounding

2011-02-22 Thread Markus Jelsma
Hi,

You're right, it's illegal syntax to use other functions in the ms function, 
which is a pity indeed.

However, you reduce the score by 50% for each year. Therefore paging through 
the results shouldn't make that much of a difference because the difference in 
score with NOW+2 minutes has a negligable impact on the total score.

I had some thoughts on this issue as well but i decided the impact was too 
little to bother about.

Cheers,

> I'm trying to use
> http://wiki.apache.org/solr/SolrRelevancyFAQ#How_can_I_boost_the_score_of_n
> ewer_documents as
> a bf parameter to my dismax handler.  The problem is, the value of NOW can
> cause documents in a similar range (date value within a few seconds of each
> other) to sometimes round to be equal, and sometimes not, changing their
> sort order (when equal, falling back to a secondary sort).  This, in turn,
> screws up paging.
> 
> The problem is that score is rounded to a lower level of precision than
> what the suggested formula produces as a difference between two values
> within seconds of each other.  It seems to me if I could round the value
> to minutes or hours, where the difference will be large enough to not be
> rounded-out, then I wouldn't have problems with order changing on me.  But
> it's not legal syntax to specify something like:
> recip(ms(NOW,manufacturedate_dt/HOUR),3.16e-11,1,1)
> 
> Is this a problem anyone has faced and solved?  Anyone have suggested
> solutions, other than indexing a copy of the date field that's rounded to
> the hour?
> 
> --
> Stephen Duncan Jr
> www.stephenduncanjr.com


Re: UpdateProcessor and copyField

2011-02-23 Thread Markus Jelsma
You are right, i misread. Fields a copied first, then analyzed and then behave 
like other fields and pass the same way through the update processor.

Cheers,

> Markus,
> 
> I searched but I couldn't find a definite answer, so I posted this
> question.
> The article you quoted talks about implementing a copyField-like operation
> using UpdateProcessor.  It doesn't talk about relationship between
> the copyField operation proper and UpdateProcessors.
> 
> Kuro
> 
> On 2/22/11 3:00 PM, "Markus Jelsma"  wrote:
> >Yes. But did you actually search the mailing list or Solr's wiki? I guess
> >not.
> >
> >Here it is:
> >http://wiki.apache.org/solr/UpdateRequestProcessor
> >
> >> Can fields created by copyField instructions be processed by
> >> UpdateProcessors?
> >> Or only raw input fields can?
> >> 
> >> So far my experiment is suggesting the latter.
> >> 
> >> 
> >> T. "Kuro" Kurosaka


Re: Sort Stability With Date Boosting and Rounding

2011-02-23 Thread Markus Jelsma
Hi,

This seems to be a tricky issue judging from the other replies. I'm just 
thinking out of the box now and the following options come to mind:

1) can you store the timestamp in the session in your middleware for each 
user? This way it stays fixed and doesn't change the order between requests. Of 
course, the order can still change when new documents are committed but this 
cannot be avoided. 

2) if you have frequent commits, you might find a way to modify Solr's 
RandomSortField to create a NOW for each commit. The timestamp remains fixed 
for all consequetive requests if you use the same field for the timestamp 
everytime. So instead of generating a random value, you'd just compute the 
current timestamp and the behavior will stay the same as RandomSortField.

Cheers

> The problem comes when you have results that are all the same natural score
> (because you've filtered them, with no primary search, for instance), and
> are very close together in time.  Then, as you page through, the order
> changes.  So the user experience is that they see duplicate documents, and
> miss out on some of the docs in the overall set.  It's not something
> negligible that I can ignore.  I either have to come up with a fix for
> this, or get rid of the boost function altogether.
> 
> Stephen Duncan Jr
> www.stephenduncanjr.com
> 
> 
> On Tue, Feb 22, 2011 at 6:09 PM, Markus Jelsma
> 
> wrote:
> > Hi,
> > 
> > You're right, it's illegal syntax to use other functions in the ms
> > function,
> > which is a pity indeed.
> > 
> > However, you reduce the score by 50% for each year. Therefore paging
> > through
> > the results shouldn't make that much of a difference because the
> > difference in
> > score with NOW+2 minutes has a negligable impact on the total score.
> > 
> > I had some thoughts on this issue as well but i decided the impact was
> > too little to bother about.
> > 
> > Cheers,
> > 
> > > I'm trying to use
> > 
> > http://wiki.apache.org/solr/SolrRelevancyFAQ#How_can_I_boost_the_score_of
> > _n
> > 
> > > ewer_documents as
> > > a bf parameter to my dismax handler.  The problem is, the value of NOW
> > 
> > can
> > 
> > > cause documents in a similar range (date value within a few seconds of
> > 
> > each
> > 
> > > other) to sometimes round to be equal, and sometimes not, changing
> > > their sort order (when equal, falling back to a secondary sort). 
> > > This, in
> > 
> > turn,
> > 
> > > screws up paging.
> > > 
> > > The problem is that score is rounded to a lower level of precision than
> > > what the suggested formula produces as a difference between two values
> > > within seconds of each other.  It seems to me if I could round the
> > > value to minutes or hours, where the difference will be large enough
> > > to not be rounded-out, then I wouldn't have problems with order
> > > changing on me.
> >  
> >  But
> >  
> > > it's not legal syntax to specify something like:
> > > recip(ms(NOW,manufacturedate_dt/HOUR),3.16e-11,1,1)
> > > 
> > > Is this a problem anyone has faced and solved?  Anyone have suggested
> > > solutions, other than indexing a copy of the date field that's rounded
> > > to the hour?
> > > 
> > > --
> > > Stephen Duncan Jr
> > > www.stephenduncanjr.com


Re: Solr Ajax

2011-02-23 Thread Markus Jelsma
Hi,

I may have misread it all but SolrJ is the Java client and you don't need it 
for a pretty AJAX interface.

Cheers,

> Hello list,
> 
> I'm in the process of trying to implement Ajax within my Solr-backed webapp
> I have been reading both the Solrj wiki as well as the tutorial provided
> via the google group and various info from the wiki page
> https://github.com/evolvingweb/ajax-solr/wiki
> 
> I have all solrj jar libraries available in my webapp /lib but I am
> unsure as to what steps I take to configure the Solrj client. What do I
> need to configure to begin working with Solrj? I am unsure as to where to
> go and finding information on the wiki seems to be a non trivial task.
> 
> Any help would be great. Thanks
> 
> Lewis
> 
> Glasgow Caledonian University is a registered Scottish charity, number
> SC021474
> 
> Winner: Times Higher Education’s Widening Participation Initiative of the
> Year 2009 and Herald Society’s Education Initiative of the Year 2009.
> http://www.gcu.ac.uk/newsevents/news/bycategory/theuniversity/1/name,6219,
> en.html
> 
> Winner: Times Higher Education’s Outstanding Support for Early Career
> Researchers of the Year 2010, GCU as a lead with Universities Scotland
> partners.
> http://www.gcu.ac.uk/newsevents/news/bycategory/theuniversity/1/name,15691
> ,en.html


Re: Help with parsing configuration using SolrParams/NamedList

2011-02-23 Thread Markus Jelsma
Hi,

The params you have suggest you're planning to use SweetSpotSimilarity. There 
already is a factory you can use in Jira.

https://issues.apache.org/jira/browse/SOLR-1365

Cheers,
> Hi,
> 
> I'm trying to use a CustomSimilarityFactory and pass in per-field
> options from the schema.xml, like so:
> 
>  
> 
> 500
> 1
> 0.5
> 
> 
> 500
> 2
> 0.5
> 
>  
> 
> My problem is I am utterly failing to figure out how to parse this
> nested option structure within my CustomSimilarityFactory class. I
> know that the settings are available as a SolrParams object within the
> getSimilarity() method. I'm convinced I need to convert to a NamedList
> using params.toNamedList(), but my java fu is too feeble to code the
> dang thing. The closest I seem to get is the top level as a NamedList
> where the keys are "field_a" and "field_b", but then my values are
> strings, e.g., "{min=500,max=1,steepness=0.5}".
> 
> Anyone who could dash off a quick example of how to do this?
> 
> Thanks,
> --jay


Re: Detailed Steps for Scaling Solr

2011-02-23 Thread Markus Jelsma
Hi,

Scaling might be required. How large is the index going to be in number of 
documents, fields and bytes and what hardware do you have? Powerful CPU's and a 
lot of RAM will help. And, how many queries per second do you expect? And how 
many updates per minute?

Depending on average document size you can have up to tens of millions 
documents on a single box. If read load is high, you can then easily replicate 
the data to slaves to balance load.

If the data outgrows a single box then sharding is the way to go. But first i'd 
try to see if a single box will do the trick and perhaps replace spinning 
disks with SSD's and stick more RAM in it.

Cheers,
> Dear all,
> 
> I need to construct a site which supports searching for a large index. I
> think scaling Solr is required. However, I didn't get a tutorial which
> helps me do that step by step. I only have two resources as references.
> But both of them do not tell me the exact operations.
> 
> 1)
> http://www.lucidimagination.com/Community/Hear-from-the-Experts/Articles/Sc
> aling-Lucene-and-Solr
> 
> 2) David Smiley, Eric Pugh; Solr 1.4 Enterprise Search Server
> 
> If you have experiences to scale Solr, could you give me such tutorials?
> 
> Thanks so much!
> LB


Re: 1.4.1 replication failure is still 200 OK

2011-02-23 Thread Markus Jelsma
Hi,

I'd guess a non-200 HTTP response code would be more appropriate indeed but 
it's just a detail.

A successful replication will change a few things on the slave:
- increment of generation value
- updated indexVersion value
- lastReplication will have a new timestamp

You can also check for a replication.properties in your slave's data 
directory.

Cheers,

> Should this be considered a bug, or is there something i"m missing?
> 
> Let's say I have a replication slave set up, but without polling. So I'm
> going to manually trigger replication.
> 
> So I do that:  http://example.org/solr/core/replication?command=fetchIndex
> 
> I get a 200 OK _even if_ the masterUrl configured is wrong, has a typo,
> is unreachable, doesn't point at Solr, whatever.  No replication
> actually happened.
> 
> So is it a bug that I get a 200 OK anyway?
> 
> Alternately, where should I look to see if a replication actually
> succeeded?  Just the main log file?


Re: Order Facet on ranking score

2011-02-24 Thread Markus Jelsma
No, Solr returns facets ordered alphabetically or count.

> Hello everybody,
> Is it possibile to order the facet results on some ranking score?
> 
> I was doing a query with "or" operator and sometimes the first facet
> have inside of them only result with small rank and not important.
> This cause that users are led to other reasearch not important.
> 
> //


Re: upgrading to Tika 0.9 on Solr 1.4.1

2011-02-25 Thread Markus Jelsma
You don't want to use 0.8 if you're parsing PDF.

> Your best bet is perhaps upgrading to latest 1.4 branch, i.e. 1.4.2-dev
> (http://svn.apache.org/repos/asf/lucene/solr/branches/branch-1.4/) It
> includes Tika 0.8-SNAPSHOT and is a compatible drop-in (war/jar)
> replacement with lots of other bug fixes you'd also like (check
> changes.txt).
> 
> svn co http://svn.apache.org/repos/asf/lucene/solr/branches/branch-1.4
> cd branch-1.4
> ant dist
> 
> --
> Jan Høydahl, search solution architect
> Cominvent AS - www.cominvent.com
> 
> On 24. feb. 2011, at 21.42, jo wrote:
> > I have tried the steps indicated here:
> > http://wiki.apache.org/solr/ExtractingRequestHandler
> > http://wiki.apache.org/solr/ExtractingRequestHandler
> > 
> > and when I try to parse a document nothing would happen, no error.. I
> > have copied the jar files everywhere, and nothing.. can anyone give me
> > the steps on how to upgrade just tika, btw, currently on 1.4.1 has tika
> > 0.4
> > 
> > thank you


Re: Studying all files of Solr SRC

2011-02-26 Thread Markus Jelsma
DismaxQParser's mm parameter might help you out:
http://wiki.apache.org/solr/DisMaxQParserPlugin#mm_.28Minimum_.27Should.27_Match.29

> Is there any place where a detailed tutorial about all the Java files of
> Apache Solr(under Src folder) is available.?
> I want to study them as my purpose is to either write codes for my
> implementation or modify the existing files to fulfill my purpose.
> 
> Actually i want to add Advance Search in my Solr based search engine. This
> advance search will include options like ..."at least half" , "as many as
> possible" , "most" etc which are linguistic operators. We can say that
> these options will help the user in finding fuzziness in their search
> results.
> 
> The user wants "show me all the documents which contains "at least half "
> of terms like t1,t2,t3 or show me all the documents which contains "most "
> of the terms like t1,t5,t7 "etc...These "at least half" and "most" have
> been given some weight . These advance search is different from normal
> boolean search.
> 
> Thanks
> 
> 
> -
> Kumar Anurag


Re: Text field not defined in Solr Schema?

2011-02-26 Thread Markus Jelsma
Yes, you need to add the field text of type Text or use content instead of 
text.

> Hello list,
> 
> I have recently been working on some JS (ajax solr) and when using Firebug
> I am alerted to an error within the JS file as below. It immediately
> breaks on line 12 stating that 'doc.text' is undefined! Here is the code
> snippet.
> 
> 10 AjaxSolr.theme.prototype.snippet = function (doc) {
> 11   var output = '';
> 12   if (doc.text.length > 300) {
> 13 output += doc.dateline + ' ' + doc.text.substring(0, 300);
> 14 output += '' +
> doc.text.substring(300);
> 15 output += ' more';
> 16   }
> 17   else {
> 18 output += doc.dateline + ' ' + doc.text;
> 19   }
> 20   return output;
> 21 };
> 
> I have been advised that the problem might stem from my schema not defining
> a text field, however as my implementation of Solr is currently geared to
> index docs from a Nutch web crawl I am using the Nutch schema. A snippet
> of the schema is below
> 
> 
> 
>...
>  positionIncrementGap="100">
> 
> ...
> 
> 
> ...
> 
> 
> 
> 
> Can someone confirm if I require to add something similar to the following
> 
> 
> ...
> 
> 
> 
> Then perform a fresh crawl and reindex so that the schema field is
> recognised by the JS snippet?
> 
> Also (sorry I apologise) from my reading on the Solr schema, I became
> intrigued in options for TextField... namely compressed and
> compressThreshold. I understand that they are used hand in glove, however
> can anyone please explain what benefits compression adds and what integer
> value should be appropriate for the latter option.
> 
> Any help would be great
> Thank you Lewis
> 
> Glasgow Caledonian University is a registered Scottish charity, number
> SC021474
> 
> Winner: Times Higher Education’s Widening Participation Initiative of the
> Year 2009 and Herald Society’s Education Initiative of the Year 2009.
> http://www.gcu.ac.uk/newsevents/news/bycategory/theuniversity/1/name,6219,
> en.html
> 
> Winner: Times Higher Education’s Outstanding Support for Early Career
> Researchers of the Year 2010, GCU as a lead with Universities Scotland
> partners.
> http://www.gcu.ac.uk/newsevents/news/bycategory/theuniversity/1/name,15691
> ,en.html


Re: Disabling caching for fq param?

2011-03-01 Thread Markus Jelsma
If filterCache hitratio is low then just disable it in solrconfig by deleting 
the section or setting its values to 0.

> Based on what I've read here and what I could find on the web, it seems
> that each fq clause essentially gets its own results cache.  Is that
> correct?
> 
> We have a corporate policy of passing the user's Oracle OLS labels into the
> index in order to be matched against the labels field.  I currently
> separate this from the user's query text by sticking it into an fq
> param...
> 
> ?q=
> &fq=labels:
> &qf= 
> &tie=0.1
> &defType=dismax
> 
> ...but since its value (a collection of hundreds of label values) only
> apply to that user, the accompanying result set won't be reusable by other
> users:
> 
> My understanding is that this query will result in two result sets (q and
> fq) being cached separately, with the union of the two sets being returned
> to the user.  (Is that correct?)
> 
> There are thousands of users, each with a unique combination of labels, so
> there seems to be little value in caching the result set created from the
> fq labels param.  It would be beneficial if there were some kind of fq
> parameter override to indicate to Solr to not cache the results?
> 
> 
> Thanks!


Error during auto-warming of key

2011-03-01 Thread Markus Jelsma
Hi,

Yesterday's error log contains something peculiar: 

 ERROR [solr.search.SolrCache] - [pool-29-thread-1] - : Error during auto-
warming of key:+*:* 
(1.0/(7.71E-8*float(ms(const(1298682616680),date(sort_date)))+1.0))^20.0:java.lang.NullPointerException
at org.apache.lucene.util.StringHelper.intern(StringHelper.java:36)
at 
org.apache.lucene.search.FieldCacheImpl$Entry.(FieldCacheImpl.java:275)
at 
org.apache.lucene.search.FieldCacheImpl.getLongs(FieldCacheImpl.java:525)
at 
org.apache.solr.search.function.LongFieldSource.getValues(LongFieldSource.java:57)
at 
org.apache.solr.search.function.DualFloatFunction.getValues(DualFloatFunction.java:48)
at 
org.apache.solr.search.function.ReciprocalFloatFunction.getValues(ReciprocalFloatFunction.java:61)
at 
org.apache.solr.search.function.FunctionQuery$AllScorer.(FunctionQuery.java:123)
at 
org.apache.solr.search.function.FunctionQuery$FunctionWeight.scorer(FunctionQuery.java:93)
at 
org.apache.lucene.search.BooleanQuery$BooleanWeight.scorer(BooleanQuery.java:297)
at 
org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:246)
at org.apache.lucene.search.Searcher.search(Searcher.java:171)
at 
org.apache.solr.search.SolrIndexSearcher.getDocSetNC(SolrIndexSearcher.java:651)
at 
org.apache.solr.search.SolrIndexSearcher.getDocSet(SolrIndexSearcher.java:545)
at 
org.apache.solr.search.SolrIndexSearcher.cacheDocSet(SolrIndexSearcher.java:520)
at 
org.apache.solr.search.SolrIndexSearcher$2.regenerateItem(SolrIndexSearcher.java:296)
at org.apache.solr.search.FastLRUCache.warm(FastLRUCache.java:168)
at 
org.apache.solr.search.SolrIndexSearcher.warm(SolrIndexSearcher.java:1481)
at org.apache.solr.core.SolrCore$2.call(SolrCore.java:1131)
at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
at java.util.concurrent.FutureTask.run(FutureTask.java:138)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:619)


Well, i use Dismax' bf parameter to boost very recent documents. I'm not using 
the queryResultCache or documentCache, only filterCache and Lucene fieldCache. 
I've check LUCENE-1890 but am unsure if that's the issue. Anyt thoughts on 
this one?

https://issues.apache.org/jira/browse/LUCENE-1890

Cheers,

-- 
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350


Re: Indexed, but cannot search

2011-03-01 Thread Markus Jelsma
Traditionally, people forget to reindex ;)

> Hi all,
> 
> The problem was that my fields were defined as type="string" instead of
> type="text". Once I corrected that, it seems to be fixed. The only part
> that still is not working though is the search across all fields.
> 
> For example:
> 
> http://localhost:8983/solr/select/?q=type%3AMammal
> 
> Now correctly returns the records matching mammal. But if I try to do a
> global search across all fields:
> 
> http://localhost:8983/solr/select/?q=Mammal
> http://localhost:8983/solr/select/?q=text%3AMammal
> 
> I get no results returned. Here is how the schema is set up:
> 
>  multiValued="true"/>
> text
> 
> 
> Thanks to everyone for your help so far. I think this is the last hurdle I
> have to jump over.
> 
> On Tue, Mar 1, 2011 at 12:34 PM, Upayavira  wrote:
> > Next question, do you have your "type" field set to index="true" in your
> > schema?
> > 
> > Upayavira
> > 
> > On Tue, 01 Mar 2011 11:06 -0500, "Brian Lamb"
> > 
> >  wrote:
> > > Thank you for your reply but the searching is still not working out.
> > > For example, when I go to:
> > > 
> > > http://localhost:8983/solr/select/?q=*%3A*<
> > 
> > http://localhost:8983/solr/select/?q=*%3A*&version=2.2&start=0&rows=10&in
> > dent=on
> > 
> > > I get the following as a response:
> > > 
> > > 
> > > 
> > >   
> > >   
> > > Mammal
> > > 1
> > > Canis
> > >   
> > >   
> > > 
> > > 
> > > 
> > > (plus some other docs but one is enough for this example)
> > > 
> > > But if I go to
> > > http://localhost:8983/solr/select/?q=type%3A<
> > 
> > http://localhost:8983/solr/select/?q=*%3A*&version=2.2&start=0&rows=10&in
> > dent=on
> > 
> > > Mammal
> > > 
> > > I only get:
> > > 
> > > 
> > > 
> > > But it seems that should return at least the result I have listed
> > > above. What am I doing incorrectly?
> > > 
> > > On Mon, Feb 28, 2011 at 6:57 PM, Upayavira  wrote:
> > > > q=dog is equivalent to q=text:dog (where the default search field is
> > > > defined as text at the bottom of schema.xml).
> > > > 
> > > > If you want to specify a different field, well, you need to tell it
> > > > :-)
> > > > 
> > > > Is that it?
> > > > 
> > > > Upayavira
> > > > 
> > > > On Mon, 28 Feb 2011 15:38 -0500, "Brian Lamb"
> > > > 
> > > >  wrote:
> > > > > Hi all,
> > > > > 
> > > > > I was able to get my installation of Solr indexed using dataimport.
> > > > > However,
> > > > > I cannot seem to get search working. I can verify that the data is
> > 
> > there
> > 
> > > > > by
> > 
> > > > > going to:
> > http://localhost:8983/solr/select/?q=*%3A*&version=2.2&start=0&rows=10&in
> > dent=on
> > 
> > > > > This gives me the response:  > > > > numFound="234961" start="0">
> > > > > 
> > > > > But when I go to
> > 
> > http://localhost:8983/solr/select/?q=dog&version=2.2&start=0&rows=10&inde
> > nt=on
> > 
> > > > > I get the response: 
> > > > > 
> > > > > I know that dog should return some results because it is the first
> > 
> > result
> > 
> > > > > when I select all the records. So what am I doing incorrectly that
> > 
> > would
> > 
> > > > > prevent me from seeing results?
> > > > 
> > > > ---
> > > > Enterprise Search Consultant at Sourcesense UK,
> > > > Making Sense of Open Source
> > 
> > ---
> > Enterprise Search Consultant at Sourcesense UK,
> > Making Sense of Open Source


Re: solr different sizes on master and slave

2011-03-01 Thread Markus Jelsma
Are there pending commits on the master?

> I was curious why would the size be dramatically different even though
> the index versions are the same?
> 
> One is 1.2 Gb, and on the slave it is 512 MB
> 
> I would think they should both be the same size no?
> 
> Thanks


<    5   6   7   8   9   10   11   12   13   14   >