date:20090204

Maybe I am not clear, but I am not able to find anything on the net.
Basically, if I had in my index millions of names starting with A* I would
like to know how many distinct surnames are present in the resultset
(similar to a distinct SQL query).
I will attempt to have a look at the SOLR sources to try to see if this is
possible to implement. Any hints where to look at would be great!

Thanks,

Bruno

2009/2/3 Bruno Aranda 

> But as far as I understand the total number of constraints is limited
> (there is a default value), so I cannot know the total if I don't set the
> facet.limit to a really big number and then the request takes a long time. I
> was wondering if there was a way to get the total number (e.g. 100.000
> constraints) to show it to the user, and then paginate using facet.offset
> and facet.limit until I reach that total.
> Does this make sense?
>
> Thanks!
>
> Bruno
>
> 2009/2/3 Markus Jelsma - Buyways B.V. 
>
> Hello,
>>
>>
>> Searching for ?q=*:* with facetting turned on gives me the total number
>> of available constraints, if that is what you mean.
>>
>>
>> Cheers,
>>
>>
>>
>> On Tue, 2009-02-03 at 16:03 +, Bruno Aranda wrote:
>>
>> > Hi,
>> >
>> > I would like to know if is there a way to get the total number of
>> different
>> > facets returned by a faceted search? I see already that I can paginate
>> > through the facets with the facet.offset and facet.limit, but there is a
>> way
>> > to know how many facets are found in total?
>> >
>> > For instance,
>> >
>> > NameSurname
>> >
>> > Peter Smith
>> > John  Smith
>> > Anne Baker
>> > Mary York
>> > ... 1 million records more with 100.000 distinct surnames
>> >
>> > For instance, now I search for people with names starting with A, and I
>> > retrieve 5000 results. I would like to know the distinct number of
>> surnames
>> > (facets) for the result set if possible, so I could show in my app
>> something
>> > like this:
>> >
>> > 5000 people found with 1440 distinct surnames.
>> >
>> > Any ideas? Is this possible to implement? Any pointers would be greatly
>> > appreciated,
>> >
>> > Thanks!
>> >
>> > Bruno
>>
>
>

Re: Total count of facets

2009-02-04 Thread Shalin Shekhar Mangar

On Wed, Feb 4, 2009 at 2:14 PM, Bruno Aranda  wrote:

> Maybe I am not clear, but I am not able to find anything on the net.
> Basically, if I had in my index millions of names starting with A* I would
> like to know how many distinct surnames are present in the resultset
> (similar to a distinct SQL query).
> I will attempt to have a look at the SOLR sources to try to see if this is
> possible to implement. Any hints where to look at would be great!
>

You can use facet.query=name:A* to get the count of names starting with A.

-- 
Regards,
Shalin Shekhar Mangar.

Re: New wiki pages

2009-02-04 Thread Lance Norskog

I've added them to http://wiki.apache.org/solr/FrontPage under "Search and
Indexing". I declare open season on them. That is, anyone can edit them for
any reason. I'm sure I got some things wrong in memory sizing and sorting.

These tips and opinions came from my experience on an index with hundreds of
millions of small records. These are not the final word on how to do
production Solr.

Enjoy,

Lance Norskog

On Mon, Feb 2, 2009 at 10:25 PM, Lance Norskog  wrote:

> http://wiki.apache.org/solr/SchemaDesign
> http://wiki.apache.org/solr/LargeIndexes
> http://wiki.apache.org/solr/UniqueKey
>
> These pages are based on my recent experience and some generalizations.
> They are intended for new users who want to use Solr for a major project.
> Please review them and send me comments.
>
> For example: "they are stupid",  "the wiki has no links to them and those
> links should be here", etc.
>
> --
> Lance Norskog
> goks...@gmail.com
> 650-922-8831 (US)
>
>

-- 
Lance Norskog
goks...@gmail.com
650-922-8831 (US)

Re: Total count of facets

Mmh, thanks for your answer but with that I get the count of names starting
with A*, but I would like to get the count of distinct surnames (or town
names, or any other field that is not the name...) for the people with name
starting with A*. Is that possible?

Thanks!

Bruno

2009/2/4 Shalin Shekhar Mangar 

> On Wed, Feb 4, 2009 at 2:14 PM, Bruno Aranda 
> wrote:
>
> > Maybe I am not clear, but I am not able to find anything on the net.
> > Basically, if I had in my index millions of names starting with A* I
> would
> > like to know how many distinct surnames are present in the resultset
> > (similar to a distinct SQL query).
> > I will attempt to have a look at the SOLR sources to try to see if this
> is
> > possible to implement. Any hints where to look at would be great!
> >
>
> You can use facet.query=name:A* to get the count of names starting with A.
>
> --
> Regards,
> Shalin Shekhar Mangar.
>

Re: Total count of facets

2009-02-04 Thread Shalin Shekhar Mangar

On Wed, Feb 4, 2009 at 2:53 PM, Bruno Aranda  wrote:

> Mmh, thanks for your answer but with that I get the count of names starting
> with A*, but I would like to get the count of distinct surnames (or town
> names, or any other field that is not the name...) for the people with name
> starting with A*. Is that possible?
>

It is possible. You can use fq=name:A* to filter people whose names start
with 'A'. Then you can use facet.field=surnames or facet.field=town or
whatever you want with facet.limit=-1 and count the number of results for
each facet. It may be slow for the first query but it is cached so
subsequent queries should be faster (make sure you size filterCache
appropriately).

-- 
Regards,
Shalin Shekhar Mangar.

Re: DIH using values from solrconfig.xml inside data-config.xml

2009-02-04 Thread Fergus McMenemie

>: > The solr data field is populated properly. So I guess that bit works.
>: > I really wish I could use xpath="//para"
>
>: The limitation comes from streaming the XML instead of creating a DOM.
>: XPathRecordReader is a custom streaming XPath parser implementation and
>: streaming is easy only because we limit the syntax. You can use
>: PlainTextEntityProcessor which gives the XML as a string to a  custom
>: Transformer. This Transformer can create a DOM, run your XPath query and
>: populate the fields. It's more expensive but it is an option.
>
>Maybe it's just me, but it seems like i'm noticing that as DIH gets used 
>more, many people are noting that the XPath processing in DIH doesn't work 
>the way they expect because it's a custom XPath parser/engine designed for 
>streaming.  
>
>It seems like it would be helpful to have an alternate processor for 
>people who don't need the streaming support (ie: are dealing with small 
>enough docs that they can load the full DOM tree into memory) that would 
>use the default Java XPath engine (and have less caveats/suprises) ... i 
>wou think it would probably even make sense for this new XPath processor 
>to be the one we suggest for new users, and only suggest the existing 
>(stream based) processor if they have really big xml docs to deal with.
>
>(In hindsight XPathEntityProcessor and XPathRecordReader should probably 
>have been named StreamingXPathEntityProcessor and 
>StreamingXPathRecordReader)
>
Four thoughts!

1) My use case involves a few million XML documents ranging in size
   from a few K to 500K. 95% of the documents are under 25KBytes, 
   5 of the documents are around 0.5Mbytes. So.. sod it, I think I
   need a streaming parser.

2) "streaming XPath parser"? I only half understand all this stuff,
   but, and this is based on the little bit of SAX stuff I have written,
   I would have thought that //para was trivial for any kind of
   streaming XML parser.

3) Much of the confusion may be arising because the DIH wiki page is
   not to clear on what is and is not allowed. We need better,
   more explicit examples. What seems to be allowed is:-
 


   I will add these to the wiki. Just to be sure, I tested 
   xpath="//para". It does not work!

4) XML documents are ether well structured with good separation of 
   data and presentation in which case absolute xpaths work fine.
   Or older, in my case text documents, which have been forced into
   XML format with poor structure where the data and presentation 
   is all mixed up. I suspect that the addition of //para would
   cover many of the use cases, and what was left could be covered
   by a preceding XSLT transform. 
-- 

===
Fergus McMenemie   Email:fer...@twig.me.uk
Techmore Ltd   Phone:(UK) 07721 376021

Unix/Mac/Intranets Analyst Programmer
===

Re: Total count of facets

Unfortunately, after some tests listing all the distinct surnames or other
fields is too slow and too memory consuming with our current infrastructure.
Could someone confirm that if I wanted to add this functionality (just count
the total of different facets) what I should do is to subclass the
SimpleFacets class and create an extended FacetComponent that returns the
size of the term counts list instead of the list itself?
I see that the FacetComponent is registered by default. Is it possible to
register an extended FacetComponent instead? Or just creating a new one is
enough?

Sorry for asking so many questions today. I am new to SOLR and I was very
excited until I found that I could not comply with one of our requirements:
"counting the distinct surnames for names starting with A*", which is
possible with SQL but no with SOLR out of the box...

Thanks!

Bruno

2009/2/4 Bruno Aranda 

> Thanks, I will try that though I am talking in my case about 100,000+
> distinct surnames/towns maximum per query and I just needed the count and
> not the whole list. In any case, this brute-force approach is still
> something I can try but I wonder how this will behave speed and memory wise
> when there are many different concurrent queries and so on...
>
> Cheers,
>
> Bruno
>
> 2009/2/4 Shalin Shekhar Mangar 
>
>> On Wed, Feb 4, 2009 at 2:53 PM, Bruno Aranda 
>> wrote:
>>
>>
>> > Mmh, thanks for your answer but with that I get the count of names
>> starting
>> > with A*, but I would like to get the count of distinct surnames (or town
>> > names, or any other field that is not the name...) for the people with
>> name
>> > starting with A*. Is that possible?
>> >
>>
>> It is possible. You can use fq=name:A* to filter people whose names start
>> with 'A'. Then you can use facet.field=surnames or facet.field=town or
>> whatever you want with facet.limit=-1 and count the number of results for
>> each facet. It may be slow for the first query but it is cached so
>> subsequent queries should be faster (make sure you size filterCache
>> appropriately).
>>
>> --
>> Regards,
>> Shalin Shekhar Mangar.
>>
>
>

Severe errors in solr configuration

Hi,
I am trying to configure solr on ubuntu server and I am getting the following 
exception. I can able work it on windows box.


message Severe errors in solr configuration. Check your log files for more 
detailed information on what may be wrong. If you want solr to continue after 
configuration errors, change: 
false in null 
- 
java.security.AccessControlException: access denied 
(java.util.PropertyPermission user.dir read) at 
java.security.AccessControlContext.checkPermission(AccessControlContext.java:342)
 at java.security.AccessController.checkPermission(AccessController.java:553) 
at java.lang.SecurityManager.checkPermission(SecurityManager.java:549) at 
java.lang.SecurityManager.checkPropertyAccess(SecurityManager.java:1302) at 
java.lang.System.getProperty(System.java:669) at 
java.io.UnixFileSystem.resolve(UnixFileSystem.java:133) at 
java.io.File.getAbsolutePath(File.java:518) at 
org.apache.solr.core.CoreContainer$Initializer.initialize(CoreContainer.java:101)
 at org.apache.solr.servlet.SolrDispatchFilter.init(SolrDispatchFilter.java:69) 
at 
org.apache.catalina.core.ApplicationFilterConfig.getFilter(ApplicationFilterConfig.java:275)
 at 
org.apache.catalina.core.ApplicationFilterConfig.setFilterDef(ApplicationFilterConfig.java:397)
 at 
org.apache.catalina.core.ApplicationFilterConfig.(ApplicationFilterConfig.java:108)
 at 
org.apache.catalina.core.StandardContext.filterStart(StandardContext.java:3709) 
at org.apache.catalina.core.StandardContext.start(StandardContext.java:4363) at 
org.apache.catalina.core.ContainerBase.addChildInternal(ContainerBase.java:791) 
at org.apache.catalina.core.ContainerBase.access$000(ContainerBase.java:123) at 
org.apache.catalina.core.ContainerBase$PrivilegedAddChild.run(ContainerBase.java:145)
 at java.security.AccessController.doPrivileged(Native Method) at 
org.apache.catalina.core.ContainerBase.addChild(ContainerBase.java:769) at 
org.apache.catalina.core.StandardHost.addChild(StandardHost.java:525) at 
org.apache.catalina.startup.HostConfig.deployDescriptor(HostConfig.java:627) at 
org.apache.catalina.startup.HostConfig.deployDescriptors(HostConfig.java:553) 
at org.apache.catalina.startup.HostConfig.deployApps(HostConfig.java:488) at 
org.apache.catalina.startup.HostConfig.start(HostConfig.java:1149) at 
org.apache.catalina.startup.HostConfig.lifecycleEvent(HostConfig.java:311) at 
org.apache.catalina.util.LifecycleSupport.fireLifecycleEvent(LifecycleSupport.java:117)
 at org.apache.catalina.core.ContainerBase.start(ContainerBase.java:1053) at 
org.apache.catalina.core.StandardHost.start(StandardHost.java:719) at 
org.apache.catalina.core.ContainerBase.start(ContainerBase.java:1045) at 
org.apache.catalina.core.StandardEngine.start(StandardEngine.java:443) at 
org.apache.catalina.core.StandardService.start(StandardService.java:516) at 
org.apache.catalina.core.StandardServer.start(StandardServer.java:710) at 
org.apache.catalina.startup.Catalina.start(Catalina.java:578) at 
sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) 
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:616) at 
org.apache.catalina.startup.Bootstrap.start(Bootstrap.java:288) at 
sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) 
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:616) at 
org.apache.commons.daemon.support.DaemonLoader.start(DaemonLoader.java:177)

Please help me to fix this problem.

Thanks,
Anto Binish Kaspar,
Acting Team Lead,
E.C software Pvt. Ltd.

Re: Severe errors in solr configuration



Am 04.02.2009 um 13:33 schrieb Anto Binish Kaspar:


Hi,
I am trying to configure solr on ubuntu server and I am getting the  
following exception. I can able work it on windows box.



Hi Anto.

Have you installed the solr package 1.2 from ubuntu?
Or the release 1.3 as war file?

Olivier

--
Olivier Dobberkau

Je TYPO3, desto d.k.d

d.k.d Internet Service GmbH
Kaiserstr. 79
D 60329 Frankfurt/Main

Registergericht: Amtsgericht Frankfurt am Main
Registernummer: HRB 45590
Geschäftsführer:
Olivier Dobberkau, Søren Schaffstein, Götz Wegenast

fon:  +49 (0)69 - 43 05 61-70
fax:  +49 (0)69 - 43 05 61-90
mail: olivier.dobber...@dkd.de
home: http://www.dkd.de

aktuelle TYPO3-Projekte:
www.licht.de - Relaunch (TYPO3)
www.lahmeyer.de - Launch (TYPO3)
www.seb-assetmanagement.de - Relaunch (TYPO3)

RE: Severe errors in solr configuration

Hi Olivier

Thanks for your quick reply. I am using the release 1.3 as war file.

- Anto Binish Kaspar


-Original Message-
From: Olivier Dobberkau [mailto:olivier.dobber...@dkd.de] 
Sent: Wednesday, February 04, 2009 6:20 PM
To: solr-user@lucene.apache.org
Subject: Re: Severe errors in solr configuration


Am 04.02.2009 um 13:33 schrieb Anto Binish Kaspar:

> Hi,
> I am trying to configure solr on ubuntu server and I am getting the  
> following exception. I can able work it on windows box.


Hi Anto.

Have you installed the solr package 1.2 from ubuntu?
Or the release 1.3 as war file?

Olivier

--
Olivier Dobberkau

Je TYPO3, desto d.k.d

d.k.d Internet Service GmbH
Kaiserstr. 79
D 60329 Frankfurt/Main

Registergericht: Amtsgericht Frankfurt am Main
Registernummer: HRB 45590
Geschäftsführer:
Olivier Dobberkau, Søren Schaffstein, Götz Wegenast

fon:  +49 (0)69 - 43 05 61-70
fax:  +49 (0)69 - 43 05 61-90
mail: olivier.dobber...@dkd.de
home: http://www.dkd.de

aktuelle TYPO3-Projekte:
www.licht.de - Relaunch (TYPO3)
www.lahmeyer.de - Launch (TYPO3)
www.seb-assetmanagement.de - Relaunch (TYPO3)

Re: Severe errors in solr configuration



Am 04.02.2009 um 13:54 schrieb Anto Binish Kaspar:


Hi Olivier

Thanks for your quick reply. I am using the release 1.3 as war file.

- Anto Binish Kaspar


OK.
As far a i understood you need to make sure that your solr home is set.
this needs to be done in

Quting:

http://wiki.apache.org/solr/SolrTomcat

In addition to using the default behavior of relying on the Solr Home  
being in the current working directory (./solr) you can alternately  
add the solr.solr.home system property to your JVM settings before  
starting Tomcat...


export JAVA_OPTS="$JAVA_OPTS -Dsolr.solr.home=/my/custom/solr/home/dir/"

...or use a Context file to configure the Solr Home using JNDI

A Tomcat context fragments can be used to configure the JNDI property  
needed to specify your Solr Home directory.


Just put a context fragment file under $CATALINA_HOME/conf/Catalina/ 
localhost that looks something like this...


$ cat /tomcat55/conf/Catalina/localhost/solr.xml


   



Greetings,

Olivier

PS: May be it would be great if we could provide an ubuntu dpkg with  
1.3 ? Any takers?


--
Olivier Dobberkau

Je TYPO3, desto d.k.d

d.k.d Internet Service GmbH
Kaiserstr. 79
D 60329 Frankfurt/Main

Registergericht: Amtsgericht Frankfurt am Main
Registernummer: HRB 45590
Geschäftsführer:
Olivier Dobberkau, Søren Schaffstein, Götz Wegenast

fon:  +49 (0)69 - 43 05 61-70
fax:  +49 (0)69 - 43 05 61-90
mail: olivier.dobber...@dkd.de
home: http://www.dkd.de

aktuelle TYPO3-Projekte:
www.licht.de - Relaunch (TYPO3)
www.lahmeyer.de - Launch (TYPO3)
www.seb-assetmanagement.de - Relaunch (TYPO3)

RE: Severe errors in solr configuration

I am using Context file, here is my solr.xml

$ cat /var/lib/tomcat6/conf/Catalina/localhost/solr.xml 





I change the ownership of the folder (usr/local/solr/solr-1.3/solr) to 
tomcat6:tomcat6 from root:root

Anything I am missing? 

- Anto Binish Kaspar


-Original Message-
From: Olivier Dobberkau [mailto:olivier.dobber...@dkd.de] 
Sent: Wednesday, February 04, 2009 6:30 PM
To: solr-user@lucene.apache.org
Subject: Re: Severe errors in solr configuration


Am 04.02.2009 um 13:54 schrieb Anto Binish Kaspar:

> Hi Olivier
>
> Thanks for your quick reply. I am using the release 1.3 as war file.
>
> - Anto Binish Kaspar

OK.
As far a i understood you need to make sure that your solr home is set.
this needs to be done in

Quting:

http://wiki.apache.org/solr/SolrTomcat

In addition to using the default behavior of relying on the Solr Home  
being in the current working directory (./solr) you can alternately  
add the solr.solr.home system property to your JVM settings before  
starting Tomcat...

export JAVA_OPTS="$JAVA_OPTS -Dsolr.solr.home=/my/custom/solr/home/dir/"

...or use a Context file to configure the Solr Home using JNDI

A Tomcat context fragments can be used to configure the JNDI property  
needed to specify your Solr Home directory.

Just put a context fragment file under $CATALINA_HOME/conf/Catalina/ 
localhost that looks something like this...

$ cat /tomcat55/conf/Catalina/localhost/solr.xml





Greetings,

Olivier

PS: May be it would be great if we could provide an ubuntu dpkg with  
1.3 ? Any takers?

--
Olivier Dobberkau

Je TYPO3, desto d.k.d

d.k.d Internet Service GmbH
Kaiserstr. 79
D 60329 Frankfurt/Main

Registergericht: Amtsgericht Frankfurt am Main
Registernummer: HRB 45590
Geschäftsführer:
Olivier Dobberkau, Søren Schaffstein, Götz Wegenast

fon:  +49 (0)69 - 43 05 61-70
fax:  +49 (0)69 - 43 05 61-90
mail: olivier.dobber...@dkd.de
home: http://www.dkd.de

aktuelle TYPO3-Projekte:
www.licht.de - Relaunch (TYPO3)
www.lahmeyer.de - Launch (TYPO3)
www.seb-assetmanagement.de - Relaunch (TYPO3)

Re: Severe errors in solr configuration


A slash?

Olivier

Von meinem iPhone gesendet


Am 04.02.2009 um 14:06 schrieb Anto Binish Kaspar :


I am using Context file, here is my solr.xml

$ cat /var/lib/tomcat6/conf/Catalina/localhost/solr.xml






I change the ownership of the folder (usr/local/solr/solr-1.3/solr)  
to tomcat6:tomcat6 from root:root


Anything I am missing?

- Anto Binish Kaspar


-Original Message-
From: Olivier Dobberkau [mailto:olivier.dobber...@dkd.de]
Sent: Wednesday, February 04, 2009 6:30 PM
To: solr-user@lucene.apache.org
Subject: Re: Severe errors in solr configuration


Am 04.02.2009 um 13:54 schrieb Anto Binish Kaspar:


Hi Olivier

Thanks for your quick reply. I am using the release 1.3 as war file.

- Anto Binish Kaspar


OK.
As far a i understood you need to make sure that your solr home is  
set.

this needs to be done in

Quting:

http://wiki.apache.org/solr/SolrTomcat

In addition to using the default behavior of relying on the Solr Home
being in the current working directory (./solr) you can alternately
add the solr.solr.home system property to your JVM settings before
starting Tomcat...

export JAVA_OPTS="$JAVA_OPTS -Dsolr.solr.home=/my/custom/solr/home/ 
dir/"


...or use a Context file to configure the Solr Home using JNDI

A Tomcat context fragments can be used to configure the JNDI property
needed to specify your Solr Home directory.

Just put a context fragment file under $CATALINA_HOME/conf/Catalina/
localhost that looks something like this...

$ cat /tomcat55/conf/Catalina/localhost/solr.xml


   


Greetings,

Olivier

PS: May be it would be great if we could provide an ubuntu dpkg with
1.3 ? Any takers?

--
Olivier Dobberkau

Je TYPO3, desto d.k.d

d.k.d Internet Service GmbH
Kaiserstr. 79
D 60329 Frankfurt/Main

Registergericht: Amtsgericht Frankfurt am Main
Registernummer: HRB 45590
Geschäftsführer:
Olivier Dobberkau, Søren Schaffstein, Götz Wegenast

fon:  +49 (0)69 - 43 05 61-70
fax:  +49 (0)69 - 43 05 61-90
mail: olivier.dobber...@dkd.de
home: http://www.dkd.de

aktuelle TYPO3-Projekte:
www.licht.de - Relaunch (TYPO3)
www.lahmeyer.de - Launch (TYPO3)
www.seb-assetmanagement.de - Relaunch (TYPO3)

RE: Severe errors in solr configuration

Now it’s a giving a different message

Severe errors in solr configuration. Check your log files for more detailed 
information on what may be wrong. If you want solr to continue after 
configuration errors, change: 
false in null 
- 
java.security.AccessControlException: access denied (java.io.FilePermission 
/usr/local/solr/solr-1.3/solr/solr.xml read) at 
java.security.AccessControlContext.checkPermission(AccessControlContext.java:342)
 at java.security.AccessController.checkPermission(AccessController.java:553) 
at java.lang.SecurityManager.checkPermission(SecurityManager.java:549) at 
java.lang.SecurityManager.checkRead(SecurityManager.java:888) at 
java.io.File.exists(File.java:748) at 
org.apache.solr.core.CoreContainer$Initializer.initialize(CoreContainer.java:103)
 at org.apache.solr.servlet.SolrDispatchFilter.init(SolrDispatchFilter.java:69) 
at 
org.apache.catalina.core.ApplicationFilterConfig.getFilter(ApplicationFilterConfig.java:275)
 at 
org.apache.catalina.core.ApplicationFilterConfig.setFilterDef(ApplicationFilterConfig.java:397)
 at 
org.apache.catalina.core.ApplicationFilterConfig.(ApplicationFilterConfig.java:108)
 at 
org.apache.catalina.core.StandardContext.filterStart(StandardContext.java:3709) 
at org.apache.catalina.core.StandardContext.start(StandardContext.java:4363) at 
org.apache.catalina.core.ContainerBase.addChildInternal(ContainerBase.java:791) 
at org.apache.catalina.core.ContainerBase.access$000(ContainerBase.java:123) at 
org.apache.catalina.core.ContainerBase$PrivilegedAddChild.run(ContainerBase.java:145)
 at java.security.AccessController.doPrivileged(Native Method) at 
org.apache.catalina.core.ContainerBase.addChild(ContainerBase.java:769) at 
org.apache.catalina.core.StandardHost.addChild(StandardHost.java:525) at 
org.apache.catalina.startup.HostConfig.deployDescriptor(HostConfig.java:627) at 
org.apache.catalina.startup.HostConfig.deployDescriptors(HostConfig.java:553) 
at org.apache.catalina.startup.HostConfig.deployApps(HostConfig.java:488) at 
org.apache.catalina.startup.HostConfig.start(HostConfig.java:1149) at 
org.apache.catalina.startup.HostConfig.lifecycleEvent(HostConfig.java:311) at 
org.apache.catalina.util.LifecycleSupport.fireLifecycleEvent(LifecycleSupport.java:117)
 at org.apache.catalina.core.ContainerBase.start(ContainerBase.java:1053) at 
org.apache.catalina.core.StandardHost.start(StandardHost.java:719) at 
org.apache.catalina.core.ContainerBase.start(ContainerBase.java:1045) at 
org.apache.catalina.core.StandardEngine.start(StandardEngine.java:443) at 
org.apache.catalina.core.StandardService.start(StandardService.java:516) at 
org.apache.catalina.core.StandardServer.start(StandardServer.java:710) at 
org.apache.catalina.startup.Catalina.start(Catalina.java:578) at 
sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) 
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:616) at 
org.apache.catalina.startup.Bootstrap.start(Bootstrap.java:288) at 
sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) 
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:616) at 
org.apache.commons.daemon.support.DaemonLoader.start(DaemonLoader.java:177)

Why its trying to read the solr.xml from /usr/local/solr/solr-1.3/solr/ ?

- Anto Binish Kaspar


-Original Message-
From: Olivier Dobberkau [mailto:olivier.dobber...@dkd.de] 
Sent: Wednesday, February 04, 2009 6:50 PM
To: solr-user@lucene.apache.org
Subject: Re: Severe errors in solr configuration

A slash?

Olivier

Von meinem iPhone gesendet


Am 04.02.2009 um 14:06 schrieb Anto Binish Kaspar :

> I am using Context file, here is my solr.xml
>
> $ cat /var/lib/tomcat6/conf/Catalina/localhost/solr.xml
>
>  debug="0" crossContext="true" >
> 
> 
>
> I change the ownership of the folder (usr/local/solr/solr-1.3/solr)  
> to tomcat6:tomcat6 from root:root
>
> Anything I am missing?
>
> - Anto Binish Kaspar
>
>
> -Original Message-
> From: Olivier Dobberkau [mailto:olivier.dobber...@dkd.de]
> Sent: Wednesday, February 04, 2009 6:30 PM
> To: solr-user@lucene.apache.org
> Subject: Re: Severe errors in solr configuration
>
>
> Am 04.02.2009 um 13:54 schrieb Anto Binish Kaspar:
>
>> Hi Olivier
>>
>> Thanks for your quick reply. I am using the release 1.3 as war file.
>>
>> - Anto Binish Kaspar
>
> OK.
> As far a i understood you need to make sure that your solr home is  
> set.
> this needs to be done in
>
> Quting:
>
> http://wiki.apache.org/solr/SolrTomcat
>
> In addition to using the default behavior of relying on the Solr Home
> being in the current working directory (./solr) you can alternately
> add

Re: Boost function

2009-02-04 Thread Erick Erickson

>From Hossman...

<<>>

Search time boosts, as the name implies, factor into the scoring of
documents, increasing the score assigned to documents that match on the
boosted term, thus tending to score the entire document higher. So these
documents tend to be returned earlier in the results when sorting by score
(the default).

See "Lucene in Action"

Best
Erick

On Wed, Feb 4, 2009 at 8:12 AM, Tushar_Gandhi <
tushar_gan...@neovasolutions.com> wrote:

>
> Hi,
>   I want to know about boosting. What is the use ?
> How we can implement that? and How it will affect my search results?
>
> Thanks,
> Tushar
> --
> View this message in context:
> http://www.nabble.com/Boost-function-tp21829651p21829651.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>
>

Re: Severe errors in solr configuration

2009-02-04 Thread Shalin Shekhar Mangar

According to http://wiki.apache.org/solr/SolrTomcat, the JNDI context should
be:


   



Notice that in the snippet you posted, the name was "/solr/home" (an extra
leading '/')

http://wiki.apache.org/solr/SolrTomcat#head-7036378fa48b79c0797cc8230a8aa0965412fb2e

On Wed, Feb 4, 2009 at 6:59 PM, Anto Binish Kaspar  wrote:

> Now it's a giving a different message
>
> Severe errors in solr configuration. Check your log files for more detailed
> information on what may be wrong. If you want solr to continue after
> configuration errors, change:
> false in null
> -
> java.security.AccessControlException: access denied (java.io.FilePermission
> /usr/local/solr/solr-1.3/solr/solr.xml read) at
> java.security.AccessControlContext.checkPermission(AccessControlContext.java:342)
> at java.security.AccessController.checkPermission(AccessController.java:553)
> at java.lang.SecurityManager.checkPermission(SecurityManager.java:549) at
> java.lang.SecurityManager.checkRead(SecurityManager.java:888) at
> java.io.File.exists(File.java:748) at
> org.apache.solr.core.CoreContainer$Initializer.initialize(CoreContainer.java:103)
> at
> org.apache.solr.servlet.SolrDispatchFilter.init(SolrDispatchFilter.java:69)
> at
> org.apache.catalina.core.ApplicationFilterConfig.getFilter(ApplicationFilterConfig.java:275)
> at
> org.apache.catalina.core.ApplicationFilterConfig.setFilterDef(ApplicationFilterConfig.java:397)
> at
> org.apache.catalina.core.ApplicationFilterConfig.(ApplicationFilterConfig.java:108)
> at
> org.apache.catalina.core.StandardContext.filterStart(StandardContext.java:3709)
> at org.apache.catalina.core.StandardContext.start(StandardContext.java:4363)
> at
> org.apache.catalina.core.ContainerBase.addChildInternal(ContainerBase.java:791)
> at org.apache.catalina.core.ContainerBase.access$000(ContainerBase.java:123)
> at
> org.apache.catalina.core.ContainerBase$PrivilegedAddChild.run(ContainerBase.java:145)
> at java.security.AccessController.doPrivileged(Native Method) at
> org.apache.catalina.core.ContainerBase.addChild(ContainerBase.java:769) at
> org.apache.catalina.core.StandardHost.addChild(StandardHost.java:525) at
> org.apache.catalina.startup.HostConfig.deployDescriptor(HostConfig.java:627)
> at
> org.apache.catalina.startup.HostConfig.deployDescriptors(HostConfig.java:553)
> at org.apache.catalina.startup.HostConfig.deployApps(HostConfig.java:488) at
> org.apache.catalina.startup.HostConfig.start(HostConfig.java:1149) at
> org.apache.catalina.startup.HostConfig.lifecycleEvent(HostConfig.java:311)
> at
> org.apache.catalina.util.LifecycleSupport.fireLifecycleEvent(LifecycleSupport.java:117)
> at org.apache.catalina.core.ContainerBase.start(ContainerBase.java:1053) at
> org.apache.catalina.core.StandardHost.start(StandardHost.java:719) at
> org.apache.catalina.core.ContainerBase.start(ContainerBase.java:1045) at
> org.apache.catalina.core.StandardEngine.start(StandardEngine.java:443) at
> org.apache.catalina.core.StandardService.start(StandardService.java:516) at
> org.apache.catalina.core.StandardServer.start(StandardServer.java:710) at
> org.apache.catalina.startup.Catalina.start(Catalina.java:578) at
> sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:616) at
> org.apache.catalina.startup.Bootstrap.start(Bootstrap.java:288) at
> sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:616) at
> org.apache.commons.daemon.support.DaemonLoader.start(DaemonLoader.java:177)
>
> Why its trying to read the solr.xml from /usr/local/solr/solr-1.3/solr/ ?
>
> - Anto Binish Kaspar
>
>
> -Original Message-
> From: Olivier Dobberkau [mailto:olivier.dobber...@dkd.de]
> Sent: Wednesday, February 04, 2009 6:50 PM
> To: solr-user@lucene.apache.org
> Subject: Re: Severe errors in solr configuration
>
> A slash?
>
> Olivier
>
> Von meinem iPhone gesendet
>
>
> Am 04.02.2009 um 14:06 schrieb Anto Binish Kaspar :
>
> > I am using Context file, here is my solr.xml
> >
> > $ cat /var/lib/tomcat6/conf/Catalina/localhost/solr.xml
> >
> >  > debug="0" crossContext="true" >
> > 
> > 
> >
> > I change the ownership of the folder (usr/local/solr/solr-1.3/solr)
> > to tomcat6:tomcat6 from root:root
> >
> > Anything I am missing?
> >
> > - Anto Binish Kaspar
> >
> >
> > -Original Message-
> > From: Olivier Dobberkau [mailto:olivier.dobber...@dkd.de]
> > Sent: Wednesday, February 04, 2009 6:30 PM
> > To: solr-user@lucene.apache.org
> > Subject: Re: Severe errors in solr configuration
> >
> >
> > Am 04.02.2009 u

RE: Severe errors in solr configuration

Yes I removed, still I have the same issue. Any idea what may be cause of this 
issue?

- Anto Binish Kaspar


-Original Message-
From: Shalin Shekhar Mangar [mailto:shalinman...@gmail.com] 
Sent: Wednesday, February 04, 2009 7:42 PM
To: solr-user@lucene.apache.org
Subject: Re: Severe errors in solr configuration

According to http://wiki.apache.org/solr/SolrTomcat, the JNDI context should
be:


   



Notice that in the snippet you posted, the name was "/solr/home" (an extra
leading '/')

http://wiki.apache.org/solr/SolrTomcat#head-7036378fa48b79c0797cc8230a8aa0965412fb2e

On Wed, Feb 4, 2009 at 6:59 PM, Anto Binish Kaspar  wrote:

> Now it's a giving a different message
>
> Severe errors in solr configuration. Check your log files for more detailed
> information on what may be wrong. If you want solr to continue after
> configuration errors, change:
> false in null
> -
> java.security.AccessControlException: access denied (java.io.FilePermission
> /usr/local/solr/solr-1.3/solr/solr.xml read) at
> java.security.AccessControlContext.checkPermission(AccessControlContext.java:342)
> at java.security.AccessController.checkPermission(AccessController.java:553)
> at java.lang.SecurityManager.checkPermission(SecurityManager.java:549) at
> java.lang.SecurityManager.checkRead(SecurityManager.java:888) at
> java.io.File.exists(File.java:748) at
> org.apache.solr.core.CoreContainer$Initializer.initialize(CoreContainer.java:103)
> at
> org.apache.solr.servlet.SolrDispatchFilter.init(SolrDispatchFilter.java:69)
> at
> org.apache.catalina.core.ApplicationFilterConfig.getFilter(ApplicationFilterConfig.java:275)
> at
> org.apache.catalina.core.ApplicationFilterConfig.setFilterDef(ApplicationFilterConfig.java:397)
> at
> org.apache.catalina.core.ApplicationFilterConfig.(ApplicationFilterConfig.java:108)
> at
> org.apache.catalina.core.StandardContext.filterStart(StandardContext.java:3709)
> at org.apache.catalina.core.StandardContext.start(StandardContext.java:4363)
> at
> org.apache.catalina.core.ContainerBase.addChildInternal(ContainerBase.java:791)
> at org.apache.catalina.core.ContainerBase.access$000(ContainerBase.java:123)
> at
> org.apache.catalina.core.ContainerBase$PrivilegedAddChild.run(ContainerBase.java:145)
> at java.security.AccessController.doPrivileged(Native Method) at
> org.apache.catalina.core.ContainerBase.addChild(ContainerBase.java:769) at
> org.apache.catalina.core.StandardHost.addChild(StandardHost.java:525) at
> org.apache.catalina.startup.HostConfig.deployDescriptor(HostConfig.java:627)
> at
> org.apache.catalina.startup.HostConfig.deployDescriptors(HostConfig.java:553)
> at org.apache.catalina.startup.HostConfig.deployApps(HostConfig.java:488) at
> org.apache.catalina.startup.HostConfig.start(HostConfig.java:1149) at
> org.apache.catalina.startup.HostConfig.lifecycleEvent(HostConfig.java:311)
> at
> org.apache.catalina.util.LifecycleSupport.fireLifecycleEvent(LifecycleSupport.java:117)
> at org.apache.catalina.core.ContainerBase.start(ContainerBase.java:1053) at
> org.apache.catalina.core.StandardHost.start(StandardHost.java:719) at
> org.apache.catalina.core.ContainerBase.start(ContainerBase.java:1045) at
> org.apache.catalina.core.StandardEngine.start(StandardEngine.java:443) at
> org.apache.catalina.core.StandardService.start(StandardService.java:516) at
> org.apache.catalina.core.StandardServer.start(StandardServer.java:710) at
> org.apache.catalina.startup.Catalina.start(Catalina.java:578) at
> sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:616) at
> org.apache.catalina.startup.Bootstrap.start(Bootstrap.java:288) at
> sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:616) at
> org.apache.commons.daemon.support.DaemonLoader.start(DaemonLoader.java:177)
>
> Why its trying to read the solr.xml from /usr/local/solr/solr-1.3/solr/ ?
>
> - Anto Binish Kaspar
>
>
> -Original Message-
> From: Olivier Dobberkau [mailto:olivier.dobber...@dkd.de]
> Sent: Wednesday, February 04, 2009 6:50 PM
> To: solr-user@lucene.apache.org
> Subject: Re: Severe errors in solr configuration
>
> A slash?
>
> Olivier
>
> Von meinem iPhone gesendet
>
>
> Am 04.02.2009 um 14:06 schrieb Anto Binish Kaspar :
>
> > I am using Context file, here is my solr.xml
> >
> > $ cat /var/lib/tomcat6/conf/Catalina/localhost/solr.xml
> >
> >  > debug="0" crossContext="true" >
> > 
> > 
> >
> > I change the ownership of the folder (usr/local/solr/solr-1.3/solr)
> > to tomcat6:tomcat6 from root:root

Highlighting on Prefix-Search Bug/Workaround (Re: query with stemming, prefix and fuzzy?)

2009-02-04 Thread Gert Brinkmann

Mark Miller wrote:

>> Currently I think about dropping the stemming and only use
>> prefix-search. But as highlighting does not work with a prefix "house*"
>> this is a problem for me. The hint to use "house?*" instead does not
>> work here.
>>   
> Thats because wildcard queries are also not highlightable now. I
> actually have somewhat of a solution to this that I'll work on soon
> (I've gotten the ground work for it in or ready to be in Lucene). No
> guarantee on when or if it will be accepted in solr though.

As I am writing in perl (using WebService::Solr) I found the workaround
to use the Search::Tools module for highlighting "manually" in those
cases if Solr does not return snippets. This seems to work fine, but the
drawback is, that I need Solr to return the full data field in a query.
This can be expensive on larger documents. But I hope this is just a
temporal workaround until Solr 1.4...

Thanks,
Gert

Differences in output of spell checkers

2009-02-04 Thread Marcus Stratmann


Hello,

I'm trying to learn how to use the spell checkers of solr (1.3). I found 
out that FileBasedSpellChecker and IndexBasedSpellChecker produce 
different outputs.


IndexBasedSpellChecker says




1
0
4
0

85
game


false



whereas FileBasedSpellChecker returns




1
0
4

game





The differences are the usage of  respectively  for markup of 
the suggestions, missing frequences and missing "correctlySpelled" in 
FileBasedSpellChecker. Is that a bug or a feature? Or are there simply 
no universal rules for the format of the ouput? The differences make 
parsing more difficult if you use IndexBasedSpellChecker and 
FileBasedSpellChecker.


Thanks,
Marcus

Boost function

2009-02-04 Thread Tushar_Gandhi


Hi,
   I want to know about boosting. What is the use ?
How we can implement that? and How it will affect my search results?

Thanks,
Tushar
-- 
View this message in context: 
http://www.nabble.com/Boost-function-tp21829651p21829651.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Total count of facets

Thanks, I will try that though I am talking in my case about 100,000+
distinct surnames/towns maximum per query and I just needed the count and
not the whole list. In any case, this brute-force approach is still
something I can try but I wonder how this will behave speed and memory wise
when there are many different concurrent queries and so on...

Cheers,

Bruno

2009/2/4 Shalin Shekhar Mangar 

> On Wed, Feb 4, 2009 at 2:53 PM, Bruno Aranda 
> wrote:
>
> > Mmh, thanks for your answer but with that I get the count of names
> starting
> > with A*, but I would like to get the count of distinct surnames (or town
> > names, or any other field that is not the name...) for the people with
> name
> > starting with A*. Is that possible?
> >
>
> It is possible. You can use fq=name:A* to filter people whose names start
> with 'A'. Then you can use facet.field=surnames or facet.field=town or
> whatever you want with facet.limit=-1 and count the number of results for
> each facet. It may be slow for the first query but it is cached so
> subsequent queries should be faster (make sure you size filterCache
> appropriately).
>
> --
> Regards,
> Shalin Shekhar Mangar.
>

Re: DIH, assigning multiple xpaths to the same solr field: solved

2009-02-04 Thread Fergus McMenemie

Thanks Shalin,

Using the following appears to work properly!
   
   
   
   

Regards Fergus

>On Wed, Feb 4, 2009 at 1:35 AM, Fergus McMenemie  wrote:
>
>>   >  dataSource="myfilereader"
>>  processor="XPathEntityProcessor"
>>  url="${jc.fileAbsolutePath}"
>>  stream="false"
>>  forEach="/record">
>>   
>>   
>>   
>>   
>>
>> Below is the line from my schema.xml
>>
>>   >  multiValued="true"/>
>>
>> Now a given document will only have one style of layout, and of course
>> the /a/b/c /d/e/f/g  stuff is made up. For a document that has a single
>> Hello world element I see search results as follows, the
>> one  string seems to have been entered into the index four times.
>> I only saw duplicate results before adding the extra made-up stuff.
>>
>>
>I think there is something fishy with the XPathEntityProcessor. For now, I
>think you can work around by giving each field a different 'column' and
>attribute 'name=para' on each of them.
>
>-- 
>Regards,
>Shalin Shekhar Mangar.

-- 

===
Fergus McMenemie   Email:fer...@twig.me.uk
Techmore Ltd   Phone:(UK) 07721 376021

Unix/Mac/Intranets Analyst Programmer
===

Re: Total count of facets

On Wed, Feb 4, 2009 at 5:42 AM, Bruno Aranda  wrote:
> Unfortunately, after some tests listing all the distinct surnames or other
> fields is too slow and too memory consuming with our current infrastructure.
> Could someone confirm that if I wanted to add this functionality (just count
> the total of different facets) what I should do is to subclass the
> SimpleFacets class and create an extended FacetComponent that returns the
> size of the term counts list instead of the list itself?

This wouldn't be too hard to do... and I think it's been requested in
the past at least a few times:
http://www.lucidimagination.com/search/document/7ab1d7fff1fb556e/numfound_for_facet_results

The slightly harder part is changing the response format in a backward
compatible way.

-Yonik

Re: Severe errors in solr configuration

2009-02-04 Thread Noble Paul നോബിള്‍ नोब्ळ्



Am 04.02.2009 um 15:50 schrieb Anto Binish Kaspar:

Yes I removed, still I have the same issue. Any idea what may be  
cause of this issue?



Have you solved your problem?

Olivier
--
Olivier Dobberkau

Je TYPO3, desto d.k.d

d.k.d Internet Service GmbH
Kaiserstr. 79
D 60329 Frankfurt/Main

Registergericht: Amtsgericht Frankfurt am Main
Registernummer: HRB 45590
Geschäftsführer:
Olivier Dobberkau, Søren Schaffstein, Götz Wegenast

fon:  +49 (0)69 - 43 05 61-70
fax:  +49 (0)69 - 43 05 61-90
mail: olivier.dobber...@dkd.de
home: http://www.dkd.de

aktuelle TYPO3-Projekte:
www.licht.de - Relaunch (TYPO3)
www.lahmeyer.de - Launch (TYPO3)
www.seb-assetmanagement.de - Relaunch (TYPO3)

Re: exceeded limit of maxWarmingSearchers

2009-02-04 Thread Jon Drukman


Otis Gospodnetic wrote:

That should be fine (but apparently isn't), as long as you don't have some very 
slow machine or if your caches are are large and configured to copy a lot of 
data on commit.



this is becoming more and more problematic.  we have periods where we 
get 10 of these exceptions in a 4 second period.  how do i diagnose what 
the cause is, or alternatively work around it?


when you say "copy" are you talking about copyFields or something else?

we commit on every update, but each update is very small... just a few 
hundred bytes on average.

Re: DIH using values from solrconfig.xml inside data-config.xml

The implementation assumed that most of the users have xml with a
fixed schema. . In that case giving absolute path is not hard. This
helps us deal with a large subset of usecases rather easily.

We have not added all the features which are possible with a
streaming parser. It is wiser to piggyback on some real XPath engine
for because the demand for full xpath support will always be there.
--Noble

On Wed, Feb 4, 2009 at 5:15 PM, Fergus McMenemie  wrote:
>>: > The solr data field is populated properly. So I guess that bit works.
>>: > I really wish I could use xpath="//para"
>>
>>: The limitation comes from streaming the XML instead of creating a DOM.
>>: XPathRecordReader is a custom streaming XPath parser implementation and
>>: streaming is easy only because we limit the syntax. You can use
>>: PlainTextEntityProcessor which gives the XML as a string to a  custom
>>: Transformer. This Transformer can create a DOM, run your XPath query and
>>: populate the fields. It's more expensive but it is an option.
>>
>>Maybe it's just me, but it seems like i'm noticing that as DIH gets used
>>more, many people are noting that the XPath processing in DIH doesn't work
>>the way they expect because it's a custom XPath parser/engine designed for
>>streaming.
>>
>>It seems like it would be helpful to have an alternate processor for
>>people who don't need the streaming support (ie: are dealing with small
>>enough docs that they can load the full DOM tree into memory) that would
>>use the default Java XPath engine (and have less caveats/suprises) ... i
>>wou think it would probably even make sense for this new XPath processor
>>to be the one we suggest for new users, and only suggest the existing
>>(stream based) processor if they have really big xml docs to deal with.
>>
>>(In hindsight XPathEntityProcessor and XPathRecordReader should probably
>>have been named StreamingXPathEntityProcessor and
>>StreamingXPathRecordReader)
>>
> Four thoughts!
>
> 1) My use case involves a few million XML documents ranging in size
>   from a few K to 500K. 95% of the documents are under 25KBytes,
>   5 of the documents are around 0.5Mbytes. So.. sod it, I think I
>   need a streaming parser.
>
> 2) "streaming XPath parser"? I only half understand all this stuff,
>   but, and this is based on the little bit of SAX stuff I have written,
>   I would have thought that //para was trivial for any kind of
>   streaming XML parser.
>
> 3) Much of the confusion may be arising because the DIH wiki page is
>   not to clear on what is and is not allowed. We need better,
>   more explicit examples. What seems to be allowed is:-
>
>
>
>   I will add these to the wiki. Just to be sure, I tested
>   xpath="//para". It does not work!
>
> 4) XML documents are ether well structured with good separation of
>   data and presentation in which case absolute xpaths work fine.
>   Or older, in my case text documents, which have been forced into
>   XML format with poor structure where the data and presentation
>   is all mixed up. I suspect that the addition of //para would
>   cover many of the use cases, and what was left could be covered
>   by a preceding XSLT transform.
> --
>
> ===
> Fergus McMenemie   Email:fer...@twig.me.uk
> Techmore Ltd   Phone:(UK) 07721 376021
>
> Unix/Mac/Intranets Analyst Programmer
> ===
>



-- 
--Noble Paul

Multiple uniqueKey problems

2009-02-04 Thread Bruno Mateus

Hello,

I'm facing some problems in generating a compound unique key. I'm
indexing some database tables not related with each other. In my
data-config.xml I have the following












Column "alias" and "id" don't exist on the database. In my schema.xml
I have the following:

  
  

   
   

   id

When I do a full import I get the following error:

18:47:40,530 ERROR [STDERR] 4/Fev/2009 18:47:40
org.apache.solr.handler.dataimport.SolrWriter upload
WARNING: Error creating document :
SolrInputDocumnt[{node_nodeid=node_nodeid(1.0)={6706},
node_name=node_name(1.0)={CPE_106122644}}]
org.apache.solr.common.SolrException: Document [null] missing required field: id
at 
org.apache.solr.update.DocumentBuilder.toDocument(DocumentBuilder.java:289)
at 
org.apache.solr.update.processor.RunUpdateProcessor.processAdd(RunUpdateProcessorFactory.java:58)
at 
org.apache.solr.handler.dataimport.SolrWriter.upload(SolrWriter.java:69)
at 
org.apache.solr.handler.dataimport.DataImportHandler$1.upload(DataImportHandler.java:288)
at 
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:319)
at 
org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:178)
at 
org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:136)
at 
org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:334)
at 
org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:386)
at 
org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:377)


I suppose I'm missing some configuration. Is the way I'm generating
the id correct?

Thks

Custom Sorting Algorithm


Is an easy way to choose/create an alternate sorting algorithm? I'm
frequently dealing with large result sets (a few million results) and I
might be able to benefit domain knowledge in my sort.
-- 
View this message in context: 
http://www.nabble.com/Custom-Sorting-Algorithm-tp21837721p21837721.html
Sent from the Solr - User mailing list archive at Nabble.com.

Spell checking not returning "full" terms

2009-02-04 Thread Rupert Fiasco

We are using Solr 1.3 and trying to get spell checking functionality.

FYI, our index contains a lot of medical terms (which might or might
not make a difference as they are not English-y words, if that makes
any sense?)

If I specify a spellcheck query of "spellcheck.q=diabtes"

I get suggestions of:

diabet
diabetogen
dilat
diamet
diatom
diastol
diactin
dialect

If I re-mis-spell Diabetes to "q=diabets" then I go no suggestions.

So first off two things:

1) Why would leaving out one "e" over the other affect the spelling
suggestions so substantially?
2) In the former list of suggestions, notice the first suggestion is
"diabet", which isnt all that helpful, it should return something like
"diabetes" or maybe even "diabetic".

Note that if I do a normal search against "diabetes" then I get a ton
of results, in other words, our index is filled with terms of
"diabetes".

My relevant solrconfig is:


text


  default
  text_t
  ./spellchecker1
  0.1



  jarowinkler
  text_t
  
  org.apache.lucene.search.spell.JaroWinklerDistance
  ./spellchecker2
  0.1



and I have

spellcheck.count = 8

Notice that I severely bumped down the "accuracy" setting to get more
results. Bumping it up higher yields less results (not sure what
setting really meant so I dont know in what direction I want to change
that value - I am guessing that a lower value allows for more
mis-spellings, e.g. its more promiscuous).

Our "text" and "text_t" fields are defined in schema.xml as:


and


Any help would be appreciated.

Thanks
-Rupert

Queued Requests during GC


During full garbage collection, Solr doesn't acknowledge incoming requests.
Any requests that were received during the GC are timestamped the moment GC
finishes (at least that's what my logs show). Is there a limit to how many
requests can queue up during a full GC? This doesn't seem like a Solr
setting, but rather a container/OS setting (I'm using Tomcat on Linux).

Thanks.

Wojtek
-- 
View this message in context: 
http://www.nabble.com/Queued-Requests-during-GC-tp21837898p21837898.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Spell checking not returning "full" terms

2009-02-04 Thread Grant Ingersoll

I'm guessing the field you are checking against is being stemmed. The
field you spell check against should have minimal analysis done to it,
i.e. tokenization and probably downcasing. See http://wiki.apache.org/solr/SpellCheckComponent
and http://wiki.apache.org/solr/SpellCheckerRequestHandler for tips
on how to handle analysis for spelling.

On Feb 4, 2009, at 2:33 PM, Rupert Fiasco wrote:

We are using Solr 1.3 and trying to get spell checking functionality.

FYI, our index contains a lot of medical terms (which might or might
not make a difference as they are not English-y words, if that makes
any sense?)

If I specify a spellcheck query of "spellcheck.q=diabtes"

I get suggestions of:

diabet
diabetogen
dilat
diamet
diatom
diastol
diactin
dialect

If I re-mis-spell Diabetes to "q=diabets" then I go no suggestions.

So first off two things:

1) Why would leaving out one "e" over the other affect the spelling
suggestions so substantially?
2) In the former list of suggestions, notice the first suggestion is
"diabet", which isnt all that helpful, it should return something like
"diabetes" or maybe even "diabetic".

Note that if I do a normal search against "diabetes" then I get a ton
of results, in other words, our index is filled with terms of
"diabetes".

My relevant solrconfig is:

text

default
text_t
./spellchecker1
0.1

jarowinkler
text_t

name
=
"distanceMeasure
">org.apache.lucene.search.spell.JaroWinklerDistance

./spellchecker2
0.1

and I have

spellcheck.count = 8

Notice that I severely bumped down the "accuracy" setting to get more
results. Bumping it up higher yields less results (not sure what
setting really meant so I dont know in what direction I want to change
that value - I am guessing that a lower value allows for more
mis-spellings, e.g. its more promiscuous).

Our "text" and "text_t" fields are defined in schema.xml as:

and

Any help would be appreciated.

Thanks
-Rupert

--
Grant Ingersoll
http://www.lucidimagination.com/

Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ

Re: Queued Requests during GC

2009-02-04 Thread Sridhar Basam



That is the expected behaviour, all application threads are paused 
during GC (CMS collector being an exception, there are smaller pauses 
but the application threads continue to mostly run). The number of 
connections that could end up being queued would depend on your 
acceptCount setting in the server.xml file and also the inbound request 
rate and the time the GC takes to complete.


The OS will queue upto acceptCount requests before it begins to ignore 
incoming tcp connection requests. So if your inbound request rate is 2 
per second and a full GC takes 6 seconds to complete, you should have 12 
(2x6) new requests waiting for you when GC completes.


 Sridhar


wojtekpia wrote:

During full garbage collection, Solr doesn't acknowledge incoming requests.
Any requests that were received during the GC are timestamped the moment GC
finishes (at least that's what my logs show). Is there a limit to how many
requests can queue up during a full GC? This doesn't seem like a Solr
setting, but rather a container/OS setting (I'm using Tomcat on Linux).

Thanks.

Wojtek

Re: exceeded limit of maxWarmingSearchers

2009-02-04 Thread Otis Gospodnetic

Jon,

If you can, don't commit on every update and that should help or fully solve 
your problem.

Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch



- Original Message 
> From: Jon Drukman 
> To: solr-user@lucene.apache.org
> Sent: Wednesday, February 4, 2009 1:09:00 PM
> Subject: Re: exceeded limit of maxWarmingSearchers
> 
> Otis Gospodnetic wrote:
> > That should be fine (but apparently isn't), as long as you don't have some 
> very slow machine or if your caches are are large and configured to copy a 
> lot 
> of data on commit.
> 
> 
> this is becoming more and more problematic.  we have periods where we get 10 
> of 
> these exceptions in a 4 second period.  how do i diagnose what the cause is, 
> or 
> alternatively work around it?
> 
> when you say "copy" are you talking about copyFields or something else?
> 
> we commit on every update, but each update is very small... just a few 
> hundred 
> bytes on average.

Re: Differences in output of spell checkers

2009-02-04 Thread Grant Ingersoll



On Feb 4, 2009, at 11:02 AM, Marcus Stratmann wrote:


Hello,

I'm trying to learn how to use the spell checkers of solr (1.3). I  
found out that FileBasedSpellChecker and IndexBasedSpellChecker  
produce different outputs.


IndexBasedSpellChecker says




1
0
4
0

85
game


false



whereas FileBasedSpellChecker returns




1
0
4

game





The differences are the usage of  respectively  for markup  
of the suggestions, missing frequences and missing  
"correctlySpelled" in FileBasedSpellChecker. Is that a bug or a  
feature? Or are there simply no universal rules for the format of  
the ouput? The differences make parsing more difficult if you use  
IndexBasedSpellChecker and FileBasedSpellChecker.


Are you sending in the same query to both?  Frequency and word only  
get printed when extendedResults == true.  correctlySpelled only gets  
printed when there is Index frequency information.  For the  
FileBasedSpellChecker, there is no Frequency information, so it isn't  
returned.


The logic for constructing this is all handled in the  
SpellCheckComponent.toNamedList() method and is completely separated  
from the individual SpellChecker implementations.


HTH,
Grant


--
Grant Ingersoll
http://www.lucidimagination.com/

Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ

Re: Custom Sorting Algorithm

2009-02-04 Thread Otis Gospodnetic

Hi,

You can use one of the exiting function queries (if they fit your need) or 
write a custom function query to reorder the results of a query.

Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch



- Original Message 
> From: wojtekpia 
> To: solr-user@lucene.apache.org
> Sent: Wednesday, February 4, 2009 2:28:56 PM
> Subject: Custom Sorting Algorithm
> 
> 
> Is an easy way to choose/create an alternate sorting algorithm? I'm
> frequently dealing with large result sets (a few million results) and I
> might be able to benefit domain knowledge in my sort.
> -- 
> View this message in context: 
> http://www.nabble.com/Custom-Sorting-Algorithm-tp21837721p21837721.html
> Sent from the Solr - User mailing list archive at Nabble.com.

Re: Queued Requests during GC

2009-02-04 Thread Otis Gospodnetic

Wojtek,

I'm not familiar with the details of Tomcat configuration, but this definitely 
sounds like a container issue, closely related to the JVM.

Doing a thread dump for the Java process (the JVM your TOmcat runs in) while 
the GC is running will show you which threads are blocked and in turn that 
should point you in the right direction as far as Tomcat setting is covered.  
Sorry for not being able to give you a more specific answer.

Is this happening with the latest JVM from Sun?

I'd be curious if you could reproduce this in Jetty

Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch



- Original Message 
> From: wojtekpia 
> To: solr-user@lucene.apache.org
> Sent: Wednesday, February 4, 2009 2:37:46 PM
> Subject: Queued Requests during GC
> 
> 
> During full garbage collection, Solr doesn't acknowledge incoming requests.
> Any requests that were received during the GC are timestamped the moment GC
> finishes (at least that's what my logs show). Is there a limit to how many
> requests can queue up during a full GC? This doesn't seem like a Solr
> setting, but rather a container/OS setting (I'm using Tomcat on Linux).
> 
> Thanks.
> 
> Wojtek
> -- 
> View this message in context: 
> http://www.nabble.com/Queued-Requests-during-GC-tp21837898p21837898.html
> Sent from the Solr - User mailing list archive at Nabble.com.

Re: Custom Sorting Algorithm

That's not quite what I meant. I'm not looking for a custom comparator, I'm
looking for a custom sorting algorithm. Is there a way to use quick sort or
merge sort or... rather than the current algorithm? Also, what is the
current algorithm?

Otis Gospodnetic wrote:
> 
> 
> You can use one of the exiting function queries (if they fit your need) or
> write a custom function query to reorder the results of a query.
> 
> 

-- 
View this message in context: 
http://www.nabble.com/Custom-Sorting-Algorithm-tp21837721p21838804.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Total count of facets

2009-02-04 Thread Erik Hatcher

What about using the luke request handler to get the distinct values  
count?  Although it is pretty seriously heavy on a big index, so  
probably not quite workable in your case.


Erik

On Feb 4, 2009, at 12:54 PM, Yonik Seeley wrote:

On Wed, Feb 4, 2009 at 5:42 AM, Bruno Aranda   
wrote:
Unfortunately, after some tests listing all the distinct surnames  
or other
fields is too slow and too memory consuming with our current  
infrastructure.
Could someone confirm that if I wanted to add this functionality  
(just count

the total of different facets) what I should do is to subclass the
SimpleFacets class and create an extended FacetComponent that  
returns the

size of the term counts list instead of the list itself?


This wouldn't be too hard to do... and I think it's been requested in
the past at least a few times:
http://www.lucidimagination.com/search/document/7ab1d7fff1fb556e/numfound_for_facet_results

The slightly harder part is changing the response format in a backward
compatible way.

-Yonik

Re: Custom Sorting Algorithm

2009-02-04 Thread Mark Miller

It would not be simple to use a new algorithm. The current 
implementation takes place at the Lucene level and uses a priority 
queue. When you ask for the top n results, a priority queue of size n is 
filled with all of the matching documents. The ordering in the priority 
queue is the sort. The on Sort method orders by relevance score - the 
Sort method orders by field, relevance, or doc id.


- Mark

wojtekpia wrote:

That's not quite what I meant. I'm not looking for a custom comparator, I'm
looking for a custom sorting algorithm. Is there a way to use quick sort or
merge sort or... rather than the current algorithm? Also, what is the
current algorithm?


Otis Gospodnetic wrote:
  

You can use one of the exiting function queries (if they fit your need) or
write a custom function query to reorder the results of a query.

Re: Total count of facets

On Wed, Feb 4, 2009 at 3:47 PM, Erik Hatcher  wrote:
> What about using the luke request handler to get the distinct values count?

That wouldn't restrict results by the base query and filters.

-Yonik

Re: Custom Sorting Algorithm

Ok, so maybe a better question is: should I bother trying to change the
"sorting" algorithm? I'm concerned that with large data sets, sorting
becomes a severe bottleneck (this is an assumption, I haven't profiled
anything to verify). Does it become a severe bottleneck? Do you know if
alternate sort algorithms have been tried during Lucene development? 

markrmiller wrote:
> 
> It would not be simple to use a new algorithm. The current 
> implementation takes place at the Lucene level and uses a priority 
> queue. When you ask for the top n results, a priority queue of size n is 
> filled with all of the matching documents. The ordering in the priority 
> queue is the sort. The on Sort method orders by relevance score - the 
> Sort method orders by field, relevance, or doc id.
> 

-- 
View this message in context: 
http://www.nabble.com/Custom-Sorting-Algorithm-tp21837721p21840299.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Queued Requests during GC

On Wed, Feb 4, 2009 at 3:12 PM, Otis Gospodnetic
 wrote:
> I'd be curious if you could reproduce this in Jetty

All application threads are blocked... it's going to be the same in
Jetty or Tomcat or any other container that's pure Java.  There is an
OS level listening queue that has a certain depth (configurable in
both tomcat and jetty and passed down to the OS when listen() for the
socket is called).  If too many connections are initiated without
being accepted, they will start being rejected.

See UNIX man pages for listen() and connect() for more details.

For Tomcat, the config param you want is "acceptCount"
http://tomcat.apache.org/tomcat-6.0-doc/config/http.html

Increasing this will ensure that connections don't get rejected while
a long GC is going on.

-Yonik

Re: Custom Sorting Algorithm

On Wed, Feb 4, 2009 at 4:45 PM, wojtekpia  wrote:
> Ok, so maybe a better question is: should I bother trying to change the
> "sorting" algorithm? I'm concerned that with large data sets, sorting
> becomes a severe bottleneck (this is an assumption, I haven't profiled
> anything to verify).

No... Lucene/Solr never sorts the complete result set.
If you ask for the top 10 results, a priority queue (heap) of the
current top 10 results is maintained... far more efficient and
scalable than sorting all the hits at the end.

-Yonik

Re: Queued Requests during GC

This is when a load balancer helps. The requests sent around the
time that the GC starts will be stuck on that server, but later
ones can be sent to other servers.

We use a "least connections" load balancing strategy. Each connection
represents a request in progress, so this is the same as equalizing
the queue of requests for each server.

Also, only use as much heap as you really need. A larger heap
means longer GCs.

wunder

On 2/4/09 1:59 PM, "Yonik Seeley"  wrote:

> On Wed, Feb 4, 2009 at 3:12 PM, Otis Gospodnetic
>  wrote:
>> I'd be curious if you could reproduce this in Jetty
> 
> All application threads are blocked... it's going to be the same in
> Jetty or Tomcat or any other container that's pure Java.  There is an
> OS level listening queue that has a certain depth (configurable in
> both tomcat and jetty and passed down to the OS when listen() for the
> socket is called).  If too many connections are initiated without
> being accepted, they will start being rejected.
> 
> See UNIX man pages for listen() and connect() for more details.
> 
> For Tomcat, the config param you want is "acceptCount"
> http://tomcat.apache.org/tomcat-6.0-doc/config/http.html
> 
> Increasing this will ensure that connections don't get rejected while
> a long GC is going on.
> 
> -Yonik

Re: Queued Requests during GC

2009-02-04 Thread Mark Miller


Walter Underwood wrote:

Also, only use as much heap as you really need. A larger heap
means longer GCs.
  
Right. Ideally you want to figure out how to get longer pauses down. 
There is a lot of fiddling that you can do to improve gc times.


On a multiprocessor machine you can parallelize collection of both the 
new and tenured spaces for a nice boost. You can resize spaces within 
the heap as well. There is also a low pause incremental collector you 
can try. A lot of this type of tuning takes trial and error and 
experience though. A really helpful tool is visualgc, which lets you 
watch garbage collection for your app in realtime. You can also use 
jconsole and other tools like that, but visualgc actually renders a view 
of the heap and its easier to watch and get a feel for how garbage 
collection is working. If its hard to get a GUI up, all of those tools 
work remotely as well.


You can find a lot of good info on things to try here:

http://java.sun.com/docs/hotspot/gc5.0/gc_tuning_5.html

If there are spots in Lucene/Solr that are producing so much garbage 
that we can't keep up, perhaps work can be done to address this upon 
pinpointing the issues.


- Mark

Re: Queued Requests during GC

On 2/4/09 2:48 PM, "Mark Miller"  wrote:

> If there are spots in Lucene/Solr that are producing so much garbage
> that we can't keep up, perhaps work can be done to address this upon
> pinpointing the issues.
> 
> - Mark

I have not had the time to pin it down, but I suspect that items
evicted from the query result cache contain a lot of objects.
Are the keys a full parse tree? That could be big.

wunder

Re: Queued Requests during GC

On Wed, Feb 4, 2009 at 5:52 PM, Walter Underwood  wrote:
> I have not had the time to pin it down, but I suspect that items
> evicted from the query result cache contain a lot of objects.
> Are the keys a full parse tree? That could be big.

Yes, keys are full Query objects.
It would be non-trivial to switch to String given all of the things
that can affect how a Query object is built.

-Yonik

Re: Queued Requests during GC

Aha! I bet that the full Query object became a lot more complicated
between Solr 1.1 and 1.3. That would explain why we did 4X as much GC
after the upgrade.

Items evicted from cache are tenured, so they contribute to the full GC.
With an HTTP cache in front, there is hardly anything left to be
cached, so there are lots of evictions. We get a query result cache
hit rate around 0.12.

wunder

On 2/4/09 3:01 PM, "Yonik Seeley"  wrote:

> On Wed, Feb 4, 2009 at 5:52 PM, Walter Underwood 
> wrote:
>> I have not had the time to pin it down, but I suspect that items
>> evicted from the query result cache contain a lot of objects.
>> Are the keys a full parse tree? That could be big.
> 
> Yes, keys are full Query objects.
> It would be non-trivial to switch to String given all of the things
> that can affect how a Query object is built.
> 
> -Yonik

Re: Queued Requests during GC

2009-02-04 Thread Mark Miller


Walter Underwood wrote:

Aha! I bet that the full Query object became a lot more complicated
between Solr 1.1 and 1.3. That would explain why we did 4X as much GC
after the upgrade.

Items evicted from cache are tenured, so they contribute to the full GC.
With an HTTP cache in front, there is hardly anything left to be
cached, so there are lots of evictions. We get a query result cache
hit rate around 0.12.

wunder
  
At 10%, have you considered just not using the cache? Is that worth all 
the extra work? Or are you not paying as much as your losing in GC/cache 
time?


- Mark

Re: Highlighting Oddities

2009-02-04 Thread ashokc


I have seen some of these oddities that Chris is referring to. In my case,
terms that are NOT in the query get highlighted. For example searching for
'Intel' highlights 'Microsot Corp' as well. I do not have them as synonyms
either. Do these filter factories add some extra intelligence to the index
in that if you search for 'Samsung' even 'LG' is considered a highlightable
term?

I believe this was not the case when I was working with an earlier
development version (from Nov or early Dec). Right now I am using
solr-2008-12-29.war.

- ashok



ryguasu wrote:
> 
> I'm testing out the default (gap) fragmenter with some simple,
> single-word queries on a patched 1.3.0 release populated with some
> real-world data. (I think the primary quirk in my setup is that I'm
> using ShingleFilterFactory to put word bigrams (aka shingles) into my
> index. I was worried that this might mess up highlighting, but
> highlighting is *mostly* working.) There are some oddities here, and
> I'm wondering if people have any suggestions for debugging my setup
> and/or trying to make a good, reproducible test case.
> 
> 1. The main weird thing is that, the vast majority of the time, the
> highlighted term is the last term in the fragment. For example, if I
> search for "cat", then almost all my fragments look like this:
> 
> fragment 1: "to the *cat*"
> fragment 2: "with the *cat*"
> fragment 3: "it's what the *cat*"
> fragment 4: "Once upon a time the *cat*"
> 
> (My actual fragments are longer. The key to note is that all of these
> examples end in "cat".)
> 
> Sometimes "cat" will appear at somewhere other than the last position,
> but this is rare. My expectation, in contrast, is that "cat" would
> tend to be more or less evenly distributed throughout fragment
> positions.
> 
> Note: I tried to reproduce this on 1.3.0 with my patches applied but
> using the example dataset/schema from the Solr source tree rather than
> my own dataset/schema. With the example dataset this didn't seem to be
> an issue.
> 
> I've experienced three other highlighting issues, which may or may not
> be related:
> 
> 2. Sometimes, if a term appears multiple times in a fragment, not just
> the term but all the words in between the two appearances will get
> highlighted too. For example, I searched for "fear", and got this as
> one of the snippets:
> 
> SETTLEMENT AGREEMENT This Agreement ("the Agreement") is entered
> into this 18th day of August, 2008, by
> and between Cape Fear Bank Corporation, a North Carolina
> corporation (the "Company"), and Cape Fear
> 
> In contrast, I would have expected
> 
> SETTLEMENT AGREEMENT This Agreement ("the Agreement") is entered
> into this 18th day of August, 2008, by
> and between Cape Fear Bank Corporation, a North Carolina
> corporation (the "Company"), and Cape Fear
> 
> 3. My install seems to have a curiously liberal interpretation of
> hl.fragsize. Now if I put hl.fragsize=0, then things are as expected,
> i.e. it highlights the whole field. And it also seems more or less
> true (as it should) that as I increase hl.fragsize, the fragments get
> longer. However, I was surprised to see that when I put hl.fragsize=1
> or hl.fragsize=5, I can get fragments as long as this one:
> 
> addition, we believe the wireless feature for our controller will
> facilitate exceptional customer services and
> response time." About GpsLatitude GpsLatitude, a Montreal-based
> company, is a provider of security
> solutions and tracking for mobile assets. It is also a developer
> of advanced " Videlocalisation" , a cost-effective,
> integrated mobile digital video
> 
> That seems shockingly long for something of size "five".
> 
> 4. Very rarely I'll get a fragment that doesn't actually contain any
> of the search terms. For example, maybe I'll search for "cat", and
> I'll get back "three ounces of milk" as a snippet. I need to explore
> this more, though the last time this happened when I opened the
> document and found that when I located "three ounces of milk" in the
> document text, the word "cat" did appear nearby; so maybe the document
> did contain "three ounces of milk for the cat".
> 
> Obviously I'm not describing my setup in much detail. Let me know what
> you think would be helpful to know more about.
> 
> Thanks,
> Chris
> 
> 

-- 
View this message in context: 
http://www.nabble.com/Highlighting-Oddities-tp20351015p21841992.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Queued Requests during GC

On 2/4/09 3:17 PM, "Mark Miller"  wrote:

> Walter Underwood wrote:
>> Aha! I bet that the full Query object became a lot more complicated
>> between Solr 1.1 and 1.3. That would explain why we did 4X as much GC
>> after the upgrade.
>> 
>> Items evicted from cache are tenured, so they contribute to the full GC.
>> With an HTTP cache in front, there is hardly anything left to be
>> cached, so there are lots of evictions. We get a query result cache
>> hit rate around 0.12.
>> 
>> wunder
>>   
> At 10%, have you considered just not using the cache? Is that worth all
> the extra work? Or are you not paying as much as your losing in GC/cache
> time?

I was going to verify the source of the tenured garbage before starting
another round of trial-and-error tuning. Now that I have a good hunch,
I might spend some time on that after the Oscars (our peak day for the
year at Netflix).

Another approach is to get fancy with the load balancing and always
send the same query back to the same server. That increases the
effective cache size by the number of servers, but it forces a
simplistic round-robin load balancing and you have to be careful
with down servers to avoid blowing all the caches simultaneously.

At Infoseek, we learned that blowing all the caches when one server
goes down is a very bad idea.

wunder

Re: Queued Requests during GC

2009-02-04 Thread Chris Hostetter


: >> Aha! I bet that the full Query object became a lot more complicated
: >> between Solr 1.1 and 1.3. That would explain why we did 4X as much GC
: >> after the upgrade.

I don't thinkg the Query class implementations themselves changed in 
anyway that would have made them larger -- but if you switched from the 
standard parser to dismax parser, or started using lots of boost 
queries, or started using prefix or wildcard queries, then yes: the Query 
objects used would have gotten bigger.

: Another approach is to get fancy with the load balancing and always
: send the same query back to the same server. That increases the
: effective cache size by the number of servers, but it forces a
: simplistic round-robin load balancing and you have to be careful
: with down servers to avoid blowing all the caches simultaneously.

at a certain point, if you have enough machines, a two tiered LB situation 
starts to be worth consideration.  tier#1 can use hashing on the 
querystring to pick which tier#2 cluster to send the query to.  each 
tier#2 cluster can be fronted by a load balancer that picks the server to 
use based on whatever "workload" metric you want.  a small percentage of 
machines in any given cluster (or in every cluster) can be down w/o 
worrying about screwing up the caches or adversly afecting traffic -- you 
just can't let an entire cluster be down at once.



-Hoss

Latest on DataImportHandler and Tika?

2009-02-04 Thread Chris Harris

Back in November, Shalin and Grant were discussing integrating
DataImportHandler and Tika. Shalin's estimation about the best way to
do this was as follows:

**

I think the best way would be a TikaEntityProcessor which knows how to
handle documents. I guess a typical use-case would be
FileListEntityProcessor->TikaEntityProcessor as parent-child entities.

Also see SOLR-833 which adds a FieldReaderDataSource using which you can
pass any field's content to an entity for processing. So you can have a
[SqlEntityProcessor, JdbcDataSource] producing a blob and a
[FieldReaderDataSource, TikaEntityProcessor] consuming it.

(http://www.nabble.com/DataImportHandler-and-Blobs-td20464891.html)

**

Has there been any work on something like this? Alternatively, is
anyone else put together an alternative way to get DataImportHandler
to extract body text from PDFs, Word files, etc.?

Thanks,
Chris

Re: Queued Requests during GC