Saravanan Chinnadurai/Actionimages is out of the office.

2011-12-04 Thread Saravanan . Chinnadurai
I will be out of the office starting  05/12/2011 and will not return until
05/01/2012.

Please email to itsta...@actionimages.com  for any urgent issues.


Action Images is a division of Reuters Limited and your data will therefore be 
protected
in accordance with the Reuters Group Privacy / Data Protection notice which is 
available
in the privacy footer at www.reuters.com
Registered in England No. 145516   VAT REG: 397000555


Re: Possible to facet across two indices, or document types in single index?

2011-12-04 Thread Jeff Schmidt
Well, the JoinQParserPlugin is definitely there.  Turning on debug reveals why 
I get zero results.  Given the URL:

http://localhost:8091/solr/ing-content/select/?qt=partner-tmo&fq=type:node&q={!join+from=conceptId+to=id+fromIndex=partner-tmo}brca1&debugQuery=true&rows=5&fl=id,n_type,n_name

I get:



0
1

true
id,n_type,n_name
{!join from=conceptId to=id 
fromIndex=partner-tmo}brca1
partner-tmo
type:node
5




{!join from=conceptId to=id 
fromIndex=partner-tmo}brca1
{!join from=conceptId to=id 
fromIndex=partner-tmo}brca1
JoinQuery({!join from=conceptId to=id 
fromIndex=partner-tmo}n_text:brca)
{!join from=conceptId to=id 
fromIndex=partner-tmo}n_text:brca



type:node


type:node

...



It looks like despite qt=partner-tmo, the edismax based search hander is being 
bypassed for the default search handler, and is querying against the n_text 
field, which is the defaultSearchField for the ing-conent core.  But, I don't 
want to use the default handler, but rather my configured edismax hander,  and 
any specified filter queries, to determine the document set in the ing-conent 
core, and then join with the partner-tmo core.  [Yes, the edismax handler in 
the ing-content core and the second core are both named partner-tmo].

Can the JoinQParserPlugin work in conjunction with edismax?

Thanks,

Jeff

On Dec 4, 2011, at 4:12 PM, Jeff Schmidt wrote:

> Hello again:
> 
> I'm looking at the newer join functionality 
> (http://wiki.apache.org/solr/Join) to see if that will help me out.  While 
> there are signs it can go cross index/core 
> (https://issues.apache.org/jira/browse/SOLR-2272), I doubt I can specify 
> facet.field params for fields in a couple of different indexes.  But, perhaps 
> a single combined index it might work.
> 
> Anyway, the above Jira item indicates status: resolved, resolution: fixed, 
> and Fix version/s: 4.0.  I've been working with 3.5.0, so I checked out 4.0 
> from svn today:
> 
> [imac:svn/dev/trunk] jas% svn info
> Path: .
> URL: http://svn.apache.org/repos/asf/lucene/dev/trunk
> Repository Root: http://svn.apache.org/repos/asf
> Repository UUID: 13f79535-47bb-0310-9956-ffa450edef68
> Revision: 1210126
> ...
> Last Changed Rev: 1210116
> Last Changed Date: 2011-12-04 07:35:46 -0700 (Sun, 04 Dec 2011)
> 
> Issuing a join query looks like the local params syntax is being ignored and 
> is part of the search terms?  I get zero results, when w/o the join I get 979.
> 
> 
>
>0
>1
>
>id,n_type,n_name
>{!join from=conceptId to=id 
> fromIndex=partner-tmo}brca1
>partner-tmo
>type:node
>5
>
>
>
> 
> 
> I've not yet fully explored this yet, and I'm not all that familiar with the 
> Solr codebase, but is this functionality in 4.x trunk or not? I can see there 
> is the package org.apache.lucene.search.join. Is this the implementation of 
> SOLR-2272?
> 
> I can see the commit was made earlier this year, and then it was reverted and 
> things went off the rails. I don't want to open any old wounds, but does the 
> join exist?  I not, I'll know not to pursue it any further. If so, is there 
> some solrconfig.xml configuration needed to enable it?  I don't see it in the 
> examples.
> 
> Thanks,
> 
> Jeff
> 
> On Dec 1, 2011, at 9:47 PM, Jeff Schmidt wrote:
> 
>> Hello:
>> 
>> I'm trying to relate together two different types of documents.  Currently I 
>> have 'node' documents that reside in one index (core), and 'product mapping' 
>> documents that are in another index.  The product mapping index is used to 
>> map tenant products to nodes. The nodes are canonical content that gets 
>> updated every quarter, where as the product mappings can change at any time.
>> 
>> I put them in two indexes because (1) canonical content changes rarely, and 
>> I don't want product mapping changes to affect it (commit, re-open searchers 
>> etc.), and I would like to support multiple tenants mapping products to the 
>> same canonical content to avoid duplication (a few GB).
>> 
>> This arrange has worked well thus far, but only in the sense that for each 
>> node result returned, I can query the product mapping index to determine the 
>> products mapped to the node.  I combine this information within my 
>> application and return it to the client.  This works okay in that there are 
>> only 5-20 results returned per page (start, rows).  But now I'm being asked 
>> to facet the product catagories (multi-valued field within a product mapping 
>> document) along with other facets defined in the canonical content.
>> 
>> Can this be done with Solr 3.5.0?  I've been looking into sub-queries, 
>> function queries etc.  Also, I've seen various postings indicating that one 
>> nee

Re: Distributed Solr: different number of results each time

2011-12-04 Thread ffriend
It seems like the error was caused by wrong list of shard URLs, kept in
ZooKeeper. One possible workaround is to specify list of shards manually
with

shards=slave-node1,slave-node2,slave-node3,...

(see SolrCluster documentation for details)

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Distributed-Solr-different-number-of-results-each-time-tp3550284p3560310.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Possible to facet across two indices, or document types in single index?

2011-12-04 Thread Jeff Schmidt
Hello again:

I'm looking at the newer join functionality (http://wiki.apache.org/solr/Join) 
to see if that will help me out.  While there are signs it can go cross 
index/core (https://issues.apache.org/jira/browse/SOLR-2272), I doubt I can 
specify facet.field params for fields in a couple of different indexes.  But, 
perhaps a single combined index it might work.

Anyway, the above Jira item indicates status: resolved, resolution: fixed, and 
Fix version/s: 4.0.  I've been working with 3.5.0, so I checked out 4.0 from 
svn today:

[imac:svn/dev/trunk] jas% svn info
Path: .
URL: http://svn.apache.org/repos/asf/lucene/dev/trunk
Repository Root: http://svn.apache.org/repos/asf
Repository UUID: 13f79535-47bb-0310-9956-ffa450edef68
Revision: 1210126
...
Last Changed Rev: 1210116
Last Changed Date: 2011-12-04 07:35:46 -0700 (Sun, 04 Dec 2011)

Issuing a join query looks like the local params syntax is being ignored and is 
part of the search terms?  I get zero results, when w/o the join I get 979.



0
1

id,n_type,n_name
{!join from=conceptId to=id 
fromIndex=partner-tmo}brca1
partner-tmo
type:node
5





I've not yet fully explored this yet, and I'm not all that familiar with the 
Solr codebase, but is this functionality in 4.x trunk or not? I can see there 
is the package org.apache.lucene.search.join. Is this the implementation of 
SOLR-2272?

I can see the commit was made earlier this year, and then it was reverted and 
things went off the rails. I don't want to open any old wounds, but does the 
join exist?  I not, I'll know not to pursue it any further. If so, is there 
some solrconfig.xml configuration needed to enable it?  I don't see it in the 
examples.

Thanks,

Jeff

On Dec 1, 2011, at 9:47 PM, Jeff Schmidt wrote:

> Hello:
> 
> I'm trying to relate together two different types of documents.  Currently I 
> have 'node' documents that reside in one index (core), and 'product mapping' 
> documents that are in another index.  The product mapping index is used to 
> map tenant products to nodes. The nodes are canonical content that gets 
> updated every quarter, where as the product mappings can change at any time.
> 
> I put them in two indexes because (1) canonical content changes rarely, and I 
> don't want product mapping changes to affect it (commit, re-open searchers 
> etc.), and I would like to support multiple tenants mapping products to the 
> same canonical content to avoid duplication (a few GB).
> 
> This arrange has worked well thus far, but only in the sense that for each 
> node result returned, I can query the product mapping index to determine the 
> products mapped to the node.  I combine this information within my 
> application and return it to the client.  This works okay in that there are 
> only 5-20 results returned per page (start, rows).  But now I'm being asked 
> to facet the product catagories (multi-valued field within a product mapping 
> document) along with other facets defined in the canonical content.
> 
> Can this be done with Solr 3.5.0?  I've been looking into sub-queries, 
> function queries etc.  Also, I've seen various postings indicating that one 
> needs to denormalize more.  I don't want to add product information as fields 
> to the canonical content. Not only does that defeat my objective (1) above, 
> but Solr does not support incremental updates of document fields.
> 
> So, one approach is to issue by query to the canonical index and get all of 
> the document IDs (could be 1000s), and then issue a filter query to the 
> product mapping index with all of these IDs and have Solr facet the product 
> categories.  Is that efficient?  I suppose I could use HTTP POST (via SolrJ) 
> to convey that payload of IDs?  I could then take the facet results of that 
> query and combine them with the canonical index results and return them to 
> the client.
> 
> That may be do-able, but then let's say the user clicks on a product category 
> facet value to narrow the node results to only those mapped to category XYZ. 
> This will not affect the query issued against the canonical content index.  
> Instead, I think I'd have to go through the canonical results and eliminate 
> the nodes that are not associated with product category XYZ.  Then, if the 
> current page of results is inadequate (rows=10, but 3 nodes were eliminated), 
> I'd have to go back to the canonical index to get more rows, eliminate some 
> some again perhaps, get more etc.  That sounds unappealing and low performing.
> 
> Is there a Solr way to do this?  My Packt "Apache Solr 3 Enterprise Search 
> Server" book (page 34) states regarding separate indices:
> 
>   "If you do develop separate schemas and if you need to search across 
> your indices in one search then you must perform a distributed search, 
> described in the last chapter. A distributed search is usually a feature 
> employed fo

Re: SolR for time-series data

2011-12-04 Thread Ted Dunning
Sax is attractive, but I have found it lacking in practice.  My primary
issue is that in order to get sufficient recall for practical matching
problems, I had to do enough query expansion that the speed advantage of
inverted indexes went away.

The OP was asking for blob storage, however, and I think that SolR is fine
for that.

There is also the question of access to time series based on annotations
produced by other programs.  If the annotations express your intent, then
SolR wins again.  IF the annotations are sax annotations and that works for
you, great, but I wouldn't be optimistic that this would handle a wide
range of time series problems.

On Sun, Dec 4, 2011 at 5:14 AM, Grant Ingersoll  wrote:

> Definitely should be possible.  As an aside, I've also thought one could
> do more time series stuff.  Have a look at the iSax stuff by Shieh and
> Koegh: http://www.cs.ucr.edu/~eamonn/iSAX/iSAX.html
>
>
> On Dec 3, 2011, at 12:10 PM, Alan Miller wrote:
>
> > Hi,
> >
> > I have a webapp that plots a bunch of time series data which
> > is just a series of doubles coupled with a timestamp.
> >
> > Every chart in my webapp has a chart_id in my db and i am wondering if it
> > would be
> > effective to usr solr to serve the data to my app instead of keeping the
> > data in my rdbms.
> >
> > Currently I'm using hadoop to calc and generate the report data and the
> > sticking it in my
> > rdbms but I could use solrj client to upload the data to a solr index
> > directly.
> >
> > I know solr if for indexing text documents but would it be effective to
> use
> > solr in this way?
> >
> > I want to query by chart_id and get back a series of timestamp:double
> pairs.
> >
> > Regards
> > Alan
>
> 
> Grant Ingersoll
> http://www.lucidimagination.com
>
>
>
>


quoted query issue

2011-12-04 Thread C Hagenmaier
My query for the terms road and show

body:(road show)

returns 4 documents.  The highlighting shows several instances where road 
immediately precedes show.  However, a query for the phrase "road show"

body:("road show")

returns no documents.  I have similar results with "floor show" and "road 
house."  I have verified that the indexed text field contains the phrases I'm 
searching. 


Here's the XML response

01on2.2identifier,title,year,volumec7aab49e267body100body:("road 
show")
 
What do I do now?

--
Carl



Re: Configuring the Distributed

2011-12-04 Thread Yonik Seeley
On Fri, Dec 2, 2011 at 10:48 AM, Mark Miller  wrote:
> You always want to use the distrib-update-chain. Eventually it will
> probably be part of the default chain and auto turn in zk mode.

I'm working on this now...

-Yonik
http://www.lucidimagination.com


Re: Configuring the Distributed

2011-12-04 Thread Yonik Seeley
On Thu, Dec 1, 2011 at 3:39 PM, Mark Miller  wrote:
> On Thu, Dec 1, 2011 at 10:08 AM, Jamie Johnson  wrote:
>
>> I am currently looking at the latest solrcloud branch and was
>> wondering if there was any documentation on configuring the
>> DistributedUpdateProcessor?  What specifically in solrconfig.xml needs
>> to be added/modified to make distributed indexing work?
>>
>
>
> Hi Jaime - take a look at solrconfig-distrib-update.xml in
> solr/core/src/test-files
>
> You need to enable the update log, add an empty replication handler def,
> and an update chain with solr.DistributedUpdateProcessFactory in it.

One also needs an indexed _version_ field defined in schema.xml for
versioning to work.

-Yonik
http://www.lucidimagination.com


Penalize certain keywords but not completely forbid them

2011-12-04 Thread tux
I have a situation where each doc is described by a tag field with multiple
tags. Tags come pairs. So when one tag is added to the field, it means that
the opposite tag in the pair is rejected for the document. Tags are also
optional, so two documents may be described by different set of tags. When I
match these documents, with document sharing the same tags rank higher, and
documents with opposite tags rank lower, even lower than documents that
share a small number of comment tags. An example of this:

Document 1: Red, Big, Heavy, ...
Document 2: Red,Heavy, ...
Document 3: Red, Small, ...

(Red/Green is a pair, Big/Small is a pair, Heavy/Light is a pair. There may
be many more pairs of tags. this is just an example.)

Then when I match a new Document with "Red, Big", Document 1 should be top,
Document 2 in the middle, and Document 3 in the bottom. But I still want
Document 3 to show up in result because it still matches on Red.

If I simply add opposite tags in the query with <1 boost (search for "Red
Big Small^0.1", e.g.), it still contribute positively to the final score,
document 3 will be higher than document 2.

If I use "-" on the opposite terms (fieldName: (Red Big) -fieldName:Small)
I'll lose document 3 altogether.

What is the best strategy for implementing this? If there is nothing out of
box supporting this, where should I go to modify the server itself?

Thanks.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Penalize-certain-keywords-but-not-completely-forbid-them-tp3559425p3559425.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: SolR for time-series data

2011-12-04 Thread Grant Ingersoll
Definitely should be possible.  As an aside, I've also thought one could do 
more time series stuff.  Have a look at the iSax stuff by Shieh and Koegh: 
http://www.cs.ucr.edu/~eamonn/iSAX/iSAX.html


On Dec 3, 2011, at 12:10 PM, Alan Miller wrote:

> Hi,
> 
> I have a webapp that plots a bunch of time series data which
> is just a series of doubles coupled with a timestamp.
> 
> Every chart in my webapp has a chart_id in my db and i am wondering if it
> would be
> effective to usr solr to serve the data to my app instead of keeping the
> data in my rdbms.
> 
> Currently I'm using hadoop to calc and generate the report data and the
> sticking it in my
> rdbms but I could use solrj client to upload the data to a solr index
> directly.
> 
> I know solr if for indexing text documents but would it be effective to use
> solr in this way?
> 
> I want to query by chart_id and get back a series of timestamp:double pairs.
> 
> Regards
> Alan


Grant Ingersoll
http://www.lucidimagination.com





Re: Memory Leak in Solr?

2011-12-04 Thread Samarendra Pratap
Hi Chris,
 Thanks for you reply and sorry for delay. Please find my replies below in
the mail.

On Sat, Dec 3, 2011 at 5:56 AM, Chris Hostetter wrote:

>
> : Till 3 days ago, we were running Solr 3.4 instance with following java
> : command line options
> : java -server -*Xms2048m* -*Xmx4096m* -Dsolr.solr.home=etc -jar start.jar
> :
> : Then we increased the memory with following options and restarted the
> : server
> : java -server *-**Xms4096m* -*Xmx10g* -Dsolr.solr.home=etc -jar start.jar
>...
> : Since we restarted Solr, the memory usage of application is continuously
> : increasing. The swap usage goes from almost zero to as high as 4GB in
> every
> : 6-8 hours. We kept restarting the Solr to push it down to ~zero but the
> : same memory usage trend kept repeating itself.
>
> do you really mean "swap" in that sentence, or do you mean the amount of
> memory your OS says java is using?  You said you have 16GB total
> physical ram, how big is the index itself? do you have any other processes
> running on that machine?  (You should ideally leave at least enough ram
> free to let the OS/filesystem cache the index in RAM)
>
> Yes, by "swap" i mean "swap". Which we can see by "free -m" on linux and
many other ways. So it is not the memory for java.
The index size is around 31G.
We have this machine dedicated for Solr, so no other significant processes
are run here, except incremental indexing script. I didn't think about
filesystem cache in RAM earlier, but since we have 16G ram so in my opinion
that should be enough.

Since you've not only changed the Xmx (max heap size) param but also the
> Xms param (min heap size) to 4GB, it doesn't seem out of the ordinary
> at all for the memory usage to jump up to 4GB quickly.  If the JVM did
> exactly what the docs say it should, then on startup it would
> *immediatley* allocated 4GB or ram, but i think in practice it allocates
> as needed, but doesn't do any garbage collection if the memory used is
> still below the "Xms" value.
>
> : Then finally I reverted the least expected change, the command line
> memory
> : options, back to min 2g, max 4g and I was surprised to see that the
> problem
> : vanished.
> : java -server *-Xms2g* *-Xmx4g* -Dsolr.solr.home=etc -jar start.jar
> :
> : Is this a memory leak or my lack of understanding of java/linux memory
> : allocation?
>
> I think you're just missunderstanding the allocation ... if you tell java
> to use at leaast 4GB, it's going to use at least 4GB w/o blinking.
>
> I accept I wrote the confusing word "min" for -Xms, but I promise I really
I know its meaning. :-)

 did you try "-Xms2g -Xmx10g" ?
>
> (again: don't set Xmx any higher then you actually have the RAM to
> support given the filesystem cache and any other stuff you have running,
> but you can increase mx w/o increasing ms if you are just worried about
> how fast the heap grows on startup ... not sure why that would be
> worrisome though
>
As I've written in the mail above that I really meant "swap", I am not
really concerned about heap size at startup.


>
> -Hoss
>

My concern is that when a single machine was able to serve n1+n2 queries
earlier with -Xms2g -Xmx4g
why the same machine is not able to serve n2 queries with -Xms4g -Xmx10g?

In fact I tried other combinations as well 2g-6g, 1g-6g, 2g-10g but nothing
replicated the issue.

Since yesterday I am able to see another issue in the same machine. I saw
"Too many open files" error in the log thus creating problem in incremental
indexing.

A lot of lines of the lsof were like following -
java 1232 solr   52u sock0,51805813279
can't identify protocol
java 1232 solr   53u sock0,51805813282
can't identify protocol
java 1232 solr   54u sock0,51805813283
can't identify protocol

I searched for "can't identify protocol" and my case seemed related to a
bug http://bugs.sun.com/view_bug.do?bug_id=6745052 but my java version
("1.6.0_22") did not match in the bug description.

I am not sure if this problem and the memory problem could be related. I
did not check the lsof earlier. Could this be a reason of memory leak?

-- 
Regards,
Samar


Re: Solr cache size information

2011-12-04 Thread elisabeth benoit
Thanks a lot for these answers!

Elisabeth

2011/12/4 Erick Erickson 

> See below:
>
> On Thu, Dec 1, 2011 at 10:57 AM, elisabeth benoit
>  wrote:
> > Hello,
> >
> > If anybody can help, I'd like to confirm a few things about Solr's caches
> > configuration.
> >
> > If I want to calculate cache size in memory relativly to cache size in
> > solrconfig.xml
> >
> > For Document cache
> >
> > size in memory = size in solrconfig.xml * average size of all fields
> > defined in fl parameter   ???
>
> pretty much.
>
> >
> > For Filter cache
> >
> > size in memory = size in solrconfig.xml * WHAT (the size of an id) ??? (I
> > don't use facet.enum method)
> >
>
> It Depends(tm). Solr tries to do the best thing here, depending upon
> how many docs match the filter query. One method puts in a bitset for
> each
> entry, which is (maxDocs/8) bytes. maxDocs is reported on the admin/stats
> page.
>
> If the filter cache only hits a few documents, the size is smaller than
> that.
>
> You can think of this cache as a map where the key is the
> filter query (which is how they're re-used and how autowarm
> works) and the value for each key is the bitset or list. The
> size of the map is bounded by the size in solrconfig.xml.
>
> > For Query result cache
> >
> > size in memory = size in solrconfig.xml * the size of an id ???
> >
> Pretty much. This is the maximum size, but each entry is
> the query plus a list of IDs that's up to 
> long. This cache is, by and large, the least of your worries.
>
>
> >
> > I would also like to know relation between solr's caches sizes and JVM
> max
> > size?
>
> Don't quite know what you're asking for here. There's nothing automatic
> that's sensitive to whether the JVM memory limits are about to be exceeded.
> If the caches get too big, OOMs happen.
>
> >
> > If anyone has an answer or a link for further reading to suggest, it
> would
> > be greatly appreciated.
> >
> There's some information here: http://wiki.apache.org/solr/SolrCaching,
> but
> it often comes down to "try your app and monitor"
>
> Here's a work-in-progress that Grant is working on, be aware that it's
> for trunk, not 3x.
> http://java.dzone.com/news/estimating-memory-and-storage
>
>
> Best
> Erick
>
> > Thanks,
> > Elisabeth
>