CachedSqlEntityProcessor's purpose

2008-11-25 Thread Amit Nithian
I am starting to look at Solr's Data Import Handler framework and am quite
impressed with it so far. My question is in trying to reduce the number of
SQL queries issued to the database and saw this entity processor.

In the following example:
entity name=x query=select * from x
entity name=y query=select * from y where xid=${x.id}
processor=CachedSqlEntityProcessor
/entity
entity

I like the concept of having multiple entity blocks for clarity but why
wouldn't I have (for DB efficiency), the following as one entity's SQL
statement select * from X,Y where x.id=y.xid and have two fields pointing
at X and Y columns?  My main question though is how the
CachedSQLEntityProcessor helps in this case for I want to use the multiple
entity blocks for cleanliness. If I have 500,000 X records, how many SQL
queries in the second entity block (y) would get executed, 50?

If there is any more detailed information about the number of queries
executed in different circumstances, memory overhead or way that the data is
brought from the database into Java  it would be much appreciated for it's
important for my application.

Thanks in advance!
Amit


Unknown field error using JDBC

2008-11-25 Thread Joel Karlsson
Hello,

I get Unknown field error when I'm indexing an Oracle dB. I've reduced the
number of fields/columns in order to troubleshoot. If I change the uniqeKey
to timestamp (for example) and create a dynamic field dynamicField name=*
type=text indexed=true stored=true the indexing works fine, except
the id-field is empty.

--data-config.xml---
...

dataSource driver=oracle.jdbc.OracleDriver
url=jdbc:oracle:thin:@host:port/service-name
user=user
password=pw
name=ds1/

...

entity name=document
   pk=PUBID
   query=SELECT PUBID FROM UPLMAIN
   dataSource=ds1
field column=PUBID name=id/
/entity

...

--

--schema.xml---
...

field name=id type=text indexed=true stored=true required=true /

...

uniqueKeyid/uniqueKey

...



--ERROR-message

2008-nov-25 12:25:25 org.apache.solr.handler.dataimport.SolrWriter upload
VARNING: Error creating document :
SolrInputDocument[{PUBID=PUBID(1.0)={43392}}]

org.apache.solr.common.SolrException: ERROR:unknown field 'PUBID'
at
org.apache.solr.update.DocumentBuilder.toDocument(DocumentBuilder.java:274)

...

---

Anyone who had similar problems or knows how to solve this!? Any help is
truly appreciated!!

// Joel


Re: Using Solr for indexing emails

2008-11-25 Thread Norberto Meijome
On Tue, 25 Nov 2008 03:59:31 +0200
Timo Sirainen [EMAIL PROTECTED] wrote:

  would it be faster to say q=user:user AND highestuid:[ * TO *]  ?  
 
 Now that I read again what fq really did, yes, sounds like you're right.

you may want to compare them both to see which one is better... I just went
from memory :P

  ( and i
  guess you'd sort DESC and return 1 record only).  
 
 No, I'd use the above for getting highestuid value for all mailboxes
 (there should be only one record per mailbox (each mailbox has separate
 uid values - separate highestuid value)) so I can look at the returned
 highestuid values to see what mailboxes aren't fully indexed yet.

gotcha. It is an interesting use of SOLR, i must say... I for one am not used
to having to deal with up to the second update needs.

good luck,
B

_
{Beto|Norberto|Numard} Meijome

Never offend people with style when you can offend them with substance.
  Sam Brown

I speak for myself, not my employer. Contents may be hot. Slippery when wet.
Reading disclaimers makes you go blind. Writing them is worse. You have been
Warned.


Re: solr internationalization support

2008-11-25 Thread Shalin Shekhar Mangar
On Mon, Nov 24, 2008 at 7:56 PM, rameshgalla [EMAIL PROTECTED]wrote:


 1)Which languages solr supports out-of-the box other than english?


Solr does not know about any languages. It will apply whatever analyzers you
specify in the schema.xml for that field type.


 2)What are the analyzers(stemmer,synonym,tokenizer etc) it provides for
 each
 language?


Quite a few. The complete list is at
http://lucene.apache.org/solr/api/org/apache/solr/analysis/package-summary.html


 3)Shall we create our own analyzers for any languages?(If possible explain
 how?)


If the existing analyzers do not work well, then yes, you would need to
create your own. I can't say how easy or difficult it will be because I've
never written one of my own yet.

Some javadocs that may be of help:

http://lucene.apache.org/java/docs/api/org/apache/lucene/analysis/TokenFilter.html
http://lucene.apache.org/java/docs/api/org/apache/lucene/analysis/Tokenizer.html
http://lucene.apache.org/solr/api/org/apache/solr/analysis/BaseTokenizerFactory.html
http://lucene.apache.org/solr/api/org/apache/solr/analysis/BaseTokenFilterFactory.html

-- 
Regards,
Shalin Shekhar Mangar.


Re: Unknown field error using JDBC

2008-11-25 Thread Noble Paul നോബിള്‍ नोब्ळ्
which version of DIH are you using?

On Tue, Nov 25, 2008 at 5:24 PM, Joel Karlsson [EMAIL PROTECTED] wrote:
 Hello,

 I get Unknown field error when I'm indexing an Oracle dB. I've reduced the
 number of fields/columns in order to troubleshoot. If I change the uniqeKey
 to timestamp (for example) and create a dynamic field dynamicField name=*
 type=text indexed=true stored=true the indexing works fine, except
 the id-field is empty.

 --data-config.xml---
 ...

 dataSource driver=oracle.jdbc.OracleDriver
url=jdbc:oracle:thin:@host:port/service-name
user=user
password=pw
name=ds1/

 ...

 entity name=document
   pk=PUBID
   query=SELECT PUBID FROM UPLMAIN
   dataSource=ds1
field column=PUBID name=id/
 /entity

 ...

 --

 --schema.xml---
 ...

 field name=id type=text indexed=true stored=true required=true /

 ...

 uniqueKeyid/uniqueKey

 ...

 

 --ERROR-message

 2008-nov-25 12:25:25 org.apache.solr.handler.dataimport.SolrWriter upload
 VARNING: Error creating document :
 SolrInputDocument[{PUBID=PUBID(1.0)={43392}}]

 org.apache.solr.common.SolrException: ERROR:unknown field 'PUBID'
at
 org.apache.solr.update.DocumentBuilder.toDocument(DocumentBuilder.java:274)

 ...

 ---

 Anyone who had similar problems or knows how to solve this!? Any help is
 truly appreciated!!

 // Joel




-- 
--Noble Paul


Re: Unknown field error using JDBC

2008-11-25 Thread Joel Karlsson
I actually don't know which version I was using, but now I've upgraded to
1.3 and it works like a charm!! Thanks a lot!

2008/11/25 Noble Paul നോബിള്‍ नोब्ळ् [EMAIL PROTECTED]

 which version of DIH are you using?

 On Tue, Nov 25, 2008 at 5:24 PM, Joel Karlsson [EMAIL PROTECTED]
 wrote:
  Hello,
 
  I get Unknown field error when I'm indexing an Oracle dB. I've reduced
 the
  number of fields/columns in order to troubleshoot. If I change the
 uniqeKey
  to timestamp (for example) and create a dynamic field dynamicField
 name=*
  type=text indexed=true stored=true the indexing works fine, except
  the id-field is empty.
 
 
 --data-config.xml---
  ...
 
  dataSource driver=oracle.jdbc.OracleDriver
 url=jdbc:oracle:thin:@host:port/service-name
 user=user
 password=pw
 name=ds1/
 
  ...
 
  entity name=document
pk=PUBID
query=SELECT PUBID FROM UPLMAIN
dataSource=ds1
 field column=PUBID name=id/
  /entity
 
  ...
 
 
 --
 
 
 --schema.xml---
  ...
 
  field name=id type=text indexed=true stored=true required=true
 /
 
  ...
 
  uniqueKeyid/uniqueKey
 
  ...
 
 
 
 
  --ERROR-message
 
  2008-nov-25 12:25:25 org.apache.solr.handler.dataimport.SolrWriter upload
  VARNING: Error creating document :
  SolrInputDocument[{PUBID=PUBID(1.0)={43392}}]
 
  org.apache.solr.common.SolrException: ERROR:unknown field 'PUBID'
 at
 
 org.apache.solr.update.DocumentBuilder.toDocument(DocumentBuilder.java:274)
 
  ...
 
 
 ---
 
  Anyone who had similar problems or knows how to solve this!? Any help is
  truly appreciated!!
 
  // Joel
 



 --
 --Noble Paul



Re: Schema Design Guidance

2008-11-25 Thread Shalin Shekhar Mangar
Even if you go for the 400,000 documents way, the size of data and number of
unique tokens would remain the same. With your data size, you should think
about sharding and distributed search.
Is the availability of a product a boolean value or the number of items? To
make sure that you don't need to do very frequent updates, it will be better
to use a boolean for availability. Even then, real-time updates in Solr are
not possible and you will have to allow for a reasonable delay for changes
to take effect.

On Tue, Nov 25, 2008 at 4:40 AM, Vimal Jobanputra [EMAIL PROTECTED]wrote:

 Hi, and apologies in advance for the lengthy question!

 I'm looking to use Solr to power searching  browsing over a large set of
 product data stored in a relational db. I'm wandering what the most
 appropriate schema design strategy to use is. A simplified view of the
 relational data is:

 Shop (~1000 rows)
 -Id*
 -Name

 Product (~300,000 rows)
 -Id*
 -Name
 -Availability

 ProductFormat (~5 rows)
 -Id*
 -Name

 Component (part of a product that may be sold separately) (~4,000,000 rows)
 -Id*
 -Name

 ProductComponent (~4,000,000 rows)
 -ProductId*
 -ComponentId*

 ShopProduct (~6,000,000 rows)
 -ShopId*
 -ProductId*
 -ProductFormatId*
 -AvailableDate

 ShopProductPriceList (~15,000,000 rows)
 -ShopId*
 -ProductId*
 -ProductFormatId*
 -Applicability (Component/Product)*
 -Type (Regular/SalePrice)*
 -Amount

 * logical primary key

 This means:
 -availability of a product differs from shop to shop
 -the price of a product or component is dependent on the format, and also
 differs from shop to shop

 Textual searching is required over product  component names, and filtering
 is required over Shops, Product Availability, Formats,  Prices.

 The simplest approach would be to flatten out the data completely (1 solr
 document per ShopProduct and ShopProductComponent). This would result in
 ~80million documents, which I'm guessing this would need some form of
 sharding/distribution

 An alternate approach would be to construct one document per Product, and
 *nest* the relational data via dynamic fields (and possibly plugins?)
 Eg one document per Product; multi-value fields for ProductComponent 
 Shop;
 dynamic fields for Availability/Format, using ShopId as part of the field
 name.
 This approach would result in far fewer documents (400,000), but more
 complex queries. It would also require extending Solr/Lucene to search over
 ProductComponents and filter by price, which I'm not quite clear on as
 yet...

 Any guidance on which of the two general approaches (or others) to explore
 further?

 Thanks!
 Vim




-- 
Regards,
Shalin Shekhar Mangar.


Re: [VOTE] Community Logo Preferences

2008-11-25 Thread Shalin Shekhar Mangar
https://issues.apache.org/jira/secure/attachment/12394266/apache_solr_b_red.jpg
https://issues.apache.org/jira/secure/attachment/12394070/sslogo-solr-finder2.0.png
https://issues.apache.org/jira/secure/attachment/12394282/solr2_maho_impression.png
https://issues.apache.org/jira/secure/attachment/12394353/solr.s5.jpg
https://issues.apache.org/jira/secure/attachment/12394264/apache_solr_a_red.jpg

On Sun, Nov 23, 2008 at 10:29 PM, Ryan McKinley [EMAIL PROTECTED] wrote:

 Please submit your preferences for the solr logo.

 For full voting details, see:
  http://wiki.apache.org/solr/LogoContest#Voting

 The eligible logos are:
  http://people.apache.org/~ryan/solr-logo-options.html

 Any and all members of the Solr community are encouraged to reply to this
 thread and list (up to) 5 ranked choices by listing the Jira attachment
 URLs. Votes will be assigned a point value based on rank. For each vote, 1st
 choice has a point value of 5, 5th place has a point value of 1, and all
 others follow a similar pattern.

 https://issues.apache.org/jira/secure/attachment/12345/yourfrstchoice.jpg
 https://issues.apache.org/jira/secure/attachment/34567/yoursecondchoice.jpg
 ...

 This poll will be open until Wednesday November 26th, 2008 @ 11:59PM GMT

 When the poll is complete, the solr committers will tally the community
 preferences and take a final vote on the logo.

 A big thanks to everyone would submitted possible logos -- its great to see
 so many good options.




-- 
Regards,
Shalin Shekhar Mangar.


Re: [VOTE] Community Logo Preferences

2008-11-25 Thread Marcus Stratmann

https://issues.apache.org/jira/secure/attachment/12394282/solr2_maho_impression.png
https://issues.apache.org/jira/secure/attachment/12394376/solr_sp.png
https://issues.apache.org/jira/secure/attachment/12393936/logo_remake.jpg
https://issues.apache.org/jira/secure/attachment/12394264/apache_solr_a_red.jpg
https://issues.apache.org/jira/secure/attachment/12394353/solr.s5.jpg


Re: Sorting and JVM heap size ....

2008-11-25 Thread Shalin Shekhar Mangar
On Tue, Nov 25, 2008 at 7:49 AM, souravm [EMAIL PROTECTED] wrote:


 3. Another case is - if there are 2 search requests concurrently hitting
 the server, each with sorting on the same 20 character date field, then also
 it would need 2x2GB memory. So if I know that I need to support at least 4
 concurrent search requests, I need to start the JVM at least with 8 GB heap
 size.


This is a misunderstanding. Yonik said searchers, not searches. A single
searcher handles all live search requests. When a commit/optimize happens, a
new searcher is created, it's caches are auto-warmed and then swapped with
the live searcher. It may be a bit more complicated under the hoods, but
that's pretty much how it works.

Considering that after commits and during auto-warming, another searcher
might have been created which will have another field cache for each field
you are sorting on, you'll need double the memory. The number of searchers
can be controlled through the maxWarmingSearchers parameter in
solrconfig.xml

-- 
Regards,
Shalin Shekhar Mangar.


Re: Analyzing CSV phrase fields

2008-11-25 Thread Yonik Seeley
The easiest solution would be to create the documents you send to solr
with multiple keywords fields... they will be separated by a
positionIncrement so a phrase query won't see yankees adjacent to
cleveland.

If you can't do that, then perhaps patch PatternTokenizer filter to
put a larger positionIncrement between groups.  Then you would need to
follow it by another filter that tokens on whitespace or some other
regex (which we currently don't have).

-Yonik

On Tue, Nov 25, 2008 at 2:10 AM, Neal Richter [EMAIL PROTECTED] wrote:
 Hey all,

 Very basic question.. I want to index fields of comma separated values:

 Example document:
 id: 1
 title: Football Teams
 keywords: philadelphia eagles, cleveland browns, new york jets

 id: 2
 title: Baseball Teams
 keywords:philadelphia phillies, new york yankees, cleveland indians

 A query of 'new york' should return the obvious documents, but a quoted
 phrase query of yankees cleveland should return nothing... meaning that
 comma breaks phrases without fail.

 I've created a textCSV type in the schema.xml file and used the
 PatternTokenizerFactory to split on commas, and from there analysis can
 proceed as normal via StopFilterFactory, LowerCaseFilter,
 RemoveDuplicatesTokenFilter

 tokenizer class=solr.PatternTokenizerFactory pattern=\s*,\s*
 group=-1/

 Has anyone done this before?  Can I somehow use an existing (or combination
 of) Analyzer?  It seems as though I need to create a PhraseDelimiterFilter
 from the WordDelimiterFilter.. though I am sure there is a way to make an
 existing analyzer to break things up the way I want.

 Thanks - Neal Richter



Re: CachedSqlEntityProcessor's purpose

2008-11-25 Thread Shalin Shekhar Mangar
On Tue, Nov 25, 2008 at 1:52 PM, Amit Nithian [EMAIL PROTECTED] wrote:


 I like the concept of having multiple entity blocks for clarity but why
 wouldn't I have (for DB efficiency), the following as one entity's SQL
 statement select * from X,Y where x.id=y.xid and have two fields
 pointing
 at X and Y columns?


You can certainly do that. However, it is a problem when you need field X or
Y to be multi-valued. You'd get repeated rows for that query and
DataImportHandler will have no way to figure out what to put where. In the
nested entities approach, DataImportHandler multiple values will come from a
nested entity which can be very easily represented as a List. If you do not
have multi-valued fields then you can go for that approach.


 My main question though is how the
 CachedSQLEntityProcessor helps in this case for I want to use the multiple
 entity blocks for cleanliness. If I have 500,000 X records, how many SQL
 queries in the second entity block (y) would get executed, 50?


For each row fetched from the parent entity, the query for its nested entity
is executed after replacing the variables with known values. When the nested
entity has few records in the database, it is more efficient to use
CachedSqlEntityProcessor which executes the query only once and keeps all
the returned rows in memory. After that for each row returned by parent
entity, the cached entity needs to do a lookup in the cache which is quite
fast. Since all rows are stored in-memory, you trade memory for number of
queries to the db when you use CachedSqlEntityProcessor.

http://wiki.apache.org/solr/DataImportHandler#head-4465e39677ec06e4b14fd6a574434bac6e4d01e1


-- 
Regards,
Shalin Shekhar Mangar.


Re: Using Solr for indexing emails

2008-11-25 Thread Shalin Shekhar Mangar
On Mon, Nov 24, 2008 at 11:51 PM, Timo Sirainen [EMAIL PROTECTED] wrote:


 DIH seems to be about Solr pulling data into it from an external source.
 That's not really practical with Dovecot since there's no central
 repository of any kind of data, so there's no way to know what has
 changed since last pull.


Isn't your IMAP server the external data source? DIH can consume from any
data store. Tools for consuming from databases and files have been written.
I think it is possible to write one which consumes from IMAP.

-- 
Regards,
Shalin Shekhar Mangar.


Re: port of Nutch CommonGrams to Solr for help with slow phrase queries

2008-11-25 Thread Shalin Shekhar Mangar
Hi Tom,
I don't think anybody has worked on adding this to Solr yet. Do you mind
opening a jira issue?

On Tue, Nov 25, 2008 at 12:01 AM, Burton-West, Tom [EMAIL PROTECTED]wrote:

 Hello all,

 We are having problems with extremely slow phrase queries when the
 phrase query contains a common words. We are reluctant to just use stop
 words due to various problems with false hits and some things becoming
 impossible to search with stop words turned on. (For example to be or
 not to be, the who, man in the moon vs man on the moon etc.)

 The approach to this problem used by Nutch looks promising.  Has anyone
 ported the Nutch CommonGrams filter to Solr?

 Construct n-grams for frequently occuring terms and phrases while
 indexing. Optimize phrase queries to use the n-grams. Single terms are
 still indexed too, with n-grams overlaid.
 http://lucene.apache.org/nutch/apidocs-0.8.x/org/apache/nutch/analysis/C
 ommonGrams.htmlhttp://lucene.apache.org/nutch/apidocs-0.8.x/org/apache/nutch/analysis/CommonGrams.html


 Tom

 Tom Burton-West
 Information Retrieval Programmer
 Digital Library Production Services
 University of Michigan Library




-- 
Regards,
Shalin Shekhar Mangar.


Re: CachedSqlEntityProcessor's purpose

2008-11-25 Thread Noble Paul നോബിള്‍ नोब्ळ्
every row emitted by an outer entity results in a new Sql query in the
inner entity. (yes 50 queries on inner entity)So,if you wish to
join multiple tables then nested entities is the way to go.

CachedSqlEntityProcessor is meant to help you reduce the number of
queries fired on sub-entities.

If you get the entire table in one query (by using select * from y)
and use a separate where attribute , The entire set of rows in y get
loaded into RAM.

If you use it w/o the where attribute, it still ends up loading the
entire table into the memory (it is an unbounded cache ).It can easily
give you an OOM.

dod not use CachedSqlEntityProcessor for tidying up. use it if you
wish to save time and you have a lot of RAM


On Tue, Nov 25, 2008 at 1:52 PM, Amit Nithian [EMAIL PROTECTED] wrote:
 I am starting to look at Solr's Data Import Handler framework and am quite
 impressed with it so far. My question is in trying to reduce the number of
 SQL queries issued to the database and saw this entity processor.

 In the following example:
 entity name=x query=select * from x
entity name=y query=select * from y where xid=${x.id}
 processor=CachedSqlEntityProcessor
/entity
 entity

 I like the concept of having multiple entity blocks for clarity but why
 wouldn't I have (for DB efficiency), the following as one entity's SQL
 statement select * from X,Y where x.id=y.xid and have two fields pointing
 at X and Y columns?  My main question though is how the
 CachedSQLEntityProcessor helps in this case for I want to use the multiple
 entity blocks for cleanliness. If I have 500,000 X records, how many SQL
 queries in the second entity block (y) would get executed, 50?

 If there is any more detailed information about the number of queries
 executed in different circumstances, memory overhead or way that the data is
 brought from the database into Java  it would be much appreciated for it's
 important for my application.

 Thanks in advance!
 Amit




-- 
--Noble Paul


RE: Sorting and JVM heap size ....

2008-11-25 Thread souravm
Hi Shalin,

Thanks for the clarifications.

Could you please explain a bit more on how the new searcher can double the 
memory ?

Based on your explanation, when a new set of documents gets committed a new 
searcher is created. So what I understand is whenever a update/delete query and 
search query run in parallel then only this type of situation may occur. 

Also I am assuming that like commit optimization also happens during 
update/delete query only.

Regards,
Sourav


From: Shalin Shekhar Mangar [EMAIL PROTECTED]
Sent: Tuesday, November 25, 2008 6:40 AM
To: solr-user@lucene.apache.org
Cc: souravm
Subject: Re: Sorting and JVM heap size 

On Tue, Nov 25, 2008 at 7:49 AM, souravm [EMAIL PROTECTED]mailto:[EMAIL 
PROTECTED] wrote:

3. Another case is - if there are 2 search requests concurrently hitting the 
server, each with sorting on the same 20 character date field, then also it 
would need 2x2GB memory. So if I know that I need to support at least 4 
concurrent search requests, I need to start the JVM at least with 8 GB heap 
size.

This is a misunderstanding. Yonik said searchers, not searches. A single 
searcher handles all live search requests. When a commit/optimize happens, a 
new searcher is created, it's caches are auto-warmed and then swapped with the 
live searcher. It may be a bit more complicated under the hoods, but that's 
pretty much how it works.

Considering that after commits and during auto-warming, another searcher might 
have been created which will have another field cache for each field you are 
sorting on, you'll need double the memory. The number of searchers can be 
controlled through the maxWarmingSearchers parameter in solrconfig.xml

--
Regards,
Shalin Shekhar Mangar.

 CAUTION - Disclaimer *
This e-mail contains PRIVILEGED AND CONFIDENTIAL INFORMATION intended solely 
for the use of the addressee(s). If you are not the intended recipient, please 
notify the sender by e-mail and delete the original message. Further, you are 
not 
to copy, disclose, or distribute this e-mail or its contents to any other 
person and 
any such actions are unlawful. This e-mail may contain viruses. Infosys has 
taken 
every reasonable precaution to minimize this risk, but is not liable for any 
damage 
you may sustain as a result of any virus in this e-mail. You should carry out 
your 
own virus checks before opening the e-mail or attachment. Infosys reserves the 
right to monitor and review the content of all messages sent to or from this 
e-mail 
address. Messages sent to or from this e-mail address may be stored on the 
Infosys e-mail system.
***INFOSYS End of Disclaimer INFOSYS***


matching exact terms

2008-11-25 Thread Brian Whitman
This is probably severe user error, but I am curious about how to index docs
to make this query work:
happy birthday

to return the doc with n_name:Happy Birthday before the doc with
n_name:Happy Birthday, Happy Birthday . As it is now, the latter appears
first for a query of n_name:happy birthday, the former second.

It would be great to do this at query time instead of having to re-index,
but I will if I have to!

The n_* type is defined as:

fieldtype name=name class=solr.TextField
positionIncrementGap=100
  analyzer type=index
tokenizer class=solr.StandardTokenizerFactory/
filter class=solr.StandardFilterFactory/
filter class=solr.ISOLatin1AccentFilterFactory/
filter class=solr.LowerCaseFilterFactory/
  /analyzer
analyzer type=query
tokenizer class=solr.StandardTokenizerFactory/
filter class=solr.StandardFilterFactory/
filter class=solr.ISOLatin1AccentFilterFactory/
filter class=solr.LowerCaseFilterFactory/
/analyzer
/fieldtype


Re: Sorting and JVM heap size ....

2008-11-25 Thread Shalin Shekhar Mangar
On Tue, Nov 25, 2008 at 9:37 PM, souravm [EMAIL PROTECTED] wrote:


 Could you please explain a bit more on how the new searcher can double the
 memory ?


Take a look at slide 13 of Yonik's presentation available at
http://people.apache.org/~yonik/ApacheConEU2006/Solr.ppt

Each searcher in Solr maintains various caches for performance reasons. When
a new one is created, its caches are empty. If one exposes this searcher to
live requests, response times can be very long because a lot of disk
accesses may be needed. Therefore, Solr warms the new searcher's caches by
re-executing queries whose results had been cached on the old searcher's
cache. If you sort on fields, then the new searcher will create its own
FieldCache for each field you sort. At this time, both the old and the new
searcher will have their field caches.


 Based on your explanation, when a new set of documents gets committed a new
 searcher is created. So what I understand is whenever a update/delete query
 and search query run in parallel then only this type of situation may occur.


Not during updates/deletes, but when you issue an commit or optimize
command.



 Also I am assuming that like commit optimization also happens during
 update/delete query only.


Commit or Optimize have to be called by you explicitly.

-- 
Regards,
Shalin Shekhar Mangar.


Re: matching exact terms

2008-11-25 Thread Ryan McKinley


On Nov 25, 2008, at 11:40 AM, Brian Whitman wrote:

This is probably severe user error, but I am curious about how to  
index docs

to make this query work:
happy birthday

to return the doc with n_name:Happy Birthday before the doc with
n_name:Happy Birthday, Happy Birthday . As it is now, the latter  
appears

first for a query of n_name:happy birthday, the former second.

It would be great to do this at query time instead of having to re- 
index,

but I will if I have to!

The n_* type is defined as:

   fieldtype name=name class=solr.TextField
positionIncrementGap=100
 analyzer type=index
   tokenizer class=solr.StandardTokenizerFactory/
   filter class=solr.StandardFilterFactory/
   filter class=solr.ISOLatin1AccentFilterFactory/
   filter class=solr.LowerCaseFilterFactory/
 /analyzer
   analyzer type=query
   tokenizer class=solr.StandardTokenizerFactory/
   filter class=solr.StandardFilterFactory/
   filter class=solr.ISOLatin1AccentFilterFactory/
   filter class=solr.LowerCaseFilterFactory/
   /analyzer
   /fieldtype


Hi Brian!

what is the explain text when you turn on debugQuery=true?

With the indexing scheme you have, happy birthday, happy birthday  
will match 4 terms while happy birthday only two.


Two options come to mind (sorry, both require reindexing)

1. add the remove duplicates filter.  This would have both documents  
match only two terms, and the fieldNorm should boost the shorter field  
about the longer one.  However removing the duplicates may make some  
other queries less relevant.


2. add a copyField and index the name as a string or something without  
tokenization (use the KeywordTokenizerFactory)-- then query on both  
fields (dismax) and boost an exact match over text match:

  name_with_tokens^1 name_no_tokens^3 (or something like that)


ryan





Re: CachedSqlEntityProcessor's purpose

2008-11-25 Thread Amit Nithian
Thanks for the responses. Few follow-ups:
1) It seems that the CachedSQLEntityProcessor performs the where clause in
memory on the cache. Is this cache an in memory RDBMS or maps?
2) In the example, there were two use cases, one that is like query=select
* from Y where xid=${X.ID} and another where it's query=select * from Y
where=xid=${x.ID}. Is there any difference in how CachedSQLEntityPRocessor
behaves? Does it know to strip off the WHERE clause and simply cache the
select * from Y?

What are some dataset sizes that have been tested using this framework and
what are some performance metrics?

Thanks again
Amit

On Tue, Nov 25, 2008 at 7:32 AM, Noble Paul നോബിള്‍ नोब्ळ् 
[EMAIL PROTECTED] wrote:

 every row emitted by an outer entity results in a new Sql query in the
 inner entity. (yes 50 queries on inner entity)So,if you wish to
 join multiple tables then nested entities is the way to go.

 CachedSqlEntityProcessor is meant to help you reduce the number of
 queries fired on sub-entities.

 If you get the entire table in one query (by using select * from y)
 and use a separate where attribute , The entire set of rows in y get
 loaded into RAM.

 If you use it w/o the where attribute, it still ends up loading the
 entire table into the memory (it is an unbounded cache ).It can easily
 give you an OOM.

 dod not use CachedSqlEntityProcessor for tidying up. use it if you
 wish to save time and you have a lot of RAM


 On Tue, Nov 25, 2008 at 1:52 PM, Amit Nithian [EMAIL PROTECTED] wrote:
  I am starting to look at Solr's Data Import Handler framework and am
 quite
  impressed with it so far. My question is in trying to reduce the number
 of
  SQL queries issued to the database and saw this entity processor.
 
  In the following example:
  entity name=x query=select * from x
 entity name=y query=select * from y where xid=${x.id}
  processor=CachedSqlEntityProcessor
 /entity
  entity
 
  I like the concept of having multiple entity blocks for clarity but why
  wouldn't I have (for DB efficiency), the following as one entity's SQL
  statement select * from X,Y where x.id=y.xid and have two fields
 pointing
  at X and Y columns?  My main question though is how the
  CachedSQLEntityProcessor helps in this case for I want to use the
 multiple
  entity blocks for cleanliness. If I have 500,000 X records, how many SQL
  queries in the second entity block (y) would get executed, 50?
 
  If there is any more detailed information about the number of queries
  executed in different circumstances, memory overhead or way that the data
 is
  brought from the database into Java  it would be much appreciated for
 it's
  important for my application.
 
  Thanks in advance!
  Amit
 



 --
 --Noble Paul



Re: [VOTE] Community Logo Preferences

2008-11-25 Thread Chris Harris
https://issues.apache.org/jira/secure/attachment/12394282/solr2_maho_impression.png
https://issues.apache.org/jira/secure/attachment/12394475/solr2_maho-vote.png


Re: [VOTE] Community Logo Preferences

2008-11-25 Thread Brendan Grainger

https://issues.apache.org/jira/secure/attachment/12394264/apache_solr_a_red.jpg
https://issues.apache.org/jira/secure/attachment/12394282/solr2_maho_impression.png
https://issues.apache.org/jira/secure/attachment/12394266/apache_solr_b_red.jpg
https://issues.apache.org/jira/secure/attachment/12394353/solr.s5.jpg

On Nov 25, 2008, at 9:05 AM, Marcus Stratmann wrote:


https://issues.apache.org/jira/secure/attachment/12394282/solr2_maho_impression.png
https://issues.apache.org/jira/secure/attachment/12394376/solr_sp.png
https://issues.apache.org/jira/secure/attachment/12393936/logo_remake.jpg
https://issues.apache.org/jira/secure/attachment/12394264/apache_solr_a_red.jpg
https://issues.apache.org/jira/secure/attachment/12394353/solr.s5.jpg




Re: [VOTE] Community Logo Preferences

2008-11-25 Thread Thomas Dowling
https://issues.apache.org/jira/secure/attachment/12394266/apache_solr_b_red.jpg
https://issues.apache.org/jira/secure/attachment/12394314/apache_soir_001.jpg
https://issues.apache.org/jira/secure/attachment/12394264/apache_solr_a_red.jpg


newbie question on SOLR distributed searches with many shards

2008-11-25 Thread Gerald De Conto
I wasn't able to find examples/anything via google so thought I'd ask:

 

Say I want to implement a solution using distributed searches with many
shards in SOLR 1.3.0. Also, say there are too many shards to pass in
via the URL (dozens, hundreds, whatever)

 

Is there a way to specify in solrconfig.xml (or elsewhere) a list of the
shard URLs to use?

 

I saw references to a shards.txt but no info on it.  I also saw bits of
info that suggested that there MIGHT be another way to do this.

 

Any info appreciated on doing this sort of distributed search.

 

thx



Keyword extraction

2008-11-25 Thread Plaatje, Patrick
Hi all,

Strugling with a question I recently got from a collegue: is it possible
to extract keywords from indexed content?

In my opinion it should be possible to find out on what words the
ranking of the indexed content is the highest (Lucene or Solr), but have
no clue where to begin. Anyone having suggestions?

Best,

Patrick


Re: Keyword extraction

2008-11-25 Thread Ryan McKinley

lots of approaches out there...

the easiest off the shelf method would be to use the  
MoreLikeThisHandler and get the top interesting terms;


http://wiki.apache.org/solr/MoreLikeThisHandler

ryan


On Nov 25, 2008, at 2:09 PM, Plaatje, Patrick wrote:


Hi all,

Strugling with a question I recently got from a collegue: is it  
possible

to extract keywords from indexed content?

In my opinion it should be possible to find out on what words the
ranking of the indexed content is the highest (Lucene or Solr), but  
have

no clue where to begin. Anyone having suggestions?

Best,

Patrick




Re: Using Solr for indexing emails

2008-11-25 Thread Timo Sirainen
On Tue, 2008-11-25 at 20:45 +0530, Shalin Shekhar Mangar wrote:
 On Mon, Nov 24, 2008 at 11:51 PM, Timo Sirainen [EMAIL PROTECTED] wrote:
 
 
  DIH seems to be about Solr pulling data into it from an external source.
  That's not really practical with Dovecot since there's no central
  repository of any kind of data, so there's no way to know what has
  changed since last pull.
 
 
 Isn't your IMAP server the external data source? DIH can consume from any
 data store. Tools for consuming from databases and files have been written.
 I think it is possible to write one which consumes from IMAP.

Yes, but that would require going through all users' all mailboxes to
find out which ones have new nonindexed messages. The data isn't stored
in any centralized database that would allow quickly returning all
non-indexed messages. Instead for each mailbox it would have to (at
minimum) open and read two files. That won't really scale for large
installations with a huge amount of mailboxes.

(At some point I probably am going to implement something that allows
finding everyone's all new messages more easily so that I can
implement replication support, but for now that kind of a change would
be way too much work.)


signature.asc
Description: This is a digitally signed message part


Spellcheck for phrase queries

2008-11-25 Thread Manepalli, Kalyan
Hi,

I am trying to implement a spell check functionality on a
particular field. I need to do a complete phrase spell check when user
enters multiple words.

 

For eg: If the user enters great Hyat the current implementation would
suggest great Hyatt, just correcting the word hyatt. But there will
not be any record for this suggestion.

How do I implement a complete phrase spell check, so that it suggests
grand Hyatt instead of great Hyatt.

 

Any suggestions in this regard will be helpful

 

Thanks,

Kalyan Manepalli

 



Stuck threads on Weblogic

2008-11-25 Thread Alexander Ramos Jardim
Hello guys,

I am getting some stuck threads on my application when it connects to Solr.
The stuck threads occur in an even time, in such a way that each 3 days the
app is online it hangs up the entire cluster.

I don't know if there's any direct relation to Solr, but I get the following
exception on some sparse connections the application does to Solr.

Is there any know bug about Solr writing wrong responses?

Nov 25, 2008 6:14:35 PM BRST Error HTTP localhost cluster0
[ACTIVE] ExecuteThread: '1' for queue: 'weblogic.kernel.Default (s
elf-tuning)' WLS Kernel   1227644075142 BEA-101083 Connection
failure.
java.net.ProtocolException: Didn't meet stated Content-Length, wrote: '259'
bytes instead of stated: '258' bytes.
at weblogic.servlet.internal.
ServletOutputStreamImpl.ensureContentLength(ServletOutputStreamImpl.java:410)
at
weblogic.servlet.internal.ServletResponseImpl.ensureContentLength(ServletResponseImpl.java:1358)
at
weblogic.servlet.internal.ServletResponseImpl.send(ServletResponseImpl.java:1400)
at
weblogic.servlet.internal.ServletRequestImpl.run(ServletRequestImpl.java:1375)
at weblogic.work.ExecuteThread.execute(ExecuteThread.java:200)
at weblogic.work.ExecuteThread.run(ExecuteThread.java:172)


-- 
Alexander Ramos Jardim


Re: Unknown field error using JDBC

2008-11-25 Thread Jon Baer
This sounds exactly same issue I had when going from 1.3 to 1.4 ... it  
sounds like DIH is trying to automagically figure out the columns :-\


- Jon

On Nov 25, 2008, at 6:37 AM, Joel Karlsson wrote:


Hello,

I get Unknown field error when I'm indexing an Oracle dB. I've  
reduced the
number of fields/columns in order to troubleshoot. If I change the  
uniqeKey
to timestamp (for example) and create a dynamic field dynamicField  
name=*
type=text indexed=true stored=true the indexing works fine,  
except

the id-field is empty.

--data- 
config 
.xml---

...

dataSource driver=oracle.jdbc.OracleDriver
   url=jdbc:oracle:thin:@host:port/service-name
   user=user
   password=pw
   name=ds1/

...

entity name=document
  pk=PUBID
  query=SELECT PUBID FROM UPLMAIN
  dataSource=ds1
   field column=PUBID name=id/
/entity

...

--

--
schema 
.xml 
---

...

field name=id type=text indexed=true stored=true  
required=true /


...

uniqueKeyid/uniqueKey

...



--ERROR- 
message


2008-nov-25 12:25:25 org.apache.solr.handler.dataimport.SolrWriter  
upload

VARNING: Error creating document :
SolrInputDocument[{PUBID=PUBID(1.0)={43392}}]

org.apache.solr.common.SolrException: ERROR:unknown field 'PUBID'
   at
org 
.apache.solr.update.DocumentBuilder.toDocument(DocumentBuilder.java: 
274)


...

---

Anyone who had similar problems or knows how to solve this!? Any  
help is

truly appreciated!!

// Joel




Re: port of Nutch CommonGrams to Solr for help with slow phrase queries

2008-11-25 Thread Norberto Meijome
On Mon, 24 Nov 2008 13:31:39 -0500
Burton-West, Tom [EMAIL PROTECTED] wrote:

 The approach to this problem used by Nutch looks promising.  Has anyone
 ported the Nutch CommonGrams filter to Solr?
 
 Construct n-grams for frequently occuring terms and phrases while
 indexing. Optimize phrase queries to use the n-grams. Single terms are
 still indexed too, with n-grams overlaid.
 http://lucene.apache.org/nutch/apidocs-0.8.x/org/apache/nutch/analysis/C
 ommonGrams.html

Tom,
i haven't used Nutch's implementation, but used the current implementation
(1.3) of ngrams and shingles to address exactly the same issue ( database of
music albums and tracks). 
We didn't notice any severe performance hit but :
- data set isn't huge ( ca 1 MM docs).
- reindexed nightly via DIH from MS-SQL, so we can use a separate cache layer to
lower the number of hits to SOLR.

B
_
{Beto|Norberto|Numard} Meijome

Truth has no special time of its own.  Its hour is now -- always.
   Albert Schweitzer

I speak for myself, not my employer. Contents may be hot. Slippery when wet.
Reading disclaimers makes you go blind. Writing them is worse. You have been
Warned.


Re: port of Nutch CommonGrams to Solr for help with slow phrase queries

2008-11-25 Thread Norberto Meijome
On Wed, 26 Nov 2008 10:08:03 +1100
Norberto Meijome [EMAIL PROTECTED] wrote:

 We didn't notice any severe performance hit but :
 - data set isn't huge ( ca 1 MM docs).
 - reindexed nightly via DIH from MS-SQL, so we can use a separate cache layer
 to lower the number of hits to SOLR.

To make this clear - there was a noticeable hit when we removed stop words, but
the nature of the beast forced our hand.

b

_
{Beto|Norberto|Numard} Meijome

Peace can only be achieved by understanding.
   Albert Einstein

I speak for myself, not my employer. Contents may be hot. Slippery when wet.
Reading disclaimers makes you go blind. Writing them is worse. You have been
Warned.


Increased garbage with Solr 1.3?

2008-11-25 Thread Walter Underwood
We are moving from Solr 1.1 to 1.3, and have noticed that 1.3 is working
the garbage collector a lot more. Has anyone else seen this?

wunder



Re: Increased garbage with Solr 1.3?

2008-11-25 Thread Yonik Seeley
On Tue, Nov 25, 2008 at 7:56 PM, Walter Underwood
[EMAIL PROTECTED] wrote:
 We are moving from Solr 1.1 to 1.3, and have noticed that 1.3 is working
 the garbage collector a lot more. Has anyone else seen this?

During indexing or searching?
Indexing uses the SolrDocument class as an intermediate form, so that
would cause some greater GC there (actually, there have been a ton of
indexing related changes in Lucene too).  Not too much comes to mind
for searching though.

-Yonik


Re: Increased garbage with Solr 1.3?

2008-11-25 Thread Walter Underwood
Searching. No facets, but fuzzy matching. --wunder

On 11/25/08 5:08 PM, Yonik Seeley [EMAIL PROTECTED] wrote:

 On Tue, Nov 25, 2008 at 7:56 PM, Walter Underwood
 [EMAIL PROTECTED] wrote:
 We are moving from Solr 1.1 to 1.3, and have noticed that 1.3 is working
 the garbage collector a lot more. Has anyone else seen this?
 
 During indexing or searching?
 Indexing uses the SolrDocument class as an intermediate form, so that
 would cause some greater GC there (actually, there have been a ton of
 indexing related changes in Lucene too).  Not too much comes to mind
 for searching though.
 
 -Yonik



copyField stored values question

2008-11-25 Thread Michael Henson
Hello,

 

I am using copyField to send the raw name of an entity into different
fields for indexing:

 

# schema.xml snippet

field name=raw_name type=string indexed=false stored=true /

field name=indexed_name type=some_custom_type indexed=true
stored=true /

field name=other_indexed_name type=some_other_type indexed=true
stored=true /

 

copyField source=raw_name dest=indexed_name /

copyField source=raw_name dest=other_indexed_name /

 

 

I set the indexed fields to be stored so that I could see what exactly
my custom types' filters produce. In the Analyzer utility in the Admin
webapp seems to apply the filters properly. However, query results
against this index return the original raw_name value for both of the
indexed fields.

 

Is it the expected behavior that copyField targets with stored=true
always store the source value they were given?

 

If so, is there any way to store the post-filtered target value instead?

 

Thanks,

Michael Henson

[EMAIL PROTECTED]

 
   
   
   
This correspondence is from Napster, Inc. or its affiliated entities
and is intended only for use by the recipient named herein. This
correspondence may contain privileged, proprietary and/or confidential
information, and is intended only to be seen and used by named addressee(s).
You are notified that any discussion, dissemination,distribution or copying
of this correspondence and any attachments, is strictly prohibited, 
unless otherwise authorized or consented to in writing by the sender. 
If you have received this correspondence in error, please notify the 
sender immediately, and please permanently delete the original and any 
copies of it and any attachment and destroy any related printouts without 
reading or copying them.


Re: copyField stored values question

2008-11-25 Thread Yonik Seeley
On Tue, Nov 25, 2008 at 9:24 PM, Michael Henson
[EMAIL PROTECTED] wrote:
 I set the indexed fields to be stored so that I could see what exactly
 my custom types' filters produce. In the Analyzer utility in the Admin
 webapp seems to apply the filters properly. However, query results
 against this index return the original raw_name value for both of the
 indexed fields.

Stored fields are never modified.  The output from analyzers is used
for indexing purposes only.

-Yonik


Re: CachedSqlEntityProcessor's purpose

2008-11-25 Thread Noble Paul നോബിള്‍ नोब्ळ्
On Tue, Nov 25, 2008 at 11:35 PM, Amit Nithian [EMAIL PROTECTED] wrote:
 Thanks for the responses. Few follow-ups:
 1) It seems that the CachedSQLEntityProcessor performs the where clause in
 memory on the cache. Is this cache an in memory RDBMS or maps?
It is a hashmap in memory
 2) In the example, there were two use cases, one that is like query=select
 * from Y where xid=${X.ID} and another where it's query=select * from Y
 where=xid=${x.ID}. Is there any difference in how CachedSQLEntityPRocessor
 behaves? Does it know to strip off the WHERE clause and simply cache the
 select * from Y?
It fetches all the rows using the 'query' first.

he where=xid=x.id (see no ${} here )
is evaluated in the map. In the map all the xid values will be kept as
keys and the lookup is done on the map after evaluating the value of
'x.id' as ${x.ID}


Then for subsequent requests it looks

 What are some dataset sizes that have been tested using this framework and
 what are some performance metrics?

 Thanks again
 Amit

 On Tue, Nov 25, 2008 at 7:32 AM, Noble Paul നോബിള്‍ नोब्ळ् 
 [EMAIL PROTECTED] wrote:

 every row emitted by an outer entity results in a new Sql query in the
 inner entity. (yes 50 queries on inner entity)So,if you wish to
 join multiple tables then nested entities is the way to go.

 CachedSqlEntityProcessor is meant to help you reduce the number of
 queries fired on sub-entities.

 If you get the entire table in one query (by using select * from y)
 and use a separate where attribute , The entire set of rows in y get
 loaded into RAM.

 If you use it w/o the where attribute, it still ends up loading the
 entire table into the memory (it is an unbounded cache ).It can easily
 give you an OOM.

 dod not use CachedSqlEntityProcessor for tidying up. use it if you
 wish to save time and you have a lot of RAM


 On Tue, Nov 25, 2008 at 1:52 PM, Amit Nithian [EMAIL PROTECTED] wrote:
  I am starting to look at Solr's Data Import Handler framework and am
 quite
  impressed with it so far. My question is in trying to reduce the number
 of
  SQL queries issued to the database and saw this entity processor.
 
  In the following example:
  entity name=x query=select * from x
 entity name=y query=select * from y where xid=${x.id}
  processor=CachedSqlEntityProcessor
 /entity
  entity
 
  I like the concept of having multiple entity blocks for clarity but why
  wouldn't I have (for DB efficiency), the following as one entity's SQL
  statement select * from X,Y where x.id=y.xid and have two fields
 pointing
  at X and Y columns?  My main question though is how the
  CachedSQLEntityProcessor helps in this case for I want to use the
 multiple
  entity blocks for cleanliness. If I have 500,000 X records, how many SQL
  queries in the second entity block (y) would get executed, 50?
 
  If there is any more detailed information about the number of queries
  executed in different circumstances, memory overhead or way that the data
 is
  brought from the database into Java  it would be much appreciated for
 it's
  important for my application.
 
  Thanks in advance!
  Amit
 



 --
 --Noble Paul





-- 
--Noble Paul


Re: newbie question on SOLR distributed searches with many shards

2008-11-25 Thread Noble Paul നോബിള്‍ नोब्ळ्
anything that is passed as a request parameter can be put into the
SearchHandlers defaults or invariants section .


This is equivalent to passing the shard url in the request

However this expects that you may need to setup a loadbalancer if a
shard hhos more than one host


On Wed, Nov 26, 2008 at 12:25 AM, Gerald De Conto
[EMAIL PROTECTED] wrote:
 I wasn't able to find examples/anything via google so thought I'd ask:



 Say I want to implement a solution using distributed searches with many
 shards in SOLR 1.3.0. Also, say there are too many shards to pass in
 via the URL (dozens, hundreds, whatever)



 Is there a way to specify in solrconfig.xml (or elsewhere) a list of the
 shard URLs to use?



 I saw references to a shards.txt but no info on it.  I also saw bits of
 info that suggested that there MIGHT be another way to do this.



 Any info appreciated on doing this sort of distributed search.



 thx





-- 
--Noble Paul


Facet Query and Query

2008-11-25 Thread Jae Joo

 I am having some trouble to utilize the facet Query. As I know that the
 facet Query has better performance that simple query (q).
 Here is the example.


 http://localhost:8080/test_solr/select?q=*:*facet=truefq=state:CAfacet.mincount=1facet.field=cityfacet.field=sectorfacet.limit=-1sort=score+desc

 -- facet by sector and city for state of CA.
 Any idea how to optimize this query to avoid q=*:*?

 Thanks,

 Jae