date:20110911

Re: Full-search index for the database

2011-09-11 Thread Jamie Johnson

You should create separate fields in your solr schema for each field
in your database that you want recognized separately.  You can use a
query parser like edismax to do a weighted query across all of your
fields and then provide highlighting on the specific field which
matched.

2011/9/10 Eugeny Balakhonov c0f...@gmail.com:
 I want to create full-text search for my database.

 It means that search engine should look up some string for all fields of my
 database.

 I have created Solr configuration for extracting and indexing data from a
 database.





 According documentation in the file schema.xml I have created field for
 full-text search index:



 field name=TEXT type=... indexed=true stored=true
 multiValued=true/



 Also I have added strings for copying all values of all fields into this
 full-search field:



 ...

    copyField source= dest=TEXT/

 ...



 In result I have possibility to search for all fields in my database. But I
 can't recognize which field in the found record contains requested string.

 Highlighting functionality just marks string in the TEXT field like
 following:



 lst name=highlighting

 lst name=431046.431344...8473633

  arr name=TEXT

    strAny text any text emTest/em/str

  /arr

 /lst

 lst name=431046.431231...8476393

  arr name=TEXT

   strAny text any text emTest/em/str

  /arr

 /lst



 How to create full-search index with possibility to recognize source
 database field?



 Thx a lot.

 Eugeny

Re: indexing data from rich documents - Tika with solr3.1

2011-09-11 Thread scorpking

oh, it is good for me. Thank Erik Hatcher-4 very much. I have done to index
from https. 

--
View this message in context: 
http://lucene.472066.n3.nabble.com/indexing-data-from-rich-documents-Tika-with-solr3-1-tp3322555p3326971.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Using multivalued field in map function

2011-09-11 Thread Erick Erickson

Hmmm, would it be simpler to do something like append
a clause like this?
BloggerId:12304^10 OR CoBloggerId:123404^5?

Best
Erick

On Fri, Sep 9, 2011 at 2:14 AM, tkamphuis tom_m...@hotmail.com wrote:
Well, I'd like to do the following:

I've got a website full of blogposts and every blogpost has an owner, this
owner is refererred to through his/her id. For example: BloggerId = 123.
It's also possible that the blog has multiple co-writers, which are also
referred to by there BloggerId but these id's are stored in the multivalue
field, in my previous example SubIds.

When searching for a specific blogger one searches the BloggerId.
Searchresults are influenced by a number of variables, the
country/state/more specific geological data, the blogcategory, etc. For this
I use a facetted query. Next I want to make some results more important,
depending on the BloggerId, I tried to do this with the following query:

?q={!func}map(sum(map(BloggerId,12304,12304,2,0),map(BloggerId,12304,12304,1,0)),3,3,2)fl=*,scorefacet.field=Countryf.Country.facet.limit=6facet.field=Statefq=(BlogCategory:internet%20OR%20BlogCategory:sportssort=score%20desc,Top%20desc,%20SortPriority%20ascstart=0omitHeader=true

In the resulting list, blogs written by BloggerId 12304 should be on top of
the list, followed by the blogs where BloggerId 12304 was co-writer. After
that, all other blogs that follow the criteria but aren't written (or
co-written) by BloggerId 12304.

Any ideas? Thanks!

--
View this message in context:
http://lucene.472066.n3.nabble.com/Using-multivalued-field-in-map-function-tp3318843p3322023.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: NRT and commit behavior

2011-09-11 Thread Erick Erickson

Hmm, OK. You might want to look at the non-cached filter query stuff,
it's quite recent.
The point here is that it is a filter that is applied only after all
of the less expensive filter
queries are run, One of its uses is exactly ACL calculations. Rather
than calculate the
ACL for the entire doc set, it only calculates access for docs that
have made it past
all the other elements of the query See SOLR-2429 and note that it
is a 3.4 (currently
being released) only.

As to why your commits are taking so long, I have no idea given that
you really haven't
given us much to work with.

How big is your index? Are you optimizing? Have you profiled the application to
see what the bottleneck is (I/O, CPU, etc?). What else is running on your
machine? It's quite surprising that it takes that long. How much memory are you
giving the JVM? etc...

You might want to review: http://wiki.apache.org/solr/UsingMailingLists

Best
Erick


On Fri, Sep 9, 2011 at 9:41 AM, Tirthankar Chatterjee
tchatter...@commvault.com wrote:
 Erick,
 What you said is correct for us the searches are based on some Active 
 Directory permissions which are populated in Filter query parameter. So we 
 don't have any warming query concept as we cannot fire for every user ahead 
 of time.

 What we do here is that when user logs in we do an invalid query(which return 
 no results instead of '*') with the correct filter query (which is his 
 permissions based on the login). This way the cache gets warmed up with valid 
 docs.

 It works then.


 Also, can you please let me know why commit is taking 45 mins to 1 hours on a 
 good resourced hardware with multiple processors and 16gb RAM 64 bit VM, etc. 
 We tried passing waitSearcher as false and found that inside the code it hard 
 coded to be true. Is there any specific reason. Can we change that value to 
 honor what is being passed.

 Thanks,
 Tirthankar

 -Original Message-
 From: Erick Erickson [mailto:erickerick...@gmail.com]
 Sent: Thursday, September 01, 2011 8:38 AM
 To: solr-user@lucene.apache.org
 Subject: Re: NRT and commit behavior

 Hmm, I'm guessing a bit here, but using an invalid query doesn't sound very 
 safe, but I suppose it *might* be OK.

 What does invalid mean? Syntax error? not safe.

 search that returns 0 results? I don't know, but I'd guess that filling your 
 caches, which is the point of warming queries, might be short circuited if 
 the query returns
 0 results but I don't know for sure.

 But the fact that invalid queries return quicker does not inspire 
 confidence since the *point* of warming queries is to spend the time up front 
 so your users don't have to wait.

 So here's a test. Comment out your warming queries.
 Restart your server and fire the warming query from the browser 
 withdebugQuery=on and look at the QTime parameter.

 Now fire the same form of the query (as in the same sort, facet, grouping, 
 etc, but presumably a valid term). See the QTime.

 Now fire the same form of the query with a *different* value in the query. 
 That is, it should search on different terms but with the same sort, facet, 
 etc. to avoid getting your data straight from the queryResultCache.

 My guess is that the last query will return much more quickly than the second 
 query. Which would indicate that the first form isn't doing you any good.

 But a test is worth a thousand opinions.

 Best
 Erick

 On Wed, Aug 31, 2011 at 11:04 AM, Tirthankar Chatterjee 
 tchatter...@commvault.com wrote:
 Also noticed that waitSearcher parameter value is not  honored inside 
 commit. It is always defaulted to true which makes it slow during indexing.

 What we are trying to do is use an invalid query (which wont return any 
 results) as a warming query. This way the commit returns faster. Are we 
 doing something wrong here?

 Thanks,
 Tirthankar

 -Original Message-
 From: Jonathan Rochkind [mailto:rochk...@jhu.edu]
 Sent: Monday, July 18, 2011 11:38 AM
 To: solr-user@lucene.apache.org; yo...@lucidimagination.com
 Subject: Re: NRT and commit behavior

 In practice, in my experience at least, a very 'expensive' commit can
 still slow down searches significantly, I think just due to CPU (or
 i/o?) starvation. Not sure anything can be done about that.  That's my 
 experience in Solr 1.4.1, but since searches have always been async with 
 commits, it probably is the same situation even in more recent versions, I'd 
 guess.

 On 7/18/2011 11:07 AM, Yonik Seeley wrote:
 On Mon, Jul 18, 2011 at 10:53 AM, Nicholas Chasench...@earthlink.net  
 wrote:
 Very glad to hear that NRT is finally here!  But my question is this:
 will things still come to a standstill during a commit?
 New updates can now proceed in parallel with a commit, and searches
 have always been completely asynchronous w.r.t. commits.

 -Yonik
 http://www.lucidimagination.com

 **Legal Disclaimer***
 This communication may contain confidential and privileged material
 for

Solr messing up the UK GBP (pound) symbol in response, even though Java environment variabe has file encoding is set to UTF 8....

2011-09-11 Thread Ravish Bhagdev

Any idea why solr is unable to return the pound sign as-is?

I tried typing in £ 1 million in Solr admin GUI and got following response.

response
lst name=responseHeader
int name=status0/int
int name=QTime5/int
lst name=params
str name=indenton/str
str name=start0/str
str name=qÂ£ 1 million/str
str name=rows10/str
str name=version2.2/str
/lst
/lst
result name=response numFound=0 start=0/
/response

Here is my Java Properties I got also from admin interface:

java.runtime.name = Java(TM) SE Runtime Environment
sun.boot.library.path = /usr/lib/jvm/java-6-sun-1.6.0.26/jre/lib/amd64
java.vm.version = 20.1-b02
solr.data.dir = target/solr_data
shared.loader =
java.vm.vendor = Sun Microsystems Inc.
java.vendor.url = http://java.sun.com/
path.separator = :java.vm.name = Java HotSpot(TM) 64-Bit Server VM
tomcat.util.buf.StringCache.byte.enabled = true
file.encoding.pkg = sun.io
user.country = GB
sun.java.launcher = SUN_STANDARD
sun.os.patch.level = unknownjava.vm.specification.name = Java Virtual
Machine Specification
user.dir = /home/rbhagdev/SCCRepos/SCC_Platform/search/solr
java.runtime.version = 1.6.0_26-b03
java.awt.graphicsenv = sun.awt.X11GraphicsEnvironment
java.endorsed.dirs = /usr/lib/jvm/java-6-sun-1.6.0.26/jre/lib/endorsed
os.arch = amd64
java.io.tmpdir = /tmp
line.separator =

java.vm.specification.vendor = Sun Microsystems Inc.
java.naming.factory.url.pkgs = org.apache.namingos.name = Linux
classworlds.conf = /usr/share/maven2/bin/m2.conf
sun.jnu.encoding = UTF-8
java.library.path =
/usr/lib/jvm/java-6-sun-1.6.0.26/jre/lib/amd64/server:/usr/lib/jvm/java-6-sun-1.6.0.26/jre/lib/amd64:/usr/lib/jvm/java-6-sun-1.6.0.26/jre/../lib/amd64:/usr/java/packages/lib/amd64:/usr/lib64:/lib64:/lib:/usr/libjava.specification.name
= Java Platform API Specification
java.class.version = 50.0
sun.management.compiler = HotSpot 64-Bit Tiered Compilers
os.version = 2.6.38-11-generic
user.home = /home/rbhagdev
user.timezone = Europe/London
catalina.useNaming = true
java.awt.printerjob = sun.print.PSPrinterJob
java.specification.version = 1.6
file.encoding = UTF-8
solr.solr.home = src/test/resources/solr_home
catalina.home =
/home/rbhagdev/SCCRepos/SCC_Platform/search/solr/target/tomcatuser.name
= rbhagdev
java.class.path = /usr/share/maven2/boot/classworlds.jar
java.naming.factory.initial = org.apache.naming.java.javaURLContextFactory
package.definition =
sun.,java.,org.apache.catalina.,org.apache.coyote.,org.apache.tomcat.,org.apache.jasper.
java.vm.specification.version = 1.0
sun.arch.data.model = 64
java.home = /usr/lib/jvm/java-6-sun-1.6.0.26/jre
sun.java.command = org.codehaus.classworlds.Launcher tomcat:run-war
java.specification.vendor = Sun Microsystems Inc.
user.language = enjava.vm.info = mixed mode
java.version = 1.6.0_26
java.ext.dirs =
/usr/lib/jvm/java-6-sun-1.6.0.26/jre/lib/ext:/usr/java/packages/lib/ext
securerandom.source = file:/dev/./urandom
sun.boot.class.path =
/usr/lib/jvm/java-6-sun-1.6.0.26/jre/lib/resources.jar:/usr/lib/jvm/java-6-sun-1.6.0.26/jre/lib/rt.jar:/usr/lib/jvm/java-6-sun-1.6.0.26/jre/lib/sunrsasign.jar:/usr/lib/jvm/java-6-sun-1.6.0.26/jre/lib/jsse.jar:/usr/lib/jvm/java-6-sun-1.6.0.26/jre/lib/jce.jar:/usr/lib/jvm/java-6-sun-1.6.0.26/jre/lib/charsets.jar:/usr/lib/jvm/java-6-sun-1.6.0.26/jre/lib/modules/jdk.boot.jar:/usr/lib/jvm/java-6-sun-1.6.0.26/jre/classes
java.vendor = Sun Microsystems Inc.
server.loader =
maven.home = /usr/share/maven2
catalina.base = /home/rbhagdev/SCCRepos/SCC_Platform/search/solr/target/tomcat
file.separator = /
java.vendor.url.bug = http://java.sun.com/cgi-bin/bugreport.cgi
common.loader = ${catalina.home}/lib,${catalina.home}/lib/*.jar
sun.cpu.endian = little
sun.io.unicode.encoding = UnicodeLittle
package.access =
sun.,org.apache.catalina.,org.apache.coyote.,org.apache.tomcat.,org.apache.jasper.,sun.beans.
sun.desktop = gnome
sun.cpu.isalist =

Thanks,

Ravish

Re: Running solr on small amounts of RAM

2011-09-11 Thread Erick Erickson

Well, this answer isn't much more satisfactory than get more memory,
but about all I can say is try it and see.

Sure, make your caches very small and monitor memory and test it out.

You'll get a sense of how fast (or slow) the queries are pretty quickly. Or
you can get a ballpark estimate of what running without caches would
do performance wise by simply measuring the first query after a restart.

Because, unfortunately, it depends is the only accurate answer. It
depends on how much sorting, faceting etc. you do as well as the
queries themselves.

Best
Erick

On Fri, Sep 9, 2011 at 12:48 PM, Mike Austin mike.aus...@juggle.com wrote:
 I'm trying to push to get solr used in our environment. I know I could have
 responses saying WHY can't you get more RAM etc.., but lets just skip those
 and work with this situation.

 Our index is very small with 100k documents and a light load at the moment.
 If I wanted to use the smallest possible RAM on the server, how would I do
 this and what are the issues?

 I know that caching would be the biggest lose but if solr ran with no to
 little caching, the performance would still be ok? I know this is a relative
 question..
 This is the only application using java on this machine, would tuning java
 to use less cache help anything?
 I should set the cache settings low in the config?
 Basically, what will having a very low cache hit rate do to search speed and
 server performance?  I know more is better and it depends on what I'm
 comparing it to but if you could just answer in some way saying that it's
 not going to cripple the machine or cause 5 second searches?

 It's on a windows server.


 Thanks,
 Mike

Re: solr equivalent of select distinct

2011-09-11 Thread Erick Erickson

This smells like an XY problem, can you back up and give a higher-level
reason *why* you want this behavior?

Because given your problem description, this seems like you are getting
correct behavior no matter how you define the problem. You're essentially
saying that you have two records with identical beginnings of your PK,
why is it incorrect to give you both records?

But, anyway, if you're searching on FLD1 and FLD2, then by definition
you're going to get both records back or the search would be failing!

Best
Erick

On Fri, Sep 9, 2011 at 8:08 PM, Mark juszczec mark.juszc...@gmail.com wrote:
 Hello everyone

 Let's say each record in my index contains fields named PK, FLD1, FLD2, FLD3
  FLD100

 PK is my solr primary key and I'm creating it by concatenating
 FLD1+FLD2+FLD3 and I'm guaranteed that combination will be unique

 Let's say 2 of these records have FLD1 = A and FLD2 = B.  I am unsure about
 the remaining fields

 Right now, if I do a query specifying FLD1 = A and FLD2 = B then I get both
 records.  I only want 1.

 Research says I should use faceting.  But this:

 q=FLD1:A and FLD2:B  rows=500  defType=edismax  fl=FLD1, FLD2 
 facet=true  facet_field=FLD1  facet_field=FLD2

 gives me 2 records.

 In fact, it gives me the same results as:

 q=FLD1:A and FLD2:B  rows=500  defType=edismax  fl=FLD1, FLD2

 I'm wrong somewhere, but I'm unsure where.

 Is faceting the right way to go or should I be using grouping?

 Curiously, when I use grouping like this:

 q=FLD1:A and FLD2:B rows=500 defType=edismax indent=true fl=FLD1, FLD2
 group=true group.field=FLD1 group.field=FLD2

 I get 2 records as well.

 Has anyone dealt with mimicing select distinct in Solr?

 Any advice would be very appreciated.

 Mark

Re: searching for terms containing embedded spaces

2011-09-11 Thread Erick Erickson

Try escaping it for a start.

But why do you want to? If it's a phrase query, enclose it in double quotes.
You really have to provide more details, because there are too many
possibilities
to answer. For instance:

If you're entering field:a b then 'b' will be searched against your
default text field
and you should enter field:(a b) or field:a field:b

If you've tokenized the field, you shouldn't care.

If you're using keywordanalyzer, escaping should work.

Etc.


Best
Erick

On Fri, Sep 9, 2011 at 8:11 PM, Mark juszczec mark.juszc...@gmail.com wrote:
 Hi folks

 I've got a field that contains 2 words separated by a single blank.

 What's the trick to creating a search string that contains the single blank?

 Mark

Re: How to write this query?

2011-09-11 Thread Erick Erickson

So are you still having a problem, and if so what?

Best
Erick

On Sat, Sep 10, 2011 at 5:48 AM, crisfromnova crisfromn...@gmail.com wrote:
 Hi,

 key:value1^8 key:value2^4 key:value3^2 is correct.

 Sorry for bad query written.

 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/How-to-write-this-query-tp3318577p3325033.html
 Sent from the Solr - User mailing list archive at Nabble.com.

Re: Nested documents

2011-09-11 Thread Erick Erickson

Does this JIRA apply?

https://issues.apache.org/jira/browse/LUCENE-3171

Best
Erick

On Sat, Sep 10, 2011 at 8:32 PM, Andy angelf...@yahoo.com wrote:
 Hi,

 Does Solr support nested documents? If not is there any plan to add such a 
 feature?

 Thanks.

Re: solr equivalent of select distinct

2011-09-11 Thread Mark juszczec

Erick

Thanks very much for the reply.

I typed this late Friday after work and tried to simplify the problem
description.  I got something wrong.  Hopefully this restatement is better:

My PK is FLD1, FLD2 and FLD3 concatenated together.

In some cases FLD1 and FLD2 can be the same.  The ONLY differing field being
FLD3.

Here's an example:

PK   FLD1  FLD2FLD3 FLD4 FLD5
AB0  AB  0 x   y
AB1  AB  1 x   y
CD0  CD  0 a   b
CD1  CD  1 e   f

I want to write a query using only the terms FLD1 and FLD2 and ONLY get
back:

A B x y
C D a b
C D e f

Since FLD4 and FLD5 are the same for PK=AB0 and AB1, I only want one
occurrence of those records.

Since FLD4 and FLD5 are different for PK=CD0 and CD1, I want BOTH
occurrences of those records.

I'm hoping I can use wildcards to get FLD4 and FLD5.  If not, I can use fl=

I'm using edismax.

We are also creating the query string on the fly.  I suspect using SolrJ and
plugging the values into a bean would be easier - or do I have that wrong?

I hope the tables of example data display properly.

Mark

On Sun, Sep 11, 2011 at 12:06 PM, Erick Erickson erickerick...@gmail.comwrote:

 This smells like an XY problem, can you back up and give a higher-level
 reason *why* you want this behavior?

 Because given your problem description, this seems like you are getting
 correct behavior no matter how you define the problem. You're essentially
 saying that you have two records with identical beginnings of your PK,
 why is it incorrect to give you both records?

 But, anyway, if you're searching on FLD1 and FLD2, then by definition
 you're going to get both records back or the search would be failing!

 Best
 Erick

 On Fri, Sep 9, 2011 at 8:08 PM, Mark juszczec mark.juszc...@gmail.com
 wrote:
  Hello everyone
 
  Let's say each record in my index contains fields named PK, FLD1, FLD2,
 FLD3
   FLD100
 
  PK is my solr primary key and I'm creating it by concatenating
  FLD1+FLD2+FLD3 and I'm guaranteed that combination will be unique
 
  Let's say 2 of these records have FLD1 = A and FLD2 = B.  I am unsure
 about
  the remaining fields
 
  Right now, if I do a query specifying FLD1 = A and FLD2 = B then I get
 both
  records.  I only want 1.
 
  Research says I should use faceting.  But this:
 
  q=FLD1:A and FLD2:B  rows=500  defType=edismax  fl=FLD1, FLD2 
  facet=true  facet_field=FLD1  facet_field=FLD2
 
  gives me 2 records.
 
  In fact, it gives me the same results as:
 
  q=FLD1:A and FLD2:B  rows=500  defType=edismax  fl=FLD1, FLD2
 
  I'm wrong somewhere, but I'm unsure where.
 
  Is faceting the right way to go or should I be using grouping?
 
  Curiously, when I use grouping like this:
 
  q=FLD1:A and FLD2:B rows=500 defType=edismax indent=true fl=FLD1,
 FLD2
  group=true group.field=FLD1 group.field=FLD2
 
  I get 2 records as well.
 
  Has anyone dealt with mimicing select distinct in Solr?
 
  Any advice would be very appreciated.
 
  Mark

Re: searching for terms containing embedded spaces

2011-09-11 Thread Mark juszczec

Erick

My field contains a b (without )

We are trying to assemble the query as a String by appending the various
values.  I think that is a large part of the problem and our lives would be
easier if we let the Solr api do this work.

We've experimented with our query assembler producing

field:a+b

We've also tried making it create

field:a\ b

The first case just does not work and I'm unsure why.

The second case ends up url encoding the \ and I'm unsure if that will cause
it to be used in the query or not.

Mark



On Sun, Sep 11, 2011 at 12:10 PM, Erick Erickson erickerick...@gmail.comwrote:

 Try escaping it for a start.

 But why do you want to? If it's a phrase query, enclose it in double
 quotes.
 You really have to provide more details, because there are too many
 possibilities
 to answer. For instance:

 If you're entering field:a b then 'b' will be searched against your
 default text field
 and you should enter field:(a b) or field:a field:b

 If you've tokenized the field, you shouldn't care.

 If you're using keywordanalyzer, escaping should work.

 Etc.
 

 Best
 Erick

 On Fri, Sep 9, 2011 at 8:11 PM, Mark juszczec mark.juszc...@gmail.com
 wrote:
  Hi folks
 
  I've got a field that contains 2 words separated by a single blank.
 
  What's the trick to creating a search string that contains the single
 blank?
 
  Mark

Re: searching for terms containing embedded spaces

2011-09-11 Thread Yonik Seeley

On Sun, Sep 11, 2011 at 12:56 PM, Mark juszczec mark.juszc...@gmail.com wrote:
 We've also tried making it create

 field:a\ b

 The first case just does not work and I'm unsure why.

 The second case ends up url encoding the \ and I'm unsure if that will cause
 it to be used in the query or not.

URL encoding is just part of the transfer syntax for an HTTP GET/POST
- by the time the query makes it to the lucene/solr query parser, that
escaping will have been removed.

You can also use
http://lucene.apache.org/solr/api/org/apache/solr/search/TermQParserPlugin.html
and not worry about any escaping.

But as Erick says, it's not clear that's really what you want (to
search on a single term with a space in it).  If it's a normal text
field, each word will be indexed separately, so you really want a
phrase query or a boolean query:

field:a b
or
field:(a b)

-Yonik
http://www.lucene-eurocon.com - The Lucene/Solr User Conference

Re: searching for terms containing embedded spaces

2011-09-11 Thread Mark juszczec


 But as Erick says, it's not clear that's really what you want (to
 search on a single term with a space in it).  If it's a normal text
 field, each word will be indexed separately, so you really want a
 phrase query or a boolean query:

 field:a b
 or
 field:(a b)


I am looking for a text string with a single, embedded space.  For the
purposes of this example, it is a b and its stored in the index in a field
called field.

Am I incorrect in assuming the query field:a b will match the the string a
followed by a single embedded space followed by a b?

I'm also wondering if this is already handled by the Solr/SolrJ API and if
we are making our lives more difficult by assembling the query strings
ourselves.

Mark


 -Yonik
 http://www.lucene-eurocon.com - The Lucene/Solr User Conference

Re: searching for terms containing embedded spaces

2011-09-11 Thread Yonik Seeley

On Sun, Sep 11, 2011 at 1:15 PM, Mark juszczec mark.juszc...@gmail.com wrote:
 I am looking for a text string with a single, embedded space.  For the
 purposes of this example, it is a b and its stored in the index in a field
 called field.

 Am I incorrect in assuming the query field:a b will match the the string a
 followed by a single embedded space followed by a b?

Yes, that should work regardless of how the field is indexed (as a big
single token, or as a normal text field that doesn't preserve spaces).

-Yonik
http://www.lucene-eurocon.com - The Lucene/Solr User Conference

Re: searching for terms containing embedded spaces

2011-09-11 Thread Mark juszczec

That's what I thought.  The problem is, its not and I am unsure what is
wrong.



On Sun, Sep 11, 2011 at 1:35 PM, Yonik Seeley yo...@lucidimagination.comwrote:

 On Sun, Sep 11, 2011 at 1:15 PM, Mark juszczec mark.juszc...@gmail.com
 wrote:
  I am looking for a text string with a single, embedded space.  For the
  purposes of this example, it is a b and its stored in the index in a
 field
  called field.
 
  Am I incorrect in assuming the query field:a b will match the the
 string a
  followed by a single embedded space followed by a b?

 Yes, that should work regardless of how the field is indexed (as a big
 single token, or as a normal text field that doesn't preserve spaces).

 -Yonik
 http://www.lucene-eurocon.com - The Lucene/Solr User Conference

Re: searching for terms containing embedded spaces

2011-09-11 Thread Yonik Seeley

On Sun, Sep 11, 2011 at 1:39 PM, Mark juszczec mark.juszc...@gmail.com wrote:
 That's what I thought.  The problem is, its not and I am unsure what is
 wrong.

What is the fieldType definition for that field?  Did you change it
without re-indexing?

-Yonik
http://www.lucene-eurocon.com - The Lucene/Solr User Conference

Re: searching for terms containing embedded spaces

2011-09-11 Thread Mark juszczec

The field's properties are:

field name=CUSTOMER_TYPE_NM type=string indexed=true stored=true
required=true default=CUSTOMER_TYPE_NM_MISSING

There have been no changes since I last completely rebuilt the index.

Is re-indexing done when an index is completely rebuilt with a a
dataimport=full?   How about if we've done dataimport=delta?

If it helps, this is what I get when I print out the ModifiableSolrParams
object I'm sending to the query method:

q=+*%3A*++AND+CUSTOMER_TYPE_NM%3ANetwork+Advertiser+AND+ACTIVE_IND%3A1defType=edismaxrows=500sort=ACCOUNT_CUSTOMER_ID+ascstart=0

Mark

On Sun, Sep 11, 2011 at 2:05 PM, Yonik Seeley yo...@lucidimagination.comwrote:

 On Sun, Sep 11, 2011 at 1:39 PM, Mark juszczec mark.juszc...@gmail.com
 wrote:
  That's what I thought.  The problem is, its not and I am unsure what is
  wrong.

 What is the fieldType definition for that field?  Did you change it
 without re-indexing?

 -Yonik
 http://www.lucene-eurocon.com - The Lucene/Solr User Conference

Re: solr equivalent of select distinct

2011-09-11 Thread Erick Erickson

Hmmm, there's no good way I can think of off the top of my
head to do this. Whenever people find themselves thinking
in terms of RDBMSs, I have to ask whether the problem is
really appropriate for a search engine. And/or what the problem
you're trying to solve with this approach is from a higher level.
Perhaps there's another approach completely that would
serve

Best
Erick

On Sun, Sep 11, 2011 at 12:39 PM, Mark juszczec mark.juszc...@gmail.com wrote:
 Erick

 Thanks very much for the reply.

 I typed this late Friday after work and tried to simplify the problem
 description.  I got something wrong.  Hopefully this restatement is better:

 My PK is FLD1, FLD2 and FLD3 concatenated together.

 In some cases FLD1 and FLD2 can be the same.  The ONLY differing field being
 FLD3.

 Here's an example:

 PK   FLD1      FLD2    FLD3 FLD4 FLD5
 AB0  A            B          0     x       y
 AB1  A            B          1     x       y
 CD0  C            D          0     a       b
 CD1  C            D          1     e       f

 I want to write a query using only the terms FLD1 and FLD2 and ONLY get
 back:

 A B x y
 C D a b
 C D e f

 Since FLD4 and FLD5 are the same for PK=AB0 and AB1, I only want one
 occurrence of those records.

 Since FLD4 and FLD5 are different for PK=CD0 and CD1, I want BOTH
 occurrences of those records.

 I'm hoping I can use wildcards to get FLD4 and FLD5.  If not, I can use fl=

 I'm using edismax.

 We are also creating the query string on the fly.  I suspect using SolrJ and
 plugging the values into a bean would be easier - or do I have that wrong?

 I hope the tables of example data display properly.

 Mark

 On Sun, Sep 11, 2011 at 12:06 PM, Erick Erickson 
 erickerick...@gmail.comwrote:

 This smells like an XY problem, can you back up and give a higher-level
 reason *why* you want this behavior?

 Because given your problem description, this seems like you are getting
 correct behavior no matter how you define the problem. You're essentially
 saying that you have two records with identical beginnings of your PK,
 why is it incorrect to give you both records?

 But, anyway, if you're searching on FLD1 and FLD2, then by definition
 you're going to get both records back or the search would be failing!

 Best
 Erick

 On Fri, Sep 9, 2011 at 8:08 PM, Mark juszczec mark.juszc...@gmail.com
 wrote:
  Hello everyone
 
  Let's say each record in my index contains fields named PK, FLD1, FLD2,
 FLD3
   FLD100
 
  PK is my solr primary key and I'm creating it by concatenating
  FLD1+FLD2+FLD3 and I'm guaranteed that combination will be unique
 
  Let's say 2 of these records have FLD1 = A and FLD2 = B.  I am unsure
 about
  the remaining fields
 
  Right now, if I do a query specifying FLD1 = A and FLD2 = B then I get
 both
  records.  I only want 1.
 
  Research says I should use faceting.  But this:
 
  q=FLD1:A and FLD2:B  rows=500  defType=edismax  fl=FLD1, FLD2 
  facet=true  facet_field=FLD1  facet_field=FLD2
 
  gives me 2 records.
 
  In fact, it gives me the same results as:
 
  q=FLD1:A and FLD2:B  rows=500  defType=edismax  fl=FLD1, FLD2
 
  I'm wrong somewhere, but I'm unsure where.
 
  Is faceting the right way to go or should I be using grouping?
 
  Curiously, when I use grouping like this:
 
  q=FLD1:A and FLD2:B rows=500 defType=edismax indent=true fl=FLD1,
 FLD2
  group=true group.field=FLD1 group.field=FLD2
 
  I get 2 records as well.
 
  Has anyone dealt with mimicing select distinct in Solr?
 
  Any advice would be very appreciated.
 
  Mark

Re: searching for terms containing embedded spaces

2011-09-11 Thread Erick Erickson

OK, there are several issues here:
q= *:* AND CUSTOMER_TYPE_NM:Network Advertiser AND
ACTIVE_IND:1defType=edismaxrows=500sort=ACCOUNT_CUSTOMER_ID
ascstart=0

the *:* is doing you no good, I'd just remove it.

defType=edismax probably isn't doing what you expect, you're not
specifying any fields
(no qf parameter).

This is going to your request handler that has ' default=true '
defined. If you're using a
stock example, you're probably searching against the default search
field defined in
schema.xml, probably a field named text.

If you have a request handler named edismax, you can use the qt=edismax
parameter. If your request handler is named /edismax, then use either
qt=/edismax or solr/edismax?q=

Attach the debugQuery=on and look at the parsed form of the
query.

But edismax plays nicer than dismax used to, it's probably searching
against your default
search field. Which is probably NOT CUSTOMER_TYPE_NM.

String types are completely unanalyzed, so they're case sensitive. If
you want a case-insensitive
version, use something like KeywordTokenizer followed by
LowerCaseFilter. The admin/analysis
page will help you a lot here.

I think you'll get a lot of insight into this if you attach
debugQuery=on and look at the
parsedquery and parsedquery_tostring sections (after the results list).

Best
Erick

On Sun, Sep 11, 2011 at 2:25 PM, Mark juszczec mark.juszc...@gmail.com wrote:
The field's properties are:

field name=CUSTOMER_TYPE_NM type=string indexed=true stored=true
required=true default=CUSTOMER_TYPE_NM_MISSING

There have been no changes since I last completely rebuilt the index.

Is re-indexing done when an index is completely rebuilt with a a
dataimport=full? How about if we've done dataimport=delta?

If it helps, this is what I get when I print out the ModifiableSolrParams
object I'm sending to the query method:

q=+*%3A*++AND+CUSTOMER_TYPE_NM%3ANetwork+Advertiser+AND+ACTIVE_IND%3A1defType=edismaxrows=500sort=ACCOUNT_CUSTOMER_ID+ascstart=0

Mark

On Sun, Sep 11, 2011 at 2:05 PM, Yonik Seeley
yo...@lucidimagination.comwrote:

On Sun, Sep 11, 2011 at 1:39 PM, Mark juszczec mark.juszc...@gmail.com
wrote:
That's what I thought. The problem is, its not and I am unsure what is
wrong.

What is the fieldType definition for that field? Did you change it
without re-indexing?

-Yonik
http://www.lucene-eurocon.com - The Lucene/Solr User Conference

Re: solr equivalent of select distinct

2011-09-11 Thread Michael Sokolov

You can get what you want - unique lists of values from docs matching 
your query - for a single field (using facets), but not for the 
co-occurrence of two field values.  So you could combine the two fields 
together, if you know what they are going to be in advance.  Facets 
also give you counts, so in some special cases, you could get what you 
want - eg you can tell when there is only a single pair of values since 
their counts will be the same and the same as the total.  But that's all 
I can think of.


-Mike

On 9/11/2011 12:39 PM, Mark juszczec wrote:

Here's an example:

PK   FLD1  FLD2FLD3 FLD4 FLD5
AB0  AB  0 x   y
AB1  AB  1 x   y
CD0  CD  0 a   b
CD1  CD  1 e   f

I want to write a query using only the terms FLD1 and FLD2 and ONLY get
back:

A B x y
C D a b
C D e f

Since FLD4 and FLD5 are the same for PK=AB0 and AB1, I only want one
occurrence of those records.

Since FLD4 and FLD5 are different for PK=CD0 and CD1, I want BOTH
occurrences of those records.

Re: SolrCloud Feedback

2011-09-11 Thread Mark Miller


On Sep 9, 2011, at 1:09 PM, Pulkit Singhal wrote:

 I think I understand it a bit better now but wouldn't mind some validation.
 
 1) solr.xml does not become part of ZooKeeper

Right - currently it does not. Info is put there to tell Solr how to connect to 
zookeeper and register the cores.

 2) The default looks like this out-of-box:
  cores adminPath=/admin/cores defaultCoreName=collection1
core name=collection1 instanceDir=. shard=shard1/
  /cores
 so that may leave one wondering where the core's association to a
 collection name is made?
 
 It can be made like so:
 a) statically in a file:
 core name=collection1 instanceDir=. shard=shard1 collection=myconf /
 b) at start time via java:
 java ... -Dcollection.configName=myconf ... -jar start.jar

These are two different things. First, just to make the bootstrap case simple, 
if you don't specify a collection name, it defaults to the SolrCore name. That 
is why we make a default SolrCore name of collection1. In the simple wiki 
SolrCloud example, you can avoid naming the collection on each shard and simply 
have things come up under collection1 by default.

a) shows how to override using the SolrCore name for the collection name.

b) shows how to set the configuration set name for the config files that you 
upload with -Dbootstrap_confdir=. If you specify nothing for 
collection.configName, it defaults to configuration1.

 
 And I'm guessing that since the core's name (collection1) for shard1
 has already been associated with -Dcollection.configname=myconf in
 http://wiki.apache.org/solr/SolrCloud#Example_A:_Simple_two_shard_cluster
 once already, adding an additional shard2 with the same core name
 (collection1), automatically throws it in with the collection name
 (myconf) without any need to specify anything at startup via -D or
 statically in solr.xml file.

myconf is not the collection name - it's the name of a collection of 
configuration files. If only one such set exists, you don't have to specify 
which to use (which you would do by changing the value at a given node in the 
zookeeper layout). If you wanted multiple named collection file sets, you would 
have to explicitly set each collection - name configuration file set.

 
 Validate away otherwise I'll just accept any hate mail after making
 edits to the Solr wiki directly.
 
 - Pulkit
 
 On Fri, Sep 9, 2011 at 11:38 AM, Pulkit Singhal pulkitsing...@gmail.com 
 wrote:
 Hello Jan,
 
 You've made a very good point in (b). I would be happy to make the
 edit to the wiki if I understood your explanation completely.
 
 When you say that it is looking up what collection that core is part
 of ... I'm curious how a core is being put under a particular
 collection in the first place? And what that collection is named?
 Obviously you've made it clear that colelction1 is really the name of
 the core itself. And where this association is being stored for the
 code to look it up?
 
 If not Jan, then perhaps the gurus who wrote Solr Cloud could answer :)
 
 Thanks!
 - Pulkit
 
 On Thu, Feb 10, 2011 at 9:10 AM, Jan Høydahl jan@cominvent.com wrote:
 Hi,
 
 I have so far just tested the examples and got a N by M cluster running. My 
 feedback:
 
 a) First of all, a major update of the SolrCloud Wiki is needed, to clearly 
 state what is in which version, what are current improvement plans and get 
 rid of outdated stuff. That said I think there are many good ideas there.
 
 b) The collection terminology is too much confused with core, and 
 should probably be made more distinct. I just tried to configure two cores 
 on the same Solr instance into the same collection, and that worked fine, 
 both as distinct shards and as same shard (replica). The wiki examples give 
 the impression that collection1 in 
 localhost:8983/solr/collection1/select?distrib=true is some magic 
 collection identifier, but what it really does is doing the query on the 
 *core* named collection1, looking up what collection that core is part of 
 and distributing the query to all shards in that collection.
 
 c) ZK is not designed to store large files. While the files in conf are 
 normally well below the 1M limit ZK imposes, we should perhaps consider 
 using a lightweight distributed object or k/v store for holding the 
 /CONFIGS and let ZK store a reference only
 
 d) How are admins supposed to update configs in ZK? Install their favourite 
 ZK editor?
 
 e) We should perhaps not be so afraid to make ZK a requirement for Solr in 
 v4. Ideally you should interact with a 1-node Solr in the same manner as 
 you do with a 100-node Solr. An example is the Admin GUI where the schema 
 and solrconfig links assume local file. This requires decent tool support 
 to make ZK interaction intuitive, such as import and export commands.
 
 --
 Jan Høydahl, search solution architect
 Cominvent AS - www.cominvent.com
 
 On 19. jan. 2011, at 21.07, Mark Miller wrote:
 
 Hello Users,
 
 About a little over a year ago, a few of us started

Solr and DateTimes - bug?

2011-09-11 Thread Nicklas Overgaard


Hi everybody,

I just started playing around with solr, however i'm facing some 
trouble. The test data i'm indexing with solr is, amongst other things, 
containing date and times.


By the way, I'm using mono and i'm talking to solr through the SolrNet 
library.


The issue i'm facing:

Some of the dates corresponds to the DateTime.MinValue of .net, which is 
0001-01-01 00:00:00. When this date is returned from Solr, it's 
returned like 1-01-01T00:00:00Z. Now, I figured out that solr 
supposedly should return dates according to the ISO 8601 standard - but 
the above output is not in that format.


This basically leads to mono breaking down because it's not able to 
parse the above date. If i add three leading zeroes, it parses just fine 
(so it becomes 0001-01-01T00:00:00Z, the correct ISO 8601 format).


So my question is: Is this a bug in the solr output engine, or should 
mono be able to parse the date as given from solr? I have not yet tried 
it out on .net as I do not have access to a windows machine at the moment.


Best regards,

Nicklas

Re: Example Solr Config on EC2

2011-09-11 Thread Pulkit Singhal

Just to clarify, that link doesn't do anything to promote an already running
slave into a master. One would have to bounce the Solr node which has that
slave and then make the shift. It is not something that happens at runtime
live.

On Wed, Aug 10, 2011 at 4:04 PM, Akshay akm...@gmail.com wrote:

Yes you can promote a slave to be master refer

http://wiki.apache.org/solr/SolrReplication#enable.2BAC8-disable_master.2BAC8-slave_in_a_node

In AWS one can use an elastic IP(http://aws.amazon.com/articles/1346) to
refer to the master and this can be assigned to slaves as they assume the
role of master(in case of failure). All slaves will then refer to this new
master and there will be no need to regenerate data.

Automation of this maybe possible through CloudWatch alarm-actions. I don't
know of any available example automation scripts.

Cheers
Akshay.

On Wed, Aug 10, 2011 at 9:08 PM, Matt Shields m...@mattshields.org
wrote:

If I were to build a master with multiple slaves, is it possible to
promote
a slave to be the new master if the original master fails? Will all the
slaves pickup right where they left off, or any time the master fails
will
we need to completely regenerate all the data?

If this is possible, are there any examples of this being automated?
Especially on Win2k3.

Matthew Shields
Owner
BeanTown Host - Web Hosting, Domain Names, Dedicated Servers, Colocation,
Managed Services
www.beantownhost.com
www.sysadminvalley.com
www.jeeprally.com

On Mon, Aug 8, 2011 at 5:34 PM, mboh...@yahoo.com wrote:

Matthew,

Here's another resource:

http://www.lucidimagination.com/blog/2010/02/01/solr-shines-through-the-cloud-lucidworks-solr-on-ec2/

Michael Bohlig
Lucid Imagination

- Original Message
From: Matt Shields m...@mattshields.org
To: solr-user@lucene.apache.org
Sent: Mon, August 8, 2011 2:03:20 PM
Subject: Example Solr Config on EC2

I'm looking for some examples of how to setup Solr on EC2. The
configuration I'm looking for would have multiple nodes for redundancy.
I've tested in-house with a single master and slave with replication
running in Tomcat on Windows Server 2003, but even if I have multiple
slaves
the single master is a single point of failure. Any suggestions or
example
configurations? The project I'm working on is a .NET setup, so ideally
I'd
like to keep this search cluster on Windows Server, even though I
prefer
Linux.

Matthew Shields
Owner
BeanTown Host - Web Hosting, Domain Names, Dedicated Servers,
Colocation,
Managed Services
www.beantownhost.com
www.sysadminvalley.com
www.jeeprally.com

Re: Stemming and other tokenizers

2011-09-11 Thread Jan Høydahl

Hi,

You'll not be able to detect language and change stemmer on the same field in 
one go. You need to create one fieldType in your schema per language you want 
to use, and then use LanguageIdentification (SOLR-1979) to do the magic of 
detecting language and renaming the field. If you set langid.override=false, 
languid.map=true and populate your language field with the known language, 
you will probably get the desired effect.

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com
Solr Training - www.solrtraining.com

On 10. sep. 2011, at 03:24, Patrick Sauts wrote:

 Hello,
 
 
 
 I want to implement some king of AutoStemming that will detect the language
 of a field based on a tag at the start of this field like #en# my field is
 stored on disc but I don't want this tag to be stored. Is there a way to
 avoid this field to be stored ?
 
 To me all the filters and the tokenizers interact only with the indexed
 field and not the stored one.
 
 Am I wrong ?
 
 Is it possible to you to do such a filter.
 
 
 
 Patrick.

Re: Running solr on small amounts of RAM

2011-09-11 Thread Jan Høydahl

Hi,

Beware that Solr4.0 branch has multiple RAM conserving optimizations which may 
cause your index to take considerably less space, so try it out.
Also, of course, prune your schema to turn off everything you don't need, and 
also your OS to stop services you don't use.
Consider disallowing certain type of queries from the clients (such as 
wildcard, sorting, fuzzy etc) to avoid getting int high-mem situations.

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com
Solr Training - www.solrtraining.com

On 11. sep. 2011, at 17:59, Erick Erickson wrote:

 Well, this answer isn't much more satisfactory than get more memory,
 but about all I can say is try it and see.
 
 Sure, make your caches very small and monitor memory and test it out.
 
 You'll get a sense of how fast (or slow) the queries are pretty quickly. Or
 you can get a ballpark estimate of what running without caches would
 do performance wise by simply measuring the first query after a restart.
 
 Because, unfortunately, it depends is the only accurate answer. It
 depends on how much sorting, faceting etc. you do as well as the
 queries themselves.
 
 Best
 Erick
 
 On Fri, Sep 9, 2011 at 12:48 PM, Mike Austin mike.aus...@juggle.com wrote:
 I'm trying to push to get solr used in our environment. I know I could have
 responses saying WHY can't you get more RAM etc.., but lets just skip those
 and work with this situation.
 
 Our index is very small with 100k documents and a light load at the moment.
 If I wanted to use the smallest possible RAM on the server, how would I do
 this and what are the issues?
 
 I know that caching would be the biggest lose but if solr ran with no to
 little caching, the performance would still be ok? I know this is a relative
 question..
 This is the only application using java on this machine, would tuning java
 to use less cache help anything?
 I should set the cache settings low in the config?
 Basically, what will having a very low cache hit rate do to search speed and
 server performance?  I know more is better and it depends on what I'm
 comparing it to but if you could just answer in some way saying that it's
 not going to cripple the machine or cause 5 second searches?
 
 It's on a windows server.
 
 
 Thanks,
 Mike

Re: Full-search index for the database

2011-09-11 Thread Eugeny Balakhonov

Hello,

Thanks for answer!

I have created separate fields in mysolr schema for each field in database
(more than 500!). How to ask parser for search via all these fields? By
default Solr schema should contain explicit declaration of default search
field like following:

defaultSearchFieldTEXT/defaultSearchField

I tried to use following search query:

.?q=*:search texthl=ondefType=edismax

In this case search goes across default search field.

I can't concatenate all 500 database field names in a big search expression.


2011/9/11 Jamie Johnson jej2...@gmail.com

 You should create separate fields in your solr schema for each field
 in your database that you want recognized separately.  You can use a
 query parser like edismax to do a weighted query across all of your
 fields and then provide highlighting on the specific field which
 matched.

 2011/9/10 Eugeny Balakhonov c0f...@gmail.com:
  I want to create full-text search for my database.
 
  It means that search engine should look up some string for all fields of
 my
  database.
 
  I have created Solr configuration for extracting and indexing data from a
  database.
 
 
 
 
 
  According documentation in the file schema.xml I have created field for
  full-text search index:
 
 
 
  field name=TEXT type=... indexed=true stored=true
  multiValued=true/
 
 
 
  Also I have added strings for copying all values of all fields into this
  full-search field:
 
 
 
  ...
 
 copyField source= dest=TEXT/
 
  ...
 
 
 
  In result I have possibility to search for all fields in my database. But
 I
  can't recognize which field in the found record contains requested
 string.
 
  Highlighting functionality just marks string in the TEXT field like
  following:
 
 
 
  lst name=highlighting
 
  lst name=431046.431344...8473633
 
   arr name=TEXT
 
 strAny text any text emTest/em/str
 
   /arr
 
  /lst
 
  lst name=431046.431231...8476393
 
   arr name=TEXT
 
strAny text any text emTest/em/str
 
   /arr
 
  /lst
 
 
 
  How to create full-search index with possibility to recognize source
  database field?
 
 
 
  Thx a lot.
 
  Eugeny
 
 




-- 
Best regards,
Eugeny Balakhonov

Re: Solr and DateTimes - bug?

2011-09-11 Thread Jan Høydahl

Hi,

Can you try to make a plain HTTP query from the admin GUI on your index and 
tell us what the XML response is for that date field?
http://localhost:8983/solr/select?q=*:*
If that date output is wrong as well, there may be a bug with Solr. If it is 
correct, you have a problem in SolrNet.

Btw, which version of Solr do you use?

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com
Solr Training - www.solrtraining.com

On 12. sep. 2011, at 00:28, Nicklas Overgaard wrote:

 Hi everybody,
 
 I just started playing around with solr, however i'm facing some trouble. The 
 test data i'm indexing with solr is, amongst other things, containing date 
 and times.
 
 By the way, I'm using mono and i'm talking to solr through the SolrNet 
 library.
 
 The issue i'm facing:
 
 Some of the dates corresponds to the DateTime.MinValue of .net, which is 
 0001-01-01 00:00:00. When this date is returned from Solr, it's returned 
 like 1-01-01T00:00:00Z. Now, I figured out that solr supposedly should 
 return dates according to the ISO 8601 standard - but the above output is not 
 in that format.
 
 This basically leads to mono breaking down because it's not able to parse the 
 above date. If i add three leading zeroes, it parses just fine (so it becomes 
 0001-01-01T00:00:00Z, the correct ISO 8601 format).
 
 So my question is: Is this a bug in the solr output engine, or should mono be 
 able to parse the date as given from solr? I have not yet tried it out on 
 .net as I do not have access to a windows machine at the moment.
 
 Best regards,
 
 Nicklas

Re: Nested documents

2011-09-11 Thread Michael McCandless

Even if it applies, this is for Lucene.  I don't think we've added
Solr support for this yet... we should!

Mike McCandless

http://blog.mikemccandless.com

On Sun, Sep 11, 2011 at 12:16 PM, Erick Erickson
erickerick...@gmail.com wrote:
 Does this JIRA apply?

 https://issues.apache.org/jira/browse/LUCENE-3171

 Best
 Erick

 On Sat, Sep 10, 2011 at 8:32 PM, Andy angelf...@yahoo.com wrote:
 Hi,

 Does Solr support nested documents? If not is there any plan to add such a 
 feature?

 Thanks.

Re: Solr and DateTimes - bug?

2011-09-11 Thread Nicklas Overgaard


Hi,

The XML output when performing a query via the solr interface is like this:
datename=endDate1-01-01T00:00:00Z/date

It's solr 3.3.0 on an ArchLinux desktop machine with OpenJDK 
6.b22_1.10.3-1 as my java runtime environment.


/Nicklas

On 2011-09-12 00:26, Jan Høydahl wrote:

Hi,

Can you try to make a plain HTTP query from the admin GUI on your index and 
tell us what the XML response is for that date field?
http://localhost:8983/solr/select?q=*:*
If that date output is wrong as well, there may be a bug with Solr. If it is 
correct, you have a problem in SolrNet.

Btw, which version of Solr do you use?

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com
Solr Training - www.solrtraining.com

On 12. sep. 2011, at 00:28, Nicklas Overgaard wrote:


Hi everybody,

I just started playing around with solr, however i'm facing some trouble. The 
test data i'm indexing with solr is, amongst other things, containing date and 
times.

By the way, I'm using mono and i'm talking to solr through the SolrNet library.

The issue i'm facing:

Some of the dates corresponds to the DateTime.MinValue of .net, which is 0001-01-01 
00:00:00. When this date is returned from Solr, it's returned like 
1-01-01T00:00:00Z. Now, I figured out that solr supposedly should return dates 
according to the ISO 8601 standard - but the above output is not in that format.

This basically leads to mono breaking down because it's not able to parse the above date. 
If i add three leading zeroes, it parses just fine (so it becomes 
0001-01-01T00:00:00Z, the correct ISO 8601 format).

So my question is: Is this a bug in the solr output engine, or should mono be 
able to parse the date as given from solr? I have not yet tried it out on .net 
as I do not have access to a windows machine at the moment.

Best regards,

Nicklas

Re: Full-search index for the database

2011-09-11 Thread Eugeny Balakhonov

My task is very simple:

I have a big database with a lot tables and fields. This database has
dynamic structure and can be extended or changed in any time.
I need a tool for full-search possibility via all fields in all tables of my
database. On the input of this tool - some text for search. On the output -
some unique key and the name of field which contains this text.


Solr is very good selection, but I have serious problem with it: all Solr
query parsers (standard, dismax, edismax) requires explicit declaration of
fields for search. But list of these fields in my case is very and very big!
And at search time I don't know all field names in  the database.

I think that my task is not unique. According google a lot of people tries
to solve same problems with Solr.

May be good idea to add more flexible possibilities for search in all
indexed fields?


I see following variants:

1. Add wildcards in the qf parameter for dismax/edismax query parsers.

2. Add possibility to store source field name in copyField  operator in
schema.xml. In this case user can do following:

a) create field for default search:
field name=TEXT type=text_ALL indexed=true stored=true
multiValued=true/
...
defaultSearchFieldTEXT/defaultSearchField

b) copy all fields to default search field:
copyField source=* dest=TEXT storeSource=true /

c) In query response user can receive needed source field name:

lst name=highlighting
 lst name=..
 arr name=TEXT
  str source=SOURCE_FIELD_NAMEfoo foo foo emtest/em foo foo/str
  /arr
  /lst


2011/9/12 Eugeny Balakhonov c0f...@gmail.com

 Hello,

 Thanks for answer!

 I have created separate fields in mysolr schema for each field in database
 (more than 500!). How to ask parser for search via all these fields? By
 default Solr schema should contain explicit declaration of default search
 field like following:

 defaultSearchFieldTEXT/defaultSearchField

 I tried to use following search query:

 .?q=*:search texthl=ondefType=edismax

 In this case search goes across default search field.

 I can't concatenate all 500 database field names in a big search
 expression.


 2011/9/11 Jamie Johnson jej2...@gmail.com

 You should create separate fields in your solr schema for each field
 in your database that you want recognized separately.  You can use a
 query parser like edismax to do a weighted query across all of your
 fields and then provide highlighting on the specific field which
 matched.

 2011/9/10 Eugeny Balakhonov c0f...@gmail.com:
  I want to create full-text search for my database.
 
  It means that search engine should look up some string for all fields of
 my
  database.
 
  I have created Solr configuration for extracting and indexing data from
 a
  database.
 
 
 
 
 
  According documentation in the file schema.xml I have created field for
  full-text search index:
 
 
 
  field name=TEXT type=... indexed=true stored=true
  multiValued=true/
 
 
 
  Also I have added strings for copying all values of all fields into this
  full-search field:
 
 
 
  ...
 
 copyField source= dest=TEXT/
 
  ...
 
 
 
  In result I have possibility to search for all fields in my database.
 But I
  can't recognize which field in the found record contains requested
 string.
 
  Highlighting functionality just marks string in the TEXT field like
  following:
 
 
 
  lst name=highlighting
 
  lst name=431046.431344...8473633
 
   arr name=TEXT
 
 strAny text any text emTest/em/str
 
   /arr
 
  /lst
 
  lst name=431046.431231...8476393
 
   arr name=TEXT
 
strAny text any text emTest/em/str
 
   /arr
 
  /lst
 
 
 
  How to create full-search index with possibility to recognize source
  database field?
 
 
 
  Thx a lot.
 
  Eugeny
 
 




 --
 Best regards,
 Eugeny Balakhonov




-- 
Best regards,
Eugeny Balakhonov

Re: Full-search index for the database

2011-09-11 Thread Erick Erickson

How much search-specific stuff are we talking here? Do you want to
do stemming? Plurals? Or are you talking exact match? Phrases?
multi-word queries? If exact match on individual terms
is all you want, you could hack something together like this:

index each term into a catch-all field with the field appended, something
like
val1|field1 val2|field2
be sure you don't use an analysis chain that splits on non-letters. Then, for
each term, append |* to the term and your returned terms will have the
field they came from. Of course you'll have to do the right thing with the
results to show them correctly, but that'd work.

But this is really abusing Solr G. I wonder if this is an XY problem, so
can you explain what it is you're trying to do at a higher level and maybe
we can suggest some other approach?

You could also have some kind of hybrid solution that searched with
Solr (not using the trick above) and just returned the PK from Solr,
then go to the DB to fill things out.

Best
Erick

On Sun, Sep 11, 2011 at 7:06 PM, Eugeny Balakhonov c0f...@gmail.com wrote:
 My task is very simple:

 I have a big database with a lot tables and fields. This database has
 dynamic structure and can be extended or changed in any time.
 I need a tool for full-search possibility via all fields in all tables of my
 database. On the input of this tool - some text for search. On the output -
 some unique key and the name of field which contains this text.


 Solr is very good selection, but I have serious problem with it: all Solr
 query parsers (standard, dismax, edismax) requires explicit declaration of
 fields for search. But list of these fields in my case is very and very big!
 And at search time I don't know all field names in  the database.

 I think that my task is not unique. According google a lot of people tries
 to solve same problems with Solr.

 May be good idea to add more flexible possibilities for search in all
 indexed fields?


 I see following variants:

 1. Add wildcards in the qf parameter for dismax/edismax query parsers.

 2. Add possibility to store source field name in copyField  operator in
 schema.xml. In this case user can do following:

 a) create field for default search:
 field name=TEXT type=text_ALL indexed=true stored=true
 multiValued=true/
 ...
 defaultSearchFieldTEXT/defaultSearchField

 b) copy all fields to default search field:
 copyField source=* dest=TEXT storeSource=true /

 c) In query response user can receive needed source field name:

 lst name=highlighting
  lst name=..
  arr name=TEXT
  str source=SOURCE_FIELD_NAMEfoo foo foo emtest/em foo foo/str
  /arr
  /lst


 2011/9/12 Eugeny Balakhonov c0f...@gmail.com

 Hello,

 Thanks for answer!

 I have created separate fields in mysolr schema for each field in database
 (more than 500!). How to ask parser for search via all these fields? By
 default Solr schema should contain explicit declaration of default search
 field like following:

 defaultSearchFieldTEXT/defaultSearchField

 I tried to use following search query:

 .?q=*:search texthl=ondefType=edismax

 In this case search goes across default search field.

 I can't concatenate all 500 database field names in a big search
 expression.


 2011/9/11 Jamie Johnson jej2...@gmail.com

 You should create separate fields in your solr schema for each field
 in your database that you want recognized separately.  You can use a
 query parser like edismax to do a weighted query across all of your
 fields and then provide highlighting on the specific field which
 matched.

 2011/9/10 Eugeny Balakhonov c0f...@gmail.com:
  I want to create full-text search for my database.
 
  It means that search engine should look up some string for all fields of
 my
  database.
 
  I have created Solr configuration for extracting and indexing data from
 a
  database.
 
 
 
 
 
  According documentation in the file schema.xml I have created field for
  full-text search index:
 
 
 
  field name=TEXT type=... indexed=true stored=true
  multiValued=true/
 
 
 
  Also I have added strings for copying all values of all fields into this
  full-search field:
 
 
 
  ...
 
     copyField source= dest=TEXT/
 
  ...
 
 
 
  In result I have possibility to search for all fields in my database.
 But I
  can't recognize which field in the found record contains requested
 string.
 
  Highlighting functionality just marks string in the TEXT field like
  following:
 
 
 
  lst name=highlighting
 
  lst name=431046.431344...8473633
 
   arr name=TEXT
 
     strAny text any text emTest/em/str
 
   /arr
 
  /lst
 
  lst name=431046.431231...8476393
 
   arr name=TEXT
 
    strAny text any text emTest/em/str
 
   /arr
 
  /lst
 
 
 
  How to create full-search index with possibility to recognize source
  database field?
 
 
 
  Thx a lot.
 
  Eugeny
 
 




 --
 Best regards,
 Eugeny Balakhonov




 --
 Best regards,
 Eugeny Balakhonov

select query does not find indexed pdf document

2011-09-11 Thread Michael Dockery

I am new to solr.  

I tried to upload a pdf file via curl to my solr webapp (on tomcat)

curl 
http://www/SearchApp/update/extract?stream.file=c:\dmvpn.pdfstream.contentType=application/pdfliteral.id=pdfcommit=true;



?xml version=1.0 encoding=UTF-8?
response
lst name=responseHeaderint name=status0/intint 
name=QTime860/int/lst
/response


but

http://www/SearchApp/select/?q=vpn


does not find the document


response
lst name=responseHeader
int name=status0/int
int name=QTime0/int
lst name=params
str name=qvpn/str
/lst
/lst
result name=response numFound=0 start=0/
/response


help is appreciated.

=
fyi
I point my test webapp to the index/solr home via mod meta-data/context.xml
Context crossContext=true 
   Environment name=solr/home type=java.lang.String 
   value=c:/solr_home override=true /

and I had to copy all these jars to my webapp lib dir: (to avoid the 
classnotfound)
Solr_download\contrib\extraction\lib
  ...in the future i plan to put them in the tomcat/lib dir.


Also, I have not modified conf\solrconfig.xml or schema.xml.

Will Solr/Lucene crawl multi websites (aka a mini google with faceted search)?

2011-09-11 Thread dpt9876

Hi all,
I am wondering if Solr will do the following for a project I am working on.
I want to create a search engine with facets for potentially hundreds of
websites.
Similar to say crawling amazon + buy.com + ebay and someone can search these
3 sites from my 1 website.
(I realise there are better ways of doing the above example, its for
illustrative purposes).
Eventually I would build that search crawl to index say 200 or 1000
merchants.
Someone would come to my site and search for digital camera.

They would get results from all 3 indexes and hopefully dynamic facets eg
Price $100-200
Price 200-300
Resolution 1mp-2mp

etc etc

Can this be done on the fly?

I ask this because I am currently developing webscrapers to crawl these
websites, dump that data into a db, then was thinking of tacking on a solr
server to crawl my db.

Problem with that approach is that crawling the worlds ecommerce sites will
take forever, when it seems solr might do that for me? (I have read about
multiple indexes etc).

Many thanks

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Will-Solr-Lucene-crawl-multi-websites-aka-a-mini-google-with-faceted-search-tp3328314p3328314.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Will Solr/Lucene crawl multi websites (aka a mini google with faceted search)?

2011-09-11 Thread Erick Erickson

Nope, there's nothing in Solr that crawls anything, you have to feed
documents in yourself from the websites.

Or, look at the Nutch project, see: http://nutch.apache.org/about.html

which is designed for this kind of problem.

Best
Erick

On Sun, Sep 11, 2011 at 8:53 PM, dpt9876 daninthetrop...@gmail.com wrote:
Hi all,
I am wondering if Solr will do the following for a project I am working on.
I want to create a search engine with facets for potentially hundreds of
websites.
Similar to say crawling amazon + buy.com + ebay and someone can search these
3 sites from my 1 website.
(I realise there are better ways of doing the above example, its for
illustrative purposes).
Eventually I would build that search crawl to index say 200 or 1000
merchants.
Someone would come to my site and search for digital camera.

They would get results from all 3 indexes and hopefully dynamic facets eg
Price $100-200
Price 200-300
Resolution 1mp-2mp

etc etc

Can this be done on the fly?

I ask this because I am currently developing webscrapers to crawl these
websites, dump that data into a db, then was thinking of tacking on a solr
server to crawl my db.

Problem with that approach is that crawling the worlds ecommerce sites will
take forever, when it seems solr might do that for me? (I have read about
multiple indexes etc).

Many thanks

--
View this message in context:
http://lucene.472066.n3.nabble.com/Will-Solr-Lucene-crawl-multi-websites-aka-a-mini-google-with-faceted-search-tp3328314p3328314.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Will Solr/Lucene crawl multi websites (aka a mini google with faceted search)?

2011-09-11 Thread dpt9876

Hi thanks for the reply.

How does nutch/solr handle the scenario where 1 website calls price, price
and another website calls it cost. Same thing different name, yet I would
want the facet to handle that and not create a different facet.

Is this combo of nutch and Solr that intelligent and or intuitive?

Thanks for the fast response.
On Sep 12, 2011 9:06 AM, Erick Erickson [via Lucene]
ml-node+s472066n3328340...@n3.nabble.com wrote:

Nope, there's nothing in Solr that crawls anything, you have to feed
documents in yourself from the websites.

Or, look at the Nutch project, see: http://nutch.apache.org/about.html

which is designed for this kind of problem.

Best
Erick

On Sun, Sep 11, 2011 at 8:53 PM, dpt9876 daninthetrop...@gmail.com
wrote:
Hi all,
I am wondering if Solr will do the following for a project I am working
on.
I want to create a search engine with facets for potentially hundreds of
websites.
Similar to say crawling amazon + buy.com + ebay and someone can search
these
3 sites from my 1 website.
(I realise there are better ways of doing the above example, its for
illustrative purposes).
Eventually I would build that search crawl to index say 200 or 1000
merchants.
Someone would come to my site and search for digital camera.

They would get results from all 3 indexes and hopefully dynamic facets eg
Price $100-200
Price 200-300
Resolution 1mp-2mp

etc etc

Can this be done on the fly?

I ask this because I am currently developing webscrapers to crawl these
websites, dump that data into a db, then was thinking of tacking on a
solr
server to crawl my db.

Problem with that approach is that crawling the worlds ecommerce sites
will
take forever, when it seems solr might do that for me? (I have read about
multiple indexes etc).

Many thanks

___
If you reply to this email, your message will be added to the discussion
below:

http://lucene.472066.n3.nabble.com/Will-Solr-Lucene-crawl-multi-websites-aka-a-mini-google-with-faceted-search-tp3328314p3328340.html

To unsubscribe from Will Solr/Lucene crawl multi websites (aka a mini
google with faceted search)?, visit
http://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_codenode=3328314code=ZGFuaW50aGV0cm9waWNzQGdtYWlsLmNvbXwzMzI4MzE0fC04MDk0NTc1ODg=

--
View this message in context:
http://lucene.472066.n3.nabble.com/Will-Solr-Lucene-crawl-multi-websites-aka-a-mini-google-with-faceted-search-tp3328314p3328449.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Will Solr/Lucene crawl multi websites (aka a mini google with faceted search)?

2011-09-11 Thread Ken Krugler

On Sep 11, 2011, at 7:04pm, dpt9876 wrote:

Hi thanks for the reply.

Is this combo of nutch and Solr that intelligent and or intuitive?

What you're describing here is web mining, not web crawling.

You want to extract price data from web pages, and put that into a specific
field in Solr.

To do that using Nutch, you'd need to write custom plug-ins that know how to
extract the price from a page, and add that as a custom field to the crawl
results.

The above is a topic for the Nutch mailing list, since Solr is just a
downstream consumer of whatever Nutch provides.

-- Ken

On Sep 12, 2011 9:06 AM, Erick Erickson [via Lucene]
ml-node+s472066n3328340...@n3.nabble.com wrote:

Nope, there's nothing in Solr that crawls anything, you have to feed
documents in yourself from the websites.

Or, look at the Nutch project, see: http://nutch.apache.org/about.html

which is designed for this kind of problem.

Best
Erick

On Sun, Sep 11, 2011 at 8:53 PM, dpt9876 daninthetrop...@gmail.com
wrote:
Hi all,
I am wondering if Solr will do the following for a project I am working
on.
I want to create a search engine with facets for potentially hundreds of
websites.
Similar to say crawling amazon + buy.com + ebay and someone can search
these
3 sites from my 1 website.
(I realise there are better ways of doing the above example, its for
illustrative purposes).
Eventually I would build that search crawl to index say 200 or 1000
merchants.
Someone would come to my site and search for digital camera.

They would get results from all 3 indexes and hopefully dynamic facets eg
Price $100-200
Price 200-300
Resolution 1mp-2mp

etc etc

Can this be done on the fly?

I ask this because I am currently developing webscrapers to crawl these
websites, dump that data into a db, then was thinking of tacking on a
solr
server to crawl my db.

Problem with that approach is that crawling the worlds ecommerce sites
will
take forever, when it seems solr might do that for me? (I have read about
multiple indexes etc).

Many thanks

___
If you reply to this email, your message will be added to the discussion
below:

http://lucene.472066.n3.nabble.com/Will-Solr-Lucene-crawl-multi-websites-aka-a-mini-google-with-faceted-search-tp3328314p3328340.html

--
View this message in context:
http://lucene.472066.n3.nabble.com/Will-Solr-Lucene-crawl-multi-websites-aka-a-mini-google-with-faceted-search-tp3328314p3328449.html
Sent from the Solr - User mailing list archive at Nabble.com.

--
Ken Krugler
+1 530-210-6378
http://bixolabs.com
custom big data solutions training
Hadoop, Cascading, Mahout Solr

Parameter not working for master/slave

2011-09-11 Thread William Bell

I am using 3.3 SOLR. I tried passing in -Denable.master=true and
-Denable.slave=true on the Slave machine.
Then I changed solrconfig.xml to reference each as per:

http://wiki.apache.org/solr/SolrReplication#enable.2BAC8-disable_master.2BAC8-slave_in_a_node

But this is not working. The enable parameter does not appear to work in 3.3.

If this supposed to be working? What else can I do to debug it? How
can I see other parameters working in solrconfig.xml ?

-- 
Bill Bell
billnb...@gmail.com
cell 720-256-8076

Re: Solr and DateTimes - bug?

2011-09-11 Thread Chris Hostetter


: The XML output when performing a query via the solr interface is like this:
: datename=endDate1-01-01T00:00:00Z/date

i think you mean: date name=endDate1-01-01T00:00:00Z/date

:   So my question is: Is this a bug in the solr output engine, or should mono
:   be able to parse the date as given from solr? I have not yet tried it out
:   on .net as I do not have access to a windows machine at the moment.

it is in fact a bug in Solr that not a lot of people have been overly 
concerned with some most people don't deal with dates that far back

https://issues.apache.org/jira/browse/SOLR-1899

...I spent a little time working on it at one point but got side tracked 
by other things since there are a coupld of related issues with the 
canonical iso8601 date format arround year 0 that made it non obvious 
what hte ideal solution was.

-Hoss

Re: Using multivalued field in map function

2011-09-11 Thread Chris Hostetter


: Hmmm, would it be simpler to do something like append
: a clause like this?
: BloggerId:12304^10 OR CoBloggerId:123404^5?

Definitely, but that won't garuntee you a strict ordering if there is a 
particularly good relevany match.

There's a bunch of ways to go about something like this, but trying to use 
the map function is definitely overkill (even if it could work on 
multivalued fields)

this kind of thing is particularly easy with the sort by function feature 
added in 3.2 -- because any query can be used as a function ...

q=your_querysort=query(BloggerId:12304)+desc,+query(CoBloggerId:123404)+desc,+score+desc


-Hoss

Re: Adding Query Filter custom implementation to Solr's pipeline

2011-09-11 Thread Chris Hostetter


: When I was using Lucene directly I used a custom implementation of query 
: filter to enforce entitlements of search results. Now, that I'm 
: switching my infrastructure from custom host to Solr, what is the best 
: way to configure Solr to use my custom query filter for every request?

It depends on how complex your custom Filter was.  

many people find that things that when using Solr, they can reimplement 
basic Filter logic using fq params and the built in QParsers provided by 
solr.  

If you do need to implement something truely custom, writing it as your 
own QParser to trigger via an fq can be advantageous so it can cached 
and re-used by many queries.

If that doesn't cut it for you, some people implement their own 
SearchComponents to manipulate the Queries.

And as a last resort: you can always implement your own RequestHandler and 
directly use so SolrIndexSearcher to execute the queyr anyway you want -- 
but if you don't use the DocList/DocSet methods, other built in features 
like faceting won't be very easy to use.

If you provide some more details on how your existing Filter work,s people 
cna provide more advice on what would make the most sense.

-Hoss

41 matches

Mail list logo