Re: Reindex Solr Using Tomcat

2010-11-18 Thread Ken Stanley
On Thu, Nov 18, 2010 at 3:42 PM, Eric Martin  wrote:
> Ah, I am using an ApacheSolr module in Drupal and used nutch to insert the 
> data into the Solr index. When I using Jetty I could just delete the data 
> contents in sshd and then restart the service forcing the reindex.
>
> Currently, the ApacheSolr module for Drupal allows for a 200 record re-index 
> every cron run, but that is too slow for me. During implantation and testing 
> I would prefer to re-index the entire database as I have over 400k records.
>
> I appreciate your help. My mind was searching for a command on the CLI that 
> would just tell solr to reindex the entire dbase and be done with it.
>

Eric,

>From what I could find, this looks to be your best bet:
http://drupal.org/node/267543.

- Ken


Re: Reindex Solr Using Tomcat

2010-11-18 Thread Ken Stanley
On Thu, Nov 18, 2010 at 3:33 PM, Eric Martin  wrote:
> Hi,
>
>
>
> I searched google and the wiki to find out how I can force a full re-index
> of all of my content and I came up with zilch. My goal is to be able to
> adjust the weight settings, re-index  my entire database and then search my
> site and view the results of my weight adjustments.
>
>
>
> I am using Tomcat 5.x and Solr 1.4.1. Weird how I couldn't find this info. I
> must have missed it. Anyone know where to find it?
>
>
>
> Eric
>

Eric,

How you re-index SOLR determines which method you wish to use. You can
either use the UpdateHandler using a POST of an XML file [1], or you
can use the DataImportHandler (DIH) [2]. There exist other means, but
these two should be sufficient to get started. How did you import your
initial index in the first place?

[1] http://wiki.apache.org/solr/UpdateXmlMessages
[2] http://wiki.apache.org/solr/DataImportHandler


Re: WordDelimiterFilterFactory + CamelCase query

2010-11-18 Thread Ken Stanley
On Thu, Nov 18, 2010 at 3:22 PM, Peter Karich  wrote:
>
>> Hi,
>>
>> Please add preserveOriginal="1"  to your WDF [1] definition and reindex
>> (or
>> just try with the analysis page).
>
> but it is already there!?
>
>                          generateWordParts="1" generateNumberParts="1"
> catenateAll="0" preserveOriginal="1"/>
>
>
> Regards,
> Peter.
>

Peter,

I recently had this issue, and I had to set splitOnCaseChange="0" to
keep the word delimiter filter from doing what you describe. Can you
try that and see if it helps?

- Ken


Re: ranged and boolean query

2010-11-17 Thread Ken Stanley
On Wed, Nov 17, 2010 at 11:00 AM, Peter Blokland  wrote:
> hi,
>
> On Wed, Nov 17, 2010 at 10:54:48AM -0500, Ken Stanley wrote:
>
>> > pubdate:([* TO NOW] OR (NOT *))
>
>> Instead of using NOT, try simply prefixing the field name with a minus
>> sign. This tells SOLR to exclude the field. Otherwise, the word NOT
>> would be treated as a term, and would be applied against your default
>> field (which may or may not affect your results). So instead of
>> (pubdate:[* TO NOW]) OR ( NOT pubdate:*), you would write (pubdate:[*
>> TO NOW]) OR ( -pubdate:*).
>
> tried that, it gives me exactly the same result... I can't really
> figure out what's going on.
>
> --
> CUL8R, Peter.
>
> www.desk.nl --- Sent from my NetBSD-powered Talkie Toaster™
>

If you append your URL with debugQuery=on, it will tell you how SOLR
parsed your query. What's your schema look like? And what does the
debug query look like?


Re: ranged and boolean query

2010-11-17 Thread Ken Stanley
On Wed, Nov 17, 2010 at 10:39 AM, Peter Blokland  wrote:
> hi.
>
> i'm using solr and am trying to limit my resultset to documents
> that either have a publication date in the range * to now, or
> have no publication date set at all (field is not present).
> however, using this :
>
> (pubdate:[* TO NOW]) OR ( NOT pubdate:*)
>
> gives me only the documents in the range * to now (reversing the
> two clauses has no effect). using only
>
> NOT pubdate:*
>
> gives me the correct set of documents (those not having a pubddate).
> any reason the OR does not work in this case ?
>
> ps: also tried it like this :
>
> pubdate:([* TO NOW] OR (NOT *))
>
> which gives the same result.
>
>
> --
> CUL8R, Peter.
>
> www.desk.nl --- Sent from my NetBSD-powered Talkie Toaster™
>

Peter,

Instead of using NOT, try simply prefixing the field name with a minus
sign. This tells SOLR to exclude the field. Otherwise, the word NOT
would be treated as a term, and would be applied against your default
field (which may or may not affect your results). So instead of
(pubdate:[* TO NOW]) OR ( NOT pubdate:*), you would write (pubdate:[*
TO NOW]) OR ( -pubdate:*).

- Ken


Re: How do I format this query with 2 search terms?

2010-11-17 Thread Ken Stanley
2010/11/17 Jón Helgi Jónsson :
> I'm using index time boosting and need to specify every field I want
> to search (not use copy fields) or else the boosting wont work.
>
> This query with 1 saerchterm works fine, boosts look good:
>
> http://localhost:8983/solr/select/?
> q=companyName:foo
> +descriptionTxt:verslun
> &fl=*%20score&rows=10&start=0
>
> However if I have 2 words in the query and do it like this boosting
> seems not to be working
>
> http://localhost:8983/solr/select/?
> q=companyName:foo+bar
> +descriptionTxt:foo+bar
> &fl=*%20score&rows=10&start=0
>
> Its probably using the default search field for the second word which
> has no boosting configured. How do I go about this?
>
> Thanks,
> Jon
>

Jon,

You have a few options here, depending on what you want to achieve
with your query:

1. If you're trying to do a phrase query, you simply need to ensure
that your phrases are quoted. The default behavior in SOLR is to split
the phrase into multiple chunks. If a word is not preceded with a
field definition, then SOLR will automatically apply the word(s) as if
you had specified the default field. So for your example, SOLR would
parse your query into companyName:foo defaultField:bar
descriptionTxt:foo defaultField:bar.
2. You can use the dismax query plugin instead of the standard query
plugin. You simply configure the dismax section of your solrconfig.xml
to your liking - you define which fields to search, apply any special
boosts for your needs, etc
(http://wiki.apache.org/solr/DisMaxQParserPlugin) - and then you
simply feed the query terms without naming your fields (i.e.,
q=foo+bar), along with telling SOLR to use dismax (i.e.,
qt=whatever_you_named_your_dismax_handler).
3. If phrase queries are not important to you, you can manually prefix
each term in your query with the field you wish to search; for
example, you would do companyName:foo companyName:bar
descriptionTxt:foo descriptionTxt:bar.

Whichever way you decide to go, the best thing that you can do to
understand SOLR and how it's working in your environment is to append
debugQuery=on to the end of your URL; this tells SOLR to output
information about how it parsed your query, how long each component
took to run, and some other useful debugging information. It's very
useful, and has come in handy several times here where I'm at when I
wanted to know why SOLR returned the results (or didn't return) that I
expected.

I hope this helps.

- Ken


Re: DIH for multilingual index & multiValued field?

2010-11-13 Thread Ken Stanley
On Sat, Nov 13, 2010 at 5:59 PM, Ken Stanley  wrote:
>   CREATE TABLE documents (
>       id INT NOT NULL AUTO_INCREMENT,
>       language_code CHAR(2),
>       tags CHAR(30),
>       text TEXT,
>       PRIMARY KEY (id)
>   );

I apologize, but I couldn't leave the typo in my last post without a
follow up; it might cause confusion. I copied the OP's original table
definition and forgot to remove the tags field. My purposed definition
for the documents table should be:

  CREATE TABLE documents (
  id INT NOT NULL AUTO_INCREMENT,
  language_code CHAR(2),
  text TEXT,
  PRIMARY KEY (id)
  );

- Ken


Re: DIH for multilingual index & multiValued field?

2010-11-13 Thread Ken Stanley
On Sat, Nov 13, 2010 at 4:56 PM, Ahmet Arslan  wrote:
> For (1) you probably need to write a custom transformer. Something like:
> public Object transformRow(Map row)     {
> String language_code = row.get("language_code");
> String text = row.get("text");
> if("en".equals(language_code))
>       row.put("text_en", text);
> else if if("fr".equals(language_code))
>       row.put("text_fr", text);
>
> return row;
> }
>
>
> For (2), it doable with regex transformer.
>
> "
> The 'emailids' field in the table can be a comma separated value. So it ends 
> up giving out one or more than one email ids and we expect the 'mailId' to be 
> a multivalued field in Solr." [1]
>
> [1]http://wiki.apache.org/solr/DataImportHandler#RegexTransformer
>

In my opinion, I think that this is a bit of overkill. Since the DIH
supports multiple entities, with no real limit on the SQL queries, I
think that the easiest (and less involved) approach would be to create
three entities for the languages the OP wishes to index:
















But, I admit that depending on future growth of languages, as well as
other factors (i.e., needing more specific logic, etc), a programmatic
approach might be warranted.

I would recommend, however, that the database table be a little more
normalized. Your definition for tags is quite limiting, and could be
better served using a many-to-many relationship. Something like the
following might serve you well:

   CREATE TABLE documents (
   id INT NOT NULL AUTO_INCREMENT,
   language_code CHAR(2),
   tags CHAR(30),
   text TEXT,
   PRIMARY KEY (id)
   );

   CREATE TABLE document_tags (
   id INT NOT NULL AUTO_INCREMENT,
   tag CHAR(30),
   PRIMARY KEY (id)
   );

   CREATE TABLE document_tag_lookup (
   document_id INT NOT NULL,
   tag_id INT NOT NULL,
   PRIMARY KEY (document_id, tag_id)
   );

Then in the DIH, you simply nest a second entity to look up the zero
or more tags that might be associated with your documents; take the
"english" entity from above:









This would allow for growth, and is easy to maintain. Additionally, if
you wanted to implement a custom transformer of your own, you could.
As an aside, a sort of compromise, you could also use the
ScriptTransformer [1] to create a Javascript function that can do your
language logic and create the necessary fields, and not have to worry
about maintaining any custom Java code.

[1] http://wiki.apache.org/solr/DataImportHandler#ScriptTransformer

- Ken


Re: Best practice for emailing this list?

2010-11-10 Thread Ken Stanley
On Wed, Nov 10, 2010 at 1:11 PM, robo -  wrote:
> How do people email this list without getting spam filter problems?
>

Depends on which side of the spam filter that you're referring to.
I've found that to keep these emails from entering my spam filter is
to add a rule to Gmail that says "Never send to spam". As for when I
send emails, I make sure that I send my emails as plain text to avoid
getting bounce backs.

- Ken


Re: scheduling imports and heartbeats

2010-11-10 Thread Ken Stanley
On Tue, Nov 9, 2010 at 10:16 PM, Tri Nguyen  wrote:
> Hi,
>
> Can I configure solr to schedule imports at a specified time (say once a day,
> once an hour, etc)?
>
> Also, does solr have some sort of heartbeat mechanism?
>
> Thanks,
>
> Tri

Tri,

If you use the DataImportHandler (DIH), you can set up a
dataimport.properties file that can be configured to import on
intervals.

http://wiki.apache.org/solr/DataImportHandler#dataimport.properties_example

As for "heartbeat", you can use the ping handler (default is
/admin/ping) to check the status of the servlet.

- Ken


Re: spell check vs terms component

2010-11-09 Thread Ken Stanley
On Tue, Nov 9, 2010 at 1:02 PM, Shalin Shekhar Mangar
 wrote:
> On Tue, Nov 9, 2010 at 8:20 AM, bbarani  wrote:
>
>>
>> Hi,
>>
>> We are trying to implement auto suggest feature in our application.
>>
>> I would like to know the difference between terms vs spell check component.
>>
>> Both the handlers seems to display almost the same output, can anyone let
>> me
>> know the difference and also I would like to know when to go for spell
>> check
>> and when to go for terms component.
>>
>>
> SpellCheckComponent is designed to operate on whole words and not partial
> words so I don't know how well it will work for auto-suggest, if at all.
>
> As far as differences between SpellCheckComponent and Terms Component is
> concerned, TermsComponent is a straight prefix match whereas SCC takes edit
> distance into account. Also, SCC can deal with phrases composed of multiple
> words and also gives back a collated suggestion.
>
> --
> Regards,
> Shalin Shekhar Mangar.
>

An alternative to using the SpellCheckComponent and/or the
TermsComponent, would be the (Edge)NGrams filter. Basically, this
filter breaks words down into auto-suggest-friendly tokens (i.e.,
"Hello" => "H", "He", "Hel", "Hell", "Hello") that works great for
auto suggestion querying.

Here is an article from Lucid Imagination on using the ngram filter:
http://www.lucidimagination.com/blog/2009/09/08/auto-suggest-from-popular-queries-using-edgengrams/
Here is the SOLR wiki entry for the filter:
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.EdgeNGramFilterFactory

- Ken Stanley


Re: dynamically create unique key

2010-11-09 Thread Ken Stanley
On Tue, Nov 9, 2010 at 10:53 AM, Christopher Gross  wrote:
> Thanks Ken.
>
> I'm using a script with Java/SolrJ to copy documents from their original
> locations into the Solr Index.
>
> I wasn't sure if the copyField would help me, but from your answers it seems
> that I'll have to handle it on my own.  That's fine -- it is definitely not
> hard to pass a new field myself.  I was just thinking that there should be
> an "easy" way to have Solr build the unique field, since it was getting
> everything anyway.
>
> I was just confused as to why I was getting a multiValued error, since I was
> just trying to append to a field.  I wasn't sure if I was missing something.
>
> Thanks again!
>
> -- Chris
>

Chris,

I definitely understand your sentiment. The thing to keep in mind with
SOLR is that it really has limited logic mechanisms; in fact, unless
you're willing to use the DataImportHandler (dih) and the
ScriptTransformer, you really have no logic.

The copyField directive in schema.xml is mainly used to help you
easily copy the contents of one field into another so that it may be
indexed in multiple ways; for example, you can index a string so that
it is stored literally (i.e., "Hello World"), parsed using a
whitespace tokenizer (i.e., "Hello", "World"), parsed for an nGram
tokenizer (i.e., "H", "He", "Hel"... ). This is beneficial to you
because you wouldn't have to explicitly define each possible instance
in your data stream. You just define the field once, and SOLR is smart
enough to copy it where it needs to go.

Glad to have helped. :)

- Ken


Re: dynamically create unique key

2010-11-09 Thread Ken Stanley
On Tue, Nov 9, 2010 at 10:39 AM, Christopher Gross  wrote:
> I'm trying to use Solr to store information from a few different sources in
> one large index.  I need to create a unique key for the Solr index that will
> be unique per document.  If I have 3 systems, and they all have a document
> with id=1, then I need to create a "uniqueId" field in my schema that
> contains both the system name and that id, along the lines of: "sysa1",
> "sysb1", and "sysc1".  That way, each document will have a unique id.
>
> I added this to my schema.xml:
>
>  
>  
>
>
> However, after trying to insert, I got this:
> java.lang.Exception: ERROR: multiple values encountered for non multiValued
> copy field uniqueId: sysa
>
> So instead of just appending to the uniqueId field, it tried to do a
> multiValued.  Does anyone have an idea on how I can make this work?
>
> Thanks!
>
> -- Chris
>

Chris,

Depending on how you insert your documents into SOLR will determine
how to create your unique field. If you are POST'ing the data via
HTTP, then you would be responsible for building your unique id (i.e.,
your program/language would use string concatenation to add the unique
id to the output before it gets to the update handler in SOLR). If
you're using the DataImportHandler, then you can use the
TemplateTransformer
(http://wiki.apache.org/solr/DataImportHandler#TemplateTransformer) to
dynamically build your unique id at document insertion time.

For example, we here at bizjournals use SOLR and the DataImportHandler
to index our documents. Like you, we run the risk of two or more ids
clashing, and thus overwriting a different type of document. As such,
we take two or three different fields and combine them together using
the TemplateTransformer to generate a more unique id for each document
we index.

With respect to the multiValued option, that is used more for an
array-like structure within a field. For example, if you have a blog
entry with multiple tag keywords, you would probably want a field in
SOLR that can contain the various tag keywords for each blog entry;
this is where multiValued comes in handy.

I hope that this helps to clarify things for you.

- Ken Stanley


Re: Fixed value in dataimporthandler

2010-11-08 Thread Ken Stanley
On Mon, Nov 8, 2010 at 3:50 PM, Renato Wesenauer
 wrote:
>
> Hi Ahmet Arslan,
>
> I'm using this in schema.xml:
>  stored="true"/>
> 
>
> I'm using this in dataimporthandler:
> 
> 
>
> The indexing process work correctly, but it's happening something wrong with
> the results of queries.
>
> All queries with some field with 2 words or more, plus the field
> "indativo:true", it isn't returning any result.
>
> Example of queries:
>
> 1º) secao:"accessories for cars" AND indativo:true
> 2º) secao:"accessories for cars" AND indativo:false
>
> The first query returns 0 results, but there are 40.000 documents indexed
> with these fields.
> The second query returns 300.000 documents, but 300.000 is the total of
> documents for query secao:"celular e telefonia", the correct would be
> 260.000.
>
> Another example:
> 1º) secao:"toys" AND indativo:true
> 2º) secao:"toys" AND indativo:false
>
> In this example, the two queries work correctly.
>
> The problem happens with values with 2 words or more, plus the "indativo"
> field.
>
> Do you know what can be happening?
>
> Thank you,
>
> Renato F. Wesenauer
>

Renato,

Correct me if I'm wrong, but you have an entity that you explicitly
set to a false value for the "indativo" field. And when you query, is
your intention to find the fields that were not indexed through that
entity? The way that I am reading your question is that you are
expecting the indativo field to be true by default, but I do not see
where you're explicitly stating that in your schema. The reason that I
bring this up is - and I could be wrong - I would think that if you do
not set a value in SOLR, then it doesn't exist (either in the schema,
or during indexing). If you are expecting the other entries where
indativo was explicitly set to false to be true, you might need to
tweak your schema so that the field definition is by default "true".
Is it possible to try adding the default attribute to your field
definition and reindexing to see if that gives you what you're looking
for?

- Ken Stanley

PS. If this came through twice, I apologize; I got a bounce-back
saying my original reply was blocked, so I'm trying to re-send as
plain text.


Re: Tomcat special character problem

2010-11-07 Thread Ken Stanley
On Sun, Nov 7, 2010 at 9:34 AM, Em  wrote:

>
> Hi Ken,
>
> thank you for your quick answer!
>
> To make sure that there occurs no mistakes at my application's side, I send
> my requests with the form that is available at solr/admin/form.jsp
>
> I changed almost nothing from the example-configurations within the
> example-package except some auto-commit params.
>
> All the special-characters within the results were displayed correctly, and
> so far they were also indexed correctly.
> The only problem is querying with special-characters.
>
> I can confirm that the page is encoded in UTF-8 within my browser.
>
> Is there a possibility that Tomcat did not use the UTF-8 URIEncoding?
> Maybe I should say that Tomcat is behind an Apache HttpdServer and is
> mounted by a jk_mount.
>
> Thank you!
>
>
I am not familiar with using your type of set up, but a quick Google search
suggested using a second connector on a different port. If you're using
mod_jk, you can try setting "JkOptions +ForwardURICompatUnparsed" to see if
that helps. (
http://markstechstuff.blogspot.com/2008/02/utf-8-problem-between-apache-and-tomcat.html).
Sorry I couldn't have been more help. :)

- Ken


Re: Tomcat special character problem

2010-11-07 Thread Ken Stanley
On Sun, Nov 7, 2010 at 9:11 AM, Em  wrote:

>
> Hi List,
>
> I got an issue with my Solr-environment in Tomcat.
> First: I am not very familiar with Tomcat, so it might be my fault and not
> Solr's.
>
> It can not be a solr-side configuration problem, since everything worked
> fine with my local Jetty-servlet container.
>
> However, when I deploy into Tomcat, several special characters were shown
> in
> their utf-8 representation.
>
> Example:
> göteburg will be displayed as göteburg when it comes
> to
> search.
>
> I tried the following within my server.xml-file
>
>   connectionTimeout="2"
>   redirectPort="8443"
>   URIEncoding="UTF-8" />
>
> And restarted Tomcat afterwards.
>
> The problem only occurs when I try to search for something.
> It is no problem to index that data.
>
> Thank you for any help!
>
> Regards,
> Em
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Tomcat-special-character-problem-tp1857648p1857648.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>

That is definitely odd. When I tried copying "göteburg" and doing a manual
query in my web browser, everything worked. How are you making the request
to SOLR? When I viewed the properties/info of the results, my returned
charset was in UTF-8. Can you confirm similar for you?

When I grepped for "UTF-8" in both my SOLR and Tomcat configs, nothing stood
out as a special configuration option.


Re: querying multiple fields as one

2010-11-04 Thread Ken Stanley
On Thu, Nov 4, 2010 at 8:21 AM, Tommaso Teofili
wrote:

> Hi all,
> having two fields named 'type' and 'cat' with identical type and options,
> but different values recorded, would it be possible to query them as they
> were one field?
> For instance
>  q=type:electronics cat:electronics
> should return same results as
>  q=common:electronics
> I know I could make it defining a third field 'common' with copyFields from
> 'type' and 'cat' to 'common' but this wouldn't be feasible if you've
> already
> lots of documents in your index and don't want to reindex everything, isn't
> it?
> Any suggestions?
> Thanks in advance,
> Tommaso
>

Tommaso,

If re-indexing is not feasible/preferred, you might try looking into
creating a dismax handler that should give you what you're looking for in
your query: http://wiki.apache.org/solr/DisMaxQParserPlugin. The same
solrconfig.xml that comes with SOLR has a dismax parser that you can modify
to your needs.

- Ken Stanley


Re: Highlighting and maxBooleanClauses limit

2010-11-02 Thread Ken Stanley
On Tue, Nov 2, 2010 at 11:26 AM, Koji Sekiguchi  wrote:

> (10/11/02 23:14), Ken Stanley wrote:
>
>> I've noticed in the stack trace that this exception occurs when trying to
>> build the query for the highlighting; I've confirmed this by copying the
>> params and changing hl=true to hl=false. Unfortunately, when using
>> debugQuery=on, I do not see any details on what is going on with the
>> highlighting portion of the query (after artificially increasing the
>> maxBooleanClauses so the query will run).
>>
>> With all of that said, my question(s) to the list are: Is there a way to
>> determine how exactly the highlighter is building its query (i.e., some
>> sort
>> of highlighting debug setting)?
>>
>
> Basically I think highlighter uses main query, but try to rewrite it
> before highlighting.
>
>
>  Is the behavior of highlighting in SOLR
>> intended to be held to the same restrictions (maxBooleanClauses) as the
>> query parser (even though the highlighting query is built internally)?
>>
>
> I think so because maxBooleanClauses is a static variable.
>
> I saw your stack trace and glance at highlighter source,
> my assumption is - highlighter tried to rewrite (expand) your
> range queries to boolean query, even if you set requireFieldMatch to true.
>
> Can you try to query without the range query? If the problem goes away,
> I think it is highlighter bug. Highlighter should skip the range query
> when user set requireFieldMatch to true, because your range query is for
> another field. If so, please open a jira issue.
>
> Koji
> --
> http://www.rondhuit.com/en/
>

Koji, that is most excellent. Thank you for pointing out that the range
queries were causing the highlighter to exceed the maxBooleanClauses. Once I
removed them from my main query (and moved them into separate filter
queries), SOLR and highlighting worked as I expected them to work.

Per your suggestion, I have opened a JIRA ticket (SOLR-2216) for this
problem. I am somewhat a novice at Java, and I have not yet had the pleasure
of getting the SOLR sources in my working environment, but I would be more
than eager to potentially assist in finding a solution - with maybe some
mentoring from a more experienced developer.

Anyway, thank you again, I am very excited to have a suitable work around
for the time being.

- Ken Stanley


Highlighting and maxBooleanClauses limit

2010-11-02 Thread Ken Stanley
 query (i.e., some sort
of highlighting debug setting)? Is the behavior of highlighting in SOLR
intended to be held to the same restrictions (maxBooleanClauses) as the
query parser (even though the highlighting query is built internally)?

I am not a SOLR expert by any measure of the word, and as such, I just don't
understand how two words on one field (as noted by the use of
hl.fl=df_text_content + hl.requireFieldMatch=true +
hl.usePhraseHighlighter=true) could somehow exceed the limits of both 1024
and 2048. I am concerned that even if I continue increasing
maxBooleanClauses, I am not actually solving anything; in fact, my concern
is that if I were to keep increasing this limit, I am in fact begging for
problems later on down the road.

For the sake of completeness, here are the definitions of the field I'm
highlighting on (schema.xml):























And here is my highlighter definition (solrconfig.xml):






255







70

0.5

  [-\w ,/\n\"']{20,200}












It is worth noting that I have not done anything (except formatting) to the
highlighting configuration in solrconfig.xml. Any help, assistance, and/or
guidance that can be provided would be greatly appreciated.

Thank you,

Ken Stanley

It looked like something resembling white marble, which was
probably what it was: something resembling white marble.
-- Douglas Adams, "The Hitchhikers Guide to the Galaxy"


Re: Phrase Query Problem?

2010-11-02 Thread Ken Stanley
On Tue, Nov 2, 2010 at 8:19 AM, Erick Erickson wrote:

> That's not the response I get when I try your query, so I suspect
> something's not quite right with your test...
>
> But you could also try putting parentheses around the words, like
> mykeywords:(Compliance+With+Conduct+Standards)
>
> Best
> Erick
>
>
I agree with Erick, your query string showed quotes, but your parsed query
did not. Using quotes, or parenthesis, would pretty much leave your query
alone. There is one exception that I've found: if you use a stopword
analyzer, any stop words would be converted to ? in the parsed query. So if
you absolutely need every single word to match, regardless, you cannot use a
field type that uses the stop word analyzer.

For example, I have two dynamic field definitions: df_text_* that does the
default text transformations (including stop words), and df_text_exact_*
that does nothing (field type is string). When I run the
query df_text_exact_company_name:"Bank of America" OR
df_text_company_name:"Bank of America", the following is shown as my
query/parsed query when debugQuery is on:


df_text_exact_company_name:"Bank of America" OR df_text_company_name:"Bank
of America"


df_text_exact_company_name:"Bank of America" OR df_text_company_name:"Bank
of America"


df_text_exact_company_name:Bank of America
PhraseQuery(df_text_company_name:"bank ? america")


df_text_exact_company_name:Bank of America df_text_company_name:"bank ?
america"


The difference is subtle, but important. If I were to do
df_text_company_name:"Bank and America", I would still match "Bank of
America". These are things that you should keep in mind when you are
creating fields for your indices.

A useful tool for seeing what SOLR does to your query terms is the Analysis
tool found in the admin panel. You can do an analysis on either a specific
field, or by a field type, and you will see a breakdown by Analyzer for
either the index, query, or both of any query that you put in. This would
definitely be useful when trying to determine why SOLR might return what it
does.

- Ken


Re: Phrase Query Problem?

2010-11-01 Thread Ken Stanley
On Mon, Nov 1, 2010 at 10:26 PM, Tod  wrote:

> I have a number of fields I need to do an exact match on.  I've defined
> them as 'string' in my schema.xml.  I've noticed that I get back query
> results that don't have all of the words I'm using to search with.
>
> For example:
>
>
> q=(((mykeywords:Compliance+With+Conduct+Standards)OR(mykeywords:All)OR(mykeywords:ALL)))&start=0&indent=true&wt=json
>
> Should, with an exact match, return only one entry but it returns five some
> of which don't have any of the fields I've specified.  I've tried this both
> with and without quotes.
>
> What could I be doing wrong?
>
>
> Thanks - Tod
>
>

Tod,

Without knowing your exact field definition, my first guess would be your
first boolean query; because it is not quoted, what SOLR typically does is
to transform that type of query into something like (assuming your uniqueKey
is "id"): (mykeywords:Compliance id:With id:Conduct id:Standards). If you do
(mykeywords:"Compliance+With+Conduct+Standards) you might see different
(better?) results. Otherwise, append &debugQuery=on to your URL and you can
see exactly how SOLR is parsing your query. If none of that helps, what is
your field definition in your schema.xml?

- Ken


Re: indexing '-

2010-10-31 Thread Ken Stanley
On Sun, Oct 31, 2010 at 12:12 PM, PeterKerk  wrote:

>
> I have a city named 's-Hertogenbosch
>
> I want it to be indexed exactly like that, so "'s-Hertogenbosch" (without
> "")
>
> But now I get:
> 
>1
>1
>1
> 
>
> What filter should I add/remove from my field definition?
>
> I already tried a new fieldtype with just this, but no luck:
> positionIncrementGap="100" >
>  
>
> ignoreCase="true" expand="false"/>
>  
>
>
>
> My schema.xml
>
> positionIncrementGap="100" >
>  
>
> ignoreCase="true" expand="false"/>
> words="stopwords_dutch.txt" />
> generateWordParts="0" generateNumberParts="0" catenateWords="1"
> catenateNumbers="1" catenateAll="0"/>
>
>
> protected="protwords.txt"/>
>
>  
>
>
> 
>
>
>
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/indexing-tp1816969p1816969.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>

For exact text, you should try using either the string type, or a type that
only uses the KeywordTokenizer. Other field types may perform
transformations on the text similar to what you are seeing.

- Ken


Re: Looking for Developers

2010-10-28 Thread Ken Stanley
On Thu, Oct 28, 2010 at 2:57 PM, Michael McCandless <
luc...@mikemccandless.com> wrote:

> I don't think we should do this until it becomes a "real" problem.
>
> The number of job offers is tiny compared to dev emails, so far, as
> far as I can tell.
>
> Mike
>
>
By the time that it becomes a real problem, it would be too late to get
people to stop spamming the -user mailing list; no?

- Ken


Re: If I want to move a core from one physical machine to another....

2010-10-28 Thread Ken Stanley
On Thu, Oct 28, 2010 at 8:07 AM, Ephraim Ofir  wrote:

> How is this better than replication?
>
> Ephraim Ofir
>
>
It's not; for our needs here, we have not set up replication through SOLR.
We are working through OOM problems/performance tuning first, then "best
practices" second. I just wanted the OP to know that it can be done, and how
we do it. :)


Re: If I want to move a core from one physical machine to another....

2010-10-28 Thread Ken Stanley
On Wed, Oct 27, 2010 at 6:12 PM, Ron Mayer  wrote:

> If I want to move a core from one physical machine to another,
> is it as simple as just
>   scp -r core5 otherserver:/path/on/other/server/
> and then adding
>
> on that other server's solr.xml file and restarting the server there?
>
>
>
> PS: Should have I been able to figure the answer to that
>out by RTFM somewhere?
>

Ron,

In our current environment I index all of our data on one machine, and to
save time with "replication", I use scp to copy the data directory over to
our other servers. On the server that I copy from, I don't turn SOLR off,
but on the servers that I copy to, I shutdown tomcat; remove the data
directory; mv the data directory I scp'd from the source; turn tomcat back
on. I do it this way (especially with mv, versus cp) because it is the
fastest way to get the data on the other servers. And, as Gora pointed out,
you need to make sure that your configuration files match (specifically the
schema.xml) the source.

- Ken


Re: ClassCastException Issue

2010-10-26 Thread Ken Stanley
On Mon, Oct 25, 2010 at 2:45 AM, Alex Matviychuk  wrote:

> Getting this when deploying to tomcat:
>
> [INFO][http-4443-exec-3][solr.schema.IndexSchema] readSchema():394
> Reading Solr Schema
> [INFO][http-4443-exec-3][solr.schema.IndexSchema] readSchema():408
> Schema name=tsadmin
> [ERROR][http-4443-exec-3][util.plugin.AbstractPluginLoader] log():139
> java.lang.ClassCastException: org.apache.solr.schema.StrField cannot
> be cast to org.apache.solr.schema.FieldType
>at org.apache.solr.schema.IndexSchema$1.create(IndexSchema.java:419)
>at org.apache.solr.schema.IndexSchema$1.create(IndexSchema.java:447)
>at
> org.apache.solr.util.plugin.AbstractPluginLoader.load(AbstractPluginLoader.java:141)
>at
> org.apache.solr.schema.IndexSchema.readSchema(IndexSchema.java:456)
>at org.apache.solr.schema.IndexSchema.(IndexSchema.java:95)
>at org.apache.solr.core.SolrCore.(SolrCore.java:520)
>at
> org.apache.solr.core.CoreContainer$Initializer.initialize(CoreContainer.java:137)
>
>
> solr schema:
>
> 
> 
>
> sortMissingLast="true" omitNorms="true"/>
>...
>
>
>   
>   ...
>
> 
>
>
> Any ideas?
>
> Thanks,
> Alex Matviychuk
>


Alex,

I've run into this issue myself, and it was because I tried to create a
fieldType called string (like you). Rename "string" to something else and
the exception should go away.

- Ken


Re: How do I this in Solr?

2010-10-26 Thread Ken Stanley
On Tue, Oct 26, 2010 at 9:15 AM, Savvas-Andreas Moysidis <
savvas.andreas.moysi...@googlemail.com> wrote:

> If I get your question right, you probably want to use the AND binary
> operator as in "samsung AND andriod AND GPS" or "+samsung +andriod +GPS"
>
>
N.b. For these queries you can also pass the q.op parameter in the request
to temporarily change the default operator to AND; this has the same effect
without having to build the query; i.e., you can just pass
"http://host:port/solr/select?q=samsung+android+gps&q.op=and";
as the query string (along with any other params you need).


Re: DataImporter using pure solr XML

2010-10-25 Thread Ken Stanley
On Mon, Oct 25, 2010 at 10:12 AM, Dario Rigolin
wrote:

> Looking at DataImporter I'm not sure if it's possible to import using a
> standard ... xml document representing a document add operation.
> Generating  is quite expensive in my application and I have
> cached
> all those documents into a text column into MySQL database.
> It will be easier for me to "push" all updated documents directly from
> Database instead passing via multiple xml files posted in "stream" mode to
> Solr.
>
> Thank you.
>
> Dario.
>


Dario,

Technically nothing is stopping you from using the DIH to import your XML
document(s). However, note that the  structure is not
required. In fact, you can make up your own structure for the documents, so
long as you configure the DIH to recognize them. At minimum, you should be
able to use something to the effect of:






   




The break down is as follows:

The  defines the document encoding that SOLR should use for
your XML files.

The top-level  creates the list of files to parse (hence why the
fileName attribute supports regex expressions). The dataSource attribute
needs to be set null here (I'm using 1.4.1, and AFAIK this is the same as
1.3 as well). The rootEntity="false"  is important to tell SOLR that it
should not try to define fields from this entity.

The second-level  is where the documents found in the file list
are processed and parsed. The dataSource attribute needs to be the name of
the top-level . The url attribute is defined as the absolute path
to the file generated by the top-level entity. The forEach is the key
component here; this is the minimum xPath needed to iterate over your
document structure. So, if by example you had:



 data
 more data
 ...



Also note that, in my experience, case sensitivity matters when parsing your
xpath instructions.

I hope this helps!

- Ken Stanley


Re: xpath processing

2010-10-23 Thread Ken Stanley
On Fri, Oct 22, 2010 at 11:52 PM,  wrote:

>
>
> 
> 
> 
>  processor="FileListEntityProcessor" fileName=".*xml" recursive="true"
> baseDir="C:\data\sample_records\mods\starr">
>  url="${f.fileAbsolutePath}" stream="false" forEach="/mods"
> transformer="DateFormatTransformer,RegexTransformer,TemplateTransformer">
> 
> 
> 
> 
> 
> 
> 
> 
> 
>  />
> 
> 
> 
> 


The documentation says you don't need a dataSource for your
XPathEntityProcessor entity; in my configuration, I have mine set to the
name of the top-level FileListEntityProcessor. Everything else looks fine.
Can you provide one record from your data? Also, are you getting any errors
in your log?

- Ken


Re: xpath processing

2010-10-22 Thread Ken Stanley
Parinita,

In its simplest form, what does your entity definition for DIH look like;
also, what does one record from your xml look like? We need more information
before we can really be of any help. :)

- Ken

It looked like something resembling white marble, which was
probably what it was: something resembling white marble.
-- Douglas Adams, "The Hitchhikers Guide to the Galaxy"


On Fri, Oct 22, 2010 at 8:00 PM,  wrote:

> Quoting pghorp...@ucla.edu:
> Can someone help me please?
>
>
>> I am trying to import mods xml data in solr using  the xml/http datasource
>>
>> This does not work with XPathEntityProcessor of the data import handler
>> xpath="/mods/name/namepa...@type = 'date']"
>>
>> I actually have 143 records with type attribute as 'date' for element
>> namePart.
>>
>> Thank you
>> Parinita
>>
>>
>
>


Re: Documents and Cores, take 2

2010-10-19 Thread Ken Stanley
Ron,

In the past I've worked with SOLR for a product that required the ability to
search - separately - for companies, people, business lists, and a
combination of the previous three. In designing this in SOLR, I found that
using a combination of explicit field definitions and dynamic fields (
http://wiki.apache.org/solr/SchemaXml#Dynamic_fields) gave me the best
possible solution for the problem.

In essence, I created explicit fields that would be shared among all
document "types": a unique id, a document type, an indexed date, a modified
date, and maybe a couple of other fields that share traits with all document
types (i.e., name, a "market" specific to our business, etc). The unique id
was built as a string, and was prefixed with the document type, and it ended
with the unique id from the database.

The dynamic fields can be configured to be as flexible as you need, and in
my experience I would strongly recommend documenting each type of dynamic
field for each of your document types as a reference for your developers
(and yourself). :)

This allows us to build queries that can be focused on specific document
types, or combining all of the types into a "super" search. For example, you
could something to the effect of: (docType: people) AND (df_firstName:John
AND df_lastName:Hancock), (docType:companies) AND
(df_BusinessName:Acme+Inc), or even ((df_firstName:John AND
df_lastName:Hancock) OR (df_BusinessName:Acme+Inc)).

I hope this helps!

- Ken

It looked like something resembling white marble, which was
probably what it was: something resembling white marble.
-- Douglas Adams, "The Hitchhikers Guide to the Galaxy"


On Tue, Oct 19, 2010 at 4:57 PM, Olson, Ron  wrote:

> Hi all-
>
> I have a newbie design question about documents, especially with SQL
> databases. I am trying to set up Solr to go against a database that, for
> example, has "items" and "people". The way I see it, and I don't know if
> this is right or not (thus the question), is that I see both as separate
> documents as an item may contain a list of parts, which the user may want to
> search, and, as part of the "item", view the list of people who have ordered
> the item.
>
> Then there's the actual "people", who the user might want to search to find
> a name and, consequently, what items they ordered. To me they are both "top
> level" things, with some overlap of fields. If I'm searching for "people",
> I'm likely not going to be interested in the parts of the item, while if I'm
> searching for "items" the likelihood is that I may want to search for
> "42532" which is, in this instance, a SKU, and not get hits on the zip code
> section of the "people".
>
> Does it make sense, then, to separate these two out as separate documents?
> I believe so because the documentation I've read suggests that a document
> should be analogous to a row in a table (in this case, very de-normalized).
> What is tripping me up is, as far as I can tell, you can have only one
> document type per index, and thus one document per core. So in this example,
> I have two cores, "items" and "people". Is this correct? Should I embrace
> the idea of having many cores or am I supposed to have a single, unified
> index with all documents (which doesn't seem like Solr supports).
>
> The ultimate question comes down to the search interface. I don't
> necessarily want to have the user explicitly state which document they want
> to search; I'd like them to simply type "42532" and get documents from both
> cores, and then possibly allow for filtering results after the fact, not
> before. As I've only used the admin site so far (which is core-specific),
> does the client API allow for unified searching across all cores? Assuming
> it does, I'd think my idea of multiple-documents is okay, but I'd love to
> hear from people who actually know what they're doing. :)
>
> Thanks,
>
> Ron
>
> BTW: Sorry about the problem with the previous message; I didn't know about
> thread hijacking.
>
> DISCLAIMER: This electronic message, including any attachments, files or
> documents, is intended only for the addressee and may contain CONFIDENTIAL,
> PROPRIETARY or LEGALLY PRIVILEGED information.  If you are not the intended
> recipient, you are hereby notified that any use, disclosure, copying or
> distribution of this message or any of the information included in or with
> it is  unauthorized and strictly prohibited.  If you have received this
> message in error, please notify the sender immediately by reply e-mail and
> permanently delete and destroy this message and its attachments, along with
> any copies thereof. This message does not create any contractual obligation
> on behalf of the sender or Law Bulletin Publishing Company.
> Thank you.
>


Re: **SPAM** Re: boosting injection

2010-10-19 Thread Ken Stanley
Andrea,

Another approach, aside of Markus' suggestion, would be to create your own
handler that could intercept the query and perform whatever necessary
transformations that you need at query time. However, that would require
having Java knowledge (which I make no assumption).

Regards,

Ken

It looked like something resembling white marble, which was
probably what it was: something resembling white marble.
-- Douglas Adams, "The Hitchhikers Guide to the Galaxy"


On Tue, Oct 19, 2010 at 10:23 AM, Andrea Gazzarini <
andrea.gazzar...@atcult.it> wrote:

>  Hi Ken,
> thanks for your response...unfortunately it doesn't solve my problem.
>
> I cannot chnage the client behaviour so the query must be a query and not
> only the query terms.
> In this scenario, It would be great, for example, if I could declare the
> boost in the schema field definitionbut I think it's not possible isn't
> it?
>
> Regards
> Andrea
>
> --
> *From:* Ken Stanley [mailto:doh...@gmail.com]
> *To:* solr-user@lucene.apache.org
> *Sent:* Tue, 19 Oct 2010 15:05:31 +0200
> *Subject:* **SPAM** Re: boosting injection
>
> Andrea,
>
> Using the SOLR dismax query handler, you could set up queries like this to
> boost on fields of your choice. Basically, the q parameter would be the
> query terms (without the field definitions, and a qf (Query Fields)
> parameter that you use to define your boost(s):
> http://wiki.apache.org/solr/DisMaxQParserPlugin. A non-SOLR alternative
> would be to parse the query in whatever application is sending the queries
> to the SOLR instance to make the necessary transformations.
>
> Regards,
>
> Ken
>
> It looked like something resembling white marble, which was
> probably what it was: something resembling white marble.
> -- Douglas Adams, "The Hitchhikers Guide to the Galaxy"
>
>
> On Tue, Oct 19, 2010 at 8:48 AM, Andrea Gazzarini <
> andrea.gazzar...@atcult.it> wrote:
>
> > Hi all,
> > I have a client that is sending this query
> >
> > q=title:history AND author:joyce
> >
> > is it possible to "transform" at runtime this query in this way:
> >
> > q=title:history^10 AND author:joyce^5
> >
> > ?
> >
> > Best regards,
> > Andrea
> >
> >
> >
>
>


Re: boosting injection

2010-10-19 Thread Ken Stanley
Andrea,

Using the SOLR dismax query handler, you could set up queries like this to
boost on fields of your choice. Basically, the q parameter would be the
query terms (without the field definitions, and a qf (Query Fields)
parameter that you use to define your boost(s):
http://wiki.apache.org/solr/DisMaxQParserPlugin. A non-SOLR alternative
would be to parse the query in whatever application is sending the queries
to the SOLR instance to make the necessary transformations.

Regards,

Ken

It looked like something resembling white marble, which was
probably what it was: something resembling white marble.
-- Douglas Adams, "The Hitchhikers Guide to the Galaxy"


On Tue, Oct 19, 2010 at 8:48 AM, Andrea Gazzarini <
andrea.gazzar...@atcult.it> wrote:

>  Hi all,
> I have a client that is sending this query
>
> q=title:history AND author:joyce
>
> is it possible to "transform" at runtime this query in this way:
>
> q=title:history^10 AND author:joyce^5
>
> ?
>
> Best regards,
> Andrea
>
>
>


Re: SOLR DateTime and SortableLongField field type problems

2010-10-18 Thread Ken Stanley
On Mon, Oct 18, 2010 at 7:52 AM, Michael Sokolov wrote:

> I think if you look closely you'll find the date quoted in the Exception
> report doesn't match any of the declared formats in the schema.  I would
> suggest, as a first step, hunting through your data to see where that date
> is coming from.
>
> -Mike
>
>
[Note: RE-sending this because apparently in my sleepy-stupor, I clicked to
wrong Reply button and never sent this to the list (It's a Monday) :)]

I've noticed that date anomaly as well, and I've discovered that is one of
the gotchas of DIH: it seems to modify my date to that format. All of the
dates in the data are in the correct "-MM-dd'T'hh:mm:ss'Z'" format. Once
it is run through dateTImeFormat, I assume it is converted into a date
object; trying to use that date object in any other form (i.e., using
template, or even another dateTimeFormat) results in the exception I've
described (displaying the date in the incorrect format).

Thanks,

Ken Stanley


Re: SOLR DateTime and SortableLongField field type problems

2010-10-18 Thread Ken Stanley
Just following up to see if anybody might have some words of wisdom on the
issue?

Thank you,

Ken

It looked like something resembling white marble, which was
probably what it was: something resembling white marble.
-- Douglas Adams, "The Hitchhikers Guide to the Galaxy"


On Fri, Oct 15, 2010 at 6:42 PM, Ken Stanley  wrote:

> Hello all,
>
> I am using SOLR-1.4.1 with the DataImportHandler, and I am trying to follow
> the advice from
> http://www.mail-archive.com/solr-user@lucene.apache.org/msg11887.htmlabout 
> converting date fields to SortableLong fields for better memory
> efficiency. However, whenever I try to do this using the DateFormater, I get
> exceptions when indexing for every row that tries to create my sortable
> fields.
>
> In my schema.xml, I have the following definitions for the fieldType and
> dynamicField:
>
>  stored="false" sortMissingLast="true" omitNorms="true" />
>  />
>
> In my dih.xml, I have the following definitions:
>
> 
> 
>  name="xml_stories"
> rootEntity="false"
> dataSource="null"
> processor="FileListEntityProcessor"
> fileName="legacy_stories.*\.xml$"
> recursive="false"
> baseDir="/usr/local/extracts"
> newerThan="${dataimporter.xml_stories.last_index_time}"
> >
>  name="stories"
> pk="id"
> dataSource="xml_stories"
> processor="XPathEntityProcessor"
> url="${xml_stories.fileAbsolutePath}"
> forEach="/RECORDS/RECORD"
> stream="true"
>
> transformer="DateFormatTransformer,HTMLStripTransformer,RegexTransformer,TemplateTransformer"
> onError="continue"
> >
>  xpath="/RECORDS/RECORD/pr...@name='R_ModifiedTime']/PVAL" />
>  sourceColName="_modified_date" dateTimeFormat="-MM-dd'T'hh:mm:ss'Z'" />
>
>  xpath="/RECORDS/RECORD/pr...@name='R_StoryDate']/PVAL" />
>  sourceColName="_df_date_published" dateTimeFormat="-MM-dd'T'hh:mm:ss'Z'"
> />
>
>  sourceColName="modified_date" dateTimeFormat="MMddhhmmss" />
>  sourceColName="df_date_published" dateTimeFormat="MMddhhmmss" />
> 
> 
> 
> 
>
> The fields in question are in the formats:
>
> 
> 
> 
> 2001-12-04T00:00:00Z
> 
> 
> 2001-12-04T19:38:01Z
> 
> 
> 
>
> The exception that I am receiving is:
>
> Oct 15, 2010 6:23:24 PM
> org.apache.solr.handler.dataimport.DateFormatTransformer transformRow
> WARNING: Could not parse a Date field
> java.text.ParseException: Unparseable date: "Wed Nov 28 21:39:05 EST 2007"
> at java.text.DateFormat.parse(DateFormat.java:337)
> at
> org.apache.solr.handler.dataimport.DateFormatTransformer.process(DateFormatTransformer.java:89)
> at
> org.apache.solr.handler.dataimport.DateFormatTransformer.transformRow(DateFormatTransformer.java:69)
> at
> org.apache.solr.handler.dataimport.EntityProcessorWrapper.applyTransformer(EntityProcessorWrapper.java:195)
> at
> org.apache.solr.handler.dataimport.EntityProcessorWrapper.nextRow(EntityProcessorWrapper.java:241)
> at
> org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:357)
> at
> org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:383)
> at
> org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:242)
> at
> org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:180)
> at
> org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:331)
> at
> org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:389)
> at
> org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:370)
>
> I know that it has to be the SortableLong fields, because if I remove just
> those two lines from my dih.xml, everything imports as I expect it to. Am I
> doing something wrong? Mis-using the SortableLong and/or DateTransformer? Is
> this not supported in my version of SOLR? I'm not very experienced with
> Java, so digging into the code would be a lost cause for me right now. I was
> hoping that somebody here might be able to help point me in the
> right/correct direction.
>
> It should be noted that the modified_date and df_date_published fields
> index just fine (so long as I do it as I've defined above).
>
> Thank you,
>
> - Ken
>
> It looked like something resembling white marble, which was
> probably what it was: something resembling white marble.
> -- Douglas Adams, "The Hitchhikers Guide to the Galaxy"
>


SOLR DateTime and SortableLongField field type problems

2010-10-15 Thread Ken Stanley
Hello all,

I am using SOLR-1.4.1 with the DataImportHandler, and I am trying to follow
the advice from
http://www.mail-archive.com/solr-user@lucene.apache.org/msg11887.html about
converting date fields to SortableLong fields for better memory efficiency.
However, whenever I try to do this using the DateFormater, I get exceptions
when indexing for every row that tries to create my sortable fields.

In my schema.xml, I have the following definitions for the fieldType and
dynamicField:




In my dih.xml, I have the following definitions:


















The fields in question are in the formats:




2001-12-04T00:00:00Z


2001-12-04T19:38:01Z




The exception that I am receiving is:

Oct 15, 2010 6:23:24 PM
org.apache.solr.handler.dataimport.DateFormatTransformer transformRow
WARNING: Could not parse a Date field
java.text.ParseException: Unparseable date: "Wed Nov 28 21:39:05 EST 2007"
at java.text.DateFormat.parse(DateFormat.java:337)
at
org.apache.solr.handler.dataimport.DateFormatTransformer.process(DateFormatTransformer.java:89)
at
org.apache.solr.handler.dataimport.DateFormatTransformer.transformRow(DateFormatTransformer.java:69)
at
org.apache.solr.handler.dataimport.EntityProcessorWrapper.applyTransformer(EntityProcessorWrapper.java:195)
at
org.apache.solr.handler.dataimport.EntityProcessorWrapper.nextRow(EntityProcessorWrapper.java:241)
at
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:357)
at
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:383)
at
org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:242)
at
org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:180)
at
org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:331)
at
org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:389)
at
org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:370)

I know that it has to be the SortableLong fields, because if I remove just
those two lines from my dih.xml, everything imports as I expect it to. Am I
doing something wrong? Mis-using the SortableLong and/or DateTransformer? Is
this not supported in my version of SOLR? I'm not very experienced with
Java, so digging into the code would be a lost cause for me right now. I was
hoping that somebody here might be able to help point me in the
right/correct direction.

It should be noted that the modified_date and df_date_published fields index
just fine (so long as I do it as I've defined above).

Thank you,

- Ken

It looked like something resembling white marble, which was
probably what it was: something resembling white marble.
-- Douglas Adams, "The Hitchhikers Guide to the Galaxy"


Re: problem on running fullimport

2010-10-15 Thread Ken Stanley
On Fri, Oct 15, 2010 at 7:42 AM, swapnil dubey wrote:

> Hi,
>
> I am using the full import option with the data-config file as mentioned
> below
>
> 
>url="jdbc:mysql:///xxx" user="xxx" password="xx"  />
>
>
>
>
>
> 
>
>
> on running the full-import option I am getting the error mentioned below.I
> had already included the dataimport.properties file in my conf file.help me
> to get the issue resolved
>
> 
> -
> 
> 0
> 334
> 
> -
> 
> -
> 
> data-config.xml
> 
> 
> full-import
> debug
> 
> -
> 
> -
> 
> -
> 
> select studentName from test1
> -
> 
> org.apache.solr.handler.dataimport.DataImportHandlerException: Unable to
> execute query: select studentName from test1 Processing Document # 1
> ...
>
> --
> Regards
> Swapnil Dubey
>

Swapnil,

Everything looks fine, except that in your entity definition you forgot to
define which datasource you wish to use. So if you add the
'dataSource="JdbcDataSource"' that should get rid of your exception. As a
reminder, the DataImportHandler wiki (
http://wiki.apache.org/solr/DataImportHandler) on Apache's website is very
helpful with learning how to use the DIH properly. It has helped me with
having a printed copy beside me for easy and quick reference.

- Ken


Re: Searching Across Multiple Cores

2010-10-14 Thread Ken Stanley
Steve,

Using shards is actually quite simple; it's just a matter of setting up your
shards (via multiple cores, or multiple instances of SOLR) and then passing
the shards parameter in the query string. The shards parameter is a
comma-separated list of the servers/cores you wish to use together.

So, let's try this using a fictitious example. You have two cores, one
called main for your main data set of metadata and favorites for your user
favorites meta data. You set up each schema accordingly, and you've indexed
your data. When you want to do a query on both sets of data you would build
your query appropriately, and then use the following URL (the host is
assumed to be localhost for simplicity):

http://localhost/solr/main/select?q=id:[*+TO+*]&shards=localhost/solr/main,localhost/solr/favorites&rows=100&start=0

I am personally investigating using this technique to tie together two cores
that utilize different schemas; one schema will contain news articles,
blogs, and similar types of data, while another schema will contain
company-specific information, such as addresses, etc. If you're still having
trouble after trying this, let me know and I'd be more than happy to share
any findings that I come across.

I hope that this helps to clear things up for you. :)

- Ken

It looked like something resembling white marble, which was
probably what it was: something resembling white marble.
-- Douglas Adams, "The Hitchhikers Guide to the Galaxy"


On Thu, Oct 14, 2010 at 4:25 AM, Lohrenz, Steven
wrote:

> Ken,
>
> I have been through that page many times. I could use Distributed search
> for what? The first scenario or the second?
>
> The question is: can I merge a set of results from the two cores/shards and
> only return results that exist in both (determined by the resourceId, which
> exists on both)?
>
> Cheers,
> Steve
>
> -Original Message-
> From: Ken Stanley [mailto:doh...@gmail.com]
> Sent: 13 October 2010 20:08
> To: solr-user@lucene.apache.org
> Subject: Re: Searching Across Multiple Cores
>
> On Wed, Oct 13, 2010 at 2:11 PM, Lohrenz, Steven
> wrote:
>
> > Hi,
> >
> > I am trying to figure out if how I can accomplish the following:
> >
> > I have a fairly static and large set of resources I need to have indexed
> > and searchable. Solr seems to be a perfect fit for that. In addition I
> need
> > to have the ability for my users to add resources from the main data set
> to
> > a 'Favourites' folder (which can include a few more tags added by them).
> The
> > Favourites needs to be searchable in the same manner as the main data
> set,
> > across all the same fields.
> >
> > My first thought was to have two separate schemas
> > - the first  for the main data set and its metadata
> > - the second for the Favourites folder with all of the metadata from the
> > main set copied over and then adding the additional fields.
> >
> > Then I thought that would probably waste quite a bit of space (the number
> > of users is much larger than the number of main resources).
> >
> > So then I thought I could have the main data set with its metadata. Then
> > there would be second one for the Favourites folder with the unique id
> from
> > the first and the additional fields it needs (userId, grade, folder,
> tag).
> > In addition, I would create another schema/core with all the fields from
> the
> > other two and have a request handler defined on it that searches across
> the
> > other 2 cores and returns the results through this core.
> >
> > This third core would have searches run against it where the results
> would
> > expect to only be returned for a single user. For example, a user
> searches
> > their Favourites folder for all the items with Foo. The result is only
> those
> > items the user has added to their Favourites with Foo somewhere in their
> > main data set metadata.
> >
> > Could this be made to work? What would the consequences be? Any
> alternative
> > suggestions?
> >
> > Thanks,
> > Steve
> >
> >
> Steve,
>
> From your description, it really sounds like you could reap the benefits of
> using Distributed Search in SOLR:
>
> http://wiki.apache.org/solr/DistributedSearch
>
> I hope that this helps.
>
> - Ken
>


Re: searching while importing

2010-10-13 Thread Ken Stanley
On Wed, Oct 13, 2010 at 6:38 PM, Shawn Heisey  wrote:

>  If you are using the DataImportHandler, you will not be able to search new
> data until the full-import or delta-import is complete and the update is
> committed.  When I do a full reindex, it takes about 5 hours, and until it
> is finished, I cannot search it.
>
> This is not true; when I use the DIH to do a full-import, I (and my team)
are still able to search on the already-indexed data that exists.


> I have not tried to issue a manual commit in the middle of an import to see
> whether that makes data inserted up to that point searchable, but I would
> not expect that to work.
>
> If you set the autoCommit properties maxDocs and maxTime to reasonable
values, then once those limits are reached, I suspect that SOLR would commit
and continue indexing; however, I have not had the chance to use those
features in solrconfig.xml.


> If you need this kind of functionality, you may need to change your build
> system so that a full import clears the index manually and then does a
> series of delta-import batches.
>
> The only time I've had an issue with being able to search while indexing is
when my DIH had mis-configuration that caused the import to finish without
indexing anything, thus wiping out my data. Aside of that, I continually
index and search at the same time almost every day (using 1.4.1).


>
>
> On 10/13/2010 3:51 PM, Tri Nguyen wrote:
>
>> Hi,
>>  Can I perform searches against the index while it is being imported?
>>  Does importing add 1 document at a time or will solr make a temporary
>> index and
>> switch to that index when indexing is done?
>>  Thanks,
>>  Tri
>>
>
>


Re: Searching Across Multiple Cores

2010-10-13 Thread Ken Stanley
On Wed, Oct 13, 2010 at 2:11 PM, Lohrenz, Steven
wrote:

> Hi,
>
> I am trying to figure out if how I can accomplish the following:
>
> I have a fairly static and large set of resources I need to have indexed
> and searchable. Solr seems to be a perfect fit for that. In addition I need
> to have the ability for my users to add resources from the main data set to
> a 'Favourites' folder (which can include a few more tags added by them). The
> Favourites needs to be searchable in the same manner as the main data set,
> across all the same fields.
>
> My first thought was to have two separate schemas
> - the first  for the main data set and its metadata
> - the second for the Favourites folder with all of the metadata from the
> main set copied over and then adding the additional fields.
>
> Then I thought that would probably waste quite a bit of space (the number
> of users is much larger than the number of main resources).
>
> So then I thought I could have the main data set with its metadata. Then
> there would be second one for the Favourites folder with the unique id from
> the first and the additional fields it needs (userId, grade, folder, tag).
> In addition, I would create another schema/core with all the fields from the
> other two and have a request handler defined on it that searches across the
> other 2 cores and returns the results through this core.
>
> This third core would have searches run against it where the results would
> expect to only be returned for a single user. For example, a user searches
> their Favourites folder for all the items with Foo. The result is only those
> items the user has added to their Favourites with Foo somewhere in their
> main data set metadata.
>
> Could this be made to work? What would the consequences be? Any alternative
> suggestions?
>
> Thanks,
> Steve
>
>
Steve,

>From your description, it really sounds like you could reap the benefits of
using Distributed Search in SOLR:

http://wiki.apache.org/solr/DistributedSearch

I hope that this helps.

- Ken


Re: Solr PHP PECL Extension going to Stable Release - Wishing for Any New Features?

2010-10-12 Thread Ken Stanley
>
> > > If you are using Solr via PHP and would like to see any new features in
> > the
> > > extension please feel free to send me a note.
>

I'm new to this list, but in seeing this thread - and using PHP SOLR - I
wanted to make a suggestion that - while minor - I think would greatly
improve the quality of the extension.

(I'm basing this mostly off of SolrQuery since that's where I've encountered
the issue, but this might be true elsewhere)

Whenever a method is supposed to return an array (i.e.,
SolrQuery::getFields(), SolrQuery::getFacets(), etc), if there is no data to
return, a null is returned. I think that this should be normalized across
the board to return an empty array. First, the documentation is
contradictory (http://us.php.net/manual/en/solrquery.getfields.php) in that
the method signature says that it returns an array (not mixed), while the
Return Values section says that it returns either an array or null.
Secondly, returning an array under any circumstance provides more
consistency and less logic; for example, let's say that I am looking for the
fields (as-is in its current state):

getFields() !== null) {
foreach ($solrquery->getFields() as $field) {
// Do something
}
}
?>

This is a minor request, I know. But, I feel that it would go a long way
toward polishing the extension up for general consumption.

Thank you,

Ken Stanley

PS. I apologize if this request has come through the pipes already; as I've
stated, I am new to this list; I have yet to find any reference to my
request. :)