date:20090123

Re: How to make Relationships work for Multi-valued Index Fields?

2009-01-23 Thread Gunaranjan Chandraraju



I thought 1.3 supported dynamic fields in schema.xml?

Guna

On Jan 22, 2009, at 11:54 PM, Shalin Shekhar Mangar wrote:


Oops, one more gotcha. The dynamic field support is only in 1.4 trunk.

On Fri, Jan 23, 2009 at 1:24 PM, Shalin Shekhar Mangar <
shalinman...@gmail.com> wrote:


On Fri, Jan 23, 2009 at 1:08 PM, Gunaranjan Chandraraju <
chandrar...@apple.com> wrote:




 
 
 
 


I have setup my DIH to treat these as entities as below


 
 
   
  
  

  
   
   
   
   
  
 
   
 




I think the only way is to create a dynamic field for each attribute
(street, state etc.). Write a transformer to copy the fields from  
your data
config to appropriately named dynamic field (e.g. street_1,  
state_1, etc).

To maintain this counter you will need to get/store it with
Context#getSessionAttribute(name, val, Context.SCOPE_DOC) and
Context#setSessionAttribute(name, val, Context.SCOPE_DOC).

I cant't think of an easier way.
--
Regards,
Shalin Shekhar Mangar.





--
Regards,
Shalin Shekhar Mangar.

Re: How to make Relationships work for Multi-valued Index Fields?

2009-01-23 Thread Gunaranjan Chandraraju



I thought 1.3 supported dynamic fields in schema.xml?

Guna

On Jan 22, 2009, at 11:54 PM, Shalin Shekhar Mangar wrote:


Oops, one more gotcha. The dynamic field support is only in 1.4 trunk.

On Fri, Jan 23, 2009 at 1:24 PM, Shalin Shekhar Mangar <
shalinman...@gmail.com> wrote:


On Fri, Jan 23, 2009 at 1:08 PM, Gunaranjan Chandraraju <
chandrar...@apple.com> wrote:










I have setup my DIH to treat these as entities as below




  
 
 

 
  
  
  
  
 

  





I think the only way is to create a dynamic field for each attribute
(street, state etc.). Write a transformer to copy the fields from  
your data
config to appropriately named dynamic field (e.g. street_1,  
state_1, etc).

To maintain this counter you will need to get/store it with
Context#getSessionAttribute(name, val, Context.SCOPE_DOC) and
Context#setSessionAttribute(name, val, Context.SCOPE_DOC).

I cant't think of an easier way.
--
Regards,
Shalin Shekhar Mangar.





--
Regards,
Shalin Shekhar Mangar.

Re: How to make Relationships work for Multi-valued Index Fields?

2009-01-23 Thread Shalin Shekhar Mangar

Yes Solr does. But DataImportHandler with the 1.3 release does not support
it.

However, you can use the trunk data import handler jar with Solr 1.3 if you
do not feel comfortable using Solr 1.4 trunk.

On Fri, Jan 23, 2009 at 1:36 PM, Gunaranjan Chandraraju <
chandrar...@apple.com> wrote:

>
> I thought 1.3 supported dynamic fields in schema.xml?
>
> Guna
>
>
> On Jan 22, 2009, at 11:54 PM, Shalin Shekhar Mangar wrote:
>
>  Oops, one more gotcha. The dynamic field support is only in 1.4 trunk.
>>
>> On Fri, Jan 23, 2009 at 1:24 PM, Shalin Shekhar Mangar <
>> shalinman...@gmail.com> wrote:
>>
>>  On Fri, Jan 23, 2009 at 1:08 PM, Gunaranjan Chandraraju <
>>> chandrar...@apple.com> wrote:
>>>
>>>
 
  
  
  
  
 

 I have setup my DIH to treat these as entities as below

 
  
  
   >>>   baseDir="***"
   fileName=".*xml"
   rootEntity="false"
   dataSource="null" >
  >>> name="record"
 processor="XPathEntityProcessor"
 stream="false"
 forEach="/record"
 url="${f.fileAbsolutePath}">
  

  
   >>>   name="record_adr"
   processor="XPathEntityProcessor"
   stream="false"
   forEach="/record/address"
   url="${f.fileAbsolutePath}">
   >>> xpath="/record/address/@street" />
   >>> xpath="/record/address//@state" />
   >>> xpath="/record/address//@type" />
  
 
   
  
 


>>> I think the only way is to create a dynamic field for each attribute
>>> (street, state etc.). Write a transformer to copy the fields from your
>>> data
>>> config to appropriately named dynamic field (e.g. street_1, state_1,
>>> etc).
>>> To maintain this counter you will need to get/store it with
>>> Context#getSessionAttribute(name, val, Context.SCOPE_DOC) and
>>> Context#setSessionAttribute(name, val, Context.SCOPE_DOC).
>>>
>>> I cant't think of an easier way.
>>> --
>>> Regards,
>>> Shalin Shekhar Mangar.
>>>
>>>
>>
>>
>> --
>> Regards,
>> Shalin Shekhar Mangar.
>>
>
>


-- 
Regards,
Shalin Shekhar Mangar.

stats.jsp - maxDoc and numDoc-help

2009-01-23 Thread S.Selvam Siva

Hi all,

i am new to solr.I have posted nearly 10 lakh xml docs for the last few
months.

Now i want to find out the total number of duplicate posts untill now.

whether the stats.jsp's  numDocs and maxDocs is the appropriate one to find
out the total duplicate post(maxDocs-numDocs) so far?
please guide me to the solution.
-- 
Yours,
S.Selvam

Re: Solr Replication: disk space consumed on slave much higher than on master

2009-01-23 Thread Jaco

Hi,

I applied the patch and did some more tests - also adding some LOG.info()
calls in delTree to see if it actually gets invoked (LOG.info("START:
delTree: "+dir.getName()); at the start of that method). I don't see any
entries of this showing up in the log file at all, so it looks like delTree
doesn't get invoked at all.

To be sure, explaining the issue to prevent misunderstanding:
- The number of files in the index directory on the slave keeps increasing
(in my very small test core, there are now 128 files in the slave's index
directory, and only 73 files in the master's index directory)
- The directories index.x are still there after replication, but they
are empty

Are there any other things I can do check, or more info that I can provide
to help fix this?

Thanks, bye,

Jaco.

2009/1/22 Shalin Shekhar Mangar 

> On Fri, Jan 23, 2009 at 12:15 AM, Noble Paul നോബിള്‍ नोब्ळ् <
> noble.p...@gmail.com> wrote:
>
> > I have attached a patch which logs the names of the files which could
> > not get deleted (which may help us diagnose the problem). If you are
> > comfortable applying a patch you may try it out.
> >
>
> I've committed this patch to trunk.
>
> --
> Regards,
> Shalin Shekhar Mangar.
>

Re: URL-import field type?

2009-01-23 Thread Paul Libbrecht


Well,

the idea is that the solr engine indexes the contents of a web platform.

Each document is a user-side-URL out of which several fields would be  
fetched through various URL-get-documents (e.g. the full-text-view,  
e.g. the future openmath representation, e.g. the topics (URIs in an  
ontology), ...).


Would the alternate (and maybe equivalent) way to stream all documents  
into one XML document and let the XPath triage act through all fields?  
That would also work would take advantage of the  
XPathEntityProcessor's nice configuration.


What bothers me with the HttpDataSource example is that, for now, at  
least, it is configured to pull a single URL while what is needed (and  
would provide delta ability) is really to index a list of URLs (for  
which one would pull regularly the list of recently update URLs or  
simply use GET-if-modified-since on all of them).


I didn't think that modifying the XPathEntityProcessor was the right  
thing since it seems based on a single stream.


Hints for altnernative eagerly welcome.

paul


Le 23-janv.-09 à 05:45, Noble Paul നോബിള്‍  
नोब्ळ् a écrit :



where is this url coming from? what is the content type of the stream?
is it plain text or html?

if yes, this is a possible enhancement to DIH



On Fri, Jan 23, 2009 at 4:39 AM, Paul Libbrecht  
 wrote:


Hello list,

after searching around for quite a while, including in the  
DataImportHandler
documentation on the wiki (which looks amazing), I couldn't find a  
way to
indicate to solr that the tokens of that field should be the result  
of

analyzing the tokens of the stream at URL-xxx.

I know I was able to imitate that in plain-lucene by crafting a  
particular
analyzer-filter who was only given the URL as content and who gave  
further

the tokens of the stream.

Is this the right way in solr?

thanks in advance.

paul




--
--Noble Paul




smime.p7s
Description: S/MIME cryptographic signature

Re: URL-import field type?

2009-01-23 Thread Noble Paul നോബിള്‍ नोब्ळ्

On Fri, Jan 23, 2009 at 2:28 PM, Paul Libbrecht  wrote:
> Well,
>
> the idea is that the solr engine indexes the contents of a web platform.
>
> Each document is a user-side-URL out of which several fields would be
> fetched through various URL-get-documents (e.g. the full-text-view, e.g. the
> future openmath representation, e.g. the topics (URIs in an ontology), ...).

if the response of these are URLs are well formed xpaths they can be
channeled to an XPathEntityProcessor (one per field) and they can be
processed

if the response is not XML ,then  there is no EntityProcessor that can
consume this. We may need to add one.
>
> Would the alternate (and maybe equivalent) way to stream all documents into
> one XML document and let the XPath triage act through all fields? That would
> also work would take advantage of the XPathEntityProcessor's nice
> configuration.

>
> What bothers me with the HttpDataSource example is that, for now, at least,
> it is configured to pull a single URL while what is needed (and would
> provide delta ability) is really to index a list of URLs (for which one
> would pull regularly the list of recently update URLs or simply use
> GET-if-modified-since on all of them).
The if-modified since is not supported by HttpdataSource. However you
can write a transformer which pings the URL w/ a if-modified-since
header an skip the document using the $skipDoc option
>
> I didn't think that modifying the XPathEntityProcessor was the right thing
> since it seems based on a single stream.
>
> Hints for altnernative eagerly welcome.
>
> paul
>
>
> Le 23-janv.-09 à 05:45, Noble Paul നോബിള്‍ नोब्ळ् a écrit :
>
>> where is this url coming from? what is the content type of the stream?
>> is it plain text or html?
>>
>> if yes, this is a possible enhancement to DIH
>>
>>
>>
>> On Fri, Jan 23, 2009 at 4:39 AM, Paul Libbrecht 
>> wrote:
>>>
>>> Hello list,
>>>
>>> after searching around for quite a while, including in the
>>> DataImportHandler
>>> documentation on the wiki (which looks amazing), I couldn't find a way to
>>> indicate to solr that the tokens of that field should be the result of
>>> analyzing the tokens of the stream at URL-xxx.
>>>
>>> I know I was able to imitate that in plain-lucene by crafting a
>>> particular
>>> analyzer-filter who was only given the URL as content and who gave
>>> further
>>> the tokens of the stream.
>>>
>>> Is this the right way in solr?
>>>
>>> thanks in advance.
>>>
>>> paul
>>
>>
>>
>> --
>> --Noble Paul
>
>



-- 
--Noble Paul

Re: DIH XPathEntityProcessor fails with docs containing

2009-01-23 Thread Fergus McMenemie

Seems to work fin on this mornings 23-jan-2009 nightly.

Thanks very much.



>On Wed, Jan 21, 2009 at 6:05 PM, Fergus McMenemie  wrote:
>
>>
>> After looking looking at http://issues.apache.org/jira/browse/SOLR-964,
>> where
>> it seems this issue has been addressed, I had another go at indexing
>> documents
>> containing DOCTYPE. It failed as follows.
>>
>>
>That patch has not been committed to the trunk yet. I'll take it up.
>
>-- 
>Regards,
>Shalin Shekhar Mangar.

-- 

===
Fergus McMenemie   Email:fer...@twig.me.uk
Techmore Ltd   Phone:(UK) 07721 376021

Unix/Mac/Intranets Analyst Programmer
===

Re: URL-import field type?

2009-01-23 Thread Paul Libbrecht



Le 23-janv.-09 à 10:10, Noble Paul നോബിള്‍  
नोब्ळ् a écrit :

if the response is not XML ,then  there is no EntityProcessor that can
consume this. We may need to add one.


well, even binary data such as word documents (base64-encoded for  
example) run the risk of appearing here. They sure need a pile of  
filters!


What bothers me with the HttpDataSource example is that, for now,  
at least,

it is configured to pull a single URL while what is needed (and would
provide delta ability) is really to index a list of URLs (for which  
one

would pull regularly the list of recently update URLs or simply use
GET-if-modified-since on all of them).

The if-modified since is not supported by HttpdataSource. However you
can write a transformer which pings the URL w/ a if-modified-since
header an skip the document using the $skipDoc option


I still don't understand how you give several documents to the  
HttpDataSource.

The configuration seems only to allow a single URL.
Am I missing something?

paul

PS: would it be worth chatting about that on irc.freenode.net#solr ?

smime.p7s
Description: S/MIME cryptographic signature

QTime in microsecond

2009-01-23 Thread AHMET ARSLAN

Is there a way to get QTime in microsecond from solr?

I have small set of collection and my response time (QTime) is 0 or 1 
milliseconds. I am running benchmark tests and I need more sensitive running 
times for comparision.

Thanks for your help.

Re: What can be the reason for stopping solr work after some time?

2009-01-23 Thread an...@iguanait.com

Hi, thanks for your reply.

Sorry for lesser information that i gave in my first post, i just didn't
know what to share.

Yes, java proccess is still working, but search in the site does not
work and i cannot see any http request at this time in the logs. I have
not tested the admin page, this is something that i should test. How can
i enable debug mode in solr?

I'm sending you the private message only, because i have unsubscribed
from solr mailing list, i need to subscribe again.

On Wed, 2009-01-21 at 22:00 -0800, Chris Hostetter wrote:
> : i'm newbie with solr. We have installed with together with ezfind from
> : EZ Publish web sites and it is working. But in one of the servers we
> : have this kind of problem. It works for example for 3 hours, and then in
> : one moment it stop to work, searching and indexing does not work.
> 
> it's pretty hard to make any sort of guess as to what your problem might 
> be without more information.  is your java process still running? does it 
> responsed to any HTTP requests (ie: do the admin pages work?) what do the 
> logs say?
> 
> 
> -Hoss
>

facet dates and distributed search

2009-01-23 Thread Marc Sturlese


Hey there, I would like to understand why distributed search doesn't suport
facet dates. As I understand it would have problems because if the time of
the servers is not syncronized, the results would not be exact but... In
case I wouldn't mind if results are completley exacts... would be possible
to use facet dates on distributd search?
In case I am completely wrong with this explanation... can someone explain
me the reason why it's not suported? If I understand it maybe I could try to
do a path ...
Thanks in advance.
-- 
View this message in context: 
http://www.nabble.com/facet-dates-and-distributed-search-tp21621576p21621576.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Intermittent high response times

2009-01-23 Thread hbi dev

Hi wojtekpia,

That's interesting, I shall be looking into this over the weekend so I shall
look at the GC also. I was briefly reading about GC last night, am I right
in thinking it could be affected by what version of the jvm I'm using
(1.5.0.8), and also what type of Collector is set? What collector is the
default, and what would people recommend for an application like Solr?
Thanks
Waseem

On Thu, Jan 22, 2009 at 5:24 PM, wojtekpia  wrote:

>
> I'm experiencing similar issues. Mine seem to be related to old generation
> garbage collection. Can you monitor your garbage collection activity? (I'm
> using JConsole to monitor it:
> http://java.sun.com/developer/technicalArticles/J2SE/jconsole.html).
>
> In my system, garbage collection usually doesn't cause any trouble. But
> once
> in a while, the size of the old generation flat-lines for some time
> (~dozens
> of seconds). When this happens, I see really bad response times from Solr
> (not quite as bad as you're seeing, but almost). The old-gen flat-lines
> always seem to be right before, or right after the old-gen is garbage
> collected.
> --
> View this message in context:
> http://www.nabble.com/Intermittent-high-response-times-tp21602475p21608986.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>
>

Re: Any advice for facet.prefix for suggestions

2009-01-23 Thread Erik Hatcher


Ian,

A new field is indeed needed and warranted for this case.  Facets only  
work off indexed terms, not stored.


Erik

On Jan 22, 2009, at 11:48 PM, Ian Connor wrote:

The facet prefix method to get suggestions for search terms really  
helps.

However, it seems to show the indexed rather than the stored terms.

For instance, if you have a "word-with-hyphen", it will show the
"wordwithhyphen" as a suggestion in fields where I have asked it to  
strip
out these characters (this is a valid facet based on the index but  
confusing

to the user).

Here is an example:

http://pubget.com/search?suggest=true

type "kirschne" and wait for the suggestion of kirschnerwir based on  
a index

of kirschner-wire.

Is there a way to have it show the stored version of the words or do  
I need

to mirror a field that does the indexing but without the filters?

I am hoping there might be something I am missing here and a new  
field is

not needed.

--
Regards,

Ian Connor

Re: URL-import field type?

2009-01-23 Thread Noble Paul നോബിള്‍ नोब्ळ्

On Fri, Jan 23, 2009 at 2:55 PM, Paul Libbrecht  wrote:
>
> Le 23-janv.-09 à 10:10, Noble Paul നോബിള്‍ नोब्ळ् a écrit :
>>
>> if the response is not XML ,then  there is no EntityProcessor that can
>> consume this. We may need to add one.
>
> well, even binary data such as word documents (base64-encoded for example)
> run the risk of appearing here. They sure need a pile of filters!
>
>>> What bothers me with the HttpDataSource example is that, for now, at
>>> least,
>>> it is configured to pull a single URL while what is needed (and would
>>> provide delta ability) is really to index a list of URLs (for which one
>>> would pull regularly the list of recently update URLs or simply use
>>> GET-if-modified-since on all of them).
>>
>> The if-modified since is not supported by HttpdataSource. However you
>> can write a transformer which pings the URL w/ a if-modified-since
>> header an skip the document using the $skipDoc option
>
> I still don't understand how you give several documents to the
> HttpDataSource.
> The configuration seems only to allow a single URL.
> Am I missing something?
The DataSource is like a helper class. The only intelligent piece here
is an EntityProcessor.
>
> paul
>
> PS: would it be worth chatting about that on irc.freenode.net#solr ?



-- 
--Noble Paul

Re: Solr Replication: disk space consumed on slave much higher than on master

2009-01-23 Thread Shalin Shekhar Mangar

On Fri, Jan 23, 2009 at 2:12 PM, Jaco  wrote:

> Hi,
>
> I applied the patch and did some more tests - also adding some LOG.info()
> calls in delTree to see if it actually gets invoked (LOG.info("START:
> delTree: "+dir.getName()); at the start of that method). I don't see any
> entries of this showing up in the log file at all, so it looks like delTree
> doesn't get invoked at all.
>
> To be sure, explaining the issue to prevent misunderstanding:
> - The number of files in the index directory on the slave keeps increasing
> (in my very small test core, there are now 128 files in the slave's index
> directory, and only 73 files in the master's index directory)
> - The directories index.x are still there after replication, but they
> are empty
>
> Are there any other things I can do check, or more info that I can provide
> to help fix this?
>

The problem is that when we do a commit on the slave after replication is
done. The commit does not re-open the IndexWriter. Therefore, the deletion
policy does not take affect and older files are left as is. This can keep on
building up. The only solution is to re-open the index writer.

I think the attached patch can solve this problem. Can you try this and let
us know? Thank you for your patience.

-- 
Regards,
Shalin Shekhar Mangar.
Index: src/java/org/apache/solr/handler/SnapPuller.java
===
--- src/java/org/apache/solr/handler/SnapPuller.java	(revision 736746)
+++ src/java/org/apache/solr/handler/SnapPuller.java	Fri Jan 23 16:47:41 IST 2009
@@ -27,6 +27,7 @@
 import static org.apache.solr.handler.ReplicationHandler.*;
 import org.apache.solr.search.SolrIndexSearcher;
 import org.apache.solr.update.CommitUpdateCommand;
+import org.apache.solr.update.DirectUpdateHandler2;
 import org.apache.solr.util.RefCounted;
 import org.slf4j.Logger;
 import org.slf4j.LoggerFactory;
@@ -281,14 +282,14 @@
 replicationStartTime = 0;
 return successfulInstall;
   } catch (ReplicationHandlerException e) {
-delTree(tmpIndexDir);
 LOG.error("User aborted Replication");
   } catch (SolrException e) {
-delTree(tmpIndexDir);
 throw e;
   } catch (Exception e) {
 delTree(tmpIndexDir);
 throw new SolrException(SolrException.ErrorCode.SERVER_ERROR, "Snappull failed : ", e);
+  } finally {
+delTree(tmpIndexDir);
   }
   return successfulInstall;
 } finally {
@@ -349,7 +350,15 @@
 cmd.waitFlush = true;
 cmd.waitSearcher = true;
 solrCore.getUpdateHandler().commit(cmd);
+if (solrCore.getUpdateHandler() instanceof DirectUpdateHandler2) {
+  LOG.info("Re-opening index writer to make sure older index files get deleted");
+  DirectUpdateHandler2 handler = (DirectUpdateHandler2) solrCore.getUpdateHandler();
+  handler.reOpenWriter();
+} else  {
+  LOG.warn("The update handler is not an instance or sub-class of DirectUpdateHandler2. " +
+  "ReplicationHandler may not be able to cleanup un-used index files.");
-  }
+}
+  }
 
 
   /**
Index: src/java/org/apache/solr/update/DirectUpdateHandler2.java
===
--- src/java/org/apache/solr/update/DirectUpdateHandler2.java	(revision 736614)
+++ src/java/org/apache/solr/update/DirectUpdateHandler2.java	Fri Jan 23 16:23:36 IST 2009
@@ -187,7 +187,7 @@
 addCommands.incrementAndGet();
 addCommandsCumulative.incrementAndGet();
 int rc=-1;
-
+
 // if there is no ID field, use allowDups
 if( idField == null ) {
   cmd.allowDups = true;
@@ -259,7 +259,7 @@
 } finally {
   iwCommit.unlock();
 }
-
+
 if( tracker.timeUpperBound > 0 ) {
   tracker.scheduleCommitWithin( tracker.timeUpperBound );
 }
@@ -294,7 +294,7 @@
  deleteAll();
} else {
 openWriter();
-writer.deleteDocuments(q); 
+writer.deleteDocuments(q);
}
  } finally {
iwCommit.unlock();
@@ -313,8 +313,15 @@
 }
   }
 
+  public void reOpenWriter() throws IOException  {
+iwCommit.lock();
+try {
+  openWriter();
+} finally {
+  iwCommit.unlock();
+}
+  }
 
-
   public void commit(CommitUpdateCommand cmd) throws IOException {
 
 if (cmd.optimize) {
@@ -419,14 +426,14 @@
 tracker.pending.cancel( true );
 tracker.pending = null;
   }
-  tracker.scheduler.shutdown(); 
+  tracker.scheduler.shutdown();
   closeWriter();
 } finally {
   iwCommit.unlock();
 }
 log.info("closed " + this);
   }
-
+
   /** Helper class for tracking autoCommit state.
*
* Note: This is purely an implementation detail of autoCommit and will
@@ -435,8 +442,8 @@
*
* Note: all access must be synchronized.
*/
-  class CommitTracker implements Runnable 
-  {  
+  class CommitTracker implements Runnable
+  {
 // scheduler delay for maxDoc-trigger

Re: Solr Replication: disk space consumed on slave much higher than on master

2009-01-23 Thread Noble Paul നോബിള്‍ नोब्ळ्

I tested with the patch
it has solved both the issues

On Fri, Jan 23, 2009 at 5:00 PM, Shalin Shekhar Mangar
 wrote:
>
>
> On Fri, Jan 23, 2009 at 2:12 PM, Jaco  wrote:
>>
>> Hi,
>>
>> I applied the patch and did some more tests - also adding some LOG.info()
>> calls in delTree to see if it actually gets invoked (LOG.info("START:
>> delTree: "+dir.getName()); at the start of that method). I don't see any
>> entries of this showing up in the log file at all, so it looks like
>> delTree
>> doesn't get invoked at all.
>>
>> To be sure, explaining the issue to prevent misunderstanding:
>> - The number of files in the index directory on the slave keeps increasing
>> (in my very small test core, there are now 128 files in the slave's index
>> directory, and only 73 files in the master's index directory)
>> - The directories index.x are still there after replication, but they
>> are empty
>>
>> Are there any other things I can do check, or more info that I can provide
>> to help fix this?
>
> The problem is that when we do a commit on the slave after replication is
> done. The commit does not re-open the IndexWriter. Therefore, the deletion
> policy does not take affect and older files are left as is. This can keep on
> building up. The only solution is to re-open the index writer.
>
> I think the attached patch can solve this problem. Can you try this and let
> us know? Thank you for your patience.
>
> --
> Regards,
> Shalin Shekhar Mangar.
>



-- 
--Noble Paul

Re: Solr Replication: disk space consumed on slave much higher than on master

2009-01-23 Thread Noble Paul നോബിള്‍ नोब्ळ्

I have opened an issue to track this
https://issues.apache.org/jira/browse/SOLR-978

On Fri, Jan 23, 2009 at 5:22 PM, Noble Paul നോബിള്‍  नोब्ळ्
 wrote:
> I tested with the patch
> it has solved both the issues
>
> On Fri, Jan 23, 2009 at 5:00 PM, Shalin Shekhar Mangar
>  wrote:
>>
>>
>> On Fri, Jan 23, 2009 at 2:12 PM, Jaco  wrote:
>>>
>>> Hi,
>>>
>>> I applied the patch and did some more tests - also adding some LOG.info()
>>> calls in delTree to see if it actually gets invoked (LOG.info("START:
>>> delTree: "+dir.getName()); at the start of that method). I don't see any
>>> entries of this showing up in the log file at all, so it looks like
>>> delTree
>>> doesn't get invoked at all.
>>>
>>> To be sure, explaining the issue to prevent misunderstanding:
>>> - The number of files in the index directory on the slave keeps increasing
>>> (in my very small test core, there are now 128 files in the slave's index
>>> directory, and only 73 files in the master's index directory)
>>> - The directories index.x are still there after replication, but they
>>> are empty
>>>
>>> Are there any other things I can do check, or more info that I can provide
>>> to help fix this?
>>
>> The problem is that when we do a commit on the slave after replication is
>> done. The commit does not re-open the IndexWriter. Therefore, the deletion
>> policy does not take affect and older files are left as is. This can keep on
>> building up. The only solution is to re-open the index writer.
>>
>> I think the attached patch can solve this problem. Can you try this and let
>> us know? Thank you for your patience.
>>
>> --
>> Regards,
>> Shalin Shekhar Mangar.
>>
>
>
>
> --
> --Noble Paul
>



-- 
--Noble Paul

Maximum size of document indexed

2009-01-23 Thread Gargate, Siddharth

Hi,
I am trying to index a 25 MB word document. I am not able to search all
the keywords. Looks like only certain number of initial words are
getting indexed. 
Is there any limit to the size of document getting indexed? Or is there
any word count limit per field? 
 
Thanks,
Siddharth

Re: Solr Replication: disk space consumed on slave much higher than on master

2009-01-23 Thread Jaco

Hi,

I have tested this as well, looking fine! Both issues are indeed fixed, and
the index directory of the slaves gets cleaned up nicely. I will apply the
changes to all systems I've got running and report back in this thread in
case any issues are found.

Thanks for the very fast help! I usually need much, much more patience with
commercial software vendors..

Cheers,

Jaco.


2009/1/23 Noble Paul നോബിള്‍ नोब्ळ् 

> I have opened an issue to track this
> https://issues.apache.org/jira/browse/SOLR-978
>
> On Fri, Jan 23, 2009 at 5:22 PM, Noble Paul നോബിള്‍  नोब्ळ्
>  wrote:
> > I tested with the patch
> > it has solved both the issues
> >
> > On Fri, Jan 23, 2009 at 5:00 PM, Shalin Shekhar Mangar
> >  wrote:
> >>
> >>
> >> On Fri, Jan 23, 2009 at 2:12 PM, Jaco  wrote:
> >>>
> >>> Hi,
> >>>
> >>> I applied the patch and did some more tests - also adding some
> LOG.info()
> >>> calls in delTree to see if it actually gets invoked (LOG.info("START:
> >>> delTree: "+dir.getName()); at the start of that method). I don't see
> any
> >>> entries of this showing up in the log file at all, so it looks like
> >>> delTree
> >>> doesn't get invoked at all.
> >>>
> >>> To be sure, explaining the issue to prevent misunderstanding:
> >>> - The number of files in the index directory on the slave keeps
> increasing
> >>> (in my very small test core, there are now 128 files in the slave's
> index
> >>> directory, and only 73 files in the master's index directory)
> >>> - The directories index.x are still there after replication, but
> they
> >>> are empty
> >>>
> >>> Are there any other things I can do check, or more info that I can
> provide
> >>> to help fix this?
> >>
> >> The problem is that when we do a commit on the slave after replication
> is
> >> done. The commit does not re-open the IndexWriter. Therefore, the
> deletion
> >> policy does not take affect and older files are left as is. This can
> keep on
> >> building up. The only solution is to re-open the index writer.
> >>
> >> I think the attached patch can solve this problem. Can you try this and
> let
> >> us know? Thank you for your patience.
> >>
> >> --
> >> Regards,
> >> Shalin Shekhar Mangar.
> >>
> >
> >
> >
> > --
> > --Noble Paul
> >
>
>
>
> --
> --Noble Paul
>

search/query issue. sorting, match exact, match first etc

2009-01-23 Thread Julian Davchev

Hi,
I am trying to utilize solr into an autocomplete thingy.

Let's assume I query for 'foo'.
Assuming we work with case insensitive here.

I would like to have records returned in specific order. First all that
have exact match, then all that start with Foo in alphabetical order,
then all that contain the exact word (but not necessarily first) and
lastly all matches where foo is anywhere within words.
Any pointers are more than welcome. I am trying to find something in
archives as well but no luck so far.

Example response when searching 'foo' or 'Foo':

Foo
Foo AAA
Foo BBB
Gooo Foo
Moo Foo
xxxfoox
Boo Foos

Re: Maximum size of document indexed

2009-01-23 Thread Erick Erickson

Try:
http://wiki.apache.org/solr/SolrConfigXml?highlight=(maxfieldlength)

Best
Erick

On Fri, Jan 23, 2009 at 7:29 AM, Gargate, Siddharth wrote:

> Hi,
> I am trying to index a 25 MB word document. I am not able to search all
> the keywords. Looks like only certain number of initial words are
> getting indexed.
> Is there any limit to the size of document getting indexed? Or is there
> any word count limit per field?
>
> Thanks,
> Siddharth
>

Re: Master failover - seeking comments

2009-01-23 Thread edre...@ha


Thanks for the response. Let me clarify things a bit.

Regarding the Slaves:
Our project is a web application. It is our desire to embedd Solr into the
web application.   The web applications are configured with a local embedded
Solr instance configured as a slave, and a remote Solr instance configured
as a master.

We have a requirement for real-time updates to the Solr indexes.  Our
strategy is to use the local embedded Solr instance as a read-only
repository.  Any time a write is made, we will send it to the remote Master. 
Once a user pushes a write operation to the remote Master, all subsequent
read operations for this user now are made against the Master for the
duration of the session.  This approximates "realtime" updates and seems to
work for our purposes.  Writes to our system are a small percentage of Read
operations.

Now, back to the original question.  We're simply looking for failover
solution if the Master server goes down.  Oh, and we are using the
replication scripts to sync the servers.



> It seems like you are trying to write to Solr directly from your front end
> application. This is why you are thinking of multiple masters. I'll let
> others comment on how easy/hard/correct the solution would be. 
> 

Well, yes.  We have business requirements that want updates to Solr to be
realtime, or as close to that as possible, so when a user changes something,
our strategy was to save it to the DB and push it to the Solr Master as
well.  Although, we will have a background application that will help ensure
that Solr is in sync with the DB for times that Solr is down and the DB is
not.



> But, do you really need to have live writes? Can they be channeled through
> a
> background process? Since you anyway cannot do a commit per-write, the
> advantage of live writes is minimal. Moreover you would need to invest a
> lot
> of time in handling availability concerns to avoid losing updates. If you
> log/record the write requests to an intermediate store (or queue), you can
> do with one master (with another host on standby acting as a slave).
> 

We do need to have live writes, as I mentioned above.  The concern you
mention about losing live writes is exactly why we are looking at a Master
Solr server failover strategy.  We thought about having a backup Solr server
that is a Slave to the Master and could be easily reconfigured as a new
Master in a pinch.  Our operations team has pushed us to come up with a
solution that would be more seamless.  This is why we came up with a
Master/Master solution where both Masters are also slaves to each other.



>>
>> To test this, I ran the following scenario.
>>
>> 1) Slave 1 (S1) is configured to use M2 as it's master.
>> 2) We push an update to M2.
>> 3) We restart S1, now pointing to M1.
>> 4) We wait for M1 to sync from M2
>> 5) We then sync S1 to M1.
>> 6) Success!
>>
> 
> How do you co-ordinate all this?
> 

This was just a test scenario I ran manually to see if the setup I described
above would even work.  

Is there a Wiki page that outlines typical web application Solr deployment
strategies?  There are a lot of questions on the forum about this type of
thing (including this one).  For those who have expertise in this area, I'm
sure there are many who could benefit from this (hint hint).

As before, any comments or suggestions on the above would be much
appreciated.

Thanks,
Erik
-- 
View this message in context: 
http://www.nabble.com/Master-failover---seeking-comments-tp21614750p21625324.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: how can solr search angainst group of field

2009-01-23 Thread Marc Sturlese


I think you could use dismax and restric de result with a filter query. 
Suposing you're using dismaxquery parser it should look like:
http://localhost:8080/solr/select?q=whatever&fq=category:3
I think this would sort your case


surfer10 wrote:
> 
> definitly disMax do the thing by searching one term against multifield.
> but  what if my index contains two additional multivalued fields like
> category id
> 
> i need to search against terms in particular fields of documents and
> dismax do this well thru "qf=field1,field2"
> how can i filter results which has only "1" or "2" or "3" in categoryID
> field?
> 
> could you please help me to figure this?
> 
> update: i've found discursion about that on
> http://www.nabble.com/using-dismax-with-additional-query--td18178512.html#a18178512
> there is a suggestion to use filterquery. i'll check it out
> 

-- 
View this message in context: 
http://www.nabble.com/how-can-solr-search-angainst-group-of-field-tp21557783p21625476.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: I get SEVERE: Lock obtain timed out

2009-01-23 Thread Jerome L Quinn



Julian Davchev  wrote on 01/20/2009 10:07:48 AM:

> Julian Davchev 
> 01/20/2009 10:07 AM
>
> I get SEVERE: Lock obtain timed out
>
> Hi,
> Any documents or something I can read on how locks work and how I can
> controll it. When do they occur etc.
> Cause only way I got out of this mess was restarting tomcat
>
> SEVERE: org.apache.lucene.store.LockObtainFailedException: Lock obtain
> timed out: SingleInstanceLock: write.lock


I've seen this with my customized setup.  Before I saw the write.lock
messages, I had an OutOfMemoryError, but the container didn't shut down.
After that Solr spewed write lock messages and I had to restart.

So, you might want to search backwards in your logs and see if you can find
when the write lock problems and if there is some identifiable problem
preceding that.

Jerry Quinn

Fwd: [Travel Assistance] Applications for ApacheCon EU 2009 - Now Open

2009-01-23 Thread Erik Hatcher

Begin forwarded message:

From: Tony Stevenson 
Date: January 23, 2009 8:28:19 AM EST
To: travel-assista...@apache.org
Subject: [Travel Assistance] Applications for ApacheCon EU 2009 -  
Now Open

The Travel Assistance Committee is now accepting applications for  
those
wanting to attend ApacheCon EU 2009 between the 23rd and 27th March  
2009

in Amsterdam.

The Travel Assistance Committee is looking for people who would like  
to

be able to attend ApacheCon EU 2009 who need some financial support in
order to get there. There are very few places available and the  
criteria
is high, that aside applications are open to all open source  
developers

who feel that their attendance would benefit themselves, their
project(s), the ASF or open source in general.

Financial assistance is available for travel, accommodation and  
entrance

fees either in full or in part, depending on circumstances. It is
intended that all our ApacheCon events are covered, so it may be  
prudent
for those in the United States or Asia to wait until an event closer  
to
them comes up - you are all welcome to apply for ApacheCon EU of  
course,
but there must be compelling reasons for you to attend an event  
further
away that your home location for your application to be considered  
above

those closer to the event location.

More information can be found on the main Apache website at
http://www.apache.org/travel/index.html - where you will also find a
link to the online application form.

Time is very tight for this event, so applications are open now and  
will

end on the 4th February 2009 - to give enough time for travel
arrangements to be made.

Good luck to all those that apply.

Regards,
The Travel Assistance Committee
--

--
Tony Stevenson
t...@pc-tony.com  //  pct...@apache.org  // pct...@freenode.net
http://blog.pc-tony.com/

1024D/51047D66 ECAF DC55 C608 5E82 0B5E  3359 C9C7 924E 5104 7D66
--

Method toMultiMap(NamedList params) in SolrParams

2009-01-23 Thread Hana


Hi,

I'm getting confused about the method Map
toMultiMap(NamedList params) in SolrParams class.
When some of your parameter is instanceof String[] it's converted to to
String using the toString() method, which seems
to me to be wrong. It is probably assuming, that the values in NamedList are
all String, but when you look at the method
toNamedList() it's clearly adding String[] in case that the parameter has
more than one value.
So my question is, wheater it is a bug or I'm getting something wrong.


public static Map toMultiMap(NamedList params) {
HashMap map = new HashMap();
for (int i=0; i it=getParameterNamesIterator(); it.hasNext(); ) {
  final String name = it.next();
  final String [] values = getParams(name);
  if(values.length==1) {
   result.add(name,values[0]);
  } else {
  // currently no reason not to use the same array
  result.add(name,values);
  }
}
return result;
  }


Cheers

Hana
-- 
View this message in context: 
http://www.nabble.com/Method-toMultiMap%28NamedList-params%29-in-SolrParams-tp21626588p21626588.html
Sent from the Solr - User mailing list archive at Nabble.com.

RE: Performance "dead-zone" due to garbage collection

2009-01-23 Thread Feak, Todd

Can you share your experience with the IBM JDK once you've evaluated it?
You are working with a heavy load, I think many would benefit from the
feedback.

-Todd Feak

-Original Message-
From: wojtekpia [mailto:wojte...@hotmail.com] 
Sent: Thursday, January 22, 2009 3:46 PM
To: solr-user@lucene.apache.org
Subject: Re: Performance "dead-zone" due to garbage collection

I'm not sure if you suggested it, but I'd like to try the IBM JVM. Aside
from
setting my JRE paths, is there anything else I need to do run inside the
IBM
JVM? (e.g. re-compiling?)

Walter Underwood wrote:
> 
> What JVM and garbage collector setting? We are using the IBM JVM with
> their concurrent generational collector. I would strongly recommend
> trying a similar collector on your JVM. Hint: how much memory is in
> use after a full GC? That is a good approximation to the working set.
> 
> 

-- 
View this message in context:
http://www.nabble.com/Performance-%22dead-zone%22-due-to-garbage-collect
ion-tp21588427p21616078.html
Sent from the Solr - User mailing list archive at Nabble.com.

RE: QTime in microsecond

2009-01-23 Thread Feak, Todd

The easiest way is to run maybe 100,000 or more queries and take an
average. A single microsecond value for a query would be incredibly
inaccurate.

-ToddFeak



-Original Message-
From: AHMET ARSLAN [mailto:iori...@yahoo.com] 
Sent: Friday, January 23, 2009 1:33 AM
To: solr-user@lucene.apache.org
Subject: QTime in microsecond 

Is there a way to get QTime in microsecond from solr?

I have small set of collection and my response time (QTime) is 0 or 1
milliseconds. I am running benchmark tests and I need more sensitive
running times for comparision.

Thanks for your help.

Re: Intermittent high response times

2009-01-23 Thread wojtekpia


The type of garbage collector definitely affects performance, but there are
other settings as well. There's a related thread currently discussing this:
http://www.nabble.com/Performance-%22dead-zone%22-due-to-garbage-collection-td21588427.html



hbi dev wrote:
> 
> Hi wojtekpia,
> 
> That's interesting, I shall be looking into this over the weekend so I
> shall
> look at the GC also. I was briefly reading about GC last night, am I right
> in thinking it could be affected by what version of the jvm I'm using
> (1.5.0.8), and also what type of Collector is set? What collector is the
> default, and what would people recommend for an application like Solr?
> Thanks
> Waseem
> 

-- 
View this message in context: 
http://www.nabble.com/Intermittent-high-response-times-tp21602475p21628769.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: stats.jsp - maxDoc and numDoc-help

2009-01-23 Thread Otis Gospodnetic

Hello,

Those two numbers won't necessarily give you the number of duplicates, as they 
reflect the number of deletes in the index, and those deletes were not 
necessarily caused by Solr detecting a duplicate insert.


Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch



- Original Message 
> From: S.Selvam Siva 
> To: solr-user@lucene.apache.org
> Sent: Friday, January 23, 2009 3:33:56 AM
> Subject: stats.jsp - maxDoc and numDoc-help
> 
> Hi all,
> 
> i am new to solr.I have posted nearly 10 lakh xml docs for the last few
> months.
> 
> Now i want to find out the total number of duplicate posts untill now.
> 
> whether the stats.jsp's  numDocs and maxDocs is the appropriate one to find
> out the total duplicate post(maxDocs-numDocs) so far?
> please guide me to the solution.
> -- 
> Yours,
> S.Selvam

Solr schema causing an error

2009-01-23 Thread Johnny X


Hi there,


I just configured my Solr schema file to support the data types I wish to
submit for indexing. However, as soon as try and start the Solr server I get
an error trying to reach the admin page.

I know this only has something to do with my definitions in the schema,
because when I tried to revert back to the default schema it worked again.

In my new schema I took out only the example definitions I was told to and
input the below. Can someone tell me what's wrong?

   
   
   
   
   

 
   
   
   
   
   
   

 
   
   
   


Also, what's the difference between text/string (I tried with both). And am
I right in thinking that I could set the type to "StrField" to prevent any
analysis pre-index?


Cheers for the help!


-- 
View this message in context: 
http://www.nabble.com/Solr-schema-causing-an-error-tp21629485p21629485.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Solr schema causing an error

2009-01-23 Thread Jeff Newburn

Are there any error log messages?

The difference between a string and text is that string is basically stored
with no modification (it is the solr.StrField).  The text type is actually
defined in the fieldtype section and usually contains a tokenizer and some
analyzers (usually stemming, lowercasing, deduping).


On 1/23/09 9:52 AM, "Johnny X"  wrote:

> 
> Hi there,
> 
> 
> I just configured my Solr schema file to support the data types I wish to
> submit for indexing. However, as soon as try and start the Solr server I get
> an error trying to reach the admin page.
> 
> I know this only has something to do with my definitions in the schema,
> because when I tried to revert back to the default schema it worked again.
> 
> In my new schema I took out only the example definitions I was told to and
> input the below. Can someone tell me what's wrong?
> 
>
>
>
>
>
>  
>
> stored="true"/>
>
>
>
>
>  
>
>
>
> 
> 
> Also, what's the difference between text/string (I tried with both). And am
> I right in thinking that I could set the type to "StrField" to prevent any
> analysis pre-index?
> 
> 
> Cheers for the help!
>

Re: stats.jsp - maxDoc and numDoc-help

2009-01-23 Thread S.Selvam Siva

On Fri, Jan 23, 2009 at 10:54 PM, Otis Gospodnetic <
otis_gospodne...@yahoo.com> wrote:

> Hello,
>
> Those two numbers won't necessarily give you the number of duplicates, as
> they reflect the number of deletes in the index, and those deletes were not
> necessarily caused by Solr detecting a duplicate insert.
>
>
> Otis
>

thank you otis,

1)then i can think of that "maxDocs- numDocs " should be the maximum(upper
bound) duplicate post count so far,if i assume no other deletion happened
other than duplication deletion.

2)Also i have a another query ,where the deletion of indexed document will
happen when a duplicate is posted.My aim is to retrive a particular
field(not unique field) from the indexed document before it is deleted due
to duplication.

-- 
Yours,
S.Selvam

Re: Solr schema causing an error

2009-01-23 Thread Johnny X


Ah, gotcha.

Where do I go to find the log messages? Obviously it prints a lot of jargon
on the admin page reporting the error, but is that what you want?



Jeff Newburn wrote:
> 
> Are there any error log messages?
> 
> The difference between a string and text is that string is basically
> stored
> with no modification (it is the solr.StrField).  The text type is actually
> defined in the fieldtype section and usually contains a tokenizer and some
> analyzers (usually stemming, lowercasing, deduping).
> 
> 
> On 1/23/09 9:52 AM, "Johnny X"  wrote:
> 
>> 
>> Hi there,
>> 
>> 
>> I just configured my Solr schema file to support the data types I wish to
>> submit for indexing. However, as soon as try and start the Solr server I
>> get
>> an error trying to reach the admin page.
>> 
>> I know this only has something to do with my definitions in the schema,
>> because when I tried to revert back to the default schema it worked
>> again.
>> 
>> In my new schema I took out only the example definitions I was told to
>> and
>> input the below. Can someone tell me what's wrong?
>> 
>>
>>
>>
>>
>>
>>> stored="true"/>  
>>> stored="true"/>
>>> stored="true"/>
>>
>>
>>
>>
>>  
>>
>>
>>
>> 
>> 
>> Also, what's the difference between text/string (I tried with both). And
>> am
>> I right in thinking that I could set the type to "StrField" to prevent
>> any
>> analysis pre-index?
>> 
>> 
>> Cheers for the help!
>> 
> 
> 
> 

-- 
View this message in context: 
http://www.nabble.com/Solr-schema-causing-an-error-tp21629485p21630425.html
Sent from the Solr - User mailing list archive at Nabble.com.

Solr stemming -> preserve original words

2009-01-23 Thread Thushara Wijeratna

hello,

Is it possible to retrieve the original words once solr (Porter algorithm)
stems them?
I need to index a bunch of data, store it in solr, and get back a list of
most frequent terms out of solr. and i want to see the non-stemmed version
of this data.

so basically, i want to enhance this:
http://localhost:8983/solr/admin/schema.jsp to see the "top terms" in
non-stemmed form.

thanks,
thushara

Re: Solr schema causing an error

2009-01-23 Thread Jeff Newburn

The first 10-15 lines of the jargon might help.  Additionally, the full
exceptions will be in the webserver logs (ie tomcat or jetty logs).


On 1/23/09 10:40 AM, "Johnny X"  wrote:

> 
> Ah, gotcha.
> 
> Where do I go to find the log messages? Obviously it prints a lot of jargon
> on the admin page reporting the error, but is that what you want?
> 
> 
> 
> Jeff Newburn wrote:
>> 
>> Are there any error log messages?
>> 
>> The difference between a string and text is that string is basically
>> stored
>> with no modification (it is the solr.StrField).  The text type is actually
>> defined in the fieldtype section and usually contains a tokenizer and some
>> analyzers (usually stemming, lowercasing, deduping).
>> 
>> 
>> On 1/23/09 9:52 AM, "Johnny X"  wrote:
>> 
>>> 
>>> Hi there,
>>> 
>>> 
>>> I just configured my Solr schema file to support the data types I wish to
>>> submit for indexing. However, as soon as try and start the Solr server I
>>> get
>>> an error trying to reach the admin page.
>>> 
>>> I know this only has something to do with my definitions in the schema,
>>> because when I tried to revert back to the default schema it worked
>>> again.
>>> 
>>> In my new schema I took out only the example definitions I was told to
>>> and
>>> input the below. Can someone tell me what's wrong?
>>> 
>>>
>>>
>>>
>>>
>>>
>>>>> stored="true"/>
>>>>> stored="true"/>
>>>>> stored="true"/>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> 
>>> 
>>> Also, what's the difference between text/string (I tried with both). And
>>> am
>>> I right in thinking that I could set the type to "StrField" to prevent
>>> any
>>> analysis pre-index?
>>> 
>>> 
>>> Cheers for the help!
>>> 
>> 
>> 
>>

Re: Solr schema causing an error

2009-01-23 Thread Jeff Newburn

The important info you are looking for is "undefined field sku at".  It
looks like there may be a copyfield in the schema looking for a field named
sku which does not exist.  Just search "sku" in the file and see what comes
up.


On 1/23/09 11:15 AM, "Johnny X"  wrote:

> 
> Well here are the first 10/15 lines:
> 
> HTTP Status 500 - Severe errors in solr configuration. Check your log files
> for more detailed information on what may be wrong. If you want solr to
> continue after configuration errors, change:
> false in null
> -
> org.apache.solr.common.SolrException: undefined field sku at
> org.apache.solr.schema.IndexSchema.getField(IndexSchema.java:994) at
> org.apache.solr.schema.IndexSchema.registerCopyField(IndexSchema.java:652)
> at org.apache.solr.schema.IndexSchema.readSchema(IndexSchema.java:613) at
> org.apache.solr.schema.IndexSchema.(IndexSchema.java:92) at
> org.apache.solr.core.SolrCore.(SolrCore.java:412) at
> org.apache.solr.core.CoreContainer$Initializer.initialize(CoreContainer.java:1
> 19)
> at
> org.apache.solr.servlet.SolrDispatchFilter.init(SolrDispatchFilter.java:69)
> at
> org.apache.catalina.core.ApplicationFilterConfig.getFilter(ApplicationFilterCo
> nfig.java:275)
> at
> org.apache.catalina.core.ApplicationFilterConfig.setFilterDef(ApplicationFilte
> rConfig.java:397)
> at
> org.apache.catalina.core.ApplicationFilterConfig.(ApplicationFilterConfi
> g.java:108)
> at
> 
org.apache.catalina.core.StandardContext.filterStart(StandardContext.java:3709>
)
> at org.apache.catalina.core.StandardContext.start(StandardContext.java:4363)
> at
> 
org.apache.catalina.core.ContainerBase.addChildInternal(ContainerBase.java:791>
)
> at org.apache.catalina.core.ContainerBase.addChild(ContainerBase.java:771)
> at org.apache.catalina.core.StandardHost.addChild(StandardHost.java:525) at
> org.apache.catalina.startup.HostConfig.deployWAR(HostConfig.java:830) at
> org.apache.catalina.startup.HostConfig.deployWARs(HostConfig.java:719) at
> org.apache.catalina.startup.HostConfig.deployApps(HostConfig.java:490) at
> org.apache.catalina.startup.HostConfig.start(HostConfig.java:1149) at
> org.apache.catalina.startup.HostConfig.lifecycleEvent(HostConfig.java:311)
> at
> org.apache.catalina.util.LifecycleSupport.fireLifecycleEvent(LifecycleSupport.
> java:117)
> at org.apache.catalina.core.ContainerBase.start(ContainerBase.java:1053) at
> org.apache.catalina.core.StandardHost.start(StandardHost.java:719) at
> org.apache.catalina.core.ContainerBase.start(ContainerBase.java:1045) at
> org.apache.catalina.core.StandardEngine.start(StandardEngine.java:443) at
> org.apache.catalina.core.StandardService.start(StandardService.java:516) at
> org.apache.catalina.core.StandardServer.start(StandardServer.java:710) at
> org.apache.catalina.startup.Catalina.start(Catalina.java:578) at
> sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at
> sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source) at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source) at
> java.lang.reflect.Method.invoke(Unknown Source) at
> org.apache.catalina.startup.Bootstrap.start(Bootstrap.java:288) at
> org.apache.catalina.startup.Bootstrap.main(Bootstrap.java:413)
> 
> 
> 
> Jeff Newburn wrote:
>> 
>> The first 10-15 lines of the jargon might help.  Additionally, the full
>> exceptions will be in the webserver logs (ie tomcat or jetty logs).
>> 
>> 
>> On 1/23/09 10:40 AM, "Johnny X"  wrote:
>> 
>>> 
>>> Ah, gotcha.
>>> 
>>> Where do I go to find the log messages? Obviously it prints a lot of
>>> jargon
>>> on the admin page reporting the error, but is that what you want?
>>> 
>>> 
>>> 
>>> Jeff Newburn wrote:
 
 Are there any error log messages?
 
 The difference between a string and text is that string is basically
 stored
 with no modification (it is the solr.StrField).  The text type is
 actually
 defined in the fieldtype section and usually contains a tokenizer and
 some
 analyzers (usually stemming, lowercasing, deduping).
 
 
 On 1/23/09 9:52 AM, "Johnny X"  wrote:
 
> 
> Hi there,
> 
> 
> I just configured my Solr schema file to support the data types I wish
> to
> submit for indexing. However, as soon as try and start the Solr server
> I
> get
> an error trying to reach the admin page.
> 
> I know this only has something to do with my definitions in the schema,
> because when I tried to revert back to the default schema it worked
> again.
> 
> In my new schema I took out only the example definitions I was told to
> and
> input the below. Can someone tell me what's wrong?
> 
> stored="true"/>
>
>
>
>
> stored="true"/>
> stored="true"/>
> indexed="false"
> stored="true"/>
>
>
>
>
>
>
>>>>

Re: Solr schema causing an error

2009-01-23 Thread Johnny X

Well here are the first 10/15 lines:

HTTP Status 500 - Severe errors in solr configuration. Check your log files
for more detailed information on what may be wrong. If you want solr to
continue after configuration errors, change:
false in null
-
org.apache.solr.common.SolrException: undefined field sku at
org.apache.solr.schema.IndexSchema.getField(IndexSchema.java:994) at
org.apache.solr.schema.IndexSchema.registerCopyField(IndexSchema.java:652)
at org.apache.solr.schema.IndexSchema.readSchema(IndexSchema.java:613) at
org.apache.solr.schema.IndexSchema.(IndexSchema.java:92) at
org.apache.solr.core.SolrCore.(SolrCore.java:412) at
org.apache.solr.core.CoreContainer$Initializer.initialize(CoreContainer.java:119)
at
org.apache.solr.servlet.SolrDispatchFilter.init(SolrDispatchFilter.java:69)
at
org.apache.catalina.core.ApplicationFilterConfig.getFilter(ApplicationFilterConfig.java:275)
at
org.apache.catalina.core.ApplicationFilterConfig.setFilterDef(ApplicationFilterConfig.java:397)
at
org.apache.catalina.core.ApplicationFilterConfig.(ApplicationFilterConfig.java:108)
at
org.apache.catalina.core.StandardContext.filterStart(StandardContext.java:3709)
at org.apache.catalina.core.StandardContext.start(StandardContext.java:4363)
at
org.apache.catalina.core.ContainerBase.addChildInternal(ContainerBase.java:791)
at org.apache.catalina.core.ContainerBase.addChild(ContainerBase.java:771)
at org.apache.catalina.core.StandardHost.addChild(StandardHost.java:525) at
org.apache.catalina.startup.HostConfig.deployWAR(HostConfig.java:830) at
org.apache.catalina.startup.HostConfig.deployWARs(HostConfig.java:719) at
org.apache.catalina.startup.HostConfig.deployApps(HostConfig.java:490) at
org.apache.catalina.startup.HostConfig.start(HostConfig.java:1149) at
org.apache.catalina.startup.HostConfig.lifecycleEvent(HostConfig.java:311)
at
org.apache.catalina.util.LifecycleSupport.fireLifecycleEvent(LifecycleSupport.java:117)
at org.apache.catalina.core.ContainerBase.start(ContainerBase.java:1053) at
org.apache.catalina.core.StandardHost.start(StandardHost.java:719) at
org.apache.catalina.core.ContainerBase.start(ContainerBase.java:1045) at
org.apache.catalina.core.StandardEngine.start(StandardEngine.java:443) at
org.apache.catalina.core.StandardService.start(StandardService.java:516) at
org.apache.catalina.core.StandardServer.start(StandardServer.java:710) at
org.apache.catalina.startup.Catalina.start(Catalina.java:578) at
sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at
sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source) at
sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source) at
java.lang.reflect.Method.invoke(Unknown Source) at
org.apache.catalina.startup.Bootstrap.start(Bootstrap.java:288) at
org.apache.catalina.startup.Bootstrap.main(Bootstrap.java:413)

Jeff Newburn wrote:
> 
> The first 10-15 lines of the jargon might help.  Additionally, the full
> exceptions will be in the webserver logs (ie tomcat or jetty logs).
> 
> 
> On 1/23/09 10:40 AM, "Johnny X"  wrote:
> 
>> 
>> Ah, gotcha.
>> 
>> Where do I go to find the log messages? Obviously it prints a lot of
>> jargon
>> on the admin page reporting the error, but is that what you want?
>> 
>> 
>> 
>> Jeff Newburn wrote:
>>> 
>>> Are there any error log messages?
>>> 
>>> The difference between a string and text is that string is basically
>>> stored
>>> with no modification (it is the solr.StrField).  The text type is
>>> actually
>>> defined in the fieldtype section and usually contains a tokenizer and
>>> some
>>> analyzers (usually stemming, lowercasing, deduping).
>>> 
>>> 
>>> On 1/23/09 9:52 AM, "Johnny X"  wrote:
>>> 

 Hi there,

 I just configured my Solr schema file to support the data types I wish
 to
 submit for indexing. However, as soon as try and start the Solr server
 I
 get
 an error trying to reach the admin page.

 I know this only has something to do with my definitions in the schema,
 because when I tried to revert back to the default schema it worked
 again.

 In my new schema I took out only the example definitions I was told to
 and
 input the below. Can someone tell me what's wrong?

>>> stored="true"/>

>>> stored="true"/>
>>> stored="true"/>
>>> indexed="false"
 stored="true"/>

>>> stored="true"/>

 Also, what's the difference between text/string (I tried with both).
 And
 am
 I right in thinking that I could set the type to "StrField" to prevent
 any
 analysis pre-index?

 Cheers for the help!

>>> 
>>> 
>>> 
> 
> 
> 

-- 
View this message in context: 
http://www.nabble.com/Solr-schema-causing-an-error-tp21629485p21630937.html
Sent from the Solr - User mailing list archive at N

Re: Solr schema causing an error

2009-01-23 Thread Johnny X


Wicked...you fixed it!

Thanks very much.

Pretty simple in the end I guess...but I thought it might be.


Cheers.



Jeff Newburn wrote:
> 
> The important info you are looking for is "undefined field sku at".  It
> looks like there may be a copyfield in the schema looking for a field
> named
> sku which does not exist.  Just search "sku" in the file and see what
> comes
> up.
> 
> 
> On 1/23/09 11:15 AM, "Johnny X"  wrote:
> 
>> 
>> Well here are the first 10/15 lines:
>> 
>> HTTP Status 500 - Severe errors in solr configuration. Check your log
>> files
>> for more detailed information on what may be wrong. If you want solr to
>> continue after configuration errors, change:
>> false in null
>> -
>> org.apache.solr.common.SolrException: undefined field sku at
>> org.apache.solr.schema.IndexSchema.getField(IndexSchema.java:994) at
>> org.apache.solr.schema.IndexSchema.registerCopyField(IndexSchema.java:652)
>> at org.apache.solr.schema.IndexSchema.readSchema(IndexSchema.java:613) at
>> org.apache.solr.schema.IndexSchema.(IndexSchema.java:92) at
>> org.apache.solr.core.SolrCore.(SolrCore.java:412) at
>> org.apache.solr.core.CoreContainer$Initializer.initialize(CoreContainer.java:1
>> 19)
>> at
>> org.apache.solr.servlet.SolrDispatchFilter.init(SolrDispatchFilter.java:69)
>> at
>> org.apache.catalina.core.ApplicationFilterConfig.getFilter(ApplicationFilterCo
>> nfig.java:275)
>> at
>> org.apache.catalina.core.ApplicationFilterConfig.setFilterDef(ApplicationFilte
>> rConfig.java:397)
>> at
>> org.apache.catalina.core.ApplicationFilterConfig.(ApplicationFilterConfi
>> g.java:108)
>> at
>> 
> org.apache.catalina.core.StandardContext.filterStart(StandardContext.java:3709>
> )
>> at
>> org.apache.catalina.core.StandardContext.start(StandardContext.java:4363)
>> at
>> 
> org.apache.catalina.core.ContainerBase.addChildInternal(ContainerBase.java:791>
> )
>> at
>> org.apache.catalina.core.ContainerBase.addChild(ContainerBase.java:771)
>> at org.apache.catalina.core.StandardHost.addChild(StandardHost.java:525)
>> at
>> org.apache.catalina.startup.HostConfig.deployWAR(HostConfig.java:830) at
>> org.apache.catalina.startup.HostConfig.deployWARs(HostConfig.java:719) at
>> org.apache.catalina.startup.HostConfig.deployApps(HostConfig.java:490) at
>> org.apache.catalina.startup.HostConfig.start(HostConfig.java:1149) at
>> org.apache.catalina.startup.HostConfig.lifecycleEvent(HostConfig.java:311)
>> at
>> org.apache.catalina.util.LifecycleSupport.fireLifecycleEvent(LifecycleSupport.
>> java:117)
>> at org.apache.catalina.core.ContainerBase.start(ContainerBase.java:1053)
>> at
>> org.apache.catalina.core.StandardHost.start(StandardHost.java:719) at
>> org.apache.catalina.core.ContainerBase.start(ContainerBase.java:1045) at
>> org.apache.catalina.core.StandardEngine.start(StandardEngine.java:443) at
>> org.apache.catalina.core.StandardService.start(StandardService.java:516)
>> at
>> org.apache.catalina.core.StandardServer.start(StandardServer.java:710) at
>> org.apache.catalina.startup.Catalina.start(Catalina.java:578) at
>> sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at
>> sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source) at
>> sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source) at
>> java.lang.reflect.Method.invoke(Unknown Source) at
>> org.apache.catalina.startup.Bootstrap.start(Bootstrap.java:288) at
>> org.apache.catalina.startup.Bootstrap.main(Bootstrap.java:413)
>> 
>> 
>> 
>> Jeff Newburn wrote:
>>> 
>>> The first 10-15 lines of the jargon might help.  Additionally, the full
>>> exceptions will be in the webserver logs (ie tomcat or jetty logs).
>>> 
>>> 
>>> On 1/23/09 10:40 AM, "Johnny X"  wrote:
>>> 
 
 Ah, gotcha.
 
 Where do I go to find the log messages? Obviously it prints a lot of
 jargon
 on the admin page reporting the error, but is that what you want?
 
 
 
 Jeff Newburn wrote:
> 
> Are there any error log messages?
> 
> The difference between a string and text is that string is basically
> stored
> with no modification (it is the solr.StrField).  The text type is
> actually
> defined in the fieldtype section and usually contains a tokenizer and
> some
> analyzers (usually stemming, lowercasing, deduping).
> 
> 
> On 1/23/09 9:52 AM, "Johnny X"  wrote:
> 
>> 
>> Hi there,
>> 
>> 
>> I just configured my Solr schema file to support the data types I
>> wish
>> to
>> submit for indexing. However, as soon as try and start the Solr
>> server
>> I
>> get
>> an error trying to reach the admin page.
>> 
>> I know this only has something to do with my definitions in the
>> schema,
>> because when I tried to revert back to the default schema it worked
>> again.
>> 
>> In my new schema I took out only the example definitions I was told
>> to
>>

Re: Solr stemming -> preserve original words

2009-01-23 Thread AHMET ARSLAN

I think best way to get non-stemmed top terms is to index the field using a 
fieldType that does not employes any stem filter. For example:


  


By using copyField you can store two (or more) versions of a field. Stemmed and 
non-stemmed.

Just a new field:
 

And a copy field:
 

Schema Browser (Field: text) will give you top terms.

> Is it possible to retrieve the original words once solr
> (Porter algorithm)
> stems them?
> I need to index a bunch of data, store it in solr, and get
> back a list of
> most frequent terms out of solr. and i want to see the
> non-stemmed version
> of this data.
> 
> so basically, i want to enhance this:
> http://localhost:8983/solr/admin/schema.jsp to see the
> "top terms" in
> non-stemmed form.
> 
> thanks,
> thushara

Re: Solr stemming -> preserve original words

2009-01-23 Thread Thushara Wijeratna

hi Ahmet,

thanks. when i look at the non_stemmed_text field to get the top terms, i
will not be getting the useful feature of aggregating many related words
into one (which is done by stemming).

for ex: if a document has run(10), running(20), runner(2), runners(8) - i
would like to see a a "top term" to be "run" here. i think with the
non-stemmed solution, i will see run, running, runner, runners as separate
top terms so if the term "weather" happens to occur 21 times in the
document, it will replace any version of "run" as the top term.

of course i could go back to the text field for top terms where i will see
"run", but some of the terms in the text field will be non-english (stemmed
beyond english, ex: archiv, perman). so how can i tell if a term i see in
the text field is a "badly stemmed" word or not?

maybe at this point i could use a dictionary? if a term in the text field is
not in the dictionary, i would try to find a prefix match from the
non-stemmed field? or maybe there's a better way?

thanks,
thushara

On Fri, Jan 23, 2009 at 11:37 AM, AHMET ARSLAN  wrote:

> I think best way to get non-stemmed top terms is to index the field using a
> fieldType that does not employes any stem filter. For example:
>
> 
>   class="org.apache.lucene.analysis.standard.StandardAnalyzer"/>
> 
>
> By using copyField you can store two (or more) versions of a field. Stemmed
> and non-stemmed.
>
> Just a new field:
> 
>
> And a copy field:
> 
>
> Schema Browser (Field: text) will give you top terms.
>
> > Is it possible to retrieve the original words once solr
> > (Porter algorithm)
> > stems them?
> > I need to index a bunch of data, store it in solr, and get
> > back a list of
> > most frequent terms out of solr. and i want to see the
> > non-stemmed version
> > of this data.
> >
> > so basically, i want to enhance this:
> > http://localhost:8983/solr/admin/schema.jsp to see the
> > "top terms" in
> > non-stemmed form.
> >
> > thanks,
> > thushara
>
>
>
>

Issue indexing in Solr

2009-01-23 Thread Johnny X


I keep getting the error "FATAL: Solr returned an error: Bad Request"

Solr is running on a different port (8080) so I changed the command line
request to "java -Durl=http://localhost:8080/solr/update -jar post.jar
*.xml"

which seems to at least initiate.

"WARNING: Make sure your XML documents are encoded in UTF-8, other encodings
are not currently supported" appears, but I don't know if that's normal.
These XML files were generated using a library in dot net so I'm not sure,
but I'd guess they'd encode to UTF-8 by default?


My xml currently looks like this:

<12929996.1075855668941.javamail.ev...@thyme>Mon, 31 Dec 1979 16:00:00 -0800 (PST)phillip.al...@enron.commul...@thedoghousemail.comRe: (No
Subject)1.0text/plain; charset=us-ascii7bitPhillip K
Allenmul...@thedoghousemail.com\Phillip_Allen_Dec2000\Notes Folders\All
documentsAllen-Ppallen.nsf
How is your racing going?  What category are you up to? 

I




Would the "" cause a
problem?

Also, on a side note, do I need a  in the last XML document, or
will Solr automatically commit after a set period post-indexing?


Thanks very much.


-- 
View this message in context: 
http://www.nabble.com/Issue-indexing-in-Solr-tp21632462p21632462.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Issue indexing in Solr

2009-01-23 Thread Jeff Newburn

The best way to find out what was wrong with the request is going to be the
web server logs.  It should throw an exception that usually complains about
fields missing or incorrect.

As to the committing solr has an autocommit option that will fire after a
designated amount of changes have been entered.  If you are planning on
updating a record here or there I would advise sending in the commit
yourself.  This process will ensure your data is what you intend it to be.
If you are planning on doing large amounts of commits then the autocommit is
probably a better bet.


On 1/23/09 12:53 PM, "Johnny X"  wrote:

> 
> I keep getting the error "FATAL: Solr returned an error: Bad Request"
> 
> Solr is running on a different port (8080) so I changed the command line
> request to "java -Durl=http://localhost:8080/solr/update -jar post.jar
> *.xml"
> 
> which seems to at least initiate.
> 
> "WARNING: Make sure your XML documents are encoded in UTF-8, other encodings
> are not currently supported" appears, but I don't know if that's normal.
> These XML files were generated using a library in dot net so I'm not sure,
> but I'd guess they'd encode to UTF-8 by default?
> 
> 
> My xml currently looks like this:
> 
>  name="Message-ID"><12929996.1075855668941.javamail.ev...@thyme><
> field
> name="Date">Mon, 31 Dec 1979 16:00:00 -0800 (PST) name="From">phillip.al...@enron.com name="To">mul...@thedoghousemail.comRe: (No
> Subject)1.0 name="Content-Type">text/plain; charset=us-ascii name="Content-Transfer-Encoding">7bitPhillip K
> Allenmul...@thedoghousemail.com name="X-cc"> name="X-Folder">\Phillip_Allen_Dec2000\Notes Folders\All
> documentsAllen-P name="X-FileName">pallen.nsf
> How is your racing going?  What category are you up to?
> 
> I
> 
> 
> 
> 
> Would the "" cause a
> problem?
> 
> Also, on a side note, do I need a  in the last XML document, or
> will Solr automatically commit after a set period post-indexing?
> 
> 
> Thanks very much.
>

Should I extend DIH to handle POST too?

2009-01-23 Thread Gunaranjan Chandraraju


Hi
I had earlier described my requirement of needing to 'post XMLs as-is'  
to SOLR and have it handled just as the DIH would do on import using  
the mapping in data-config.xml.  I got multiple answers for the 'post  
approach' - the top two being


- Use SOLR CELL
- Use SOLRJ

In general I would like to keep all the 'data conversion' inside the  
SOLR powered search system rather than having clients do the XSL and  
transforming the XML before sending them (CELL approach).


My question is? How should I design this
 - Tomcat Servlet that provides this 'post' endpoint.  Accepts the  
XML over HTTP, transforms it and calls SOLRJ to update.  This is the  
same TOMCAT that houses SOLR.

 - SOLR Handler (Is this the right way?)
 - Take this a step further and implement it as an extension to  
DIH - a handler that will refer to DIH data-config xml and use the  
same transformation.  This way I can invoke an import for 'batched  
files' or do a 'post 'for the same XML with the same data-config  
mapping being applied.  Maybe it can be a separate handler that just  
refers to the same data-config.xml and not necessarily bundled with  
DIH handler code.


Looking for some advise.  If the DIH extension is the way to go then I  
would be happy to extend it and contribute that back to SOLR.


Regards,
Guna

Re: Solr stemming -> preserve original words

2009-01-23 Thread Chris Harris

It seems like what's desired is not so much a stemmer as what you might call
a "canonicalizer", which would translate each source word not into its
"stem" but into its "most canonical form". Critically, the latter, by
definition, is always a legitimate word, e.g. "run". What's more, it's
always the "most appropriate word" or "most general word", or some such.

I'm not sure you could implement this except through a massive dictionary.
And you'd have trouble because some words would probably be ambiguous
between whether they should canonicalize this way or that.

On Fri, Jan 23, 2009 at 11:53 AM, Thushara Wijeratna wrote:

> hi Ahmet,
>
> thanks. when i look at the non_stemmed_text field to get the top terms, i
> will not be getting the useful feature of aggregating many related words
> into one (which is done by stemming).
>
> for ex: if a document has run(10), running(20), runner(2), runners(8) - i
> would like to see a a "top term" to be "run" here. i think with the
> non-stemmed solution, i will see run, running, runner, runners as separate
> top terms so if the term "weather" happens to occur 21 times in the
> document, it will replace any version of "run" as the top term.
>
> of course i could go back to the text field for top terms where i will see
> "run", but some of the terms in the text field will be non-english (stemmed
> beyond english, ex: archiv, perman). so how can i tell if a term i see in
> the text field is a "badly stemmed" word or not?
>
> maybe at this point i could use a dictionary? if a term in the text field
> is
> not in the dictionary, i would try to find a prefix match from the
> non-stemmed field? or maybe there's a better way?
>
> thanks,
> thushara
>
> On Fri, Jan 23, 2009 at 11:37 AM, AHMET ARSLAN  wrote:
>
> > I think best way to get non-stemmed top terms is to index the field using
> a
> > fieldType that does not employes any stem filter. For example:
> >
> > 
> >   > class="org.apache.lucene.analysis.standard.StandardAnalyzer"/>
> > 
> >
> > By using copyField you can store two (or more) versions of a field.
> Stemmed
> > and non-stemmed.
> >
> > Just a new field:
> >  />
> >
> > And a copy field:
> > 
> >
> > Schema Browser (Field: text) will give you top terms.
> >
> > > Is it possible to retrieve the original words once solr
> > > (Porter algorithm)
> > > stems them?
> > > I need to index a bunch of data, store it in solr, and get
> > > back a list of
> > > most frequent terms out of solr. and i want to see the
> > > non-stemmed version
> > > of this data.
> > >
> > > so basically, i want to enhance this:
> > > http://localhost:8983/solr/admin/schema.jsp to see the
> > > "top terms" in
> > > non-stemmed form.
> > >
> > > thanks,
> > > thushara
> >
> >
> >
> >
>

Re: Solr stemming -> preserve original words

2009-01-23 Thread AHMET ARSLAN

I didn't understand what exactly you want.

if a document has run(10), running(20), runner(2), runners(8):
(assuming stemmer reduces all those words to run)
with non-stemmed you will see: 
running(20)
run(10)
runners(8)
runner(2)

with stemmed you will see: 
run(40)

You want to see run as a top term but also you want to see the original words 
that formed that term?
run(40) => 20 from running, 10 from run, 8 from runners, 2 from runner

Or do you want to see most frequent terms that passed through stem filter 
verbatim? (terms that stemmer didn't change/modify)

What do you mean by saying "badly stemmed" word?


> hi Ahmet,
> 
> thanks. when i look at the non_stemmed_text field to get
> the top terms, i
> will not be getting the useful feature of aggregating many
> related words
> into one (which is done by stemming).
> 
> for ex: if a document has run(10), running(20), runner(2),
> runners(8) - i
> would like to see a a "top term" to be
> "run" here. i think with the
> non-stemmed solution, i will see run, running, runner,
> runners as separate
> top terms so if the term "weather" happens to
> occur 21 times in the
> document, it will replace any version of "run" as
> the top term.
> 
> of course i could go back to the text field for top terms
> where i will see
> "run", but some of the terms in the text field
> will be non-english (stemmed
> beyond english, ex: archiv, perman). so how can i tell if a
> term i see in
> the text field is a "badly stemmed" word or not?
> 
> maybe at this point i could use a dictionary? if a term in
> the text field is
> not in the dictionary, i would try to find a prefix match
> from the
> non-stemmed field? or maybe there's a better way?
> 
> thanks,
> thushara

Re: Solr stemming -> preserve original words

2009-01-23 Thread Thushara Wijeratna

Chris, Ahmet - thanks for the responses.

Ahmet - yes, i want to see "run" as a top term + the original words that
formed that term
The reason is that due to mis-stemming, the terms could become non-english.
ex:  "permanent" would stem to "perm", "archive" would become "archiv".

I need to extract a set of keywords from the indexed content - I'd like
these to be correct full english words.

thanks,
thushara

On Fri, Jan 23, 2009 at 2:12 PM, AHMET ARSLAN  wrote:

> I didn't understand what exactly you want.
>
> if a document has run(10), running(20), runner(2), runners(8):
> (assuming stemmer reduces all those words to run)
> with non-stemmed you will see:
> running(20)
> run(10)
> runners(8)
> runner(2)
>
> with stemmed you will see:
> run(40)
>
> You want to see run as a top term but also you want to see the original
> words that formed that term?
> run(40) => 20 from running, 10 from run, 8 from runners, 2 from runner
>
> Or do you want to see most frequent terms that passed through stem filter
> verbatim? (terms that stemmer didn't change/modify)
>
> What do you mean by saying "badly stemmed" word?
>
>
> > hi Ahmet,
> >
> > thanks. when i look at the non_stemmed_text field to get
> > the top terms, i
> > will not be getting the useful feature of aggregating many
> > related words
> > into one (which is done by stemming).
> >
> > for ex: if a document has run(10), running(20), runner(2),
> > runners(8) - i
> > would like to see a a "top term" to be
> > "run" here. i think with the
> > non-stemmed solution, i will see run, running, runner,
> > runners as separate
> > top terms so if the term "weather" happens to
> > occur 21 times in the
> > document, it will replace any version of "run" as
> > the top term.
> >
> > of course i could go back to the text field for top terms
> > where i will see
> > "run", but some of the terms in the text field
> > will be non-english (stemmed
> > beyond english, ex: archiv, perman). so how can i tell if a
> > term i see in
> > the text field is a "badly stemmed" word or not?
> >
> > maybe at this point i could use a dictionary? if a term in
> > the text field is
> > not in the dictionary, i would try to find a prefix match
> > from the
> > non-stemmed field? or maybe there's a better way?
> >
> > thanks,
> > thushara
>
>
>
>

DataImport TXT file entity processor

2009-01-23 Thread Nathan Adams

Is there a way to us Data Import Handler to index non-XML (i.e. simple
text) files (either via HTTP or FileSystem)?  I need to put the entire
contents of a text file into a single field of a document and the other
fields are being pulled out of Oracle...

 

-Nathan

faceting question

2009-01-23 Thread Cam Bazz

Hello;

I got a multiField named tagList which may contain multiple tags. I am
making a query like:

tagList:a AND tagList:b AND tagList:c

and I am also getting a tagList facet returning me some values.

What I would like is Solr to return me facets as if the query was:
tagList:a AND tagList:b

is it even possible?

Best,
-C.B.

Results not appearing

2009-01-23 Thread Johnny X


I've indexed my XML using the below in the schema:

   
   
   
   
   

 
   
   
   
   
   
   

 
   
   
   

 Message-ID

However searching via the Message-ID or Content fields returns 0. Using Luke
I can still see these fields are stored however.

Out of interest, by setting the other fields to just "stored=true", can they
be returned in a query as part of a search?


Cheers.
-- 
View this message in context: 
http://www.nabble.com/Results-not-appearing-tp21637069p21637069.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Results not appearing

2009-01-23 Thread Chris Harris

These might be obvious, but:

* I assume you did a Solr commit command after indexing, right?

* If you are using the fieldtype definitions from the default
schema.xml, then your "string" fields are not being analyzed, which
means you should expect search results only if you enter the entire,
exact value of one of the Message-ID or Date fields in your query. Is
that your intention?

And yes, your analysis of "stored" seems correct. Stored fields are
those whose values you need back at query time, and indexed fields are
those you can do queries on. For a few complications, see
http://wiki.apache.org/solr/FieldOptionsByUseCase

On Fri, Jan 23, 2009 at 8:04 PM, Johnny X  wrote:
>
> I've indexed my XML using the below in the schema:
>
>required="true"/>
>   
>   
>   
>   
>   
>   
>stored="true"/>
>   
>   
>   
>   
>   
>   
>   
>   
>
>  Message-ID
>
> However searching via the Message-ID or Content fields returns 0. Using Luke
> I can still see these fields are stored however.
>
> Out of interest, by setting the other fields to just "stored=true", can they
> be returned in a query as part of a search?
>
>
> Cheers.
> --
> View this message in context: 
> http://www.nabble.com/Results-not-appearing-tp21637069p21637069.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>
>

Re: Should I extend DIH to handle POST too?

2009-01-23 Thread Shalin Shekhar Mangar

There's another option. Using DIH with Solrj. Take a look at:

https://issues.apache.org/jira/browse/SOLR-853

There's a patch there but it hasn't been updated to trunk. A contribution
would be most welcome.

On Sat, Jan 24, 2009 at 3:11 AM, Gunaranjan Chandraraju <
chandrar...@apple.com> wrote:

> Hi
> I had earlier described my requirement of needing to 'post XMLs as-is' to
> SOLR and have it handled just as the DIH would do on import using the
> mapping in data-config.xml.  I got multiple answers for the 'post approach'
> - the top two being
>
> - Use SOLR CELL
> - Use SOLRJ
>
> In general I would like to keep all the 'data conversion' inside the SOLR
> powered search system rather than having clients do the XSL and transforming
> the XML before sending them (CELL approach).
>
> My question is? How should I design this
>  - Tomcat Servlet that provides this 'post' endpoint.  Accepts the XML over
> HTTP, transforms it and calls SOLRJ to update.  This is the same TOMCAT that
> houses SOLR.
>  - SOLR Handler (Is this the right way?)
> - Take this a step further and implement it as an extension to DIH - a
> handler that will refer to DIH data-config xml and use the same
> transformation.  This way I can invoke an import for 'batched files' or do a
> 'post 'for the same XML with the same data-config mapping being applied.
>  Maybe it can be a separate handler that just refers to the same
> data-config.xml and not necessarily bundled with DIH handler code.
>
> Looking for some advise.  If the DIH extension is the way to go then I
> would be happy to extend it and contribute that back to SOLR.
>
> Regards,
> Guna
>



-- 
Regards,
Shalin Shekhar Mangar.

Re: DataImport TXT file entity processor

2009-01-23 Thread Shalin Shekhar Mangar

On Sat, Jan 24, 2009 at 5:56 AM, Nathan Adams  wrote:

> Is there a way to us Data Import Handler to index non-XML (i.e. simple
> text) files (either via HTTP or FileSystem)?  I need to put the entire
> contents of a text file into a single field of a document and the other
> fields are being pulled out of Oracle...


Not yet. But I think it will be nice to have. Can you open an issue in Jira?

I think importing from HTTP was something another user had asked for
recently. How do you get the url/path of this text file? That would help
decide if we need a Transformer or EntityProcessor for these tasks.
-- 
Regards,
Shalin Shekhar Mangar.

Re: faceting question

2009-01-23 Thread Shalin Shekhar Mangar

On Sat, Jan 24, 2009 at 6:56 AM, Cam Bazz  wrote:

> Hello;
>
> I got a multiField named tagList which may contain multiple tags. I am
> making a query like:
>
> tagList:a AND tagList:b AND tagList:c
>
> and I am also getting a tagList facet returning me some values.
>
> What I would like is Solr to return me facets as if the query was:
> tagList:a AND tagList:b
>
> is it even possible?
>

If I understand correctly,
1. You want to query for tagList:a AND tagList:b AND tagList:c
2. At the same time, you want to request facets for tagList but only for
tagList:a and tagList:b

If that is correct, you can use the features introduced by
https://issues.apache.org/jira/browse/SOLR-911

However you may need to put #1 as fq instead of q.
-- 
Regards,
Shalin Shekhar Mangar.

55 matches

Mail list logo