date:20081118

Re: Is there a DTD/XSD for XML response?

2008-11-18 Thread Ryan McKinley


nope... solr does not have a DTD.


On Nov 18, 2008, at 1:44 PM, Simon Hu wrote:



Hi,

I assume there is a schema definition or DTD for XML response but  
could not

find it anywhere.

Is there one?

thanks
-Simon

--
View this message in context: 
http://www.nabble.com/Is-there-a-DTD-XSD-for-XML-response--tp20565773p20565773.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Error in indexing timestamp format.

2008-11-18 Thread con


Hi Nobble

I have cross checked. This is my copy field of schema.xml

   

I am still getting that error.

thanks
con



Noble Paul നോബിള്‍ नोब्ळ् wrote:
> 
> yoour copyField has the wrong source field name . Field name is not
> "date" it is 'CREATED_DATE'
> 
> On Wed, Nov 19, 2008 at 11:49 AM, con <[EMAIL PROTECTED]> wrote:
>>
>> Hi Shalin
>> Please find the log data.
>>
>> 10:18:30,819 ERROR [STDERR] 19 Nov, 2008 10:18:30 AM
>> org.apache.solr.servlet.SolrDispatchFilter init
>> INFO: SolrDispatchFilter.init()
>> 10:18:30,838 ERROR [STDERR] 19 Nov, 2008 10:18:30 AM
>> org.apache.solr.core.SolrResourceLoader locateInstanceDir
>> INFO: No /solr/home in JNDI
>> 10:18:30,839 ERROR [STDERR] 19 Nov, 2008 10:18:30 AM
>> org.apache.solr.core.SolrResourceLoader locateInstanceDir
>> INFO: using system property solr.solr.home: C:\Search\solr
>> 10:18:30,844 ERROR [STDERR] 19 Nov, 2008 10:18:30 AM
>> org.apache.solr.core.CoreContainer$Initializer initialize
>> INFO: looking for solr.xml: C:\Search\solr\solr.xml
>> 10:18:30,845 ERROR [STDERR] 19 Nov, 2008 10:18:30 AM
>> org.apache.solr.core.SolrResourceLoader 
>> INFO: Solr home set to 'C:\Search\solr/'
>> 10:18:30,846 ERROR [STDERR] 19 Nov, 2008 10:18:30 AM
>> org.apache.solr.core.SolrResourceLoader createClassLoader
>> INFO: Adding 'file:/C:/Search/solr/lib/jetty-6.1.3.jar' to Solr
>> classloader
>> 10:18:30,847 ERROR [STDERR] 19 Nov, 2008 10:18:30 AM
>> org.apache.solr.core.SolrResourceLoader createClassLoader
>> INFO: Adding 'file:/C:/Search/solr/lib/jetty-util-6.1.3.jar' to Solr
>> classloader
>> 10:18:30,848 ERROR [STDERR] 19 Nov, 2008 10:18:30 AM
>> org.apache.solr.core.SolrResourceLoader createClassLoader
>> INFO: Adding 'file:/C:/Search/solr/lib/jsp-2.1/' to Solr classloader
>> 10:18:30,848 ERROR [STDERR] 19 Nov, 2008 10:18:30 AM
>> org.apache.solr.core.SolrResourceLoader createClassLoader
>> INFO: Adding 'file:/C:/Search/solr/lib/ojdbc6-11.1.0.6.0.1.jar' to Solr
>> classloader
>> 10:18:30,849 ERROR [STDERR] 19 Nov, 2008 10:18:30 AM
>> org.apache.solr.core.SolrResourceLoader createClassLoader
>> INFO: Adding 'file:/C:/Search/solr/lib/servlet-api-2.5-6.1.3.jar' to Solr
>> classloader
>> 10:18:30,864 ERROR [STDERR] 19 Nov, 2008 10:18:30 AM
>> org.apache.solr.core.CoreContainer load
>> INFO: loading shared library: C:\Search\solr\lib
>> 10:18:30,867 ERROR [STDERR] 19 Nov, 2008 10:18:30 AM
>> org.apache.solr.core.SolrResourceLoader createClassLoader
>> INFO: Adding 'file:/C:/Search/solr/lib/jetty-6.1.3.jar' to Solr
>> classloader
>> 10:18:30,870 ERROR [STDERR] 19 Nov, 2008 10:18:30 AM
>> org.apache.solr.core.SolrResourceLoader createClassLoader
>> INFO: Adding 'file:/C:/Search/solr/lib/jetty-util-6.1.3.jar' to Solr
>> classloader
>> 10:18:30,870 ERROR [STDERR] 19 Nov, 2008 10:18:30 AM
>> org.apache.solr.core.SolrResourceLoader createClassLoader
>> INFO: Adding 'file:/C:/Search/solr/lib/jsp-2.1/' to Solr classloader
>> 10:18:30,871 ERROR [STDERR] 19 Nov, 2008 10:18:30 AM
>> org.apache.solr.core.SolrResourceLoader createClassLoader
>> INFO: Adding 'file:/C:/Search/solr/lib/ojdbc6-11.1.0.6.0.1.jar' to Solr
>> classloader
>> 10:18:30,872 ERROR [STDERR] 19 Nov, 2008 10:18:30 AM
>> org.apache.solr.core.SolrResourceLoader createClassLoader
>> INFO: Adding 'file:/C:/Search/solr/lib/servlet-api-2.5-6.1.3.jar' to Solr
>> classloader
>> 10:18:30,896 ERROR [STDERR] 19 Nov, 2008 10:18:30 AM
>> org.apache.solr.core.SolrResourceLoader 
>> INFO: Solr home set to 'C:\Search\solr\feedback/'
>> 10:18:30,896 ERROR [STDERR] 19 Nov, 2008 10:18:30 AM
>> org.apache.solr.core.SolrResourceLoader createClassLoader
>> INFO: Reusing parent classloader
>> 10:18:31,328 ERROR [STDERR] 19 Nov, 2008 10:18:31 AM
>> org.apache.solr.core.SolrConfig 
>> INFO: Loaded SolrConfig: solrconfig.xml
>> 10:18:31,370 ERROR [STDERR] 19 Nov, 2008 10:18:31 AM
>> org.apache.solr.schema.IndexSchema readSchema
>> INFO: Reading Solr Schema
>> 10:18:31,381 ERROR [STDERR] 19 Nov, 2008 10:18:31 AM
>> org.apache.solr.schema.IndexSchema readSchema
>> INFO: Schema name=feedback schema
>> 10:18:31,403 ERROR [STDERR] 19 Nov, 2008 10:18:31 AM
>> org.apache.solr.util.plugin.AbstractPluginLoader load
>> INFO: created string: org.apache.solr.schema.StrField
>> 10:18:31,416 ERROR [STDERR] 19 Nov, 2008 10:18:31 AM
>> org.apache.solr.util.plugin.AbstractPluginLoader load
>> INFO: created boolean: org.apache.solr.schema.BoolField
>> 10:18:31,425 ERROR [STDERR] 19 Nov, 2008 10:18:31 AM
>> org.apache.solr.util.plugin.AbstractPluginLoader load
>> INFO: created integer: org.apache.solr.schema.IntField
>> 10:18:31,433 ERROR [STDERR] 19 Nov, 2008 10:18:31 AM
>> org.apache.solr.util.plugin.AbstractPluginLoader load
>> INFO: created long: org.apache.solr.schema.LongField
>> 10:18:31,441 ERROR [STDERR] 19 Nov, 2008 10:18:31 AM
>> org.apache.solr.util.plugin.AbstractPluginLoader load
>> INFO: created float: org.apache.solr.schema.FloatField
>> 10:18:31,449 ERROR [STDERR] 19 Nov, 2008 10:18:31 AM
>> org.apache.solr.util.plugin.AbstractPlugin

Re: Error in indexing timestamp format.

2008-11-18 Thread con


Hi Shalin
Please find the log data.

10:18:30,819 ERROR [STDERR] 19 Nov, 2008 10:18:30 AM
org.apache.solr.servlet.SolrDispatchFilter init
INFO: SolrDispatchFilter.init()
10:18:30,838 ERROR [STDERR] 19 Nov, 2008 10:18:30 AM
org.apache.solr.core.SolrResourceLoader locateInstanceDir
INFO: No /solr/home in JNDI
10:18:30,839 ERROR [STDERR] 19 Nov, 2008 10:18:30 AM
org.apache.solr.core.SolrResourceLoader locateInstanceDir
INFO: using system property solr.solr.home: C:\Search\solr
10:18:30,844 ERROR [STDERR] 19 Nov, 2008 10:18:30 AM
org.apache.solr.core.CoreContainer$Initializer initialize
INFO: looking for solr.xml: C:\Search\solr\solr.xml
10:18:30,845 ERROR [STDERR] 19 Nov, 2008 10:18:30 AM
org.apache.solr.core.SolrResourceLoader 
INFO: Solr home set to 'C:\Search\solr/'
10:18:30,846 ERROR [STDERR] 19 Nov, 2008 10:18:30 AM
org.apache.solr.core.SolrResourceLoader createClassLoader
INFO: Adding 'file:/C:/Search/solr/lib/jetty-6.1.3.jar' to Solr classloader
10:18:30,847 ERROR [STDERR] 19 Nov, 2008 10:18:30 AM
org.apache.solr.core.SolrResourceLoader createClassLoader
INFO: Adding 'file:/C:/Search/solr/lib/jetty-util-6.1.3.jar' to Solr
classloader
10:18:30,848 ERROR [STDERR] 19 Nov, 2008 10:18:30 AM
org.apache.solr.core.SolrResourceLoader createClassLoader
INFO: Adding 'file:/C:/Search/solr/lib/jsp-2.1/' to Solr classloader
10:18:30,848 ERROR [STDERR] 19 Nov, 2008 10:18:30 AM
org.apache.solr.core.SolrResourceLoader createClassLoader
INFO: Adding 'file:/C:/Search/solr/lib/ojdbc6-11.1.0.6.0.1.jar' to Solr
classloader
10:18:30,849 ERROR [STDERR] 19 Nov, 2008 10:18:30 AM
org.apache.solr.core.SolrResourceLoader createClassLoader
INFO: Adding 'file:/C:/Search/solr/lib/servlet-api-2.5-6.1.3.jar' to Solr
classloader
10:18:30,864 ERROR [STDERR] 19 Nov, 2008 10:18:30 AM
org.apache.solr.core.CoreContainer load
INFO: loading shared library: C:\Search\solr\lib
10:18:30,867 ERROR [STDERR] 19 Nov, 2008 10:18:30 AM
org.apache.solr.core.SolrResourceLoader createClassLoader
INFO: Adding 'file:/C:/Search/solr/lib/jetty-6.1.3.jar' to Solr classloader
10:18:30,870 ERROR [STDERR] 19 Nov, 2008 10:18:30 AM
org.apache.solr.core.SolrResourceLoader createClassLoader
INFO: Adding 'file:/C:/Search/solr/lib/jetty-util-6.1.3.jar' to Solr
classloader
10:18:30,870 ERROR [STDERR] 19 Nov, 2008 10:18:30 AM
org.apache.solr.core.SolrResourceLoader createClassLoader
INFO: Adding 'file:/C:/Search/solr/lib/jsp-2.1/' to Solr classloader
10:18:30,871 ERROR [STDERR] 19 Nov, 2008 10:18:30 AM
org.apache.solr.core.SolrResourceLoader createClassLoader
INFO: Adding 'file:/C:/Search/solr/lib/ojdbc6-11.1.0.6.0.1.jar' to Solr
classloader
10:18:30,872 ERROR [STDERR] 19 Nov, 2008 10:18:30 AM
org.apache.solr.core.SolrResourceLoader createClassLoader
INFO: Adding 'file:/C:/Search/solr/lib/servlet-api-2.5-6.1.3.jar' to Solr
classloader
10:18:30,896 ERROR [STDERR] 19 Nov, 2008 10:18:30 AM
org.apache.solr.core.SolrResourceLoader 
INFO: Solr home set to 'C:\Search\solr\feedback/'
10:18:30,896 ERROR [STDERR] 19 Nov, 2008 10:18:30 AM
org.apache.solr.core.SolrResourceLoader createClassLoader
INFO: Reusing parent classloader
10:18:31,328 ERROR [STDERR] 19 Nov, 2008 10:18:31 AM
org.apache.solr.core.SolrConfig 
INFO: Loaded SolrConfig: solrconfig.xml
10:18:31,370 ERROR [STDERR] 19 Nov, 2008 10:18:31 AM
org.apache.solr.schema.IndexSchema readSchema
INFO: Reading Solr Schema
10:18:31,381 ERROR [STDERR] 19 Nov, 2008 10:18:31 AM
org.apache.solr.schema.IndexSchema readSchema
INFO: Schema name=feedback schema
10:18:31,403 ERROR [STDERR] 19 Nov, 2008 10:18:31 AM
org.apache.solr.util.plugin.AbstractPluginLoader load
INFO: created string: org.apache.solr.schema.StrField
10:18:31,416 ERROR [STDERR] 19 Nov, 2008 10:18:31 AM
org.apache.solr.util.plugin.AbstractPluginLoader load
INFO: created boolean: org.apache.solr.schema.BoolField
10:18:31,425 ERROR [STDERR] 19 Nov, 2008 10:18:31 AM
org.apache.solr.util.plugin.AbstractPluginLoader load
INFO: created integer: org.apache.solr.schema.IntField
10:18:31,433 ERROR [STDERR] 19 Nov, 2008 10:18:31 AM
org.apache.solr.util.plugin.AbstractPluginLoader load
INFO: created long: org.apache.solr.schema.LongField
10:18:31,441 ERROR [STDERR] 19 Nov, 2008 10:18:31 AM
org.apache.solr.util.plugin.AbstractPluginLoader load
INFO: created float: org.apache.solr.schema.FloatField
10:18:31,449 ERROR [STDERR] 19 Nov, 2008 10:18:31 AM
org.apache.solr.util.plugin.AbstractPluginLoader load
INFO: created double: org.apache.solr.schema.DoubleField
10:18:31,456 ERROR [STDERR] 19 Nov, 2008 10:18:31 AM
org.apache.solr.util.plugin.AbstractPluginLoader load
INFO: created sint: org.apache.solr.schema.SortableIntField
10:18:31,464 ERROR [STDERR] 19 Nov, 2008 10:18:31 AM
org.apache.solr.util.plugin.AbstractPluginLoader load
INFO: created slong: org.apache.solr.schema.SortableLongField
10:18:31,474 ERROR [STDERR] 19 Nov, 2008 10:18:31 AM
org.apache.solr.util.plugin.AbstractPluginLoader load
INFO: created sfloat: org.apache.solr.schema.SortableFloatField
10:18:31,482 ERROR [STDERR

Re: Error in indexing timestamp format.

2008-11-18 Thread Noble Paul നോബിള്‍ नोब्ळ्

Do you have a stacktrace?

On Wed, Nov 19, 2008 at 10:24 AM, con <[EMAIL PROTECTED]> wrote:
>
> Hi
> Thanks for your quick reply Shalin
>
> I have updated my data-config like:
>  transformer="TemplateTransformer,DateFormatTransformer" pk="EMP_ID"
> query="select EMP_ID, CREATED_DATE, CUST_ID FROM EMP, CUST where EMP.EMP_ID
> = CUST.EMP_ID" >
>
>
>
> sourceColName="CREATED_DATE"
> dateTimeFormat="dd-MM-yy HH:mm:ss.S a" />
> 
>
> This is an example of the date in my database: 22-10-08 03:57:11.63700
> PM
>
> In the schema.xml I have:
> 
> Where the 'date' points to the usual solr.DateField
>
> But I am getting the exception while starting the server itself, SEVERE:
> org.apache.solr.common.SolrException: undefined field date.
>
> Is there any thing else to be configured or I missed something.
>
> Thanks in advance
> con
>
>
>
>
> Shalin Shekhar Mangar wrote:
>>
>> Take a look at the DateFormatTransformer. You can find documentation on
>> the
>> DataImportHandler wiki.
>>
>> http://wiki.apache.org/solr/DataImportHandler
>>
>> On Tue, Nov 18, 2008 at 10:41 PM, con <[EMAIL PROTECTED]> wrote:
>>
>>>
>>>
>>> Hi Noble,
>>> I am using DIH.
>>>
>>>
>>>
>>> Noble Paul നോബിള്‍ नोब्ळ् wrote:
>>> >
>>> > How are you indexing the data ? by posting xml? or using DIH?
>>> >
>>> >
>>> > On Tue, Nov 18, 2008 at 3:53 PM, con <[EMAIL PROTECTED]> wrote:
>>> >>
>>> >> Hi Guys
>>> >> I have timestamp fields in my database in the format,
>>> >> ddmmyyhhmmss.Z AM
>>> >> eg: 26-05-08 10:45:53.66100 AM
>>> >> But I think the since the solr date format is different, i am unable
>>> to
>>> >> index the document with the solr.DateField.
>>> >> So is there any option by which I can give my timestamp format to solr
>>> >> and
>>> >> index and search data based on that.
>>> >> i have already setup solr and is working fine for other datatypes.
>>> >> So is it possible to set this new date format in some config files.
>>> >>
>>> >> Expecting suggestions/advices
>>> >> Thanks in advance
>>> >> con
>>> >> --
>>> >> View this message in context:
>>> >>
>>> http://www.nabble.com/Error-in-indexing-timestamp-format.-tp20556862p20556862.html
>>> >> Sent from the Solr - User mailing list archive at Nabble.com.
>>> >>
>>> >>
>>> >
>>> >
>>> >
>>> > --
>>> > --Noble Paul
>>> >
>>> >
>>>
>>> --
>>> View this message in context:
>>> http://www.nabble.com/Error-in-indexing-timestamp-format.-tp20556862p20564005.html
>>> Sent from the Solr - User mailing list archive at Nabble.com.
>>>
>>>
>>
>>
>> --
>> Regards,
>> Shalin Shekhar Mangar.
>>
>>
>
> --
> View this message in context: 
> http://www.nabble.com/Error-in-indexing-timestamp-format.-tp20556862p20573758.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>
>



-- 
--Noble Paul

Re: Error in indexing timestamp format.

2008-11-18 Thread con


Hi 
Thanks for your quick reply Shalin

I have updated my data-config like: 

 



 

This is an example of the date in my database: 22-10-08 03:57:11.63700
PM

In the schema.xml I have:  
 
Where the 'date' points to the usual solr.DateField

But I am getting the exception while starting the server itself, SEVERE:
org.apache.solr.common.SolrException: undefined field date.

Is there any thing else to be configured or I missed something.

Thanks in advance
con




Shalin Shekhar Mangar wrote:
> 
> Take a look at the DateFormatTransformer. You can find documentation on
> the
> DataImportHandler wiki.
> 
> http://wiki.apache.org/solr/DataImportHandler
> 
> On Tue, Nov 18, 2008 at 10:41 PM, con <[EMAIL PROTECTED]> wrote:
> 
>>
>>
>> Hi Noble,
>> I am using DIH.
>>
>>
>>
>> Noble Paul നോബിള്‍ नोब्ळ् wrote:
>> >
>> > How are you indexing the data ? by posting xml? or using DIH?
>> >
>> >
>> > On Tue, Nov 18, 2008 at 3:53 PM, con <[EMAIL PROTECTED]> wrote:
>> >>
>> >> Hi Guys
>> >> I have timestamp fields in my database in the format,
>> >> ddmmyyhhmmss.Z AM
>> >> eg: 26-05-08 10:45:53.66100 AM
>> >> But I think the since the solr date format is different, i am unable
>> to
>> >> index the document with the solr.DateField.
>> >> So is there any option by which I can give my timestamp format to solr
>> >> and
>> >> index and search data based on that.
>> >> i have already setup solr and is working fine for other datatypes.
>> >> So is it possible to set this new date format in some config files.
>> >>
>> >> Expecting suggestions/advices
>> >> Thanks in advance
>> >> con
>> >> --
>> >> View this message in context:
>> >>
>> http://www.nabble.com/Error-in-indexing-timestamp-format.-tp20556862p20556862.html
>> >> Sent from the Solr - User mailing list archive at Nabble.com.
>> >>
>> >>
>> >
>> >
>> >
>> > --
>> > --Noble Paul
>> >
>> >
>>
>> --
>> View this message in context:
>> http://www.nabble.com/Error-in-indexing-timestamp-format.-tp20556862p20564005.html
>> Sent from the Solr - User mailing list archive at Nabble.com.
>>
>>
> 
> 
> -- 
> Regards,
> Shalin Shekhar Mangar.
> 
> 

-- 
View this message in context: 
http://www.nabble.com/Error-in-indexing-timestamp-format.-tp20556862p20573758.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Wait Flush, Wait Searcher and commit Scenarios

2008-11-18 Thread Ryan McKinley

I am using waitSearcher=false with a crawler.  The crawling thread  
finishes a set of stuff, and calls .  It does not want to  
search, it gets back to crawling ASAP



On Nov 18, 2008, at 11:35 PM, Grant Ingersoll wrote:

That explains true, but what about false?  Why would I ever set it  
to false?  I f I don't wait, how will I ever know when the new  
searcher is ready?


On Nov 18, 2008, at 10:27 PM, Ryan McKinley wrote:


waitFlush I'm not sure...

waitSearcher=true  it will wait until a new searcher is opened  
after your commit, that way the client is guaranteed to have the  
results that were just sent in the index.  if waitSearcher=true, a  
query could hit a searcher that does not have the new documents in  
the index.



On Nov 18, 2008, at 9:13 PM, Grant Ingersoll wrote:

Was wondering if anyone can fill me in on the when and why I would  
set waitFlush and waitSearcher to false when sending a commit  
command?  I think I understand what they do technically (I've  
looked at the code), but I am not clear about why I would want to  
do it.  Is there a risk in setting them to false? Wouldn't my  
response return a lot faster if it was set to false?  Yet, the  
default is true for both, so I assume there is a reason I want to  
wait...


Thanks,
Grant

Re: Wait Flush, Wait Searcher and commit Scenarios

2008-11-18 Thread Grant Ingersoll

That explains true, but what about false?  Why would I ever set it to  
false?  I f I don't wait, how will I ever know when the new searcher  
is ready?


On Nov 18, 2008, at 10:27 PM, Ryan McKinley wrote:


waitFlush I'm not sure...

waitSearcher=true  it will wait until a new searcher is opened after  
your commit, that way the client is guaranteed to have the results  
that were just sent in the index.  if waitSearcher=true, a query  
could hit a searcher that does not have the new documents in the  
index.



On Nov 18, 2008, at 9:13 PM, Grant Ingersoll wrote:

Was wondering if anyone can fill me in on the when and why I would  
set waitFlush and waitSearcher to false when sending a commit  
command?  I think I understand what they do technically (I've  
looked at the code), but I am not clear about why I would want to  
do it.  Is there a risk in setting them to false? Wouldn't my  
response return a lot faster if it was set to false?  Yet, the  
default is true for both, so I assume there is a reason I want to  
wait...


Thanks,
Grant

Re: Using properties from core configuration in data-config.xml

2008-11-18 Thread Noble Paul നോബിള്‍ नोब्ळ्

Thanks gistolero.
I have added this to the FAQ
http://wiki.apache.org/solr/DataImportHandlerFaq

On Wed, Nov 19, 2008 at 2:34 AM,  <[EMAIL PROTECTED]> wrote:
> Very cool :-)
>
> Both suggestions work fine! But only with solr version 1.4:
> https://issues.apache.org/jira/browse/SOLR-823
>
> Use a nightly build (e.g. 2008-11-17 works):
> http://people.apache.org/builds/lucene/solr/nightly/
>
> See below for examples for both solutions...
>
>
>
> ((( 1 )))
>
>> There may be one way to do this.
>>
>> Add your property in the invariant section of solrconfig's
>> DataImportHandler
>> element. For example, add this section:
>>
>> 
>>   ${xmlDataDir}
>> 
>>
>> Then you can use it as ${dataimporter.request.xmlDataDir} in your
>> data-config to access this.
>
>
>
> // *** solr.xml ***
> 
>
>
>
>
>
>
>
>
> 
>
>
>
> // *** solrconfig.xml ***
>   class="org.apache.solr.handler.dataimport.DataImportHandler">
>
>  ./data-config.xml
>
>
>  
>${xmlDataDir}
>  
> 
>
>
>
> // *** data-config.xml ***
> 
>  
>
>  processor="FileListEntityProcessor"
>baseDir="${dataimporter.request.xmlDataDir}"
>   ...
> 
>
>
> URL for full-import:
> http://localhost:8983/solr/core1/dataimport?command=full-import&commit=true
>
>
>
>
> ((( 2 
>
>
>> > nope . It is not possible as of now. the placeholders are not aware of
>> > the core properties.
>> > Is it possible to pass the values as request params? Request
>> > parameters can be accessed .
>
>
> // *** data-config.xml ***
> 
>  
>
>  processor="FileListEntityProcessor"
>baseDir="${dataimporter.request.xmlDataDir}"
>   ...
> 
>
>
> URL for full-import:
> http://localhost:8983/solr/core1/dataimport?command=full-import&commit=true&xmlDataDir=%2Fhome%2Fcore1
>
>
> Thats all.
> Gisto
>
>
>
>> > On Mon, Nov 17, 2008 at 7:57 PM,  <[EMAIL PROTECTED]> wrote:
>> > > Hello,
>> > >
>> > > is it possible to use properties from core configuration in
>> > data-config.xml?
>> > > I want to define the "baseDir" for DataImportHandler.
>> > >
>> > >
>> > > I tried the following configuration:
>> > >
>> > >
>> > > *** solr.xml ***
>> > >
>> > > 
>> > >  
>> > >
>> > > 
>> > >  
>> > >
>> > >...
>> > >  
>> > > 
>> > >
>> > >
>> > >
>> > >
>> > > *** data-config.xml ***
>> > >
>> > > 
>> > >  
>> > >  
>> > >   > > > processor="FileListEntityProcessor"
>> > > baseDir="${xmlDataDir}"
>> > > fileName="id-.*\.xml"
>> > > rootEntity="false"
>> > > dataSource="null">
>> > > > > >  pk="id"
>> > >  url="${xmlFile.fileAbsolutePath}"
>> > >  processor="XPathEntityProcessor"
>> > > ...
>> > > 
>> > >
>> > >
>> > >
>> > > But this is the result:
>> > >
>> > > ...
>> > > Nov 17, 2008 1:50:08 PM
>> org.apache.solr.handler.dataimport.DataImporter
>> > doFullImport
>> > > INFO: Starting Full Import
>> > > Nov 17, 2008 1:50:08 PM org.apache.solr.core.SolrCore execute
>> > > INFO: [posts-politics] webapp=/solr path=/dataimport
>> >
>> params={optimize=true&commit=true&command=full-import&qt=/dataimport&wt=javabin&version=2.2}
>> > status=0 QTime=66
>> > > Nov 17, 2008 1:50:08 PM org.apache.solr.core.SolrCore execute
>> > > INFO: [posts-politics] webapp=/solr path=/dataimport
>> > params={qt=/dataimport&wt=javabin&version=2.2} status=0 QTime=0
>> > > Nov 17, 2008 1:50:08 PM org.apache.solr.update.DirectUpdateHandler2
>> > deleteAll
>> > > INFO: [posts-politics] REMOVING ALL DOCUMENTS FROM INDEX
>> > > Nov 17, 2008 1:50:08 PM
>> org.apache.solr.handler.dataimport.DataImporter
>> > doFullImport
>> > > SEVERE: Full Import failed
>> > > org.apache.solr.handler.dataimport.DataImportHandlerException:
>> 'baseDir'
>> > should point to a directory Processing Document # 1
>> > >  at
>> >
>> org.apache.solr.handler.dataimport.FileListEntityProcessor.init(FileListEntityProcessor.java:81)
>> > > ...
>> > >
>> > >
>> > >
>> > >
>> > > I tried also to configure all dataimport settings in solrconfig.xml,
>> but
>> > I don't know how to do this exactly. Among other things, I tried this
>> > format:
>> > >
>> > >
>> > > *** solrconfig.xml ***
>> > >
>> > > ...
>> > > > > class="org.apache.solr.handler.dataimport.DataImportHandler">
>> > >  
>> > >  
>> > >   FileDataSource
>> > >   
>> > >
>> > > xmlFile
>> > > FileListEntityProcessor
>> > > ${xmlDataDir}
>> > > id-.*\.xml
>> > > false
>> > > null"
>> > > 
>> > >   data
>> > >   id
>> > >   ${xmlFile.fileAbsolutePath}
>> > > ...
>> > > 
>> > > ...
>> > >
>> > >
>> > >
>> > > But all my tests (with different "dataimport" formats in
>> solrconfig.xml)
>> > failed:
>> > >
>> > >
>> > > ...
>> > > INFO: Reusing parent classloader
>> > > Nov 17, 2008 2:18:14 PM org.apache.solr.common.SolrException log
>> > > SEVERE: Error in solrconfig.xml:org.apache.solr.common.SolrException:
>> No
>> > system property or default value specified for xmlFile.fileAbsolutePat

Re: Software Announcement: LuSql: Database to Lucene indexing

2008-11-18 Thread Noble Paul നോബിള്‍ नोब्ळ्

Hi Glen ,
You can post all the queries first on solr-dev and all the valid ones
can be moved to JIRA

thanks,
Noble

On Wed, Nov 19, 2008 at 3:26 AM, Glen Newton <[EMAIL PROTECTED]> wrote:
> Yes, I've found it.
>
> Do you want my comments here or in solr-dev or on jira?
>
>  Glen
>
> 2008/11/18 Shalin Shekhar Mangar <[EMAIL PROTECTED]>:
>> Hi Glen,
>>
>> There is an issue open for making DIH API friendly. Take a look and let us
>> know what you think.
>>
>> https://issues.apache.org/jira/browse/SOLR-853
>>
>> On Tue, Nov 18, 2008 at 8:26 PM, Glen Newton <[EMAIL PROTECTED]> wrote:
>>
>>> Erik,
>>>
>>> Right now there is no real abstraction like DIH in LuSql. But as
>>> indicated in the TODO section of the documentation, I was planning on
>>> implementing or straight borrowing DIH in the near future.
>>>
>>> I am assuming that Solr is all multi-threaded & as performant as it
>>> can be. Is there a test SQL database that is used to test Solr, so I
>>> might try to do some comparisons?
>>>
>>> Not being a Solr user, it is hard for me to know of any advantages of
>>> LuSql over Solr. Hopefully some in the community can identify possible
>>> overlaps / use cases. I will see what I can figure out.
>>>
>>> Thanks,
>>>
>>> -Glen
>>>
>>> 2008/11/18 Erik Hatcher <[EMAIL PROTECTED]>:
>>> > Glen,
>>> >
>>> > The thing is, Solr has a database integration built-in with the new
>>> > DataImportHandler.   So I'm not sure how much interest Solr users would
>>> have
>>> > in LuSql by itself.
>>> >
>>> > Maybe there are LuSql features that DIH could borrow from?  Or vice
>>> versa?
>>> >
>>> >Erik
>>> >
>>> >
>>> > On Nov 17, 2008, at 11:03 PM, Glen Newton wrote:
>>> >>
>>> >> That said, I am very interested in making LuSql useful to the Solr
>>> >> community as well as teh broader Lucene community, so if any of you
>>> >> can offer any feedback on how LuSql can changed to better support
>>> >> Solr, I would appreciate it.
>>> >>
>>> >> thanks,
>>> >>
>>> >> Glen Newton
>>> >>
>>> >> ---
>>> >> FromErik Hatcher <[EMAIL PROTECTED]>
>>> >> Subject Re: Software Announcement: LuSql: Database to Lucene indexing
>>> >> DateMon, 17 Nov 2008 20:12:35 GMT
>>> >>
>>> >> Yeah, it'd work, though not only does the version of Lucene need to
>>> >> match, but the field indexing/storage attributes need to jive as well
>>> >> - and that is the trickier part of the equation.
>>> >>
>>> >> But yeah, LuSQL looks slick!
>>> >>
>>> >>Erik
>>> >>
>>> >>
>>> >> On Nov 17, 2008, at 2:17 PM, Matthew Runo wrote:
>>> >>
>>> >>> Hello -
>>> >>>
>>> >>> I wanted to forward this on, since I thought that people here might
>>> >>> be able to use this to build indexes. So long as the lucene version
>>> >>> in LuSQL matches the version in Solr, it would work fine for
>>> >>> indexing - yea?
>>> >>>
>>> >>> Thanks for your time!
>>> >>>
>>> >>> Matthew Runo
>>> >>> Software Engineer, Zappos.com
>>> >>> [EMAIL PROTECTED] - 702-943-7833
>>> >>>
>>> >>> Begin forwarded message:
>>> >>>
>>>  From: "Glen Newton" <[EMAIL PROTECTED]>
>>>  Date: November 17, 2008 4:32:18 AM PST
>>>  To: [EMAIL PROTECTED]
>>>  Subject: Software Announcement: LuSql: Database to Lucene indexing
>>>  Reply-To: [EMAIL PROTECTED]
>>> 
>>>  LuSql is a simple but powerful tool for building Lucene indexes from
>>>  relational databases. It is a command-line Java application for the
>>>  construction of a Lucene index from an arbitrary SQL query of a
>>>  JDBC-accessible SQL database. It allows a user to control a number of
>>>  parameters, including the SQL query to use, individual
>>>  indexing/storage/term-vector nature of fields, analyzer, stop word
>>>  list, and other tuning parameters. In its default mode it uses
>>>  threading to take advantage of multiple cores.
>>> 
>>>  LuSql can handle complex queries, allows for additional per record
>>>  sub-queries, and has a plug-in architecture for arbitrary Lucene
>>>  document manipulation. Its only dependencies are three Apache Commons
>>>  libraries, the Lucene core itself, and a JDBC driver.
>>> 
>>>  LuSql has been extensively tested, including a large 6+ million
>>>  full-text & metadata journal article document collection, producing
>>>  an
>>>  86GB Lucene index in ~13 hours.
>>> 
>>>  http://lab.cisti-icist.nrc-cnrc.gc.ca/cistilabswiki/index.php/LuSql
>>> 
>>>  Glen Newton
>>> 
>>>  --
>>> 
>>>  -
>>> 
>>>  -
>>>  To unsubscribe, e-mail: [EMAIL PROTECTED]
>>>  For additional commands, e-mail: [EMAIL PROTECTED]
>>> 
>>> >>
>>> >>
>>> >>
>>> >>
>>> >> --
>>> >>
>>> >> -
>>> >
>>> >
>>>
>>>
>>>
>>> --
>>>
>>> -
>>>
>>
>>
>>
>> --
>> Regards,
>> Shalin Shekhar Mangar.
>>
>
>
>
> --
>
> -
>



-- 
--Noble Paul

Re: Wait Flush, Wait Searcher and commit Scenarios

2008-11-18 Thread Mark Miller

Does waitFlush do anything now? I only see it being set if eclipse is 
not missing a reference...


Ryan McKinley wrote:

waitFlush I'm not sure...

waitSearcher=true  it will wait until a new searcher is opened after 
your commit, that way the client is guaranteed to have the results 
that were just sent in the index.  if waitSearcher=true, a query could 
hit a searcher that does not have the new documents in the index.



On Nov 18, 2008, at 9:13 PM, Grant Ingersoll wrote:

Was wondering if anyone can fill me in on the when and why I would 
set waitFlush and waitSearcher to false when sending a commit 
command?  I think I understand what they do technically (I've looked 
at the code), but I am not clear about why I would want to do it.  Is 
there a risk in setting them to false? Wouldn't my response return a 
lot faster if it was set to false?  Yet, the default is true for 
both, so I assume there is a reason I want to wait...


Thanks,
Grant

Re: Wait Flush, Wait Searcher and commit Scenarios

2008-11-18 Thread Ryan McKinley


waitFlush I'm not sure...

waitSearcher=true  it will wait until a new searcher is opened after  
your commit, that way the client is guaranteed to have the results  
that were just sent in the index.  if waitSearcher=true, a query could  
hit a searcher that does not have the new documents in the index.



On Nov 18, 2008, at 9:13 PM, Grant Ingersoll wrote:

Was wondering if anyone can fill me in on the when and why I would  
set waitFlush and waitSearcher to false when sending a commit  
command?  I think I understand what they do technically (I've looked  
at the code), but I am not clear about why I would want to do it.   
Is there a risk in setting them to false? Wouldn't my response  
return a lot faster if it was set to false?  Yet, the default is  
true for both, so I assume there is a reason I want to wait...


Thanks,
Grant

Wait Flush, Wait Searcher and commit Scenarios

2008-11-18 Thread Grant Ingersoll

Was wondering if anyone can fill me in on the when and why I would set  
waitFlush and waitSearcher to false when sending a commit command?  I  
think I understand what they do technically (I've looked at the code),  
but I am not clear about why I would want to do it.  Is there a risk  
in setting them to false? Wouldn't my response return a lot faster if  
it was set to false?  Yet, the default is true for both, so I assume  
there is a reason I want to wait...


Thanks,
Grant

Re: Software Announcement: LuSql: Database to Lucene indexing

2008-11-18 Thread Mike Klaas


On 18-Nov-08, at 6:56 AM, Glen Newton wrote:


Erik,

Right now there is no real abstraction like DIH in LuSql. But as
indicated in the TODO section of the documentation, I was planning on
implementing or straight borrowing DIH in the near future.

I am assuming that Solr is all multi-threaded & as performant as it
can be. Is there a test SQL database that is used to test Solr, so I
might try to do some comparisons?


Actually, I think that Solr's multithreaded indexing could be  
improved.  It is really only analysis that is parallelizable ATM.


-Mike

Re: Software Announcement: LuSql: Database to Lucene indexing

2008-11-18 Thread Glen Newton

Yes, I've found it.

Do you want my comments here or in solr-dev or on jira?

 Glen

2008/11/18 Shalin Shekhar Mangar <[EMAIL PROTECTED]>:
> Hi Glen,
>
> There is an issue open for making DIH API friendly. Take a look and let us
> know what you think.
>
> https://issues.apache.org/jira/browse/SOLR-853
>
> On Tue, Nov 18, 2008 at 8:26 PM, Glen Newton <[EMAIL PROTECTED]> wrote:
>
>> Erik,
>>
>> Right now there is no real abstraction like DIH in LuSql. But as
>> indicated in the TODO section of the documentation, I was planning on
>> implementing or straight borrowing DIH in the near future.
>>
>> I am assuming that Solr is all multi-threaded & as performant as it
>> can be. Is there a test SQL database that is used to test Solr, so I
>> might try to do some comparisons?
>>
>> Not being a Solr user, it is hard for me to know of any advantages of
>> LuSql over Solr. Hopefully some in the community can identify possible
>> overlaps / use cases. I will see what I can figure out.
>>
>> Thanks,
>>
>> -Glen
>>
>> 2008/11/18 Erik Hatcher <[EMAIL PROTECTED]>:
>> > Glen,
>> >
>> > The thing is, Solr has a database integration built-in with the new
>> > DataImportHandler.   So I'm not sure how much interest Solr users would
>> have
>> > in LuSql by itself.
>> >
>> > Maybe there are LuSql features that DIH could borrow from?  Or vice
>> versa?
>> >
>> >Erik
>> >
>> >
>> > On Nov 17, 2008, at 11:03 PM, Glen Newton wrote:
>> >>
>> >> That said, I am very interested in making LuSql useful to the Solr
>> >> community as well as teh broader Lucene community, so if any of you
>> >> can offer any feedback on how LuSql can changed to better support
>> >> Solr, I would appreciate it.
>> >>
>> >> thanks,
>> >>
>> >> Glen Newton
>> >>
>> >> ---
>> >> FromErik Hatcher <[EMAIL PROTECTED]>
>> >> Subject Re: Software Announcement: LuSql: Database to Lucene indexing
>> >> DateMon, 17 Nov 2008 20:12:35 GMT
>> >>
>> >> Yeah, it'd work, though not only does the version of Lucene need to
>> >> match, but the field indexing/storage attributes need to jive as well
>> >> - and that is the trickier part of the equation.
>> >>
>> >> But yeah, LuSQL looks slick!
>> >>
>> >>Erik
>> >>
>> >>
>> >> On Nov 17, 2008, at 2:17 PM, Matthew Runo wrote:
>> >>
>> >>> Hello -
>> >>>
>> >>> I wanted to forward this on, since I thought that people here might
>> >>> be able to use this to build indexes. So long as the lucene version
>> >>> in LuSQL matches the version in Solr, it would work fine for
>> >>> indexing - yea?
>> >>>
>> >>> Thanks for your time!
>> >>>
>> >>> Matthew Runo
>> >>> Software Engineer, Zappos.com
>> >>> [EMAIL PROTECTED] - 702-943-7833
>> >>>
>> >>> Begin forwarded message:
>> >>>
>>  From: "Glen Newton" <[EMAIL PROTECTED]>
>>  Date: November 17, 2008 4:32:18 AM PST
>>  To: [EMAIL PROTECTED]
>>  Subject: Software Announcement: LuSql: Database to Lucene indexing
>>  Reply-To: [EMAIL PROTECTED]
>> 
>>  LuSql is a simple but powerful tool for building Lucene indexes from
>>  relational databases. It is a command-line Java application for the
>>  construction of a Lucene index from an arbitrary SQL query of a
>>  JDBC-accessible SQL database. It allows a user to control a number of
>>  parameters, including the SQL query to use, individual
>>  indexing/storage/term-vector nature of fields, analyzer, stop word
>>  list, and other tuning parameters. In its default mode it uses
>>  threading to take advantage of multiple cores.
>> 
>>  LuSql can handle complex queries, allows for additional per record
>>  sub-queries, and has a plug-in architecture for arbitrary Lucene
>>  document manipulation. Its only dependencies are three Apache Commons
>>  libraries, the Lucene core itself, and a JDBC driver.
>> 
>>  LuSql has been extensively tested, including a large 6+ million
>>  full-text & metadata journal article document collection, producing
>>  an
>>  86GB Lucene index in ~13 hours.
>> 
>>  http://lab.cisti-icist.nrc-cnrc.gc.ca/cistilabswiki/index.php/LuSql
>> 
>>  Glen Newton
>> 
>>  --
>> 
>>  -
>> 
>>  -
>>  To unsubscribe, e-mail: [EMAIL PROTECTED]
>>  For additional commands, e-mail: [EMAIL PROTECTED]
>> 
>> >>
>> >>
>> >>
>> >>
>> >> --
>> >>
>> >> -
>> >
>> >
>>
>>
>>
>> --
>>
>> -
>>
>
>
>
> --
> Regards,
> Shalin Shekhar Mangar.
>



-- 

-

Re: Deadlock with DirectUpdateHandler2

2008-11-18 Thread Mike Klaas


On 18-Nov-08, at 12:18 PM, Mark Miller wrote:


Mike Klaas wrote:



autoCommitCount is written in a CommitTracker.synchronized block  
only.  It is read to print stats in an unsynchronized fashion,  
which perhaps could be fixed, though I can't see how it could cause  
a problem


lastAddedTime is only written in a call path within a  
DirectUpdateHandler2.synchronized block.  It is only read in a  
CommitTracker.synchronized block.  It could read the wrong value,  
but I also don't see this causing a problem (a commit might fail to  
be scheduled).  This could probably also be improved, but doesn't  
seem important.
Right. I don't see these as causing a deadlock either, but whatever  
happens, its pretty much JVM undefined right, hence 'who  
knows' (I'll go with pretty doubtful ). I am not so sure its safe  
to read a value from an unsynced method whether you care about the  
result or not though. Its prob safe for atomic types and volatiles,  
but I'm fairly sure your playing with fire doing read/write in and  
out of sync. I don't think its just about stale values. But then  
again, it probably works 99.9% of the time or something.


Yeah, something like lastAddedTime is worth getting right, even if it  
works most of the time.  Things like autoCommitCount that read a long  
unsynchronized only to display on an admin screen don't concern me as  
much.


-Mike

Re: Processing of prx file for phrase queries: Whole position list for term read?

2008-11-18 Thread Erik Hatcher

Rather than attempt an answer to your questions directly, I'll mention  
how other projects have dealt with the very-common-word issue.  Nutch,  
for example, has a list of high frequency terms and concatenates them  
with the successive word in order to form less-frequent aggregate  
terms.  The original term is also indexed, but during querying in  
phrases, the common terms are again concatenated, thus making querying  
a lot faster.


I may not have explained it entirely accurately, but that's the gist.   
Have a look at Nutch's Analyzer for more details.


Erik


On Nov 18, 2008, at 4:00 PM, Burton-West, Tom wrote:


Hello,

We are working with a very large index and with large documents (300+
page books.)  It appears that the bottleneck on our system is the disk
IO involved in reading position information from the prx file for
commonly occuring terms.

An example slow query is  "the new economics".

To process the above phrase query for the word "the", does the entire
part of the .prx file for the word "the" need to be read in to  
memory or

only the fragments of the entries for the word "the" that contain
specific doc ids?

In reading the lucene index file formats document
(http://lucene.apache.org/java/2_4_0/fileformats.html) its not clear
whether the .tis file stores a pointer into the .prx file for a term
(and therefore the entire list of doc_ids and positions for that term
needs to be read into memory), or if the .tis file stores a pointer to
the term **and doc id** in the prx file, in which case only the
positions for a given doc id would need to be read. Or if somehow the
.frq file has information on where to find the doc id in the .prx  
file.



The documentation for the .tis file says that it stores ProxDelta  
which

is based on the term (rather than the term/doc id).  On the other hand
the documentation for the .prx file states that Positions entries are
"ordered by increasing document number (the document number is  
implicit

from the .frq file)"


Tom

Re: Using properties from core configuration in data-config.xml

2008-11-18 Thread gistolero

Very cool :-)

Both suggestions work fine! But only with solr version 1.4:
https://issues.apache.org/jira/browse/SOLR-823

Use a nightly build (e.g. 2008-11-17 works):
http://people.apache.org/builds/lucene/solr/nightly/

See below for examples for both solutions...



((( 1 )))

> There may be one way to do this.
> 
> Add your property in the invariant section of solrconfig's
> DataImportHandler
> element. For example, add this section:
> 
> 
>   ${xmlDataDir}
> 
> 
> Then you can use it as ${dataimporter.request.xmlDataDir} in your
> data-config to access this.



// *** solr.xml ***













// *** solrconfig.xml ***
  

  ./data-config.xml


  
${xmlDataDir}
  




// *** data-config.xml ***

  

  


URL for full-import:
http://localhost:8983/solr/core1/dataimport?command=full-import&commit=true




((( 2 


> > nope . It is not possible as of now. the placeholders are not aware of
> > the core properties.
> > Is it possible to pass the values as request params? Request
> > parameters can be accessed .


// *** data-config.xml ***

  

  


URL for full-import:
http://localhost:8983/solr/core1/dataimport?command=full-import&commit=true&xmlDataDir=%2Fhome%2Fcore1


Thats all.
Gisto



> > On Mon, Nov 17, 2008 at 7:57 PM,  <[EMAIL PROTECTED]> wrote:
> > > Hello,
> > >
> > > is it possible to use properties from core configuration in
> > data-config.xml?
> > > I want to define the "baseDir" for DataImportHandler.
> > >
> > >
> > > I tried the following configuration:
> > >
> > >
> > > *** solr.xml ***
> > >
> > > 
> > >  
> > >
> > > 
> > >  
> > >
> > >...
> > >  
> > > 
> > >
> > >
> > >
> > >
> > > *** data-config.xml ***
> > >
> > > 
> > >  
> > >  
> > >> > processor="FileListEntityProcessor"
> > > baseDir="${xmlDataDir}"
> > > fileName="id-.*\.xml"
> > > rootEntity="false"
> > > dataSource="null">
> > >  > >  pk="id"
> > >  url="${xmlFile.fileAbsolutePath}"
> > >  processor="XPathEntityProcessor"
> > > ...
> > > 
> > >
> > >
> > >
> > > But this is the result:
> > >
> > > ...
> > > Nov 17, 2008 1:50:08 PM
> org.apache.solr.handler.dataimport.DataImporter
> > doFullImport
> > > INFO: Starting Full Import
> > > Nov 17, 2008 1:50:08 PM org.apache.solr.core.SolrCore execute
> > > INFO: [posts-politics] webapp=/solr path=/dataimport
> >
> params={optimize=true&commit=true&command=full-import&qt=/dataimport&wt=javabin&version=2.2}
> > status=0 QTime=66
> > > Nov 17, 2008 1:50:08 PM org.apache.solr.core.SolrCore execute
> > > INFO: [posts-politics] webapp=/solr path=/dataimport
> > params={qt=/dataimport&wt=javabin&version=2.2} status=0 QTime=0
> > > Nov 17, 2008 1:50:08 PM org.apache.solr.update.DirectUpdateHandler2
> > deleteAll
> > > INFO: [posts-politics] REMOVING ALL DOCUMENTS FROM INDEX
> > > Nov 17, 2008 1:50:08 PM
> org.apache.solr.handler.dataimport.DataImporter
> > doFullImport
> > > SEVERE: Full Import failed
> > > org.apache.solr.handler.dataimport.DataImportHandlerException:
> 'baseDir'
> > should point to a directory Processing Document # 1
> > >  at
> >
> org.apache.solr.handler.dataimport.FileListEntityProcessor.init(FileListEntityProcessor.java:81)
> > > ...
> > >
> > >
> > >
> > >
> > > I tried also to configure all dataimport settings in solrconfig.xml,
> but
> > I don't know how to do this exactly. Among other things, I tried this
> > format:
> > >
> > >
> > > *** solrconfig.xml ***
> > >
> > > ...
> > >  > class="org.apache.solr.handler.dataimport.DataImportHandler">
> > >  
> > >  
> > >   FileDataSource
> > >   
> > >
> > > xmlFile
> > > FileListEntityProcessor
> > > ${xmlDataDir}
> > > id-.*\.xml
> > > false
> > > null"
> > > 
> > >   data
> > >   id
> > >   ${xmlFile.fileAbsolutePath}
> > > ...
> > > 
> > > ...
> > >
> > >
> > >
> > > But all my tests (with different "dataimport" formats in
> solrconfig.xml)
> > failed:
> > >
> > >
> > > ...
> > > INFO: Reusing parent classloader
> > > Nov 17, 2008 2:18:14 PM org.apache.solr.common.SolrException log
> > > SEVERE: Error in solrconfig.xml:org.apache.solr.common.SolrException:
> No
> > system property or default value specified for xmlFile.fileAbsolutePath
> > >at
> > org.apache.solr.common.util.DOMUtil.substituteProperty(DOMUtil.java:311)
> > >at
> >
> org.apache.solr.common.util.DOMUtil.substituteProperties(DOMUtil.java:264)
> > > ...
> > >
> > >
> > >
> > > Thanks again for your excellent support!
> > >
> > > Gisto
> > >
> > > --
> > > Der GMX SmartSurfer hilft bis zu 70% Ihrer Onlinekosten zu sparen!
> > > Ideal für Modem und ISDN: http://www.gmx.net/de/go/smartsurfer
> > >
> >
> >
> >
> > --
> > --Noble Paul
> >
> 
> 
> 
> -- 
> Regards,
> Shalin Shekhar Mangar.

-- 
Der GMX SmartSurfer hilft bis zu 70% Ihrer Onlinekosten zu sparen! 
Ideal für Modem und ISDN: http://www.gmx.net/de/go/smartsurfer

Processing of prx file for phrase queries: Whole position list for term read?

2008-11-18 Thread Burton-West, Tom

Hello,

We are working with a very large index and with large documents (300+
page books.)  It appears that the bottleneck on our system is the disk
IO involved in reading position information from the prx file for
commonly occuring terms. 

An example slow query is  "the new economics".

To process the above phrase query for the word "the", does the entire
part of the .prx file for the word "the" need to be read in to memory or
only the fragments of the entries for the word "the" that contain
specific doc ids?

In reading the lucene index file formats document
(http://lucene.apache.org/java/2_4_0/fileformats.html) its not clear
whether the .tis file stores a pointer into the .prx file for a term
(and therefore the entire list of doc_ids and positions for that term
needs to be read into memory), or if the .tis file stores a pointer to
the term **and doc id** in the prx file, in which case only the
positions for a given doc id would need to be read. Or if somehow the
.frq file has information on where to find the doc id in the .prx file.


The documentation for the .tis file says that it stores ProxDelta which
is based on the term (rather than the term/doc id).  On the other hand
the documentation for the .prx file states that Positions entries are
"ordered by increasing document number (the document number is implicit
from the .frq file)"


Tom

Re: Deadlock with DirectUpdateHandler2

2008-11-18 Thread Toby Cole


On 18 Nov 2008, at 20:18, Mark Miller wrote:


Mike Klaas wrote:



autoCommitCount is written in a CommitTracker.synchronized block  
only.  It is read to print stats in an unsynchronized fashion,  
which perhaps could be fixed, though I can't see how it could cause  
a problem


lastAddedTime is only written in a call path within a  
DirectUpdateHandler2.synchronized block.  It is only read in a  
CommitTracker.synchronized block.  It could read the wrong value,  
but I also don't see this causing a problem (a commit might fail to  
be scheduled).  This could probably also be improved, but doesn't  
seem important.
Right. I don't see these as causing a deadlock either, but whatever  
happens, its pretty much JVM undefined right, hence 'who  
knows' (I'll go with pretty doubtful ). I am not so sure its safe  
to read a value from an unsynced method whether you care about the  
result or not though. Its prob safe for atomic types and volatiles,  
but I'm fairly sure your playing with fire doing read/write in and  
out of sync. I don't think its just about stale values. But then  
again, it probably works 99.9% of the time or something.


pending seems to be the issue.  As long as commit are only  
triggered by autocommit, there is no issue as manipulation of  
pending is always performed inside CommitTracker.synchronized.  But  
didCommit()/didRollback() could be called via manual commit, and  
pending is directly manipulated during DUH2.close().  I'm having  
trouble coming up with a plausible deadlock scenario, but this  
needs to be fixed.  It isn't as easy as synchronizing didCommit/ 
didRollback, though--this would introduce definite deadlock  
scenarios.


Mark, is there any chance you could post the thread dump for the  
deadlocked process?  Do you issue manual commits during insertion?

Toby reported it. Thread dump Toby?


-Mike


I'll try and post a thread dump when I get to work, can't remote in  
from here.
I don't mind helping out with the fix, I've been getting to know  
solr's internals quite intimately recently after writing a few  
handlers/components for internal projects.


T

Re: Deadlock with DirectUpdateHandler2

2008-11-18 Thread Mark Miller


Mike Klaas wrote:



autoCommitCount is written in a CommitTracker.synchronized block 
only.  It is read to print stats in an unsynchronized fashion, which 
perhaps could be fixed, though I can't see how it could cause a problem


lastAddedTime is only written in a call path within a 
DirectUpdateHandler2.synchronized block.  It is only read in a 
CommitTracker.synchronized block.  It could read the wrong value, but 
I also don't see this causing a problem (a commit might fail to be 
scheduled).  This could probably also be improved, but doesn't seem 
important.
Right. I don't see these as causing a deadlock either, but whatever 
happens, its pretty much JVM undefined right, hence 'who knows' (I'll go 
with pretty doubtful ). I am not so sure its safe to read a value 
from an unsynced method whether you care about the result or not though. 
Its prob safe for atomic types and volatiles, but I'm fairly sure your 
playing with fire doing read/write in and out of sync. I don't think its 
just about stale values. But then again, it probably works 99.9% of the 
time or something.


pending seems to be the issue.  As long as commit are only triggered 
by autocommit, there is no issue as manipulation of pending is always 
performed inside CommitTracker.synchronized.  But 
didCommit()/didRollback() could be called via manual commit, and 
pending is directly manipulated during DUH2.close().  I'm having 
trouble coming up with a plausible deadlock scenario, but this needs 
to be fixed.  It isn't as easy as synchronizing didCommit/didRollback, 
though--this would introduce definite deadlock scenarios.


Mark, is there any chance you could post the thread dump for the 
deadlocked process?  Do you issue manual commits during insertion?

Toby reported it. Thread dump Toby?


-Mike

Re: Deadlock with DirectUpdateHandler2

2008-11-18 Thread Mike Klaas



On 18-Nov-08, at 8:54 AM, Mark Miller wrote:


Mark Miller wrote:

Toby Cole wrote:
Has anyone else experienced a deadlock when the  
DirectUpdateHandler2 does an autocommit?
I'm using a recent snapshot from hudson (apache- 
solr-2008-11-12_08-06-21), and quite often when I'm loading data  
the server (tomcat 6) gets stuck at line 469 of  
DirectUpdateHandler2:


 // Check if there is a commit already scheduled for longer  
then this time

 if( pending != null &&
 pending.getDelay(TimeUnit.MILLISECONDS) >= commitMaxTime )

Anyone got any enlightening tips?



There is some inconsistent synchronization I think. Especially  
involving pending. Yuck 
I would say there are problems with pending, autoCommitCount, and  
lastAddedTime. That alone could probably cause a deadlock (who  
knows), but it also seems somewhat possible that there is an issue  
with the heavy intermingling of locks (there a bunch of locks to be  
had in that class). I havn't looked for evidence of that though -  
prob makes sense to fix those 3 guys and see if you get reports from  
there.



autoCommitCount is written in a CommitTracker.synchronized block  
only.  It is read to print stats in an unsynchronized fashion, which  
perhaps could be fixed, though I can't see how it could cause a problem


lastAddedTime is only written in a call path within a  
DirectUpdateHandler2.synchronized block.  It is only read in a  
CommitTracker.synchronized block.  It could read the wrong value, but  
I also don't see this causing a problem (a commit might fail to be  
scheduled).  This could probably also be improved, but doesn't seem  
important.


pending seems to be the issue.  As long as commit are only triggered  
by autocommit, there is no issue as manipulation of pending is always  
performed inside CommitTracker.synchronized.  But didCommit()/ 
didRollback() could be called via manual commit, and pending is  
directly manipulated during DUH2.close().  I'm having trouble coming  
up with a plausible deadlock scenario, but this needs to be fixed.  It  
isn't as easy as synchronizing didCommit/didRollback, though--this  
would introduce definite deadlock scenarios.


Mark, is there any chance you could post the thread dump for the  
deadlocked process?  Do you issue manual commits during insertion?


-Mike

Re: solr-ruby gem

2008-11-18 Thread Matt Mitchell

I've been using solr-ruby with 1.3 for quite a while now. It's powering our
"experimental", open-source OPAC, "Blacklight":

blacklight.rubyforge.org

I've got a custom query builder and response wrapper, but it's using
solr-ruby underneath.

Matt

On Tue, Nov 18, 2008 at 2:57 PM, Erik Hatcher <[EMAIL PROTECTED]>wrote:

>
> On Nov 18, 2008, at 2:41 PM, Kashyap, Raghu wrote:
>
>> Anyone knows if the solr-ruby gem is compatible with solr 1.3??
>>
>
> Yes, the gem at rubyforge is compatible with 1.3.  Also, the library itself
> is distributed with the binary release of Solr, in client/ruby/solr-ruby/lib
>
>  Also anyone using acts_as_solr plugin? Off late the website is down and
>> can't find any recent activities on that
>>
>
> From my perspective, acts_as_solr is a mess.  [My apologies for creating
> the initial hack that then morphed out of control]
>
> There are a lot of users of various versions of acts_as_solr, and
> discussion of that continues here: <
> http://groups.google.com/group/acts_as_solr>.  There are numerous github
> branches each with various patches applied - take your pick and run with one
> of them :)
>
> Or go lighter weight and roll-your-own acts_as_solr by simply putting in
> after_save/after_destroy hooks.  See slide 13 of <
> http://code4lib.org/files/solr-ruby.pdf>
>
>Erik
>
>
>

Re: solr-ruby gem

2008-11-18 Thread Erik Hatcher



On Nov 18, 2008, at 2:41 PM, Kashyap, Raghu wrote:

Anyone knows if the solr-ruby gem is compatible with solr 1.3??


Yes, the gem at rubyforge is compatible with 1.3.  Also, the library  
itself is distributed with the binary release of Solr, in client/ruby/ 
solr-ruby/lib


Also anyone using acts_as_solr plugin? Off late the website is down  
and

can't find any recent activities on that


From my perspective, acts_as_solr is a mess.  [My apologies for  
creating the initial hack that then morphed out of control]


There are a lot of users of various versions of acts_as_solr, and  
discussion of that continues here: .  There are numerous github branches each with various patches  
applied - take your pick and run with one of them :)


Or go lighter weight and roll-your-own acts_as_solr by simply putting  
in after_save/after_destroy hooks.  See slide 13 of 


Erik

solr-ruby gem

2008-11-18 Thread Kashyap, Raghu

Anyone knows if the solr-ruby gem is compatible with solr 1.3??

 

Also anyone using acts_as_solr plugin? Off late the website is down and
can't find any recent activities on that

 

-Raghu

Is there a DTD/XSD for XML response?

2008-11-18 Thread Simon Hu


Hi, 

I assume there is a schema definition or DTD for XML response but could not
find it anywhere. 

Is there one? 

thanks
-Simon 

-- 
View this message in context: 
http://www.nabble.com/Is-there-a-DTD-XSD-for-XML-response--tp20565773p20565773.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: EmbeddedSolrServer questions

2008-11-18 Thread Jeryl Cook

i am using embeddedSolrServer and simply has a queue that documents
are sent to ..and a listerner on that queue that writes it to the
index..

or just keep it simple, and do a synchronization block around the
method in the writeserver that writes the document to the index.

Jeryl Cook
/^\ Pharaoh /^\
http://pharaohofkush.blogspot.com/
"Whether we bring our enemies to justice, or bring justice to our
enemies, justice will be done."
--George W. Bush, Address to a Joint Session of Congress and the
American People, September 20, 2001

On Tue, Nov 18, 2008 at 9:36 AM, Thierry Templier <[EMAIL PROTECTED]> wrote:
> Hello,
>
> I have some questions regarding the use of the EmbeddedSolrServer in order to 
> embed a solr instance into a Java application.
>
> 1°) Is an instance of the EmbeddedSolrServer class threadsafe when used by 
> several concurent threads?
>
> 2°) Regarding to transactions, can an instance of the EmbeddedSolrServer 
> class be used in order to make two transactions in the same time by two 
> different threads?
>
> Thanks for your help,
> Thierry
>
>
>
>
>



-- 
Jeryl Cook
/^\ Pharaoh /^\
http://pharaohofkush.blogspot.com/
"Whether we bring our enemies to justice, or bring justice to our
enemies, justice will be done."
--George W. Bush, Address to a Joint Session of Congress and the
American People, September 20, 2001

RE: Query Response Doc Score -> Int Value

2008-11-18 Thread Nguyen, Joe

You don't need to hack the code since you can virtually treated these
scores 2.3518934 and 2.2173865 as if they were both equal (ignoring
digits after the decimal point).

Score = original score(2.3518934) + function(date_created)

You can scale the value of function(date_created) so that digits after
the decimal point in the original score are not significantly influent
the final score.

E.g.  
To treat digits after decimal point *insignificant*
 Score  = 2.3518934 + 10.00 = 12.3518934
  
To make those significant, make function(date_created) results in small
number
 Score = 2.3518934 + 0.2   = 2.5518934

You can specify the function in the request URL
(http://wiki.apache.org/solr/FunctionQuery)



  

-Original Message-
From: Derek Springer [mailto:[EMAIL PROTECTED] 
Sent: Tuesday, November 18, 2008 8:39 Joe
To: solr-user@lucene.apache.org
Subject: Re: Query Response Doc Score -> Int Value

Better yet, does anyone know where the method that writes the score
lives?
For instance, a getScore() method that writes the score out that I could
override and truncate? Thanks!

-Derek

On Mon, Nov 17, 2008 at 9:59 PM, Derek Springer <[EMAIL PROTECTED]>
wrote:

> Thanks for the heads up. Can anyone point me to (or provide me with) 
> an example of writing a function query?
>
> -Derek
>
>
> On Mon, Nov 17, 2008 at 8:17 PM, Yonik Seeley <[EMAIL PROTECTED]>
wrote:
>
>> A function query is the likely candidate - no such quantization 
>> function exists, but it would be relatively easy to write one.
>>
>> -Yonik
>>
>> On Mon, Nov 17, 2008 at 8:17 PM, Derek Springer <[EMAIL PROTECTED]>
wrote:
>> > Hello,
>> > I am currently performing a query to a Solr index I've set up and 
>> > I'm
>> trying
>> > to 1) sort on the score and 2) sort on the date_created (a custom 
>> > field
>> I've
>> > added). The sort command looks like:
sort=score+desc,created_date+desc.
>> >
>> > The gist of it is that I will 1) first return the most relevant 
>> > results
>> then
>> > 2) within those results, return the most recent results. However, 
>> > the
>> issue
>> > I have is that the score is a decimal value that is far to precise
(e.g.
>> > 2.3518934 vs 2.2173865) and will therefore never "collide" and 
>> > trigger
>> the
>> > secondary sort on the date.
>> >
>> > The question I am asking is if anyone knows a way to produce a 
>> > score
>> that is
>> > more coarse, or if it is possible to force the score to return as 
>> > an integer. That way I could have the results collide on the score 
>> > more
>> often
>> > and therefore sort on the date as well.
>> >
>> > Thanks!
>> > -Derek
>> >
>>
>
>
>
> --
> Derek B. Springer
> Software Developer
> Mahalo.com, Inc.
> 902 Colorado Ave.,
> Santa Monica, CA 90401
> [EMAIL PROTECTED]
>



--
Derek B. Springer
Software Developer
Mahalo.com, Inc.
902 Colorado Ave.,
Santa Monica, CA 90401
[EMAIL PROTECTED]

Re: specifying Sort criteria through Solr admin ui ...

2008-11-18 Thread Chris Hostetter


: Is there a way to specify sort criteria through Solr admin ui. I tried 
: doing it thorugh the query statement box but it did not work.

the search box on the admin gui is fairly limited ... it's jsut a quick 
dirty way to run test queries.  other options like sorting, filtering, and 
faceting need to be added by hand when hitting solr via a browser.

you can just add "&sort=your_sort_field+desc" to your query URLs to get 
sorted resutls.


-Hoss

Re: Error in indexing timestamp format.

2008-11-18 Thread Shalin Shekhar Mangar

Take a look at the DateFormatTransformer. You can find documentation on the
DataImportHandler wiki.

http://wiki.apache.org/solr/DataImportHandler

On Tue, Nov 18, 2008 at 10:41 PM, con <[EMAIL PROTECTED]> wrote:

>
>
> Hi Noble,
> I am using DIH.
>
>
>
> Noble Paul നോബിള്‍ नोब्ळ् wrote:
> >
> > How are you indexing the data ? by posting xml? or using DIH?
> >
> >
> > On Tue, Nov 18, 2008 at 3:53 PM, con <[EMAIL PROTECTED]> wrote:
> >>
> >> Hi Guys
> >> I have timestamp fields in my database in the format,
> >> ddmmyyhhmmss.Z AM
> >> eg: 26-05-08 10:45:53.66100 AM
> >> But I think the since the solr date format is different, i am unable to
> >> index the document with the solr.DateField.
> >> So is there any option by which I can give my timestamp format to solr
> >> and
> >> index and search data based on that.
> >> i have already setup solr and is working fine for other datatypes.
> >> So is it possible to set this new date format in some config files.
> >>
> >> Expecting suggestions/advices
> >> Thanks in advance
> >> con
> >> --
> >> View this message in context:
> >>
> http://www.nabble.com/Error-in-indexing-timestamp-format.-tp20556862p20556862.html
> >> Sent from the Solr - User mailing list archive at Nabble.com.
> >>
> >>
> >
> >
> >
> > --
> > --Noble Paul
> >
> >
>
> --
> View this message in context:
> http://www.nabble.com/Error-in-indexing-timestamp-format.-tp20556862p20564005.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>
>


-- 
Regards,
Shalin Shekhar Mangar.

Re: Error in indexing timestamp format.

2008-11-18 Thread con



Hi Noble,
I am using DIH.



Noble Paul നോബിള്‍ नोब्ळ् wrote:
> 
> How are you indexing the data ? by posting xml? or using DIH?
> 
> 
> On Tue, Nov 18, 2008 at 3:53 PM, con <[EMAIL PROTECTED]> wrote:
>>
>> Hi Guys
>> I have timestamp fields in my database in the format,
>> ddmmyyhhmmss.Z AM
>> eg: 26-05-08 10:45:53.66100 AM
>> But I think the since the solr date format is different, i am unable to
>> index the document with the solr.DateField.
>> So is there any option by which I can give my timestamp format to solr
>> and
>> index and search data based on that.
>> i have already setup solr and is working fine for other datatypes.
>> So is it possible to set this new date format in some config files.
>>
>> Expecting suggestions/advices
>> Thanks in advance
>> con
>> --
>> View this message in context:
>> http://www.nabble.com/Error-in-indexing-timestamp-format.-tp20556862p20556862.html
>> Sent from the Solr - User mailing list archive at Nabble.com.
>>
>>
> 
> 
> 
> -- 
> --Noble Paul
> 
> 

-- 
View this message in context: 
http://www.nabble.com/Error-in-indexing-timestamp-format.-tp20556862p20564005.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Programatic way to know when an optimize is finished?

2008-11-18 Thread Phillip Farber

I'm using Perl LWP which has a default 30 sec timeout on the http 
request.  I can set it to a larger number like 24 hours :-) I guess. 
How do you set your timeout?


Phil

Lance Norskog wrote:

The 'optimize' http command blocks. If you script your automation, you can
just call the http and then the next command in the script runs after the
optimize finishes. Hours later, in our case.

Lance

-Original Message-
From: Phillip Farber [mailto:[EMAIL PROTECTED] 
Sent: Friday, November 14, 2008 10:04 AM

To: solr-user@lucene.apache.org
Subject: Programatic way to know when an optimize is finished?


I'd like to automate my indexing processes.  Is there a slick method to know
when an optimize on an index has completed?

Thanks,

Phil

Re: Deadlock with DirectUpdateHandler2

2008-11-18 Thread Mark Miller


Mark Miller wrote:

Toby Cole wrote:
Has anyone else experienced a deadlock when the DirectUpdateHandler2 
does an autocommit?
I'm using a recent snapshot from hudson 
(apache-solr-2008-11-12_08-06-21), and quite often when I'm loading 
data the server (tomcat 6) gets stuck at line 469 of 
DirectUpdateHandler2:


  // Check if there is a commit already scheduled for longer then 
this time

  if( pending != null &&
  pending.getDelay(TimeUnit.MILLISECONDS) >= commitMaxTime )

Anyone got any enlightening tips?
Cheers,

Toby Cole
Software Engineer

Semantico
Lees House, Floor 1, 21-23 Dyke Road, Brighton BN1 3FE
T: +44 (0)1273 358 238
F: +44 (0)1273 723 232
E: [EMAIL PROTECTED]
W: www.semantico.com

There is some inconsistent synchronization I think. Especially 
involving pending. Yuck 
I would say there are problems with pending, autoCommitCount, and 
lastAddedTime. That alone could probably cause a deadlock (who knows), 
but it also seems somewhat possible that there is an issue with the 
heavy intermingling of locks (there a bunch of locks to be had in that 
class). I havn't looked for evidence of that though - prob makes sense 
to fix those 3 guys and see if you get reports from there.

Re: TextProfileSigature using deduplication

2008-11-18 Thread Ken Krugler


Marc Sturlese wrote:

Hey there, I've been testing and checking the source of the
TextProfileSignature.java to avoid similar entries at indexing time.
What I understood is that it is useful for huge text where the frequency of
the tokens (the words in lowercase just with number and leters in taht case)
is important. If you want to detect duplicates in not huge text and not
giving a lot of importance to the frequencies it doesn't work...
The hash will be made just with the terms wich frequency is higher than a
QUANTUM (which value is given in function of the max freq between all the
terms). So it will say that:

aaa sss ddd fff ggg hhh aaa kkk lll ooo
aaa xxx iii www qqq aaa jjj eee zzz nnn

are duplicates because quantum here wolud be 2 and the frequency of aaa
would be 2 aswell. So, to make the hash just the term aaa would be used.

In this case:
aaa sss ddd fff ggg hhh kkk lll ooo
apa sss ddd fff ggg hhh kkk lll ooo

Here quantum would be 1 and the frequencies of all terms would be 1 so all
terms would be use for the hash. It will consider this two strings not
similar.

As I understood the algorithm there's no way to make it understand that in
my second case both strings are similar. I wish i were wrong...

I have my own duplication system to detect that but I use String comparison
so it works really slow... Would like to know if there is any tuning
possibility to do that with TextProfileSignature
Don't know if I should pot this here or in the developers forum...


Hi Marc,

TextProfileSignature is a rather crude 
implementation of approximate similarity, and as 
you pointed out it's best suited for large 
texts. The original purpose of this Signature 
was to deduplicate web pages in large amounts of 
crawled pages (in Nutch), where it worked 
reasonably well. Its advantage is also that it's 
easy to compute and doesn't require multiple 
passes over the corpus.


As it is implemented now, it breaks badly in the 
case you describe. You could modify this 
implementation to include also word-level 
ngrams, i.e. sequences of more than 1 word, up 
to N (e.g. 5) - this should work in your case.


Ultimately, what you are probably looking for is 
a shingle-based algorithm, but it's relatively 
costly and requires multiple passes.


There's an intermediate approach we use...

* Generate separate hashes for each of the quantized bands
* Create additional fingerprint values (depends on the nature of the data)
* Find potentially similar files using the above
* Then apply an accurate but slower comparison to determine true similarity

From our data, it's common to get files where 
(due to small text changes) the frequency of a 
term moves between quantized bands. This then 
changes the über hash that you get from combining 
all terms, but with 10 or so bands we still get 
some matches on the hashes from the individual 
bands.


The "find potentially similar files" uses a 
simple Lucene scoring function, based on the 
number of matching fingerprint values.


-- Ken
--
Ken Krugler
Krugle, Inc.
+1 530-210-6378
"If you can't find it, you can't fix it"

Re: Query Response Doc Score -> Int Value

2008-11-18 Thread Derek Springer

Better yet, does anyone know where the method that writes the score lives?
For instance, a getScore() method that writes the score out that I could
override and truncate? Thanks!

-Derek

On Mon, Nov 17, 2008 at 9:59 PM, Derek Springer <[EMAIL PROTECTED]> wrote:

> Thanks for the heads up. Can anyone point me to (or provide me with) an
> example of writing a function query?
>
> -Derek
>
>
> On Mon, Nov 17, 2008 at 8:17 PM, Yonik Seeley <[EMAIL PROTECTED]> wrote:
>
>> A function query is the likely candidate - no such quantization
>> function exists, but it would be relatively easy to write one.
>>
>> -Yonik
>>
>> On Mon, Nov 17, 2008 at 8:17 PM, Derek Springer <[EMAIL PROTECTED]> wrote:
>> > Hello,
>> > I am currently performing a query to a Solr index I've set up and I'm
>> trying
>> > to 1) sort on the score and 2) sort on the date_created (a custom field
>> I've
>> > added). The sort command looks like: sort=score+desc,created_date+desc.
>> >
>> > The gist of it is that I will 1) first return the most relevant results
>> then
>> > 2) within those results, return the most recent results. However, the
>> issue
>> > I have is that the score is a decimal value that is far to precise (e.g.
>> > 2.3518934 vs 2.2173865) and will therefore never "collide" and trigger
>> the
>> > secondary sort on the date.
>> >
>> > The question I am asking is if anyone knows a way to produce a score
>> that is
>> > more coarse, or if it is possible to force the score to return as an
>> > integer. That way I could have the results collide on the score more
>> often
>> > and therefore sort on the date as well.
>> >
>> > Thanks!
>> > -Derek
>> >
>>
>
>
>
> --
> Derek B. Springer
> Software Developer
> Mahalo.com, Inc.
> 902 Colorado Ave.,
> Santa Monica, CA 90401
> [EMAIL PROTECTED]
>



-- 
Derek B. Springer
Software Developer
Mahalo.com, Inc.
902 Colorado Ave.,
Santa Monica, CA 90401
[EMAIL PROTECTED]

Re: Deadlock with DirectUpdateHandler2

2008-11-18 Thread Mark Miller


Toby Cole wrote:
Has anyone else experienced a deadlock when the DirectUpdateHandler2 
does an autocommit?
I'm using a recent snapshot from hudson 
(apache-solr-2008-11-12_08-06-21), and quite often when I'm loading 
data the server (tomcat 6) gets stuck at line 469 of 
DirectUpdateHandler2:


  // Check if there is a commit already scheduled for longer then 
this time

  if( pending != null &&
  pending.getDelay(TimeUnit.MILLISECONDS) >= commitMaxTime )

Anyone got any enlightening tips?
Cheers,

Toby Cole
Software Engineer

Semantico
Lees House, Floor 1, 21-23 Dyke Road, Brighton BN1 3FE
T: +44 (0)1273 358 238
F: +44 (0)1273 723 232
E: [EMAIL PROTECTED]
W: www.semantico.com

There is some inconsistent synchronization I think. Especially involving 
pending. Yuck

Deadlock with DirectUpdateHandler2

2008-11-18 Thread Toby Cole

Has anyone else experienced a deadlock when the DirectUpdateHandler2  
does an autocommit?
I'm using a recent snapshot from hudson (apache- 
solr-2008-11-12_08-06-21), and quite often when I'm loading data the  
server (tomcat 6) gets stuck at line 469 of DirectUpdateHandler2:


  // Check if there is a commit already scheduled for longer then  
this time

  if( pending != null &&
  pending.getDelay(TimeUnit.MILLISECONDS) >= commitMaxTime )

Anyone got any enlightening tips?
Cheers,

Toby Cole
Software Engineer

Semantico
Lees House, Floor 1, 21-23 Dyke Road, Brighton BN1 3FE
T: +44 (0)1273 358 238
F: +44 (0)1273 723 232
E: [EMAIL PROTECTED]
W: www.semantico.com

Re: Software Announcement: LuSql: Database to Lucene indexing

2008-11-18 Thread Shalin Shekhar Mangar

Hi Glen,

There is an issue open for making DIH API friendly. Take a look and let us
know what you think.

https://issues.apache.org/jira/browse/SOLR-853

On Tue, Nov 18, 2008 at 8:26 PM, Glen Newton <[EMAIL PROTECTED]> wrote:

> Erik,
>
> Right now there is no real abstraction like DIH in LuSql. But as
> indicated in the TODO section of the documentation, I was planning on
> implementing or straight borrowing DIH in the near future.
>
> I am assuming that Solr is all multi-threaded & as performant as it
> can be. Is there a test SQL database that is used to test Solr, so I
> might try to do some comparisons?
>
> Not being a Solr user, it is hard for me to know of any advantages of
> LuSql over Solr. Hopefully some in the community can identify possible
> overlaps / use cases. I will see what I can figure out.
>
> Thanks,
>
> -Glen
>
> 2008/11/18 Erik Hatcher <[EMAIL PROTECTED]>:
> > Glen,
> >
> > The thing is, Solr has a database integration built-in with the new
> > DataImportHandler.   So I'm not sure how much interest Solr users would
> have
> > in LuSql by itself.
> >
> > Maybe there are LuSql features that DIH could borrow from?  Or vice
> versa?
> >
> >Erik
> >
> >
> > On Nov 17, 2008, at 11:03 PM, Glen Newton wrote:
> >>
> >> That said, I am very interested in making LuSql useful to the Solr
> >> community as well as teh broader Lucene community, so if any of you
> >> can offer any feedback on how LuSql can changed to better support
> >> Solr, I would appreciate it.
> >>
> >> thanks,
> >>
> >> Glen Newton
> >>
> >> ---
> >> FromErik Hatcher <[EMAIL PROTECTED]>
> >> Subject Re: Software Announcement: LuSql: Database to Lucene indexing
> >> DateMon, 17 Nov 2008 20:12:35 GMT
> >>
> >> Yeah, it'd work, though not only does the version of Lucene need to
> >> match, but the field indexing/storage attributes need to jive as well
> >> - and that is the trickier part of the equation.
> >>
> >> But yeah, LuSQL looks slick!
> >>
> >>Erik
> >>
> >>
> >> On Nov 17, 2008, at 2:17 PM, Matthew Runo wrote:
> >>
> >>> Hello -
> >>>
> >>> I wanted to forward this on, since I thought that people here might
> >>> be able to use this to build indexes. So long as the lucene version
> >>> in LuSQL matches the version in Solr, it would work fine for
> >>> indexing - yea?
> >>>
> >>> Thanks for your time!
> >>>
> >>> Matthew Runo
> >>> Software Engineer, Zappos.com
> >>> [EMAIL PROTECTED] - 702-943-7833
> >>>
> >>> Begin forwarded message:
> >>>
>  From: "Glen Newton" <[EMAIL PROTECTED]>
>  Date: November 17, 2008 4:32:18 AM PST
>  To: [EMAIL PROTECTED]
>  Subject: Software Announcement: LuSql: Database to Lucene indexing
>  Reply-To: [EMAIL PROTECTED]
> 
>  LuSql is a simple but powerful tool for building Lucene indexes from
>  relational databases. It is a command-line Java application for the
>  construction of a Lucene index from an arbitrary SQL query of a
>  JDBC-accessible SQL database. It allows a user to control a number of
>  parameters, including the SQL query to use, individual
>  indexing/storage/term-vector nature of fields, analyzer, stop word
>  list, and other tuning parameters. In its default mode it uses
>  threading to take advantage of multiple cores.
> 
>  LuSql can handle complex queries, allows for additional per record
>  sub-queries, and has a plug-in architecture for arbitrary Lucene
>  document manipulation. Its only dependencies are three Apache Commons
>  libraries, the Lucene core itself, and a JDBC driver.
> 
>  LuSql has been extensively tested, including a large 6+ million
>  full-text & metadata journal article document collection, producing
>  an
>  86GB Lucene index in ~13 hours.
> 
>  http://lab.cisti-icist.nrc-cnrc.gc.ca/cistilabswiki/index.php/LuSql
> 
>  Glen Newton
> 
>  --
> 
>  -
> 
>  -
>  To unsubscribe, e-mail: [EMAIL PROTECTED]
>  For additional commands, e-mail: [EMAIL PROTECTED]
> 
> >>
> >>
> >>
> >>
> >> --
> >>
> >> -
> >
> >
>
>
>
> --
>
> -
>



-- 
Regards,
Shalin Shekhar Mangar.

Re: Software Announcement: LuSql: Database to Lucene indexing

2008-11-18 Thread Glen Newton

Erik,

Right now there is no real abstraction like DIH in LuSql. But as
indicated in the TODO section of the documentation, I was planning on
implementing or straight borrowing DIH in the near future.

I am assuming that Solr is all multi-threaded & as performant as it
can be. Is there a test SQL database that is used to test Solr, so I
might try to do some comparisons?

Not being a Solr user, it is hard for me to know of any advantages of
LuSql over Solr. Hopefully some in the community can identify possible
overlaps / use cases. I will see what I can figure out.

Thanks,

-Glen

2008/11/18 Erik Hatcher <[EMAIL PROTECTED]>:
> Glen,
>
> The thing is, Solr has a database integration built-in with the new
> DataImportHandler.   So I'm not sure how much interest Solr users would have
> in LuSql by itself.
>
> Maybe there are LuSql features that DIH could borrow from?  Or vice versa?
>
>Erik
>
>
> On Nov 17, 2008, at 11:03 PM, Glen Newton wrote:
>>
>> That said, I am very interested in making LuSql useful to the Solr
>> community as well as teh broader Lucene community, so if any of you
>> can offer any feedback on how LuSql can changed to better support
>> Solr, I would appreciate it.
>>
>> thanks,
>>
>> Glen Newton
>>
>> ---
>> FromErik Hatcher <[EMAIL PROTECTED]>
>> Subject Re: Software Announcement: LuSql: Database to Lucene indexing
>> DateMon, 17 Nov 2008 20:12:35 GMT
>>
>> Yeah, it'd work, though not only does the version of Lucene need to
>> match, but the field indexing/storage attributes need to jive as well
>> - and that is the trickier part of the equation.
>>
>> But yeah, LuSQL looks slick!
>>
>>Erik
>>
>>
>> On Nov 17, 2008, at 2:17 PM, Matthew Runo wrote:
>>
>>> Hello -
>>>
>>> I wanted to forward this on, since I thought that people here might
>>> be able to use this to build indexes. So long as the lucene version
>>> in LuSQL matches the version in Solr, it would work fine for
>>> indexing - yea?
>>>
>>> Thanks for your time!
>>>
>>> Matthew Runo
>>> Software Engineer, Zappos.com
>>> [EMAIL PROTECTED] - 702-943-7833
>>>
>>> Begin forwarded message:
>>>
 From: "Glen Newton" <[EMAIL PROTECTED]>
 Date: November 17, 2008 4:32:18 AM PST
 To: [EMAIL PROTECTED]
 Subject: Software Announcement: LuSql: Database to Lucene indexing
 Reply-To: [EMAIL PROTECTED]

 LuSql is a simple but powerful tool for building Lucene indexes from
 relational databases. It is a command-line Java application for the
 construction of a Lucene index from an arbitrary SQL query of a
 JDBC-accessible SQL database. It allows a user to control a number of
 parameters, including the SQL query to use, individual
 indexing/storage/term-vector nature of fields, analyzer, stop word
 list, and other tuning parameters. In its default mode it uses
 threading to take advantage of multiple cores.

 LuSql can handle complex queries, allows for additional per record
 sub-queries, and has a plug-in architecture for arbitrary Lucene
 document manipulation. Its only dependencies are three Apache Commons
 libraries, the Lucene core itself, and a JDBC driver.

 LuSql has been extensively tested, including a large 6+ million
 full-text & metadata journal article document collection, producing
 an
 86GB Lucene index in ~13 hours.

 http://lab.cisti-icist.nrc-cnrc.gc.ca/cistilabswiki/index.php/LuSql

 Glen Newton

 --

 -

 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]

>>
>>
>>
>>
>> --
>>
>> -
>
>

-- 

-

Re: TextProfileSigature using deduplication

2008-11-18 Thread Andrzej Bialecki


Marc Sturlese wrote:

Hey there, I've been testing and checking the source of the
TextProfileSignature.java to avoid similar entries at indexing time.
What I understood is that it is useful for huge text where the frequency of
the tokens (the words in lowercase just with number and leters in taht case)
is important. If you want to detect duplicates in not huge text and not
giving a lot of importance to the frequencies it doesn't work...
The hash will be made just with the terms wich frequency is higher than a
QUANTUM (which value is given in function of the max freq between all the
terms). So it will say that:

aaa sss ddd fff ggg hhh aaa kkk lll ooo
aaa xxx iii www qqq aaa jjj eee zzz nnn

are duplicates because quantum here wolud be 2 and the frequency of aaa
would be 2 aswell. So, to make the hash just the term aaa would be used.

In this case:
aaa sss ddd fff ggg hhh kkk lll ooo
apa sss ddd fff ggg hhh kkk lll ooo

Here quantum would be 1 and the frequencies of all terms would be 1 so all
terms would be use for the hash. It will consider this two strings not
similar.

As I understood the algorithm there's no way to make it understand that in
my second case both strings are similar. I wish i were wrong...

I have my own duplication system to detect that but I use String comparison
so it works really slow... Would like to know if there is any tuning
possibility to do that with TextProfileSignature 


Don't know if I should pot this here or in the developers forum...


Hi Marc,

TextProfileSignature is a rather crude implementation of approximate 
similarity, and as you pointed out it's best suited for large texts. The 
original purpose of this Signature was to deduplicate web pages in large 
amounts of crawled pages (in Nutch), where it worked reasonably well. 
Its advantage is also that it's easy to compute and doesn't require 
multiple passes over the corpus.


As it is implemented now, it breaks badly in the case you describe. You 
could modify this implementation to include also word-level ngrams, i.e. 
sequences of more than 1 word, up to N (e.g. 5) - this should work in 
your case.


Ultimately, what you are probably looking for is a shingle-based 
algorithm, but it's relatively costly and requires multiple passes.


--
Best regards,
Andrzej Bialecki <><
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Re: TextProfileSigature using deduplication

2008-11-18 Thread Marc Sturlese

>>
>> I have my own duplication system to detect that but I use String 
>> comparison
>> so it works really slow...
>>  
What are you doing for the String comparison? Not exact right?

hey,
My comparison method looks for similar (not just exact)... what I do is to
compare two text word to word. What I do after is decide a % of similarity,
fore example:
aaa sss ddd fff ggg hhh jjj kkk lll ooo
bbb rrr ddd fff ggg hhh jjj kkk lll ooo

Deciding a 80% of similarity and comparing word to word these two String
would be similar. (I split texts in tokens and count how many similars I do
have). 
(I use some stopwords and rules aswell)

I am going to try more tunning in the parameters of TextProfileSignature as
you say.
Don't know if you remember but I ask you about this in the ApacheConn and
you told me abou this 799 JIRA. If i make it word it is definitely much
faster than my system...

Abou deduplication... I couldn't find anywhere the classe tha aperas in the
wiki :org.apache.solr.update.processor.DeduplicateUpdateProcessorFactory
so I downloaded the patch and pluedg in to my solr source (I use
org.apache.solr.update.processor.TextProfileSignature insted of the one
writed in the wiki). 

Would apreciate any advice about the tuning params of TextProfileSignature

Thank you for your time

markrmiller wrote:
> 
> 
>>>
>>> I have my own duplication system to detect that but I use String 
>>> comparison
>>> so it works really slow...
>>>  
> What are you doing for the String comparison? Not exact right?
> 
> 
-- 
View this message in context: 
http://www.nabble.com/TextProfileSigature-using-deduplication-tp20559155p20560828.html
Sent from the Solr - User mailing list archive at Nabble.com.

EmbeddedSolrServer questions

2008-11-18 Thread Thierry Templier

Hello,

I have some questions regarding the use of the EmbeddedSolrServer in order to 
embed a solr instance into a Java application.

1°) Is an instance of the EmbeddedSolrServer class threadsafe when used by 
several concurent threads?

2°) Regarding to transactions, can an instance of the EmbeddedSolrServer class 
be used in order to make two transactions in the same time by two different 
threads?

Thanks for your help,
Thierry

Re: TextProfileSigature using deduplication

2008-11-18 Thread Mark Miller





I have my own duplication system to detect that but I use String 
comparison

so it works really slow...
 

What are you doing for the String comparison? Not exact right?

Re: TextProfileSigature using deduplication

2008-11-18 Thread Mark Miller

Have you tried the tunning params for TextProfileSignature? I probably 
have to update the dedupe wiki.


You can set the quantRate and the minTokenLength. Those are the 
variables names and you set them right with signatureClass, 
signatureField, fields, etc.


Whether or not you can tune it to meet your needs I am not quite sure. 
There are quite a few more advanced fuzzy hash algorithms out there, but 
frankly, most of them are still just making my head hurt. Hope to see 
some of them in solr at some point though. The rolling hash spamsum alg 
looks like it might be fairly doable...I've got half a dozen pdf papers 
on other algorithms as well, but they are not a joke for me to implement.


- Mark


Marc Sturlese wrote:

Hey there, I've been testing and checking the source of the
TextProfileSignature.java to avoid similar entries at indexing time.
What I understood is that it is useful for huge text where the frequency of
the tokens (the words in lowercase just with number and leters in taht case)
is important. If you want to detect duplicates in not huge text and not
giving a lot of importance to the frequencies it doesn't work...
The hash will be made just with the terms wich frequency is higher than a
QUANTUM (which value is given in function of the max freq between all the
terms). So it will say that:

aaa sss ddd fff ggg hhh aaa kkk lll ooo
aaa xxx iii www qqq aaa jjj eee zzz nnn

are duplicates because quantum here wolud be 2 and the frequency of aaa
would be 2 aswell. So, to make the hash just the term aaa would be used.

In this case:
aaa sss ddd fff ggg hhh kkk lll ooo
apa sss ddd fff ggg hhh kkk lll ooo

Here quantum would be 1 and the frequencies of all terms would be 1 so all
terms would be use for the hash. It will consider this two strings not
similar.

As I understood the algorithm there's no way to make it understand that in
my second case both strings are similar. I wish i were wrong...

I have my own duplication system to detect that but I use String comparison
so it works really slow... Would like to know if there is any tuning
possibility to do that with TextProfileSignature 


Don't know if I should pot this here or in the developers forum...

Thanks in advance

Re: Use SOLR like the "MySQL LIKE"

2008-11-18 Thread Aleksander M. Stensby


Ah, okay!
Well, then I suggest you index the field in two different ways if you want  
both possible ways of searching. One, where you treat the entire name as  
one token (in lowercase) (then you can search for avera* and match on for  
instance "average joe" etc.) And then another field where you tokenize on  
whitespace for instance, if you want/need that possibility aswell. Look at  
the solr copy fields and try it out, it works like a charm :)


Cheers,
 Aleksander

On Tue, 18 Nov 2008 10:40:24 +0100, Carsten L <[EMAIL PROTECTED]> wrote:



Thanks for the quick reply!

It is supposed to work a little like the Google Suggest or field
autocompletion.

I know I mentioned email and userid, but the problem lies with the name
field, because of the whitespaces in combination with the wildcard.

I looked at the solr.WordDelimiterFilterFactory, but it does not mention
anything about whitespaces - or wildcards.

A quick brushup:
I would like to mimic the LIKE functionality from MySQL using the  
wildcards

in the end of the searchquery.
In MySQL whitespaces are treated as characters, not "splitters".


Aleksander M. Stensby wrote:


Hi there,

You should use LowerCaseTokenizerFactory as you point out yourself. As  
far
as I know, the StandardTokenizer "recognizes email addresses and  
internet

hostnames as one token". In your case, I guess you want an email, say
"[EMAIL PROTECTED]" to be split into four tokens: average joe  
apache

org, or something like that, which would indeed allow you to search for
"joe" or "average j*" and match. To do so, you could use the
WordDelimiterFilterFactory and split on intra-word delimiters (I think  
the

defaults here are non-alphanumeric chars).

Take a look at  
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters

for more info on tokenizers and filters.

cheers,
  Aleks

On Tue, 18 Nov 2008 08:35:31 +0100, Carsten L <[EMAIL PROTECTED]>  
wrote:




Hello.

The data:
I have a dataset containing ~500.000 documents.
In each document there is an email, a name and an user ID.

The problem:
I would like to be able to search in it, but it should be like the  
"MySQL

LIKE".

So when a user enters the search term: "carsten", then the query looks
like:
"name:(carsten) OR name:(carsten*) OR email:(carsten) OR
email:(carsten*) OR userid:(carsten) OR userid:(carsten*)"

Then it should match:
carsten l
carsten larsen
Carsten Larsen
Carsten
CARSTEN
etc.

And when the user enters the term: "carsten l" the query looks like:
"name:(carsten l) OR name:(carsten l*) OR email:(carsten l) OR
email:(carsten l*) OR userid:(carsten l) OR userid:(carsten l*)"

Then it should match:
carsten l
carsten larsen
Carsten Larsen

Or written to the MySQL syntax: "... WHERE `name` LIKE 'carsten%'  OR
`email` LIKE 'carsten%' OR `userid` LIKE 'carsten%'..."

I know that I need to use the "solr.LowerCaseTokenizerFactory" on my  
name

and email field, to ensure case insentitive behavior.
The problem seems to be the wildcards and the whitespaces.




--
Aleksander M. Stensby
Senior software developer
Integrasco A/S
www.integrasco.no








--
Aleksander M. Stensby
Senior software developer
Integrasco A/S
www.integrasco.no

TextProfileSigature using deduplication

2008-11-18 Thread Marc Sturlese


Hey there, I've been testing and checking the source of the
TextProfileSignature.java to avoid similar entries at indexing time.
What I understood is that it is useful for huge text where the frequency of
the tokens (the words in lowercase just with number and leters in taht case)
is important. If you want to detect duplicates in not huge text and not
giving a lot of importance to the frequencies it doesn't work...
The hash will be made just with the terms wich frequency is higher than a
QUANTUM (which value is given in function of the max freq between all the
terms). So it will say that:

aaa sss ddd fff ggg hhh aaa kkk lll ooo
aaa xxx iii www qqq aaa jjj eee zzz nnn

are duplicates because quantum here wolud be 2 and the frequency of aaa
would be 2 aswell. So, to make the hash just the term aaa would be used.

In this case:
aaa sss ddd fff ggg hhh kkk lll ooo
apa sss ddd fff ggg hhh kkk lll ooo

Here quantum would be 1 and the frequencies of all terms would be 1 so all
terms would be use for the hash. It will consider this two strings not
similar.

As I understood the algorithm there's no way to make it understand that in
my second case both strings are similar. I wish i were wrong...

I have my own duplication system to detect that but I use String comparison
so it works really slow... Would like to know if there is any tuning
possibility to do that with TextProfileSignature 

Don't know if I should pot this here or in the developers forum...

Thanks in advance
-- 
View this message in context: 
http://www.nabble.com/TextProfileSigature-using-deduplication-tp20559155p20559155.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Error in indexing timestamp format.

2008-11-18 Thread Noble Paul നോബിള്‍ नोब्ळ्

How are you indexing the data ? by posting xml? or using DIH?


On Tue, Nov 18, 2008 at 3:53 PM, con <[EMAIL PROTECTED]> wrote:
>
> Hi Guys
> I have timestamp fields in my database in the format,
> ddmmyyhhmmss.Z AM
> eg: 26-05-08 10:45:53.66100 AM
> But I think the since the solr date format is different, i am unable to
> index the document with the solr.DateField.
> So is there any option by which I can give my timestamp format to solr and
> index and search data based on that.
> i have already setup solr and is working fine for other datatypes.
> So is it possible to set this new date format in some config files.
>
> Expecting suggestions/advices
> Thanks in advance
> con
> --
> View this message in context: 
> http://www.nabble.com/Error-in-indexing-timestamp-format.-tp20556862p20556862.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>
>



-- 
--Noble Paul

Error in indexing timestamp format.

2008-11-18 Thread con


Hi Guys
I have timestamp fields in my database in the format,
 ddmmyyhhmmss.Z AM
eg: 26-05-08 10:45:53.66100 AM
But I think the since the solr date format is different, i am unable to
index the document with the solr.DateField.
So is there any option by which I can give my timestamp format to solr and
index and search data based on that.
i have already setup solr and is working fine for other datatypes. 
So is it possible to set this new date format in some config files.

Expecting suggestions/advices
Thanks in advance
con
-- 
View this message in context: 
http://www.nabble.com/Error-in-indexing-timestamp-format.-tp20556862p20556862.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Software Announcement: LuSql: Database to Lucene indexing

2008-11-18 Thread Erik Hatcher


Glen,

The thing is, Solr has a database integration built-in with the new  
DataImportHandler.   So I'm not sure how much interest Solr users  
would have in LuSql by itself.


Maybe there are LuSql features that DIH could borrow from?  Or vice  
versa?


Erik


On Nov 17, 2008, at 11:03 PM, Glen Newton wrote:

That said, I am very interested in making LuSql useful to the Solr
community as well as teh broader Lucene community, so if any of you
can offer any feedback on how LuSql can changed to better support
Solr, I would appreciate it.

thanks,

Glen Newton

---
FromErik Hatcher <[EMAIL PROTECTED]>
Subject Re: Software Announcement: LuSql: Database to Lucene indexing
DateMon, 17 Nov 2008 20:12:35 GMT

Yeah, it'd work, though not only does the version of Lucene need to
match, but the field indexing/storage attributes need to jive as well
- and that is the trickier part of the equation.

But yeah, LuSQL looks slick!

Erik


On Nov 17, 2008, at 2:17 PM, Matthew Runo wrote:


Hello -

I wanted to forward this on, since I thought that people here might
be able to use this to build indexes. So long as the lucene version
in LuSQL matches the version in Solr, it would work fine for
indexing - yea?

Thanks for your time!

Matthew Runo
Software Engineer, Zappos.com
[EMAIL PROTECTED] - 702-943-7833

Begin forwarded message:


From: "Glen Newton" <[EMAIL PROTECTED]>
Date: November 17, 2008 4:32:18 AM PST
To: [EMAIL PROTECTED]
Subject: Software Announcement: LuSql: Database to Lucene indexing
Reply-To: [EMAIL PROTECTED]

LuSql is a simple but powerful tool for building Lucene indexes from
relational databases. It is a command-line Java application for the
construction of a Lucene index from an arbitrary SQL query of a
JDBC-accessible SQL database. It allows a user to control a number  
of

parameters, including the SQL query to use, individual
indexing/storage/term-vector nature of fields, analyzer, stop word
list, and other tuning parameters. In its default mode it uses
threading to take advantage of multiple cores.

LuSql can handle complex queries, allows for additional per record
sub-queries, and has a plug-in architecture for arbitrary Lucene
document manipulation. Its only dependencies are three Apache  
Commons

libraries, the Lucene core itself, and a JDBC driver.

LuSql has been extensively tested, including a large 6+ million
full-text & metadata journal article document collection, producing
an
86GB Lucene index in ~13 hours.

http://lab.cisti-icist.nrc-cnrc.gc.ca/cistilabswiki/index.php/LuSql

Glen Newton

--

-

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]






--

-

Re: Use SOLR like the "MySQL LIKE"

2008-11-18 Thread Carsten L


Thanks for the quick reply!

It is supposed to work a little like the Google Suggest or field
autocompletion.

I know I mentioned email and userid, but the problem lies with the name
field, because of the whitespaces in combination with the wildcard.

I looked at the solr.WordDelimiterFilterFactory, but it does not mention
anything about whitespaces - or wildcards.

A quick brushup:
I would like to mimic the LIKE functionality from MySQL using the wildcards
in the end of the searchquery.
In MySQL whitespaces are treated as characters, not "splitters".


Aleksander M. Stensby wrote:
> 
> Hi there,
> 
> You should use LowerCaseTokenizerFactory as you point out yourself. As far  
> as I know, the StandardTokenizer "recognizes email addresses and internet  
> hostnames as one token". In your case, I guess you want an email, say  
> "[EMAIL PROTECTED]" to be split into four tokens: average joe apache  
> org, or something like that, which would indeed allow you to search for  
> "joe" or "average j*" and match. To do so, you could use the  
> WordDelimiterFilterFactory and split on intra-word delimiters (I think the  
> defaults here are non-alphanumeric chars).
> 
> Take a look at http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters  
> for more info on tokenizers and filters.
> 
> cheers,
>   Aleks
> 
> On Tue, 18 Nov 2008 08:35:31 +0100, Carsten L <[EMAIL PROTECTED]> wrote:
> 
>>
>> Hello.
>>
>> The data:
>> I have a dataset containing ~500.000 documents.
>> In each document there is an email, a name and an user ID.
>>
>> The problem:
>> I would like to be able to search in it, but it should be like the "MySQL
>> LIKE".
>>
>> So when a user enters the search term: "carsten", then the query looks  
>> like:
>> "name:(carsten) OR name:(carsten*) OR email:(carsten) OR
>> email:(carsten*) OR userid:(carsten) OR userid:(carsten*)"
>>
>> Then it should match:
>> carsten l
>> carsten larsen
>> Carsten Larsen
>> Carsten
>> CARSTEN
>> etc.
>>
>> And when the user enters the term: "carsten l" the query looks like:
>> "name:(carsten l) OR name:(carsten l*) OR email:(carsten l) OR
>> email:(carsten l*) OR userid:(carsten l) OR userid:(carsten l*)"
>>
>> Then it should match:
>> carsten l
>> carsten larsen
>> Carsten Larsen
>>
>> Or written to the MySQL syntax: "... WHERE `name` LIKE 'carsten%'  OR
>> `email` LIKE 'carsten%' OR `userid` LIKE 'carsten%'..."
>>
>> I know that I need to use the "solr.LowerCaseTokenizerFactory" on my name
>> and email field, to ensure case insentitive behavior.
>> The problem seems to be the wildcards and the whitespaces.
> 
> 
> 
> -- 
> Aleksander M. Stensby
> Senior software developer
> Integrasco A/S
> www.integrasco.no
> 
> 

-- 
View this message in context: 
http://www.nabble.com/Use-SOLR-like-the-%22MySQL-LIKE%22-tp20554732p20556271.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Use SOLR like the "MySQL LIKE"

2008-11-18 Thread Aleksander M. Stensby


Hi there,

You should use LowerCaseTokenizerFactory as you point out yourself. As far  
as I know, the StandardTokenizer "recognizes email addresses and internet  
hostnames as one token". In your case, I guess you want an email, say  
"[EMAIL PROTECTED]" to be split into four tokens: average joe apache  
org, or something like that, which would indeed allow you to search for  
"joe" or "average j*" and match. To do so, you could use the  
WordDelimiterFilterFactory and split on intra-word delimiters (I think the  
defaults here are non-alphanumeric chars).


Take a look at http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters  
for more info on tokenizers and filters.


cheers,
 Aleks

On Tue, 18 Nov 2008 08:35:31 +0100, Carsten L <[EMAIL PROTECTED]> wrote:



Hello.

The data:
I have a dataset containing ~500.000 documents.
In each document there is an email, a name and an user ID.

The problem:
I would like to be able to search in it, but it should be like the "MySQL
LIKE".

So when a user enters the search term: "carsten", then the query looks  
like:

"name:(carsten) OR name:(carsten*) OR email:(carsten) OR
email:(carsten*) OR userid:(carsten) OR userid:(carsten*)"

Then it should match:
carsten l
carsten larsen
Carsten Larsen
Carsten
CARSTEN
etc.

And when the user enters the term: "carsten l" the query looks like:
"name:(carsten l) OR name:(carsten l*) OR email:(carsten l) OR
email:(carsten l*) OR userid:(carsten l) OR userid:(carsten l*)"

Then it should match:
carsten l
carsten larsen
Carsten Larsen

Or written to the MySQL syntax: "... WHERE `name` LIKE 'carsten%'  OR
`email` LIKE 'carsten%' OR `userid` LIKE 'carsten%'..."

I know that I need to use the "solr.LowerCaseTokenizerFactory" on my name
and email field, to ensure case insentitive behavior.
The problem seems to be the wildcards and the whitespaces.




--
Aleksander M. Stensby
Senior software developer
Integrasco A/S
www.integrasco.no

50 matches

Mail list logo