Solr 3.3. Grouping vs DeDuplication and Deduplication Use Case

2011-08-29 Thread Pranav Prakash
Solr 3.3. has a feature "Grouping". Is it practically same as deduplication?

Here is my use case for duplicates removal -

We have many documents with similar (upto 99%) content. Upon some search
queries, almost all of them come up on first page results. Of all these
documents, essentially one is original and the other are duplicates. We are
able to find the original content on a basis of number of factors - who
uploaded it, when, how many viral shares.It is also possible that the
duplicates are uploaded earlier (and hence exist in search index) while the
original is uploaded later (and gets added later to index).

AFAIK, Deduplication targets index time. Is there a means I can specify the
original which should be returned and the duplicates which could be removed
from coming up.?


*Pranav Prakash*

"temet nosce"

Twitter  | Blog  |
Google 


Re: How to list all dynamic fields of a document using solrj?

2011-08-29 Thread Michael Szalay
Hi Juan

I tried with the following code first:

final SolrQuery allDocumentsQuery = new  SolrQuery();
allDocumentsQuery.setQuery("id:" + myId);
allDocumentsQuery.setFields("*");
allDocumentsQuery.setRows(1);
QueryResponse response = solr.query(allDocumentsQuery, METHOD.POST);


With this, only non-dynamic fields are returned.
Then I wrote the following helper method:

 private Set getDynamicFields() throws SolrServerException, IOException 
{
final LukeRequest luke = new LukeRequest();
luke.setShowSchema(false);
final LukeResponse process = luke.process(solr);
final Map fieldInfo = process.getFieldInfo();
final Set dynamicFields = new HashSet();
for (final String key : fieldInfo.keySet()) {
if (key.endsWith("_string") || (key.endsWith("_dateTime"))) {
dynamicFields.add(key);
}
}
return dynamicFields;
}

where as _string and _dateTime are the suffixes of my dynamic fields.
This one returns really all stored fields of the document:

final Set dynamicFields = getDynamicFields();
final SolrQuery allDocumentsQuery = new  SolrQuery();
allDocumentsQuery.setQuery("uri:" + myId);
allDocumentsQuery.setFields("*");
for (final String df : dynamicFields) {
allDocumentsQuery.addField(df);
}

allDocumentsQuery.setRows(1);
QueryResponse response = solr.query(allDocumentsQuery, METHOD.POST);

Is there a more elegant way to do this? We are using solrj 3.1.0 and solr 3.1.0.

Regards
Michael
--
Michael Szalay
Senior Software Engineer

basis06 AG, Birkenweg 61, CH-3013 Bern - Fon +41 31 311 32 22
http://www.basis06.ch - source of smart business

- Ursprüngliche Mail -
Von: "Juan Grande" 
An: solr-user@lucene.apache.org
Gesendet: Montag, 29. August 2011 18:19:05
Betreff: Re: How to list all dynamic fields of a document using solrj?

Hi Michael,

It's supposed to work. Can we see a snippet of the code you're using to
retrieve the fields?

*Juan*



On Mon, Aug 29, 2011 at 8:33 AM, Michael Szalay
wrote:

> Hi all
>
> how can I list all dynamic fields and their values of a document using
> solrj?
> The dynamic fields are never returned when I use setFields(*).
>
> Thanks
>
> Michael
>
> --
> Michael Szalay
> Senior Software Engineer
>
> basis06 AG, Birkenweg 61, CH-3013 Bern - Fon +41 31 311 32 22
> http://www.basis06.ch - source of smart business
>
>


SolrJ and DIH query

2011-08-29 Thread Jagdish Kumar

Hi All
 
I have been using DataImportHandler and full import in Solr , now I want SorJ 
to index files from a directory either by using DataImportHandler or some other 
way.
 
I am not sure the how to go about it :
1. Do I just need to copy solrJ and relevant jars to lib folder of Solr ?
2. Do I need to modify some specific classes of solrJ jars and rebundle them?
3.Can any one mention specific steps required to get solrJ working?
 
Thanks and regards
Jagdish   

how to get update record from database using delta-query?

2011-08-29 Thread vighnesh
hi all 

I am facing the problem in get a update record from database using delta
query in solr please give me the solution and my delta query is

   


   


is there any wrong in this code please let me know

thanks in advance.

Regards,
Vighnesh.

--
View this message in context: 
http://lucene.472066.n3.nabble.com/how-to-get-update-record-from-database-using-delta-query-tp3294510p3294510.html
Sent from the Solr - User mailing list archive at Nabble.com.


How to send an OpenBitSet object from Solr server?

2011-08-29 Thread Satish Talim
We have a need to query and fetch millions of document ids from a Solr 3.3
index and convert the same to a BitSet. To speed things up, we want to
convert these document ids into OpenBitSet on the server side, put them into
the response object and read the same on the client side.

To achieve this, we wrote our own RequestHandler and overwrote
the handleRequest method. Using this RequestHandler we do get the response
object but when we try to fetch the OpenBitSet we get an error -

Exception in thread "main" java.lang.ClassCastException: java.lang.String
cannot be cast to org.apache.lucene.util.OpenBitSet

The documentation at -
http://lucene.apache.org/solr/api/org/apache/solr/response/SolrQueryResponse.html

says that "Other data types may be added to the SolrQueryResponse, but there
is no guarantee that QueryResponseWriters will be able to deal with
unexpected types."

Is there a work-around wherein I can send an OpenBitSet object?

Satish


Search the contents of given URL in Solr.

2011-08-29 Thread Sheetal
Hi,

Is it possible to give the URL address of a site and solr search server
reads the contents of the given site and recommends similar projects to
that. I did scrapped the web contents from the given URL address and now
have the plain text format of the contents in URL. But when I pass that
scrapped text as query into Solr. It doesn't work as query being too
large(depends on the given contents of URL). 

I read it somewhere that its possible , Given the URL address and outputs
you the relevant projects to it. But I don't remember whether its using Solr
search or other search engine.

Does anyone have any ideas or suggestions for this..Would highly appreciate
your comments

Thank you in advance..

-
Sheetal
--
View this message in context: 
http://lucene.472066.n3.nabble.com/Search-the-contents-of-given-URL-in-Solr-tp3294376p3294376.html
Sent from the Solr - User mailing list archive at Nabble.com.


RE: Newbie question, ant target for packaging source files from local copy?

2011-08-29 Thread syyang
Hi Steve,

I've filed a new JIRA issue along with the patch, which can be found at
;.

Please let me know if you see any problem.

Thanks!
-Sid

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Newbie-question-ant-target-for-packaging-source-files-from-local-copy-tp3282787p3294320.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: what is scheduling ? why should we do this?how to achieve this ?

2011-08-29 Thread Alexei Martchenko
since solr is basically a http server, all you need is a scheduler to browse
to specific pages.

on windows, u can try the task scheduler (i don't know its name in english)
its the clock icon on the 'administrative tools' section.

coldfusion for instance, has its own scheduler, other languages as php might
have, you can use.

hope it helps.

2011/8/29 nagarjuna 

> Hi pravesh...
> i already saw the wiki page that what u have given...from that i got
> the points about collection distribution etc...
> but i didnt get any link which will explain the cron job process step
> by
> step for the windows OS ..
> can please tell me how to do it for windows?
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/what-is-scheduling-why-should-we-do-this-how-to-achieve-this-tp3287115p3292221.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>



-- 

*Alexei Martchenko* | *CEO* | Superdownloads
ale...@superdownloads.com.br | ale...@martchenko.com.br | (11)
5083.1018/5080.3535/5080.3533


Re: Can't Get Past Step 2 in the Tutorial

2011-08-29 Thread Waldo Jaquith
Well, heck, that fixed things just fine. Thanks, Mauricio! Now I can actually 
get to seeing the first thing about Solr. netstat reveals that a defunct Java 
process was parked on port 8983, presumably from a failed attempt to launch 
Solr. I killed that process and now Solr runs just fine on 8983.

I appreciate your help!

Best,
Waldo


On Aug 29, 2011, at 9:51 PM, Mauricio Scheffer wrote:
> The error says that port 8983 is already in use by some other application.
> Try some other port:
> 
> java -Djetty.port=8984 -jar start.jar
> 
> --
> Mauricio
> 
> 
> On Mon, Aug 29, 2011 at 10:39 PM, Waldo Jaquith  wrote:
> 
>> Howdy,
>> 
>> I’m having an incredibly frustrating time getting started with Solr. I’ve
>> got a relatively fresh Slicehost server, running CentOS 5.3. I followed the
>> instructions in the tutorial [1], installing Oracle Java JDK as per [2], and
>> got only as far as “java -jar start.jar.” (You can see the output of that at
>> the bottom of my e-mail.) Now if I open up
>> http://localhost:8983/solr/admin/ , I get a 404 from “Jetty://“, saying
>> "Problem accessing /solr/admin/. Reason: NOT_FOUND.” I have no idea of what
>> to do with this. I have zero experience with Solr. I figured the surest way
>> to see if it was something I wanted to use was to just try it, and obviously
>> that hasn’t gone very well.
>> 
>> Note that I have zero experience with Java, other than finding that every
>> interaction with it over the past 15 years has deepened my loathing for it.
>> Assume that I know nothing about Java and you will overestimate me. I’m only
>> e-mailing y’all because I was bitching about this on Twitter after a couple
>> of hours or attempting to make this work, and a couple of people encouraged
>> me to just try querying this list before nuking Java from orbit, having a
>> few stiff drinks, and forgetting that this ever happened.
>> 
>> Any suggestions?
>> 
>> Best,
>> Waldo
>> 
>> 
>> [1] http://lucene.apache.org/solr/tutorial.html
>> [2]
>> http://www.if-not-true-then-false.com/2010/install-sun-oracle-java-jdk-jre-7-on-fedora-centos-red-hat-rhel/
>> 
>> 
>> 
>> $ java -jar start.jar
>> 2011-08-29 20:50:51.876:INFO::Logging to STDERR via
>> org.mortbay.log.StdErrLog
>> 2011-08-29 20:50:52.061:INFO::jetty-6.1-SNAPSHOT
>> Aug 29, 2011 8:50:52 PM org.apache.solr.core.SolrResourceLoader
>> locateSolrHome
>> INFO: JNDI not configured for solr (NoInitialContextEx)
>> Aug 29, 2011 8:50:52 PM org.apache.solr.core.SolrResourceLoader
>> locateSolrHome
>> INFO: solr home defaulted to 'solr/' (could not find system property or
>> JNDI)
>> Aug 29, 2011 8:50:52 PM org.apache.solr.core.SolrResourceLoader 
>> INFO: Solr home set to 'solr/'
>> Aug 29, 2011 8:50:52 PM org.apache.solr.servlet.SolrDispatchFilter init
>> INFO: SolrDispatchFilter.init()
>> Aug 29, 2011 8:50:52 PM org.apache.solr.core.SolrResourceLoader
>> locateSolrHome
>> INFO: JNDI not configured for solr (NoInitialContextEx)
>> Aug 29, 2011 8:50:52 PM org.apache.solr.core.SolrResourceLoader
>> locateSolrHome
>> INFO: solr home defaulted to 'solr/' (could not find system property or
>> JNDI)
>> Aug 29, 2011 8:50:52 PM org.apache.solr.core.CoreContainer$Initializer
>> initialize
>> INFO: looking for solr.xml:
>> /home/waldo/apache-solr-3.3.0/example/solr/solr.xml
>> Aug 29, 2011 8:50:52 PM org.apache.solr.core.SolrResourceLoader
>> locateSolrHome
>> INFO: JNDI not configured for solr (NoInitialContextEx)
>> Aug 29, 2011 8:50:52 PM org.apache.solr.core.SolrResourceLoader
>> locateSolrHome
>> INFO: solr home defaulted to 'solr/' (could not find system property or
>> JNDI)
>> Aug 29, 2011 8:50:52 PM org.apache.solr.core.CoreContainer 
>> INFO: New CoreContainer: solrHome=solr/ instance=493847021
>> Aug 29, 2011 8:50:52 PM org.apache.solr.core.SolrResourceLoader 
>> INFO: Solr home set to 'solr/'
>> Aug 29, 2011 8:50:52 PM org.apache.solr.core.SolrResourceLoader 
>> INFO: Solr home set to 'solr/./'
>> Aug 29, 2011 8:50:52 PM org.apache.solr.core.SolrConfig initLibs
>> INFO: Adding specified lib dirs to ClassLoader
>> Aug 29, 2011 8:50:52 PM org.apache.solr.core.SolrResourceLoader
>> replaceClassLoader
>> INFO: Adding
>> 'file:/home/waldo/apache-solr-3.3.0/contrib/extraction/lib/jempbox-LICENSE-ASL.txt'
>> to classloader
>> Aug 29, 2011 8:50:52 PM org.apache.solr.core.SolrResourceLoader
>> replaceClassLoader
>> INFO: Adding
>> 'file:/home/waldo/apache-solr-3.3.0/contrib/extraction/lib/tika-parsers-LICENSE-ASL.txt'
>> to classloader
>> Aug 29, 2011 8:50:52 PM org.apache.solr.core.SolrResourceLoader
>> replaceClassLoader
>> INFO: Adding
>> 'file:/home/waldo/apache-solr-3.3.0/contrib/extraction/lib/poi-ooxml-LICENSE-ASL.txt'
>> to classloader
>> Aug 29, 2011 8:50:52 PM org.apache.solr.core.SolrResourceLoader
>> replaceClassLoader
>> INFO: Adding
>> 'file:/home/waldo/apache-solr-3.3.0/contrib/extraction/lib/tika-parsers-0.8.jar'
>> to classloader
>> Aug 29, 2011 8:50:52 PM org.apache.solr.core.SolrResourceLoader
>> replaceClassLoader
>> INFO: Adding
>

Re: Can't Get Past Step 2 in the Tutorial

2011-08-29 Thread Mauricio Scheffer
The error says that port 8983 is already in use by some other application.
Try some other port:

java -Djetty.port=8984 -jar start.jar

--
Mauricio


On Mon, Aug 29, 2011 at 10:39 PM, Waldo Jaquith  wrote:

> Howdy,
>
> I’m having an incredibly frustrating time getting started with Solr. I’ve
> got a relatively fresh Slicehost server, running CentOS 5.3. I followed the
> instructions in the tutorial [1], installing Oracle Java JDK as per [2], and
> got only as far as “java -jar start.jar.” (You can see the output of that at
> the bottom of my e-mail.) Now if I open up
> http://localhost:8983/solr/admin/ , I get a 404 from “Jetty://“, saying
> "Problem accessing /solr/admin/. Reason: NOT_FOUND.” I have no idea of what
> to do with this. I have zero experience with Solr. I figured the surest way
> to see if it was something I wanted to use was to just try it, and obviously
> that hasn’t gone very well.
>
> Note that I have zero experience with Java, other than finding that every
> interaction with it over the past 15 years has deepened my loathing for it.
> Assume that I know nothing about Java and you will overestimate me. I’m only
> e-mailing y’all because I was bitching about this on Twitter after a couple
> of hours or attempting to make this work, and a couple of people encouraged
> me to just try querying this list before nuking Java from orbit, having a
> few stiff drinks, and forgetting that this ever happened.
>
> Any suggestions?
>
> Best,
> Waldo
>
>
> [1] http://lucene.apache.org/solr/tutorial.html
> [2]
> http://www.if-not-true-then-false.com/2010/install-sun-oracle-java-jdk-jre-7-on-fedora-centos-red-hat-rhel/
>
>
>
> $ java -jar start.jar
> 2011-08-29 20:50:51.876:INFO::Logging to STDERR via
> org.mortbay.log.StdErrLog
> 2011-08-29 20:50:52.061:INFO::jetty-6.1-SNAPSHOT
> Aug 29, 2011 8:50:52 PM org.apache.solr.core.SolrResourceLoader
> locateSolrHome
> INFO: JNDI not configured for solr (NoInitialContextEx)
> Aug 29, 2011 8:50:52 PM org.apache.solr.core.SolrResourceLoader
> locateSolrHome
> INFO: solr home defaulted to 'solr/' (could not find system property or
> JNDI)
> Aug 29, 2011 8:50:52 PM org.apache.solr.core.SolrResourceLoader 
> INFO: Solr home set to 'solr/'
> Aug 29, 2011 8:50:52 PM org.apache.solr.servlet.SolrDispatchFilter init
> INFO: SolrDispatchFilter.init()
> Aug 29, 2011 8:50:52 PM org.apache.solr.core.SolrResourceLoader
> locateSolrHome
> INFO: JNDI not configured for solr (NoInitialContextEx)
> Aug 29, 2011 8:50:52 PM org.apache.solr.core.SolrResourceLoader
> locateSolrHome
> INFO: solr home defaulted to 'solr/' (could not find system property or
> JNDI)
> Aug 29, 2011 8:50:52 PM org.apache.solr.core.CoreContainer$Initializer
> initialize
> INFO: looking for solr.xml:
> /home/waldo/apache-solr-3.3.0/example/solr/solr.xml
> Aug 29, 2011 8:50:52 PM org.apache.solr.core.SolrResourceLoader
> locateSolrHome
> INFO: JNDI not configured for solr (NoInitialContextEx)
> Aug 29, 2011 8:50:52 PM org.apache.solr.core.SolrResourceLoader
> locateSolrHome
> INFO: solr home defaulted to 'solr/' (could not find system property or
> JNDI)
> Aug 29, 2011 8:50:52 PM org.apache.solr.core.CoreContainer 
> INFO: New CoreContainer: solrHome=solr/ instance=493847021
> Aug 29, 2011 8:50:52 PM org.apache.solr.core.SolrResourceLoader 
> INFO: Solr home set to 'solr/'
> Aug 29, 2011 8:50:52 PM org.apache.solr.core.SolrResourceLoader 
> INFO: Solr home set to 'solr/./'
> Aug 29, 2011 8:50:52 PM org.apache.solr.core.SolrConfig initLibs
> INFO: Adding specified lib dirs to ClassLoader
> Aug 29, 2011 8:50:52 PM org.apache.solr.core.SolrResourceLoader
> replaceClassLoader
> INFO: Adding
> 'file:/home/waldo/apache-solr-3.3.0/contrib/extraction/lib/jempbox-LICENSE-ASL.txt'
> to classloader
> Aug 29, 2011 8:50:52 PM org.apache.solr.core.SolrResourceLoader
> replaceClassLoader
> INFO: Adding
> 'file:/home/waldo/apache-solr-3.3.0/contrib/extraction/lib/tika-parsers-LICENSE-ASL.txt'
> to classloader
> Aug 29, 2011 8:50:52 PM org.apache.solr.core.SolrResourceLoader
> replaceClassLoader
> INFO: Adding
> 'file:/home/waldo/apache-solr-3.3.0/contrib/extraction/lib/poi-ooxml-LICENSE-ASL.txt'
> to classloader
> Aug 29, 2011 8:50:52 PM org.apache.solr.core.SolrResourceLoader
> replaceClassLoader
> INFO: Adding
> 'file:/home/waldo/apache-solr-3.3.0/contrib/extraction/lib/tika-parsers-0.8.jar'
> to classloader
> Aug 29, 2011 8:50:52 PM org.apache.solr.core.SolrResourceLoader
> replaceClassLoader
> INFO: Adding
> 'file:/home/waldo/apache-solr-3.3.0/contrib/extraction/lib/bcmail-NOTICE.txt'
> to classloader
> Aug 29, 2011 8:50:52 PM org.apache.solr.core.SolrResourceLoader
> replaceClassLoader
> INFO: Adding
> 'file:/home/waldo/apache-solr-3.3.0/contrib/extraction/lib/pdfbox-1.3.1.jar'
> to classloader
> Aug 29, 2011 8:50:52 PM org.apache.solr.core.SolrResourceLoader
> replaceClassLoader
> INFO: Adding
> 'file:/home/waldo/apache-solr-3.3.0/contrib/extraction/lib/poi-ooxml-schemas-NOTICE.txt'
> to classloader
> Aug 29, 2011 8:5

Can't Get Past Step 2 in the Tutorial

2011-08-29 Thread Waldo Jaquith
Howdy,

I’m having an incredibly frustrating time getting started with Solr. I’ve got a 
relatively fresh Slicehost server, running CentOS 5.3. I followed the 
instructions in the tutorial [1], installing Oracle Java JDK as per [2], and 
got only as far as “java -jar start.jar.” (You can see the output of that at 
the bottom of my e-mail.) Now if I open up http://localhost:8983/solr/admin/ , 
I get a 404 from “Jetty://“, saying "Problem accessing /solr/admin/. Reason: 
NOT_FOUND.” I have no idea of what to do with this. I have zero experience with 
Solr. I figured the surest way to see if it was something I wanted to use was 
to just try it, and obviously that hasn’t gone very well.

Note that I have zero experience with Java, other than finding that every 
interaction with it over the past 15 years has deepened my loathing for it. 
Assume that I know nothing about Java and you will overestimate me. I’m only 
e-mailing y’all because I was bitching about this on Twitter after a couple of 
hours or attempting to make this work, and a couple of people encouraged me to 
just try querying this list before nuking Java from orbit, having a few stiff 
drinks, and forgetting that this ever happened.

Any suggestions?

Best,
Waldo


[1] http://lucene.apache.org/solr/tutorial.html
[2] 
http://www.if-not-true-then-false.com/2010/install-sun-oracle-java-jdk-jre-7-on-fedora-centos-red-hat-rhel/



$ java -jar start.jar
2011-08-29 20:50:51.876:INFO::Logging to STDERR via org.mortbay.log.StdErrLog
2011-08-29 20:50:52.061:INFO::jetty-6.1-SNAPSHOT
Aug 29, 2011 8:50:52 PM org.apache.solr.core.SolrResourceLoader locateSolrHome
INFO: JNDI not configured for solr (NoInitialContextEx)
Aug 29, 2011 8:50:52 PM org.apache.solr.core.SolrResourceLoader locateSolrHome
INFO: solr home defaulted to 'solr/' (could not find system property or JNDI)
Aug 29, 2011 8:50:52 PM org.apache.solr.core.SolrResourceLoader 
INFO: Solr home set to 'solr/'
Aug 29, 2011 8:50:52 PM org.apache.solr.servlet.SolrDispatchFilter init
INFO: SolrDispatchFilter.init()
Aug 29, 2011 8:50:52 PM org.apache.solr.core.SolrResourceLoader locateSolrHome
INFO: JNDI not configured for solr (NoInitialContextEx)
Aug 29, 2011 8:50:52 PM org.apache.solr.core.SolrResourceLoader locateSolrHome
INFO: solr home defaulted to 'solr/' (could not find system property or JNDI)
Aug 29, 2011 8:50:52 PM org.apache.solr.core.CoreContainer$Initializer 
initialize
INFO: looking for solr.xml: /home/waldo/apache-solr-3.3.0/example/solr/solr.xml
Aug 29, 2011 8:50:52 PM org.apache.solr.core.SolrResourceLoader locateSolrHome
INFO: JNDI not configured for solr (NoInitialContextEx)
Aug 29, 2011 8:50:52 PM org.apache.solr.core.SolrResourceLoader locateSolrHome
INFO: solr home defaulted to 'solr/' (could not find system property or JNDI)
Aug 29, 2011 8:50:52 PM org.apache.solr.core.CoreContainer 
INFO: New CoreContainer: solrHome=solr/ instance=493847021
Aug 29, 2011 8:50:52 PM org.apache.solr.core.SolrResourceLoader 
INFO: Solr home set to 'solr/'
Aug 29, 2011 8:50:52 PM org.apache.solr.core.SolrResourceLoader 
INFO: Solr home set to 'solr/./'
Aug 29, 2011 8:50:52 PM org.apache.solr.core.SolrConfig initLibs
INFO: Adding specified lib dirs to ClassLoader
Aug 29, 2011 8:50:52 PM org.apache.solr.core.SolrResourceLoader 
replaceClassLoader
INFO: Adding 
'file:/home/waldo/apache-solr-3.3.0/contrib/extraction/lib/jempbox-LICENSE-ASL.txt'
 to classloader
Aug 29, 2011 8:50:52 PM org.apache.solr.core.SolrResourceLoader 
replaceClassLoader
INFO: Adding 
'file:/home/waldo/apache-solr-3.3.0/contrib/extraction/lib/tika-parsers-LICENSE-ASL.txt'
 to classloader
Aug 29, 2011 8:50:52 PM org.apache.solr.core.SolrResourceLoader 
replaceClassLoader
INFO: Adding 
'file:/home/waldo/apache-solr-3.3.0/contrib/extraction/lib/poi-ooxml-LICENSE-ASL.txt'
 to classloader
Aug 29, 2011 8:50:52 PM org.apache.solr.core.SolrResourceLoader 
replaceClassLoader
INFO: Adding 
'file:/home/waldo/apache-solr-3.3.0/contrib/extraction/lib/tika-parsers-0.8.jar'
 to classloader
Aug 29, 2011 8:50:52 PM org.apache.solr.core.SolrResourceLoader 
replaceClassLoader
INFO: Adding 
'file:/home/waldo/apache-solr-3.3.0/contrib/extraction/lib/bcmail-NOTICE.txt' 
to classloader
Aug 29, 2011 8:50:52 PM org.apache.solr.core.SolrResourceLoader 
replaceClassLoader
INFO: Adding 
'file:/home/waldo/apache-solr-3.3.0/contrib/extraction/lib/pdfbox-1.3.1.jar' to 
classloader
Aug 29, 2011 8:50:52 PM org.apache.solr.core.SolrResourceLoader 
replaceClassLoader
INFO: Adding 
'file:/home/waldo/apache-solr-3.3.0/contrib/extraction/lib/poi-ooxml-schemas-NOTICE.txt'
 to classloader
Aug 29, 2011 8:50:52 PM org.apache.solr.core.SolrResourceLoader 
replaceClassLoader
INFO: Adding 
'file:/home/waldo/apache-solr-3.3.0/contrib/extraction/lib/log4j-1.2.14.jar' to 
classloader
Aug 29, 2011 8:50:52 PM org.apache.solr.core.SolrResourceLoader 
replaceClassLoader
INFO: Adding 
'file:/home/waldo/apache-solr-3.3.0/contrib/extraction/lib/log4j-LICENSE-ASL.txt'
 to classloader
Aug 29, 20

Re: Changing the DocCollector

2011-08-29 Thread Jamie Johnson
Thanks Hoss.  I am actually ok with that, I think something like
50,000 results from each shard as a max would be reasonable since my
check takes about 1s for 50,000 records.  I'll give this a whirl and
see how it goes.

On Mon, Aug 29, 2011 at 6:46 PM, Chris Hostetter
 wrote:
>
> : Also I see that this is before sorting, is there a way to do something
> : similar after sorting?  The reason is that I'm ok with the total
> : result not being completely accurate so long as the first say 10 pages
> : are accurate.  The results could get more accurate as you page through
> : them though.  Does that make sense?
>
> munging results after sorting is dangerous in the general case, but if you
> have a specific usecase where you're okay with only garunteeing accurate
> results up to result #X, then you might be able to get away with something
> like...
>
> * custom SearchComponent
> * configure to run after QueryComponent
> * in prepare, record the start & rows params, and replace them with 0 &
> (MAX_PAGE_NUM * rows)
> * in process, iterate over the the DocList and build up your own new
> DocSlice based on the docs that match your special criteria - then use the
> original start/rows to generate a subset and return that
>
> ...getting this to play nicely with stuff like faceting be possible with
> more work, and manipulation of the DocSet (assuming you're okay with the
> facet counts only being as accurate as much as the DocList is -- filtered
> up to row X).
>
> it could fail misserablly with distributed search since you hvae no idea
> how many results will pass your filter.
>
> (note: this is all off the top of my head ... no idea if it would actually
> work)
>
>
>
> -Hoss
>


Re: Solr Faceting & DIH

2011-08-29 Thread Way Cool
I think you need to setup entity hierarchy with product as a top level
entity and attribute as another entity under product, otherwise the record
#2 and 3 will override the first one.

On Mon, Aug 29, 2011 at 3:52 PM, Aaron Bains  wrote:

> Hello,
>
> I am trying to setup Solr Faceting on products by using the
> DataImportHandler to import data from my database. I have setup my
> data-config.xml with the proper queries and schema.xml with the fields.
> After the import/index is complete I can only search one productid record
> in
> Solr. For example of the three productid '10100039' records there are I am
> only able to search for one of those. Should I somehow disable unique ids?
> What is the best way of doing this?
>
> Below is the schema I am trying to index:
>
> +---+-+-++
> | productid | attributeid | valueid | categoryid |
> +---+-+-++
> |  10100039 |  331100 |1580 |  1 |
> |  10100039 |  331694 |1581 |  1 |
> |  10100039 |33113319 | 1537370 |  1 |
> |  10100040 |  331100 |1580 |  1 |
> |  10100040 |  331694 | 1540230 |  1 |
> |  10100040 |33113319 | 1537370 |  1 |
> +---+-+-++
>
> Thanks!
>


Re: Changing the DocCollector

2011-08-29 Thread Chris Hostetter

: Also I see that this is before sorting, is there a way to do something
: similar after sorting?  The reason is that I'm ok with the total
: result not being completely accurate so long as the first say 10 pages
: are accurate.  The results could get more accurate as you page through
: them though.  Does that make sense?

munging results after sorting is dangerous in the general case, but if you 
have a specific usecase where you're okay with only garunteeing accurate 
results up to result #X, then you might be able to get away with something 
like...

* custom SearchComponent
* configure to run after QueryComponent
* in prepare, record the start & rows params, and replace them with 0 & 
(MAX_PAGE_NUM * rows)
* in process, iterate over the the DocList and build up your own new 
DocSlice based on the docs that match your special criteria - then use the 
original start/rows to generate a subset and return that

...getting this to play nicely with stuff like faceting be possible with 
more work, and manipulation of the DocSet (assuming you're okay with the 
facet counts only being as accurate as much as the DocList is -- filtered 
up to row X).

it could fail misserablly with distributed search since you hvae no idea 
how many results will pass your filter.

(note: this is all off the top of my head ... no idea if it would actually 
work)



-Hoss


Solr Faceting & DIH

2011-08-29 Thread Aaron Bains
Hello,

I am trying to setup Solr Faceting on products by using the
DataImportHandler to import data from my database. I have setup my
data-config.xml with the proper queries and schema.xml with the fields.
After the import/index is complete I can only search one productid record in
Solr. For example of the three productid '10100039' records there are I am
only able to search for one of those. Should I somehow disable unique ids?
What is the best way of doing this?

Below is the schema I am trying to index:

+---+-+-++
| productid | attributeid | valueid | categoryid |
+---+-+-++
|  10100039 |  331100 |1580 |  1 |
|  10100039 |  331694 |1581 |  1 |
|  10100039 |33113319 | 1537370 |  1 |
|  10100040 |  331100 |1580 |  1 |
|  10100040 |  331694 | 1540230 |  1 |
|  10100040 |33113319 | 1537370 |  1 |
+---+-+-++

Thanks!


How to get all the terms in a document as Luke does?

2011-08-29 Thread Gabriele Kahlout
Hello,

This time I'm trying to duplicate Luke's functionality of knowing which
terms occur in a search result/document (w/o parsing it again). Any Solrj
API to do that?

P.S. I've also posted the question on
SO
.

On Wed, Jul 6, 2011 at 11:09 AM, Gabriele Kahlout
wrote:

> From you patch I see TermFreqVector  which provides the information I
> want.
>
> I also found FieldInvertState.getLength() which seems to be exactly what I
> want. I'm after the word count (sum of tf for every term in the doc). I'm
> just not sure whether FieldInvertState.getLength() returns just the number
> of terms (not multiplied by the frequency of each term - word count) or not
> though. It seems as if it returns word count, but I've not tested it
> sufficienctly.
>
>
> On Wed, Jul 6, 2011 at 1:39 AM, Trey Grainger 
> wrote:
>
>> Gabriele,
>>
>> I created a patch that does this about a year ago.  See
>> https://issues.apache.org/jira/browse/SOLR-1837.  It was written for Solr
>> 1.4 and is based upon the Document Reconstructor in Luke.  The patch adds
>> a
>> link to the main solr admin page to a docinspector page which will
>> reconstruct the document given a uniqueid (required).  Keep in mind that
>> you're only looking at what's "in" the index for non-stored fields, not
>> the
>> original text.
>>
>> If you have any issues using this on the most recent release, let me know
>> and I'd be happy to create a new patch for solr 3.3.  One of these days
>> I'll
>> remove the JSP dependency and this may eventually making it into trunk.
>>
>> Thanks,
>>
>> -Trey Grainger
>> Search Technology Development Team Lead, Careerbuilder.com
>> Site Architect, Celiaccess.com
>>
>>
>> On Tue, Jul 5, 2011 at 3:59 PM, Gabriele Kahlout
>> wrote:
>>
>> > Hello,
>> >
>> > With an inverted index the term is the key, and the documents are the
>> > values. Is it still however possible that given a document id I get the
>> > terms indexed for that document?
>> >
>> > --
>> > Regards,
>> > K. Gabriele
>> >
>> > --- unchanged since 20/9/10 ---
>> > P.S. If the subject contains "[LON]" or the addressee acknowledges the
>> > receipt within 48 hours then I don't resend the email.
>> > subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this) ∧
>> > time(x)
>> > < Now + 48h) ⇒ ¬resend(I, this).
>> >
>> > If an email is sent by a sender that is not a trusted contact or the
>> email
>> > does not contain a valid code then the email is not received. A valid
>> code
>> > starts with a hyphen and ends with "X".
>> > ∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. y ∈ subject(x) ∧ y ∈
>> > L(-[a-z]+[0-9]X)).
>> >
>>
>
>
>
> --
> Regards,
> K. Gabriele
>
> --- unchanged since 20/9/10 ---
> P.S. If the subject contains "[LON]" or the addressee acknowledges the
> receipt within 48 hours then I don't resend the email.
> subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this) ∧
> time(x) < Now + 48h) ⇒ ¬resend(I, this).
>
> If an email is sent by a sender that is not a trusted contact or the email
> does not contain a valid code then the email is not received. A valid code
> starts with a hyphen and ends with "X".
> ∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. y ∈ subject(x) ∧ y ∈
> L(-[a-z]+[0-9]X)).
>
>


-- 
Regards,
K. Gabriele

--- unchanged since 20/9/10 ---
P.S. If the subject contains "[LON]" or the addressee acknowledges the
receipt within 48 hours then I don't resend the email.
subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this) ∧ time(x)
< Now + 48h) ⇒ ¬resend(I, this).

If an email is sent by a sender that is not a trusted contact or the email
does not contain a valid code then the email is not received. A valid code
starts with a hyphen and ends with "X".
∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. y ∈ subject(x) ∧ y ∈
L(-[a-z]+[0-9]X)).


Re: Solr warming when using master/slave replication

2011-08-29 Thread Mike Austin
Correction: Will traffic be served with a non "warmed" index searcher at any
point?

Thanks,
Mike

On Mon, Aug 29, 2011 at 4:52 PM, Mike Austin  wrote:

> "Distribution/Replication gives you a 'new' index on the slave. When Solr
> is told to use the new index, the old caches have to be discarded along with
> the old Index Searcher. That's when autowarming occurs.  If the current
> Index Searcher is serving requests and when a new searcher is opened, the
> new one is 'warmed' while the current one is serving external requests. When
> the new one is ready, it is registered so it can serve any new requests
> while the original one first finishes the requests it is handling. "
>
> So if warming is configured, the new index will warm before going live?
> How does that work with the copying to the new directory? Does it get warmed
> while in the temp directory before copied over?  My question is basically,
> will traffic be served with a non indexed searcher at any point?
>
> Thanks,
> Mike
>
>
> On Mon, Aug 29, 2011 at 4:45 PM, Rob Casson  wrote:
>
>> it's always been my understanding that the caches are discarded, then
>> rebuilt/warmed:
>>
>>
>> http://wiki.apache.org/solr/SolrCaching#Caching_and_Distribution.2BAC8-Replication
>>
>> hth,
>> rob
>>
>> On Mon, Aug 29, 2011 at 5:30 PM, Mike Austin 
>> wrote:
>> > How does warming work when a collection is being distributed to a slave.
>>  I
>> > understand that a temp directory is created and it is eventually copied
>> to
>> > the live folder, but what happens to the cache that was built in with
>> the
>> > old index?  Does the cache get rebuilt, can we warm it before it becomes
>> > live, or can we keep the old cache?
>> >
>> > Thanks,
>> > Mike
>> >
>>
>
>


Re: Solr warming when using master/slave replication

2011-08-29 Thread Mike Austin
"Distribution/Replication gives you a 'new' index on the slave. When Solr is
told to use the new index, the old caches have to be discarded along with
the old Index Searcher. That's when autowarming occurs.  If the current
Index Searcher is serving requests and when a new searcher is opened, the
new one is 'warmed' while the current one is serving external requests. When
the new one is ready, it is registered so it can serve any new requests
while the original one first finishes the requests it is handling. "

So if warming is configured, the new index will warm before going live?  How
does that work with the copying to the new directory? Does it get warmed
while in the temp directory before copied over?  My question is basically,
will traffic be served with a non indexed searcher at any point?

Thanks,
Mike

On Mon, Aug 29, 2011 at 4:45 PM, Rob Casson  wrote:

> it's always been my understanding that the caches are discarded, then
> rebuilt/warmed:
>
>
> http://wiki.apache.org/solr/SolrCaching#Caching_and_Distribution.2BAC8-Replication
>
> hth,
> rob
>
> On Mon, Aug 29, 2011 at 5:30 PM, Mike Austin 
> wrote:
> > How does warming work when a collection is being distributed to a slave.
>  I
> > understand that a temp directory is created and it is eventually copied
> to
> > the live folder, but what happens to the cache that was built in with the
> > old index?  Does the cache get rebuilt, can we warm it before it becomes
> > live, or can we keep the old cache?
> >
> > Thanks,
> > Mike
> >
>


Re: Solr warming when using master/slave replication

2011-08-29 Thread Rob Casson
it's always been my understanding that the caches are discarded, then
rebuilt/warmed:

 
http://wiki.apache.org/solr/SolrCaching#Caching_and_Distribution.2BAC8-Replication

hth,
rob

On Mon, Aug 29, 2011 at 5:30 PM, Mike Austin  wrote:
> How does warming work when a collection is being distributed to a slave.  I
> understand that a temp directory is created and it is eventually copied to
> the live folder, but what happens to the cache that was built in with the
> old index?  Does the cache get rebuilt, can we warm it before it becomes
> live, or can we keep the old cache?
>
> Thanks,
> Mike
>


RE: how to deal with URLDatasource which needs authorization?

2011-08-29 Thread Jaeger, Jay - DOT
So, the question then seems to be:  is there a way to place credentials in the 
URLDataSource.

There doesn't seem to be an explicit user ID or password ( 
http://wiki.apache.org/solr/DataImportHandler#Configuration_of_URLDataSource_or_HttpDataSource
 ) but perhaps you can include them in URL fashion:

http://user:password@host/yadayada

(See http://www.cs.rutgers.edu/~watrous/user-pass-url.html ).

Otherwise, if that doesn't work, I guess you will have to use some other way to 
get the data other than DIH/URLDataSource (such as Tika, which does support 
passwords).

JRJ

-Original Message-
From: deniz [mailto:denizdurmu...@gmail.com] 
Sent: Thursday, August 25, 2011 8:17 PM
To: solr-user@lucene.apache.org
Subject: RE: how to deal with URLDatasource which needs authorization?

Well, let me explain in details about the problem...

I have a website www.blablabla.com on which users can have profiles, with
any kind of information. And each user has an id, something like user_xyz.
So www.blablabla.com/user_xyz shows user profile, and
www.blablabla.com/solr/index/user_xyz shows an xml file, holding all of the
static information about the user. Solr uses
www.blablabla.com/solr/index/user_xyz to index the data.

Currently www.blablabla.com/solr/index/user_xyz is accessible by everyone,
both users and non-users of the site... 

I would like to put some kind of secuirty thing which only allows solr to
access www.blablabla.com/solr/index/user_xyz, and preventing both users and
non users to access it. So that link will be a 'solr only' link.

is there any other options than restricting ip address for access this link?
or that is the only option?

-
Zeki ama calismiyor... Calissa yapar...
--
View this message in context: 
http://lucene.472066.n3.nabble.com/how-to-deal-with-URLDatasource-which-needs-authorization-tp3280515p3285579.html
Sent from the Solr - User mailing list archive at Nabble.com.


Solr warming when using master/slave replication

2011-08-29 Thread Mike Austin
How does warming work when a collection is being distributed to a slave.  I
understand that a temp directory is created and it is eventually copied to
the live folder, but what happens to the cache that was built in with the
old index?  Does the cache get rebuilt, can we warm it before it becomes
live, or can we keep the old cache?

Thanks,
Mike


RE: SolrServer instances

2011-08-29 Thread Jaeger, Jay - DOT
It sounds like the correspondent (Jonty) is thinking just in terms of SolrJ -- 
wanting to share that across multiple threads in an application server.

In which case the question would be whether it would be possible/safe/efficient 
to share a single instantiation of the SolrJ class(es) across multiple threads. 
 Of that I have insufficient knowledge, having not worked with SolrJ.

However, if instead the correspondent were to just connect directly to Solr via 
HTTP to a single, shared, Solr instance (without SolrJ) and manage the XML 
himself/herself, then there would not be a problem at all.

JRJ

-Original Message-
From: François Schiettecatte [mailto:fschietteca...@gmail.com] 
Sent: Friday, August 26, 2011 6:12 AM
To: solr-user@lucene.apache.org
Subject: Re: SolrServer instances

Sounds to me that you are looking for HTTP Persistent Connections (connection 
keep-alive as opposed to close), and a singleton object. This would be outside 
SOLR per se.

A few caveats though, I am not sure if tomcat supports keep-alive, and I am not 
sure how SOLR deals with multiple requests coming down the pipe, and you will 
need to deal with concurrency, and I am not sure what you are looking to gain 
from this, opening an http connection is pretty cheap.

François

On Aug 26, 2011, at 2:09 AM, Jonty Rhods wrote:

> do I also required to close the connection from solr server
> (CommonHttpSolrServer).
> 
> regards
> 
> On Fri, Aug 26, 2011 at 9:45 AM, Jonty Rhods  wrote:
> 
>> Deal all please help I am stuck here as I have not much experience..
>> 
>> thanks
>> 
>> On Thu, Aug 25, 2011 at 6:51 PM, Jonty Rhods wrote:
>> 
>>> Hi All,
>>> 
>>> I am using SolrJ (3.1) and Tomcat 6.x. I want to open solr server once (20
>>> concurrence) and reuse this across all the site. Or something like
>>> connection pool like we are using for DB (ie Apache DBCP). There is a way to
>>> use static method which is a way but I want better solution from you people.
>>> 
>>> 
>>> 
>>> I read one threade where Ahmet suggest to use something like that
>>> 
>>> String serverPath = "http://localhost:8983/solr";;
>>> HttpClient client = new HttpClient(new
>>> MultiThreadedHttpConnectionManager());
>>> URL url = new URL(serverPath);
>>> CommonsHttpSolrServer solrServer = new CommonsHttpSolrServer(url, client);
>>> 
>>> But how to use instance of this across all class.
>>> 
>>> Please suggest.
>>> 
>>> regards
>>> Jonty
>>> 
>> 
>> 



Re: Solr Geodist

2011-08-29 Thread solrnovice
Eric, thanks for the update,  I thought solr 4.0 should have the pseudo
columns and i am using the right version. So did you ever worked on a query
to return distance, where there is no long, lat are used in the where
clause. I mean not in a radial search, but a city search, but displayed the
distance. So my thought was to pass in the long and lat to the geodist and
also the coordinates(long and lat)  of every record, and let geodist compute
the distance. Can you please let me know if this worked for you?


thanks
SN

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-Geodist-tp3287005p3293779.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Geo spatial search with multi-valued locations (SOLR-2155 / lucene-spatial-playground)

2011-08-29 Thread Smiley, David W.
Hi Mike.

I have hopes that LSP will be ready in time for Solr 4. It's usable now with 
the understanding that it's still fairly early and so there are bound to be 
bugs. I've been focusing a lot on testing lately.  You could try applying 
SOLR-2155 but I think there was some Lucene/Solr code re-organization regarding 
the ValueSource API. It shouldn't be hard to update.  I don't think JTeam's 
plugin handles multi-value but I could be wrong (Chris Male will be sure to 
jump in and correct me if so).  QBase/Metacarta has a Solr plugin I've used 
indirectly through a packaged deal with their products 
http://www.metacarta.com/products-overview.htm  I have no idea if you can get 
it stand-alone. As of a few months ago, it was based on a version of Solr trunk 
from March 2010 and they have yet to update it.

~ David Smiley

On Aug 29, 2011, at 2:27 PM, Mike Austin wrote:

> Besides the full integration into solr for this, would you recommend any
> third party solr plugins such as "
> http://www.jteam.nl/products/spatialsolrplugin.html";, or others?
> 
> I can understand that spacial features can get complex and there could be
> many use cases, but this seems like a "basic" feature that you would use
> with a standard set of spacial features like what is in solr4 now.
> 
> Thanks,
> Mike
> 
> On Mon, Aug 29, 2011 at 12:38 PM, Darren Govoni  wrote:
> 
>> It doesn't.
>> 
>> 
>> On 08/29/2011 01:37 PM, Mike Austin wrote:
>> 
>>> I've been trying to follow the progress of this and I'm not sure what the
>>> current status is.  Can someone update me on what is currently in Solr4
>>> and
>>> does it support multi-valued location in a single document?  I saw that
>>> SOLR-2155 was not included and is now lucene-spatial-playground.
>>> 
>>> Thanks,
>>> Mike
>>> 
>>> 
>> 



Re: Solr Geodist

2011-08-29 Thread Erik Hatcher
Based on the date, this was a LucidWorks Enterprise distribution that predated 
the addition of pseudo-fields to Solr for returning function values like that.  
We have another release of LWE coming out, oh I dunno exactly, in a month or 
so?  It'll have the latest greatest Solr "4.0" (which is constantly changing, 
so there is no single "4.0" at this point).

Erik

On Aug 29, 2011, at 15:33 , solrnovice wrote:

> hi Eric, thank you for the response. We use this product , which comes with
> solr, http://www.lucidimagination.com/.  
> They have a UI on top of solr, its easy to add fields or edit their
> properties using their UI.
> The queries I write are using the solr which comes with lucid imagination ,
> it says its SOLR version as below.
> 
> Solr Specification Version: 4.0.0.2011.03.21.13.41.34. 
> Solr Implementation Version: 4.0-SNAPSHOT exported - markrmiller -
> 2011-03-21 13:41:34
> Lucene Specification Version: 4.0-SNAPSHOT
> Lucene Implementation Version: 4.0-SNAPSHOT exported - 2011-03-21 13:42:35
> 
> I am assuming this is SOLR versin 4.0. I haven't downloaded just solr, and
> run my queries, i was using Lucidimagination's solr.  
> 
> Also is it possible to return the distance,  in the below scenario, we dont
> want to perform a lat long search. Suppose  the user types in
> "Jacksonville", "TN", can we run the query and return all the search
> results, and show the distance from Jacksonville TN on all search results. I
> mean we know the lat and long , but we dont want to use that and perform a
> radial search. We just want to perform a city, state search and return
> distance. I tried passing in fl=geodist(lat,
> long,coordinates_0_coordinate,coordinates_1_coordinate),score,name,street...etc,
>  
> where the coordinates_0 and coordinates_1 are the lat and long of all the
> records in the solr index. This is not returning the distance. 
> Can you please let me know if you tried this option?
> 
> 
> thanks
> SN
> 
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Solr-Geodist-tp3287005p3293551.html
> Sent from the Solr - User mailing list archive at Nabble.com.



Re: Solr Geodist

2011-08-29 Thread solrnovice
hi Eric, thank you for the response. We use this product , which comes with
solr, http://www.lucidimagination.com/.  
They have a UI on top of solr, its easy to add fields or edit their
properties using their UI.
The queries I write are using the solr which comes with lucid imagination ,
it says its SOLR version as below.

Solr Specification Version: 4.0.0.2011.03.21.13.41.34. 
Solr Implementation Version: 4.0-SNAPSHOT exported - markrmiller -
2011-03-21 13:41:34
Lucene Specification Version: 4.0-SNAPSHOT
Lucene Implementation Version: 4.0-SNAPSHOT exported - 2011-03-21 13:42:35

I am assuming this is SOLR versin 4.0. I haven't downloaded just solr, and
run my queries, i was using Lucidimagination's solr.  

Also is it possible to return the distance,  in the below scenario, we dont
want to perform a lat long search. Suppose  the user types in
"Jacksonville", "TN", can we run the query and return all the search
results, and show the distance from Jacksonville TN on all search results. I
mean we know the lat and long , but we dont want to use that and perform a
radial search. We just want to perform a city, state search and return
distance. I tried passing in fl=geodist(lat,
long,coordinates_0_coordinate,coordinates_1_coordinate),score,name,street...etc,
 
where the coordinates_0 and coordinates_1 are the lat and long of all the
records in the solr index. This is not returning the distance. 
Can you please let me know if you tried this option?


thanks
SN

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-Geodist-tp3287005p3293551.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Index directories on slaves

2011-08-29 Thread Ian Connor
This turned out to be a missing SolrDeletionPolicy in the configuration.

Once the slaves had a SolrDeletionPolicy, they stopped growing out of
control.

Ian.

On Wed, Aug 17, 2011 at 8:46 AM, Ian Connor  wrote:

> Hi,
>
> We have noticed that many index.* directories are appearing on slaves (some
> more than others).
>
> e.g. ls shows
>
> index/index.20110101021510/ index.20110105030400/
> index.20110106040701/ index.20110130031416/
> index.20101222081713/ index.20110101034500/ index.20110105075100/
> index.20110107085605/ index.20110812153349/
> index.20101231011754/ index.20110105022600/ index.20110106024902/
> index.20110108014100/ index.20110814204200/
>
> Are this harmful, should I clean them out. I see a command for backup
> cleanup but am not sure the best way to clean these up (apart from removing
> all index* and getting a fresh replica).
>
> We have also seen on the latest 3.4 build that replicas are getting 1000s
> of files even though the masters have less than a 100 each. It seems as
> though they are not deleting after some replications and not sure if this is
> also related. We are trying to monitor this to see if we can find out how to
> reproduce it or at least the conditions that tend to reproduce it.
>
> --
> Regards,
>
> Ian Connor
> 1 Leighton St #723
> Cambridge, MA 02141
> Call Center Phone: +1 (714) 239 3875 (24 hrs)
> Fax: +1(770) 818 5697
> Skype: ian.connor
>


Re: dependency injection in solr

2011-08-29 Thread Federico Fissore

Tomás Fernández Löbbe, il 29/08/2011 17:58, ha scritto:

 I think I get it. Many of the objects that depend on the configuration
are instantiated by using reflection, is that an option for you?



yes it is
what do you propose?


Possible bug in MoreLikeThisHandler debugging?

2011-08-29 Thread Andrés Cobas
Hi.

After upgrading to solr 3.3.0 from 1.4.0, I noticed that I couldn't get the
MoreLikeThisHandler to return debugging data. I tried the debug parameters
 debugQuery and debug, but all I got was:
 true

I took a look at the code for the MoreLikeThisHandler, and noted in the
debbuging part that the handler is adding the variable dbg to the response
(line 211):
 rsp.add("debug", dbg);

Such variable is created at line 197:

boolean dbg = req.getParams().getBool(CommonParams.DEBUG_QUERY, false);


I suppose the correct variable to add to the response would be dbgInfo:

NamedList

dbgInfo = SolrPluginUtils.doStandardDebug(req, q,
mlt.getRawMLTQuery(), mltDocs.docList);
if (null != dbgInfo) {
  if (null != filters) {

dbgInfo.add("filter_queries",req.getParams().getParams(CommonParams.FQ));
List

fqs = new ArrayList
(filters.size());
for (Query

fq : filters) {
  fqs.add(QueryParsing.toString(fq, req.getSchema()));
}
dbgInfo.add("parsed_filter_queries",fqs);
  }

Summarizing, i believe line 211 should be changed to:

rsp.add("debug", dbgInfo);

Thanks a lot,

Andrés Cobas


Re: dependency injection in solr

2011-08-29 Thread Tomás Fernández Löbbe
You can use reflection to instantiate the correct object (specify the class
name on the parameter on the solrconfig and then invoke the constructor via
reflection). You'll have to manage the life-cycle of your object yourself.
If I understand your requirement, you probably have created a
SearchComponent that uses that "retriever", right?

On Mon, Aug 29, 2011 at 1:30 PM, Federico Fissore  wrote:

> Tomás Fernández Löbbe, il 29/08/2011 17:58, ha scritto:
>
>   I think I get it. Many of the objects that depend on the
>> configuration
>> are instantiated by using reflection, is that an option for you?
>>
>>
> yes it is
> what do you propose?
>


Re: Geo spatial search with multi-valued locations (SOLR-2155 / lucene-spatial-playground)

2011-08-29 Thread Mike Austin
Besides the full integration into solr for this, would you recommend any
third party solr plugins such as "
http://www.jteam.nl/products/spatialsolrplugin.html";, or others?

I can understand that spacial features can get complex and there could be
many use cases, but this seems like a "basic" feature that you would use
with a standard set of spacial features like what is in solr4 now.

Thanks,
Mike

On Mon, Aug 29, 2011 at 12:38 PM, Darren Govoni  wrote:

> It doesn't.
>
>
> On 08/29/2011 01:37 PM, Mike Austin wrote:
>
>> I've been trying to follow the progress of this and I'm not sure what the
>> current status is.  Can someone update me on what is currently in Solr4
>> and
>> does it support multi-valued location in a single document?  I saw that
>> SOLR-2155 was not included and is now lucene-spatial-playground.
>>
>> Thanks,
>> Mike
>>
>>
>


Re: Post Processing Solr Results

2011-08-29 Thread Erik Hatcher
Sounds like you're looking for https://issues.apache.org/jira/browse/SOLR-2429 
which has been committed to trunk and also the 3_x branch (after the release of 
3.3).

Erik

On Aug 29, 2011, at 11:46 , Jamie Johnson wrote:

> Thanks guys, perhaps I am just going about this the wrong way.  So let
> me explain my problem and perhaps there is a more appropriate
> solution.  What I need to do is basically hide certain results based
> on some passed in user parameter (say their service tier for
> instance).  What I'd like to do is have some way to plugin my custom
> logic to basically remove certain documents from the result set using
> this information.  Now that being said I technically don't need to
> remove the documents from the full result set, I really only need to
> remove them from current page (but still ensuring that a page is
> filled and sorted).  At present I'm trying to see if there is a way
> for me to add this type of logic after the QueryComponent has
> executed, perhaps by going through the DocIdandSet at this point and
> then intersecting the DocIdSet with a DocIdSet which would filter out
> the stuff I don't want seen.  Does this sound reasonable or like a
> fools errand?
> 
> 
> 
> On Mon, Aug 29, 2011 at 10:51 AM, Erik Hatcher  wrote:
>> I haven't followed the details, but what I'm guessing you want here is 
>> Lucene's FieldCache.  Perhaps something along the lines of how faceting uses 
>> it (in SimpleFacets.java) -
>> 
>>   FieldCache.DocTermsIndex si = 
>> FieldCache.DEFAULT.getTermsIndex(searcher.getIndexReader(), fieldName);
>> 
>>Erik
>> 
>> On Aug 29, 2011, at 09:58 , Erick Erickson wrote:
>> 
>>> If you're asking whether there's a way to find, say,
>>> all the values for the "auth" field associated with
>>> a document... no. The nature of an inverted
>>> index makes this hard (think of finding all
>>> the definitions in a dictionary where the word
>>> "earth" was in the definition).
>>> 
>>> Best
>>> Erick
>>> 
>>> On Mon, Aug 29, 2011 at 9:21 AM, Jamie Johnson  wrote:
 Thanks Erick, if I did not know the token up front that could be in
 the index is there not an efficient way to get the field for a
 specific document and do some custom processing on it?
 
 On Mon, Aug 29, 2011 at 8:34 AM, Erick Erickson  
 wrote:
> Start here I think:
> 
> http://lucene.apache.org/java/3_0_2/api/core/index.html?org/apache/lucene/index/TermDocs.html
> 
> Best
> Erick
> 
> On Mon, Aug 29, 2011 at 8:24 AM, Jamie Johnson  wrote:
>> Thanks for the reply.  The fields I want are indexed, but how would I
>> go directly at the fields I wanted?
>> 
>> In regards to indexing the auth tokens I've thought about this and am
>> trying to get confirmation if that is reasonable given our
>> constraints.
>> 
>> On Mon, Aug 29, 2011 at 8:20 AM, Erick Erickson 
>>  wrote:
>>> Yeah, loading the document inside a Collector is a
>>> definite no-no. Have you tried going directly
>>> at the fields you want (assuming they're
>>> indexed)? That *should* be much faster, but
>>> whether it'll be fast enough is a good question. I'm
>>> thinking some of the Terms methods here. You
>>> *might* get some joy out of making sure lazy
>>> field loading is enabled (and make sure the
>>> fields you're accessing for your logic are
>>> indexed), but I'm not entirely sure about
>>> that bit.
>>> 
>>> This kind of problem is sometimes handled
>>> by indexing "auth tokens" with the documents
>>> and including an OR clause on the query
>>> with the authorizations for a particular
>>> user, but that works best if there is an upper
>>> limit (in the 100s) of tokens that a user can possibly
>>> have, often this works best with some kind of
>>> grouping. Making this work when a user can
>>> have tens of thousands of auth tokens is...er...
>>> contra-indicated...
>>> 
>>> Hope this helps a bit...
>>> Erick
>>> 
>>> On Sun, Aug 28, 2011 at 11:59 PM, Jamie Johnson  
>>> wrote:
 Just a bit more information.  Inside my class which extends
 FilteredDocIdSet all of the time seems to be getting spent in
 retrieving the document from the readerCtx, doing this
 
 Document doc = readerCtx.reader.document(docid);
 
 If I comment out this and just return true things fly along as I
 expect.  My query is returning a total of 2 million documents also.
 
 On Sun, Aug 28, 2011 at 11:39 AM, Jamie Johnson  
 wrote:
> I have a need to post process Solr results based on some access
> controls which are setup outside of Solr, currently we've written
> something that extends SearchComponent and in the prepare method I'm
> doing something like this
> 
>QueryWrapperFilter qwf = new
> QueryWra

Re: Post Processing Solr Results

2011-08-29 Thread Erick Erickson
It's reasonable, but post-filtering is often difficult, you have
too many documents to wade through. If you can see any way
at all to just include a clause in the query, you'll save a world
of effort...

Is there any way you can include a value in some kind of
"permissions" field? Let's say you have a document that
is only to be visible for "tier 1" customers. If your permissions
field contained the tiers (e.g. tier0, tier1), then a simple
AND permissions:tier1 would do the trick...

I know this is a trivial example, but you see where this is headed.
The documents can contain as many of these tokens in permissions
as you want. As long as you can string together a clause
like "AND permissions:(A OR B OR C)" and not have the clause
get ridiculously long (as in thousands of values), that works best.

Any such scheme depends upon being able to assign the documents
some kind of code that doesn't change too often (because when it does
you have to re-index) and figure out, at query time, what permissions
a user has.

Using FieldCache or low-level Lucene routines can answer the question
"Does doc X contain token Y in field Z" reasonably easily. What it has
a hard time doing is answering "For document X, what are all the value
in the inverted index in field Z".

If this doesn't make sense, could you explain a bit more about your
permissions model?

Hope this helps
Erick

On Mon, Aug 29, 2011 at 11:46 AM, Jamie Johnson  wrote:
> Thanks guys, perhaps I am just going about this the wrong way.  So let
> me explain my problem and perhaps there is a more appropriate
> solution.  What I need to do is basically hide certain results based
> on some passed in user parameter (say their service tier for
> instance).  What I'd like to do is have some way to plugin my custom
> logic to basically remove certain documents from the result set using
> this information.  Now that being said I technically don't need to
> remove the documents from the full result set, I really only need to
> remove them from current page (but still ensuring that a page is
> filled and sorted).  At present I'm trying to see if there is a way
> for me to add this type of logic after the QueryComponent has
> executed, perhaps by going through the DocIdandSet at this point and
> then intersecting the DocIdSet with a DocIdSet which would filter out
> the stuff I don't want seen.  Does this sound reasonable or like a
> fools errand?
>
>
>
> On Mon, Aug 29, 2011 at 10:51 AM, Erik Hatcher  wrote:
>> I haven't followed the details, but what I'm guessing you want here is 
>> Lucene's FieldCache.  Perhaps something along the lines of how faceting uses 
>> it (in SimpleFacets.java) -
>>
>>   FieldCache.DocTermsIndex si = 
>> FieldCache.DEFAULT.getTermsIndex(searcher.getIndexReader(), fieldName);
>>
>>        Erik
>>
>> On Aug 29, 2011, at 09:58 , Erick Erickson wrote:
>>
>>> If you're asking whether there's a way to find, say,
>>> all the values for the "auth" field associated with
>>> a document... no. The nature of an inverted
>>> index makes this hard (think of finding all
>>> the definitions in a dictionary where the word
>>> "earth" was in the definition).
>>>
>>> Best
>>> Erick
>>>
>>> On Mon, Aug 29, 2011 at 9:21 AM, Jamie Johnson  wrote:
 Thanks Erick, if I did not know the token up front that could be in
 the index is there not an efficient way to get the field for a
 specific document and do some custom processing on it?

 On Mon, Aug 29, 2011 at 8:34 AM, Erick Erickson  
 wrote:
> Start here I think:
>
> http://lucene.apache.org/java/3_0_2/api/core/index.html?org/apache/lucene/index/TermDocs.html
>
> Best
> Erick
>
> On Mon, Aug 29, 2011 at 8:24 AM, Jamie Johnson  wrote:
>> Thanks for the reply.  The fields I want are indexed, but how would I
>> go directly at the fields I wanted?
>>
>> In regards to indexing the auth tokens I've thought about this and am
>> trying to get confirmation if that is reasonable given our
>> constraints.
>>
>> On Mon, Aug 29, 2011 at 8:20 AM, Erick Erickson 
>>  wrote:
>>> Yeah, loading the document inside a Collector is a
>>> definite no-no. Have you tried going directly
>>> at the fields you want (assuming they're
>>> indexed)? That *should* be much faster, but
>>> whether it'll be fast enough is a good question. I'm
>>> thinking some of the Terms methods here. You
>>> *might* get some joy out of making sure lazy
>>> field loading is enabled (and make sure the
>>> fields you're accessing for your logic are
>>> indexed), but I'm not entirely sure about
>>> that bit.
>>>
>>> This kind of problem is sometimes handled
>>> by indexing "auth tokens" with the documents
>>> and including an OR clause on the query
>>> with the authorizations for a particular
>>> user, but that works best if there is an upper
>>> limit (in the 100s) of tokens that a user can possibly
>

Re: Geo spatial search with multi-valued locations (SOLR-2155 / lucene-spatial-playground)

2011-08-29 Thread Darren Govoni

It doesn't.

On 08/29/2011 01:37 PM, Mike Austin wrote:

I've been trying to follow the progress of this and I'm not sure what the
current status is.  Can someone update me on what is currently in Solr4 and
does it support multi-valued location in a single document?  I saw that
SOLR-2155 was not included and is now lucene-spatial-playground.

Thanks,
Mike





Geo spatial search with multi-valued locations (SOLR-2155 / lucene-spatial-playground)

2011-08-29 Thread Mike Austin
I've been trying to follow the progress of this and I'm not sure what the
current status is.  Can someone update me on what is currently in Solr4 and
does it support multi-valued location in a single document?  I saw that
SOLR-2155 was not included and is now lucene-spatial-playground.

Thanks,
Mike


Re: DIH importing

2011-08-29 Thread Mark

Thanks Ill give that a try

On 8/26/11 9:54 AM, simon wrote:

It sounds as though you are optimizing the index after the delta import. If
you don't do that, then only new segments will be replicated and syncing
will be much faster.


On Fri, Aug 26, 2011 at 12:08 PM, Mark  wrote:


We are currently delta-importing using DIH after which all of our servers
have to download the full index (16G). This obviously puts quite a strain on
our slaves while they are syncing over the index. Is there anyway not to
sync over the whole index, but rather just the parts that have changed?

We would like to get to the point where are no longer using DIH but rather
we are constantly sending documents over HTTP to our master in realtime. We
would then like our slaves to download these changes as soon as possible. Is
something like this even possible?

Thanks for you help



Re: Changing the DocCollector

2011-08-29 Thread Yonik Seeley
On Mon, Aug 29, 2011 at 12:44 PM, Jamie Johnson  wrote:
> Also I see that this is before sorting, is there a way to do something
> similar after sorting?

If you want post-sorting, then you don't want anything based on Collector.
A custom search component that runs after the query component (or a
custom query component) would probably be the right approach.

> I am also not seeing this in my distribution, in what release of Solr did 
> this get introduced?

The PostFilter stuff is in the latest (unreleased) 3x and trunk (so it
will be 3.4 and 4.0 post release).

-Yonik
http://www.lucidimagination.com


Re: Changing the DocCollector

2011-08-29 Thread Jamie Johnson
I am also not seeing this in my distribution, in what release of Solr
did this get introduced?

On Mon, Aug 29, 2011 at 12:44 PM, Jamie Johnson  wrote:
> Also I see that this is before sorting, is there a way to do something
> similar after sorting?  The reason is that I'm ok with the total
> result not being completely accurate so long as the first say 10 pages
> are accurate.  The results could get more accurate as you page through
> them though.  Does that make sense?
>
> On Mon, Aug 29, 2011 at 12:41 PM, Jamie Johnson  wrote:
>> This is related to post processing of documents based on a users
>> attribute (like service tier for instance) when it is not feasible to
>> put the tier tokens into solr and just do a query on them.
>>
>> So basically if I always want to run this I could implement a custom
>> search component and in prepare modify the query to include this
>> additional filter?
>>
>> On Mon, Aug 29, 2011 at 12:31 PM, Yonik Seeley
>>  wrote:
>>> On Mon, Aug 29, 2011 at 12:24 PM, Jamie Johnson  wrote:
 Is there any configuration that can be done to change the Doc
 Collector used in SolrIndexSearcher?
>>>
>>> The most generic way would be to use a post-filter (which can insert a
>>> custom collector into the chain).
>>> http://wiki.apache.org/solr/CommonQueryParameters#Caching_of_filters
>>>
>>> People may have more ideas if you can explain what you are trying to
>>> accomplish at a higher level though.
>>>
>>> -Yonik
>>> http://www.lucidimagination.com
>>>
>>
>


Re: Changing the DocCollector

2011-08-29 Thread Jamie Johnson
Also I see that this is before sorting, is there a way to do something
similar after sorting?  The reason is that I'm ok with the total
result not being completely accurate so long as the first say 10 pages
are accurate.  The results could get more accurate as you page through
them though.  Does that make sense?

On Mon, Aug 29, 2011 at 12:41 PM, Jamie Johnson  wrote:
> This is related to post processing of documents based on a users
> attribute (like service tier for instance) when it is not feasible to
> put the tier tokens into solr and just do a query on them.
>
> So basically if I always want to run this I could implement a custom
> search component and in prepare modify the query to include this
> additional filter?
>
> On Mon, Aug 29, 2011 at 12:31 PM, Yonik Seeley
>  wrote:
>> On Mon, Aug 29, 2011 at 12:24 PM, Jamie Johnson  wrote:
>>> Is there any configuration that can be done to change the Doc
>>> Collector used in SolrIndexSearcher?
>>
>> The most generic way would be to use a post-filter (which can insert a
>> custom collector into the chain).
>> http://wiki.apache.org/solr/CommonQueryParameters#Caching_of_filters
>>
>> People may have more ideas if you can explain what you are trying to
>> accomplish at a higher level though.
>>
>> -Yonik
>> http://www.lucidimagination.com
>>
>


Re: Changing the DocCollector

2011-08-29 Thread Jamie Johnson
This is related to post processing of documents based on a users
attribute (like service tier for instance) when it is not feasible to
put the tier tokens into solr and just do a query on them.

So basically if I always want to run this I could implement a custom
search component and in prepare modify the query to include this
additional filter?

On Mon, Aug 29, 2011 at 12:31 PM, Yonik Seeley
 wrote:
> On Mon, Aug 29, 2011 at 12:24 PM, Jamie Johnson  wrote:
>> Is there any configuration that can be done to change the Doc
>> Collector used in SolrIndexSearcher?
>
> The most generic way would be to use a post-filter (which can insert a
> custom collector into the chain).
> http://wiki.apache.org/solr/CommonQueryParameters#Caching_of_filters
>
> People may have more ideas if you can explain what you are trying to
> accomplish at a higher level though.
>
> -Yonik
> http://www.lucidimagination.com
>


Re: Changing the DocCollector

2011-08-29 Thread Yonik Seeley
On Mon, Aug 29, 2011 at 12:24 PM, Jamie Johnson  wrote:
> Is there any configuration that can be done to change the Doc
> Collector used in SolrIndexSearcher?

The most generic way would be to use a post-filter (which can insert a
custom collector into the chain).
http://wiki.apache.org/solr/CommonQueryParameters#Caching_of_filters

People may have more ideas if you can explain what you are trying to
accomplish at a higher level though.

-Yonik
http://www.lucidimagination.com


Re: dependency injection in solr

2011-08-29 Thread Federico Fissore

Tomás Fernández Löbbe, il 29/08/2011 17:58, ha scritto:

 I think I get it. Many of the objects that depend on the configuration
are instantiated by using reflection, is that an option for you?



yes it is
what do you propose?


Changing the DocCollector

2011-08-29 Thread Jamie Johnson
Is there any configuration that can be done to change the Doc
Collector used in SolrIndexSearcher?


Re: How to list all dynamic fields of a document using solrj?

2011-08-29 Thread Juan Grande
Hi Michael,

It's supposed to work. Can we see a snippet of the code you're using to
retrieve the fields?

*Juan*



On Mon, Aug 29, 2011 at 8:33 AM, Michael Szalay
wrote:

> Hi all
>
> how can I list all dynamic fields and their values of a document using
> solrj?
> The dynamic fields are never returned when I use setFields(*).
>
> Thanks
>
> Michael
>
> --
> Michael Szalay
> Senior Software Engineer
>
> basis06 AG, Birkenweg 61, CH-3013 Bern - Fon +41 31 311 32 22
> http://www.basis06.ch - source of smart business
>
>


Re: dependency injection in solr

2011-08-29 Thread Tomás Fernández Löbbe
 I think I get it. Many of the objects that depend on the configuration
are instantiated by using reflection, is that an option for you?

On Mon, Aug 29, 2011 at 12:33 PM, Federico Fissore  wrote:

> Tomás Fernández Löbbe, il 29/08/2011 16:39, ha scritto:
>
>  You can do a lot of dependency injection though solrconfig.xml and
>> schema.xml, Specify search components, update processors, filters,
>> similarity, etc. Solr doesn't use any DI framework, everything is built-in
>> in a pluggable manner. What kind of customizations do you need to apply?
>> maybe we can point you better.
>>
>
>
> for example, I have a classification component: it can retrieve
> classification data either from the file system or via a rest call.
>
> the "retriever" is a parameter of the component and we want to set it
> depending if we host the searcher or if the customer choose to install it
> locally
>
> i would like to inject a custom bean into the search component: at init
> time, it will just call the retrieve method and that will return the
> classification
>
> (ps: we are trying to migrate our existing system to solr)
>
> thanks in advance
>
> federico
>


Re: Post Processing Solr Results

2011-08-29 Thread Jamie Johnson
Thanks guys, perhaps I am just going about this the wrong way.  So let
me explain my problem and perhaps there is a more appropriate
solution.  What I need to do is basically hide certain results based
on some passed in user parameter (say their service tier for
instance).  What I'd like to do is have some way to plugin my custom
logic to basically remove certain documents from the result set using
this information.  Now that being said I technically don't need to
remove the documents from the full result set, I really only need to
remove them from current page (but still ensuring that a page is
filled and sorted).  At present I'm trying to see if there is a way
for me to add this type of logic after the QueryComponent has
executed, perhaps by going through the DocIdandSet at this point and
then intersecting the DocIdSet with a DocIdSet which would filter out
the stuff I don't want seen.  Does this sound reasonable or like a
fools errand?



On Mon, Aug 29, 2011 at 10:51 AM, Erik Hatcher  wrote:
> I haven't followed the details, but what I'm guessing you want here is 
> Lucene's FieldCache.  Perhaps something along the lines of how faceting uses 
> it (in SimpleFacets.java) -
>
>   FieldCache.DocTermsIndex si = 
> FieldCache.DEFAULT.getTermsIndex(searcher.getIndexReader(), fieldName);
>
>        Erik
>
> On Aug 29, 2011, at 09:58 , Erick Erickson wrote:
>
>> If you're asking whether there's a way to find, say,
>> all the values for the "auth" field associated with
>> a document... no. The nature of an inverted
>> index makes this hard (think of finding all
>> the definitions in a dictionary where the word
>> "earth" was in the definition).
>>
>> Best
>> Erick
>>
>> On Mon, Aug 29, 2011 at 9:21 AM, Jamie Johnson  wrote:
>>> Thanks Erick, if I did not know the token up front that could be in
>>> the index is there not an efficient way to get the field for a
>>> specific document and do some custom processing on it?
>>>
>>> On Mon, Aug 29, 2011 at 8:34 AM, Erick Erickson  
>>> wrote:
 Start here I think:

 http://lucene.apache.org/java/3_0_2/api/core/index.html?org/apache/lucene/index/TermDocs.html

 Best
 Erick

 On Mon, Aug 29, 2011 at 8:24 AM, Jamie Johnson  wrote:
> Thanks for the reply.  The fields I want are indexed, but how would I
> go directly at the fields I wanted?
>
> In regards to indexing the auth tokens I've thought about this and am
> trying to get confirmation if that is reasonable given our
> constraints.
>
> On Mon, Aug 29, 2011 at 8:20 AM, Erick Erickson  
> wrote:
>> Yeah, loading the document inside a Collector is a
>> definite no-no. Have you tried going directly
>> at the fields you want (assuming they're
>> indexed)? That *should* be much faster, but
>> whether it'll be fast enough is a good question. I'm
>> thinking some of the Terms methods here. You
>> *might* get some joy out of making sure lazy
>> field loading is enabled (and make sure the
>> fields you're accessing for your logic are
>> indexed), but I'm not entirely sure about
>> that bit.
>>
>> This kind of problem is sometimes handled
>> by indexing "auth tokens" with the documents
>> and including an OR clause on the query
>> with the authorizations for a particular
>> user, but that works best if there is an upper
>> limit (in the 100s) of tokens that a user can possibly
>> have, often this works best with some kind of
>> grouping. Making this work when a user can
>> have tens of thousands of auth tokens is...er...
>> contra-indicated...
>>
>> Hope this helps a bit...
>> Erick
>>
>> On Sun, Aug 28, 2011 at 11:59 PM, Jamie Johnson  
>> wrote:
>>> Just a bit more information.  Inside my class which extends
>>> FilteredDocIdSet all of the time seems to be getting spent in
>>> retrieving the document from the readerCtx, doing this
>>>
>>> Document doc = readerCtx.reader.document(docid);
>>>
>>> If I comment out this and just return true things fly along as I
>>> expect.  My query is returning a total of 2 million documents also.
>>>
>>> On Sun, Aug 28, 2011 at 11:39 AM, Jamie Johnson  
>>> wrote:
 I have a need to post process Solr results based on some access
 controls which are setup outside of Solr, currently we've written
 something that extends SearchComponent and in the prepare method I'm
 doing something like this

                    QueryWrapperFilter qwf = new
 QueryWrapperFilter(rb.getQuery());
                    Filter filter = new CustomFilter(qwf);
                    FilteredQuery fq = new FilteredQuery(rb.getQuery(), 
 filter);
                    rb.setQuery(fq);

 Inside my CustomFilter I have a FilteredDocIdSet which checks if the
 document should be returned.  This works as I 

Possible bug in MoreLikeThisHandler debugging?

2011-08-29 Thread Andrés Cobas
Hi.

After upgrading to solr 3.3.0 from 1.4.0, I noticed that I couldn't get the
MoreLikeThisHandler to return debugging data. I tried the debug parameters
 debugQuery and debug, but all I got was:
 true

I took a look at the code for the MoreLikeThisHandler, and noted in the
debbuging part that the handler is adding the variable dbg to the response
(line 211):
 rsp.add("debug", dbg);

Such variable is created at line 197:

boolean dbg = req.getParams().getBool(CommonParams.DEBUG_QUERY, false);


I suppose the correct variable to add to the response would be dbgInfo:

NamedList

dbgInfo = SolrPluginUtils.doStandardDebug(req, q,
mlt.getRawMLTQuery(), mltDocs.docList);
if (null != dbgInfo) {
  if (null != filters) {

dbgInfo.add("filter_queries",req.getParams().getParams(CommonParams.FQ));
List

fqs = new ArrayList
(filters.size());
for (Query

fq : filters) {
  fqs.add(QueryParsing.toString(fq, req.getSchema()));
}
dbgInfo.add("parsed_filter_queries",fqs);
  }

Summarizing, i believe line 211 should be changed to:

rsp.add("debug", dbgInfo);

Thanks a lot,

Andrés Cobas


Possible bug in MoreLikeThisHandler debugging?

2011-08-29 Thread Andrés Cobas
Hi.

After upgrading to solr 3.3.0 from 1.4.0, I noticed that I couldn't get the
MoreLikeThisHandler to return debugging data. I tried the debug parameters
 debugQuery and debug, but all I got was:
 true

I took a look at the code for the MoreLikeThisHandler, and noted in the
debbuging part that the handler is adding the variable dbg to the response
(line 211):
 rsp.add("debug", dbg);

Such variable is created at line 197:

boolean dbg = req.getParams().getBool(CommonParams.DEBUG_QUERY, false);


I suppose the correct variable to add to the response would be dbgInfo:

NamedList

dbgInfo = SolrPluginUtils.doStandardDebug(req, q,
mlt.getRawMLTQuery(), mltDocs.docList);
if (null != dbgInfo) {
  if (null != filters) {

dbgInfo.add("filter_queries",req.getParams().getParams(CommonParams.FQ));
List

fqs = new ArrayList
(filters.size());
for (Query

fq : filters) {
  fqs.add(QueryParsing.toString(fq, req.getSchema()));
}
dbgInfo.add("parsed_filter_queries",fqs);
  }

Summarizing, i believe line 211 should be changed to:

rsp.add("debug", dbgInfo);

Thanks a lot,

Andrés Cobas


Re: dependency injection in solr

2011-08-29 Thread Federico Fissore

Tomás Fernández Löbbe, il 29/08/2011 16:39, ha scritto:

You can do a lot of dependency injection though solrconfig.xml and
schema.xml, Specify search components, update processors, filters,
similarity, etc. Solr doesn't use any DI framework, everything is built-in
in a pluggable manner. What kind of customizations do you need to apply?
maybe we can point you better.



for example, I have a classification component: it can retrieve 
classification data either from the file system or via a rest call.


the "retriever" is a parameter of the component and we want to set it 
depending if we host the searcher or if the customer choose to install 
it locally


i would like to inject a custom bean into the search component: at init 
time, it will just call the retrieve method and that will return the 
classification


(ps: we are trying to migrate our existing system to solr)

thanks in advance

federico


Re: Solr custom plugins: is it possible to have them persistent?

2011-08-29 Thread samuele.mattiuzzo
it's how i'm doing it now... but i'm not sure i'm placing the objects into
the right place

significant part of my code here : http://pastie.org/2448984

(i've omitted the methods implementations since are pretty long)

inside the method setLocation, i create the connection to mysql database

inside the method setFieldPosition, i create the categorization object

Then i started thinking i was creating and deleting those objects locally
everytime solr reads a document to index. So, where should i put them?
inside the tothegocustom class constructor, after the super call?

I'm asking this because i'm not sure if my custom updaterequestprocessor is
created once or for everydocument parsed (i'm still learning solr, but i
think i'm getting into it, bits per bits!)

Thanks again!

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-custom-plugins-is-it-possible-to-have-them-persistent-tp3292781p3292928.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Post Processing Solr Results

2011-08-29 Thread Erik Hatcher
I haven't followed the details, but what I'm guessing you want here is Lucene's 
FieldCache.  Perhaps something along the lines of how faceting uses it (in 
SimpleFacets.java) -

   FieldCache.DocTermsIndex si = 
FieldCache.DEFAULT.getTermsIndex(searcher.getIndexReader(), fieldName);

Erik

On Aug 29, 2011, at 09:58 , Erick Erickson wrote:

> If you're asking whether there's a way to find, say,
> all the values for the "auth" field associated with
> a document... no. The nature of an inverted
> index makes this hard (think of finding all
> the definitions in a dictionary where the word
> "earth" was in the definition).
> 
> Best
> Erick
> 
> On Mon, Aug 29, 2011 at 9:21 AM, Jamie Johnson  wrote:
>> Thanks Erick, if I did not know the token up front that could be in
>> the index is there not an efficient way to get the field for a
>> specific document and do some custom processing on it?
>> 
>> On Mon, Aug 29, 2011 at 8:34 AM, Erick Erickson  
>> wrote:
>>> Start here I think:
>>> 
>>> http://lucene.apache.org/java/3_0_2/api/core/index.html?org/apache/lucene/index/TermDocs.html
>>> 
>>> Best
>>> Erick
>>> 
>>> On Mon, Aug 29, 2011 at 8:24 AM, Jamie Johnson  wrote:
 Thanks for the reply.  The fields I want are indexed, but how would I
 go directly at the fields I wanted?
 
 In regards to indexing the auth tokens I've thought about this and am
 trying to get confirmation if that is reasonable given our
 constraints.
 
 On Mon, Aug 29, 2011 at 8:20 AM, Erick Erickson  
 wrote:
> Yeah, loading the document inside a Collector is a
> definite no-no. Have you tried going directly
> at the fields you want (assuming they're
> indexed)? That *should* be much faster, but
> whether it'll be fast enough is a good question. I'm
> thinking some of the Terms methods here. You
> *might* get some joy out of making sure lazy
> field loading is enabled (and make sure the
> fields you're accessing for your logic are
> indexed), but I'm not entirely sure about
> that bit.
> 
> This kind of problem is sometimes handled
> by indexing "auth tokens" with the documents
> and including an OR clause on the query
> with the authorizations for a particular
> user, but that works best if there is an upper
> limit (in the 100s) of tokens that a user can possibly
> have, often this works best with some kind of
> grouping. Making this work when a user can
> have tens of thousands of auth tokens is...er...
> contra-indicated...
> 
> Hope this helps a bit...
> Erick
> 
> On Sun, Aug 28, 2011 at 11:59 PM, Jamie Johnson  wrote:
>> Just a bit more information.  Inside my class which extends
>> FilteredDocIdSet all of the time seems to be getting spent in
>> retrieving the document from the readerCtx, doing this
>> 
>> Document doc = readerCtx.reader.document(docid);
>> 
>> If I comment out this and just return true things fly along as I
>> expect.  My query is returning a total of 2 million documents also.
>> 
>> On Sun, Aug 28, 2011 at 11:39 AM, Jamie Johnson  
>> wrote:
>>> I have a need to post process Solr results based on some access
>>> controls which are setup outside of Solr, currently we've written
>>> something that extends SearchComponent and in the prepare method I'm
>>> doing something like this
>>> 
>>>QueryWrapperFilter qwf = new
>>> QueryWrapperFilter(rb.getQuery());
>>>Filter filter = new CustomFilter(qwf);
>>>FilteredQuery fq = new FilteredQuery(rb.getQuery(), 
>>> filter);
>>>rb.setQuery(fq);
>>> 
>>> Inside my CustomFilter I have a FilteredDocIdSet which checks if the
>>> document should be returned.  This works as I expect but for some
>>> reason is very very slow.  Even if I take out any of the machinery
>>> which does any logic with the document and only return true in the
>>> FilteredDocIdSets match method the query still takes an inordinate
>>> amount of time as compared to not including this custom filter.  So my
>>> question, is this the most appropriate way of handling this?  What
>>> should the performance out of such a setup be expected to be?  Any
>>> information/pointers would be greatly appreciated.
>>> 
>> 
> 
 
>>> 
>> 



Re: Solr custom plugins: is it possible to have them persistent?

2011-08-29 Thread Tomás Fernández Löbbe
Both of those features are needed at indexing time, right? If it is, the
best place to put it is on an UpdateRequestProcessor. See
http://wiki.apache.org/solr/UpdateRequestProcessor

Tomás

On Mon, Aug 29, 2011 at 11:06 AM, samuele.mattiuzzo wrote:

> I've posted a similar question few days ago, but our needs have gone a bit
> further.
>
> I need to develop two plugins which need to be persistent throu the whole
> indexing and updating process
>
> The first one need to open a connection to a mysql instance (and query that
> connection during every document processing)
>
> The second one uses a java library (classifier4j) which is a Bayesian
> categorization system (and doesn't talk to a db). This one learns while
> matching, so the object needs to be created at the very beginning of the
> indexing and should be available for all the documents (i cannot create the
> object for every object, because i'd miss the learning feature)
>
> For the first one i could use a dataimporthandler, but i'm not sure about
> it: i don't need to import the whole db, but just the occurencies matching
> a
> particular condition for each document. About the second one, i'm blind.
>
> Is there a place in solr where i can create the connection object and the
> categorizer object before everything else, and have them available to all
> documents?
>
> Thanks all in advance!
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Solr-custom-plugins-is-it-possible-to-have-them-persistent-tp3292781p3292781.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>


Re: dependency injection in solr

2011-08-29 Thread Tomás Fernández Löbbe
You can do a lot of dependency injection though solrconfig.xml and
schema.xml, Specify search components, update processors, filters,
similarity, etc. Solr doesn't use any DI framework, everything is built-in
in a pluggable manner. What kind of customizations do you need to apply?
maybe we can point you better.

On Mon, Aug 29, 2011 at 10:20 AM, Federico Fissore  wrote:

> Hello everyone
>
> I need to hack solr by adding a couple custom search components.
> One small inconvenience is about configuring all the stuff. AFAIK
> solrconfig.xml is not a place where to do dependency injection, not yet at
> least.
>
> Have you ever had the need to use DI on a solr configuration? How have you
> managed it? Hard coding params in some delegate SearchComponent? Getting a
> reference of a spring application context via some static method? Any more
> elegant ways?
>
> thanks in advance
>
> federico
>


Solr custom plugins: is it possible to have them persistent?

2011-08-29 Thread samuele.mattiuzzo
I've posted a similar question few days ago, but our needs have gone a bit
further.

I need to develop two plugins which need to be persistent throu the whole
indexing and updating process

The first one need to open a connection to a mysql instance (and query that
connection during every document processing)

The second one uses a java library (classifier4j) which is a Bayesian
categorization system (and doesn't talk to a db). This one learns while
matching, so the object needs to be created at the very beginning of the
indexing and should be available for all the documents (i cannot create the
object for every object, because i'd miss the learning feature)

For the first one i could use a dataimporthandler, but i'm not sure about
it: i don't need to import the whole db, but just the occurencies matching a
particular condition for each document. About the second one, i'm blind.

Is there a place in solr where i can create the connection object and the
categorizer object before everything else, and have them available to all
documents?

Thanks all in advance!

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-custom-plugins-is-it-possible-to-have-them-persistent-tp3292781p3292781.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Post Processing Solr Results

2011-08-29 Thread Erick Erickson
If you're asking whether there's a way to find, say,
all the values for the "auth" field associated with
a document... no. The nature of an inverted
index makes this hard (think of finding all
the definitions in a dictionary where the word
"earth" was in the definition).

Best
Erick

On Mon, Aug 29, 2011 at 9:21 AM, Jamie Johnson  wrote:
> Thanks Erick, if I did not know the token up front that could be in
> the index is there not an efficient way to get the field for a
> specific document and do some custom processing on it?
>
> On Mon, Aug 29, 2011 at 8:34 AM, Erick Erickson  
> wrote:
>> Start here I think:
>>
>> http://lucene.apache.org/java/3_0_2/api/core/index.html?org/apache/lucene/index/TermDocs.html
>>
>> Best
>> Erick
>>
>> On Mon, Aug 29, 2011 at 8:24 AM, Jamie Johnson  wrote:
>>> Thanks for the reply.  The fields I want are indexed, but how would I
>>> go directly at the fields I wanted?
>>>
>>> In regards to indexing the auth tokens I've thought about this and am
>>> trying to get confirmation if that is reasonable given our
>>> constraints.
>>>
>>> On Mon, Aug 29, 2011 at 8:20 AM, Erick Erickson  
>>> wrote:
 Yeah, loading the document inside a Collector is a
 definite no-no. Have you tried going directly
 at the fields you want (assuming they're
 indexed)? That *should* be much faster, but
 whether it'll be fast enough is a good question. I'm
 thinking some of the Terms methods here. You
 *might* get some joy out of making sure lazy
 field loading is enabled (and make sure the
 fields you're accessing for your logic are
 indexed), but I'm not entirely sure about
 that bit.

 This kind of problem is sometimes handled
 by indexing "auth tokens" with the documents
 and including an OR clause on the query
 with the authorizations for a particular
 user, but that works best if there is an upper
 limit (in the 100s) of tokens that a user can possibly
 have, often this works best with some kind of
 grouping. Making this work when a user can
 have tens of thousands of auth tokens is...er...
 contra-indicated...

 Hope this helps a bit...
 Erick

 On Sun, Aug 28, 2011 at 11:59 PM, Jamie Johnson  wrote:
> Just a bit more information.  Inside my class which extends
> FilteredDocIdSet all of the time seems to be getting spent in
> retrieving the document from the readerCtx, doing this
>
> Document doc = readerCtx.reader.document(docid);
>
> If I comment out this and just return true things fly along as I
> expect.  My query is returning a total of 2 million documents also.
>
> On Sun, Aug 28, 2011 at 11:39 AM, Jamie Johnson  wrote:
>> I have a need to post process Solr results based on some access
>> controls which are setup outside of Solr, currently we've written
>> something that extends SearchComponent and in the prepare method I'm
>> doing something like this
>>
>>                    QueryWrapperFilter qwf = new
>> QueryWrapperFilter(rb.getQuery());
>>                    Filter filter = new CustomFilter(qwf);
>>                    FilteredQuery fq = new FilteredQuery(rb.getQuery(), 
>> filter);
>>                    rb.setQuery(fq);
>>
>> Inside my CustomFilter I have a FilteredDocIdSet which checks if the
>> document should be returned.  This works as I expect but for some
>> reason is very very slow.  Even if I take out any of the machinery
>> which does any logic with the document and only return true in the
>> FilteredDocIdSets match method the query still takes an inordinate
>> amount of time as compared to not including this custom filter.  So my
>> question, is this the most appropriate way of handling this?  What
>> should the performance out of such a setup be expected to be?  Any
>> information/pointers would be greatly appreciated.
>>
>

>>>
>>
>


RE: index full text as possible

2011-08-29 Thread Rode González
Hi again.

>In that case, you should be able to use a tokeniser to split
>the input into phrases, though you will probably need to write
>a custom tokeniser, depending on what characters you want to
>break phrases at. Please see
>http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters

I have read this page but I didn't see anything. 
I thought it was a filter implemented.


>It is also entirely possible to index the full text, and just do a
>phrase search later. This is probably the easiest option, unless
>you have a huge volume of text, and the volume of phrases to
>be indexed can be significantly lower.

How can I do that?

Thanks.
Rode.




Re: Post Processing Solr Results

2011-08-29 Thread Jamie Johnson
Thanks Erick, if I did not know the token up front that could be in
the index is there not an efficient way to get the field for a
specific document and do some custom processing on it?

On Mon, Aug 29, 2011 at 8:34 AM, Erick Erickson  wrote:
> Start here I think:
>
> http://lucene.apache.org/java/3_0_2/api/core/index.html?org/apache/lucene/index/TermDocs.html
>
> Best
> Erick
>
> On Mon, Aug 29, 2011 at 8:24 AM, Jamie Johnson  wrote:
>> Thanks for the reply.  The fields I want are indexed, but how would I
>> go directly at the fields I wanted?
>>
>> In regards to indexing the auth tokens I've thought about this and am
>> trying to get confirmation if that is reasonable given our
>> constraints.
>>
>> On Mon, Aug 29, 2011 at 8:20 AM, Erick Erickson  
>> wrote:
>>> Yeah, loading the document inside a Collector is a
>>> definite no-no. Have you tried going directly
>>> at the fields you want (assuming they're
>>> indexed)? That *should* be much faster, but
>>> whether it'll be fast enough is a good question. I'm
>>> thinking some of the Terms methods here. You
>>> *might* get some joy out of making sure lazy
>>> field loading is enabled (and make sure the
>>> fields you're accessing for your logic are
>>> indexed), but I'm not entirely sure about
>>> that bit.
>>>
>>> This kind of problem is sometimes handled
>>> by indexing "auth tokens" with the documents
>>> and including an OR clause on the query
>>> with the authorizations for a particular
>>> user, but that works best if there is an upper
>>> limit (in the 100s) of tokens that a user can possibly
>>> have, often this works best with some kind of
>>> grouping. Making this work when a user can
>>> have tens of thousands of auth tokens is...er...
>>> contra-indicated...
>>>
>>> Hope this helps a bit...
>>> Erick
>>>
>>> On Sun, Aug 28, 2011 at 11:59 PM, Jamie Johnson  wrote:
 Just a bit more information.  Inside my class which extends
 FilteredDocIdSet all of the time seems to be getting spent in
 retrieving the document from the readerCtx, doing this

 Document doc = readerCtx.reader.document(docid);

 If I comment out this and just return true things fly along as I
 expect.  My query is returning a total of 2 million documents also.

 On Sun, Aug 28, 2011 at 11:39 AM, Jamie Johnson  wrote:
> I have a need to post process Solr results based on some access
> controls which are setup outside of Solr, currently we've written
> something that extends SearchComponent and in the prepare method I'm
> doing something like this
>
>                    QueryWrapperFilter qwf = new
> QueryWrapperFilter(rb.getQuery());
>                    Filter filter = new CustomFilter(qwf);
>                    FilteredQuery fq = new FilteredQuery(rb.getQuery(), 
> filter);
>                    rb.setQuery(fq);
>
> Inside my CustomFilter I have a FilteredDocIdSet which checks if the
> document should be returned.  This works as I expect but for some
> reason is very very slow.  Even if I take out any of the machinery
> which does any logic with the document and only return true in the
> FilteredDocIdSets match method the query still takes an inordinate
> amount of time as compared to not including this custom filter.  So my
> question, is this the most appropriate way of handling this?  What
> should the performance out of such a setup be expected to be?  Any
> information/pointers would be greatly appreciated.
>

>>>
>>
>


dependency injection in solr

2011-08-29 Thread Federico Fissore

Hello everyone

I need to hack solr by adding a couple custom search components.
One small inconvenience is about configuring all the stuff. AFAIK 
solrconfig.xml is not a place where to do dependency injection, not yet 
at least.


Have you ever had the need to use DI on a solr configuration? How have 
you managed it? Hard coding params in some delegate SearchComponent? 
Getting a reference of a spring application context via some static 
method? Any more elegant ways?


thanks in advance

federico


Re: Does Solr flush to disk even before ramBufferSizeMB is hit?

2011-08-29 Thread Shawn Heisey

On 8/28/2011 11:18 PM, roz dev wrote:

I notice that even though InfoStream does not mention that data is being
flushed to disk, new segment files were created on the server.
Size of these files kept growing even though there was enough Heap available
and 856MB Ram was not even used.


With the caveat that I am not an expert and someone may correct me, I'll 
offer this:  It's been my experience that Solr will write the files that 
constitute stored fields as soon as they are available, because that 
information is always the same and nothing will change in those files 
based on the next chunk of data.


Thanks,
Shawn



Re: Error while decoding %DC (Ü) from URL - results in ?

2011-08-29 Thread François Schiettecatte
Merlin

Just to make sure I understand what is going on here, you are getting searches 
from external crawlers. These are coming in the form of an HTTP request I 
assume?

Have you checked the encoding specified in these requests (in the content type 
header). If the encoding is not specified then iso-8859-1 is usually assumed. 
Also have you checked the default encoding of your container? If you are using 
tomcat that is set using URIEncoding, for example:



François

On Aug 28, 2011, at 3:10 PM, Merlin Morgenstern wrote:

> I double checked all code on that page and it looks like everything is in
> utf-8 and works just perfect. The problematic URLs are called always by bots
> like google bot. Looks like they are operating with a different encoding.
> The page itself has an utf-8 meta tag.
> 
> So it looks like I have to find a way that checks for the encoding and
> encodes apropriatly. this should be a common solr problem if all search
> engines treat utf-8 that way, right?
> 
> Any ideas how to fix that? Is there maybe a special solr functionality for
> this?
> 
> 2011/8/27 François Schiettecatte 
> 
>> Merlin
>> 
>> Ü encodes to two characters in utf-8 (C39C), and one in iso-8859-1 (%DC) so
>> it looks like there is a charset mismatch somewhere.
>> 
>> 
>> Cheers
>> 
>> François
>> 
>> 
>> 
>> On Aug 27, 2011, at 6:34 AM, Merlin Morgenstern wrote:
>> 
>>> Hello,
>>> 
>>> I am having problems with searches that are issued from spiders that
>> contain
>>> the ASCII encoded character "ü"
>>> 
>>> For example in : "Übersetzung"
>>> 
>>> The solr log shows following query request: /suche/%DCbersetzung
>>> which has been translated into solr query: q=?ersetzung
>>> 
>>> If you enter the search term directly as a user into the search box it
>> will
>>> result into:
>>> /suche/Übersetzung which returns perfect results.
>>> 
>>> I am decoding the URL within PHP: $term = trim(urldecode($q));
>>> 
>>> Somehow urldecode() translates the Character Ü (%DC) into a ? which is a
>>> illigeal first character in Solr.
>>> 
>>> I tried it without urldecode(), with rawurldecode() and with
>> utf8_decode()
>>> but all of those did not help.
>>> 
>>> Thank you for any help or hint on how to solve that problem.
>>> 
>>> Regards, Merlin
>> 
>> 



Re: Post Processing Solr Results

2011-08-29 Thread Erick Erickson
Start here I think:

http://lucene.apache.org/java/3_0_2/api/core/index.html?org/apache/lucene/index/TermDocs.html

Best
Erick

On Mon, Aug 29, 2011 at 8:24 AM, Jamie Johnson  wrote:
> Thanks for the reply.  The fields I want are indexed, but how would I
> go directly at the fields I wanted?
>
> In regards to indexing the auth tokens I've thought about this and am
> trying to get confirmation if that is reasonable given our
> constraints.
>
> On Mon, Aug 29, 2011 at 8:20 AM, Erick Erickson  
> wrote:
>> Yeah, loading the document inside a Collector is a
>> definite no-no. Have you tried going directly
>> at the fields you want (assuming they're
>> indexed)? That *should* be much faster, but
>> whether it'll be fast enough is a good question. I'm
>> thinking some of the Terms methods here. You
>> *might* get some joy out of making sure lazy
>> field loading is enabled (and make sure the
>> fields you're accessing for your logic are
>> indexed), but I'm not entirely sure about
>> that bit.
>>
>> This kind of problem is sometimes handled
>> by indexing "auth tokens" with the documents
>> and including an OR clause on the query
>> with the authorizations for a particular
>> user, but that works best if there is an upper
>> limit (in the 100s) of tokens that a user can possibly
>> have, often this works best with some kind of
>> grouping. Making this work when a user can
>> have tens of thousands of auth tokens is...er...
>> contra-indicated...
>>
>> Hope this helps a bit...
>> Erick
>>
>> On Sun, Aug 28, 2011 at 11:59 PM, Jamie Johnson  wrote:
>>> Just a bit more information.  Inside my class which extends
>>> FilteredDocIdSet all of the time seems to be getting spent in
>>> retrieving the document from the readerCtx, doing this
>>>
>>> Document doc = readerCtx.reader.document(docid);
>>>
>>> If I comment out this and just return true things fly along as I
>>> expect.  My query is returning a total of 2 million documents also.
>>>
>>> On Sun, Aug 28, 2011 at 11:39 AM, Jamie Johnson  wrote:
 I have a need to post process Solr results based on some access
 controls which are setup outside of Solr, currently we've written
 something that extends SearchComponent and in the prepare method I'm
 doing something like this

                    QueryWrapperFilter qwf = new
 QueryWrapperFilter(rb.getQuery());
                    Filter filter = new CustomFilter(qwf);
                    FilteredQuery fq = new FilteredQuery(rb.getQuery(), 
 filter);
                    rb.setQuery(fq);

 Inside my CustomFilter I have a FilteredDocIdSet which checks if the
 document should be returned.  This works as I expect but for some
 reason is very very slow.  Even if I take out any of the machinery
 which does any logic with the document and only return true in the
 FilteredDocIdSets match method the query still takes an inordinate
 amount of time as compared to not including this custom filter.  So my
 question, is this the most appropriate way of handling this?  What
 should the performance out of such a setup be expected to be?  Any
 information/pointers would be greatly appreciated.

>>>
>>
>


Re: Viewing the complete document from within the index

2011-08-29 Thread Erick Erickson
You can use Luke to re-construct the doc from
the indexed terms. It takes a while, because it's
not a trivial problem, so I'd use a small index for
verification first If you have Luke show
you the doc, it'll return stored fields, but as I remember
there's a button like "reconstruct and edit" that does
what you want...

You can use the TermsComponent to see what's in
the inverted part of the index, but it doesn't tell
you which document is associated with the terms,
so might not help much.

But it seems you could do this empirically by
controlling the input to a small set of docs and then
querying on terms you *know* you didn't have in
the input but were in the synonyms

Best
Erick

On Mon, Aug 29, 2011 at 3:55 AM, pravesh  wrote:
> Reconstructing the document might not be possible, since,only the stored
> fields are actually stored document-wise(un-inverted), where as the
> indexed-only fields are put as inverted way.
> In don't think SOLR/Lucene currently provides any way, so, one can
> re-construct document in the way you desire. (It's sort of reverse
> engineering not supported)
>
> Thanx
> Pravesh
>
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Viewing-the-complete-document-from-within-the-index-tp3288076p3292111.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>


Re: Post Processing Solr Results

2011-08-29 Thread Jamie Johnson
Thanks for the reply.  The fields I want are indexed, but how would I
go directly at the fields I wanted?

In regards to indexing the auth tokens I've thought about this and am
trying to get confirmation if that is reasonable given our
constraints.

On Mon, Aug 29, 2011 at 8:20 AM, Erick Erickson  wrote:
> Yeah, loading the document inside a Collector is a
> definite no-no. Have you tried going directly
> at the fields you want (assuming they're
> indexed)? That *should* be much faster, but
> whether it'll be fast enough is a good question. I'm
> thinking some of the Terms methods here. You
> *might* get some joy out of making sure lazy
> field loading is enabled (and make sure the
> fields you're accessing for your logic are
> indexed), but I'm not entirely sure about
> that bit.
>
> This kind of problem is sometimes handled
> by indexing "auth tokens" with the documents
> and including an OR clause on the query
> with the authorizations for a particular
> user, but that works best if there is an upper
> limit (in the 100s) of tokens that a user can possibly
> have, often this works best with some kind of
> grouping. Making this work when a user can
> have tens of thousands of auth tokens is...er...
> contra-indicated...
>
> Hope this helps a bit...
> Erick
>
> On Sun, Aug 28, 2011 at 11:59 PM, Jamie Johnson  wrote:
>> Just a bit more information.  Inside my class which extends
>> FilteredDocIdSet all of the time seems to be getting spent in
>> retrieving the document from the readerCtx, doing this
>>
>> Document doc = readerCtx.reader.document(docid);
>>
>> If I comment out this and just return true things fly along as I
>> expect.  My query is returning a total of 2 million documents also.
>>
>> On Sun, Aug 28, 2011 at 11:39 AM, Jamie Johnson  wrote:
>>> I have a need to post process Solr results based on some access
>>> controls which are setup outside of Solr, currently we've written
>>> something that extends SearchComponent and in the prepare method I'm
>>> doing something like this
>>>
>>>                    QueryWrapperFilter qwf = new
>>> QueryWrapperFilter(rb.getQuery());
>>>                    Filter filter = new CustomFilter(qwf);
>>>                    FilteredQuery fq = new FilteredQuery(rb.getQuery(), 
>>> filter);
>>>                    rb.setQuery(fq);
>>>
>>> Inside my CustomFilter I have a FilteredDocIdSet which checks if the
>>> document should be returned.  This works as I expect but for some
>>> reason is very very slow.  Even if I take out any of the machinery
>>> which does any logic with the document and only return true in the
>>> FilteredDocIdSets match method the query still takes an inordinate
>>> amount of time as compared to not including this custom filter.  So my
>>> question, is this the most appropriate way of handling this?  What
>>> should the performance out of such a setup be expected to be?  Any
>>> information/pointers would be greatly appreciated.
>>>
>>
>


Re: Post Processing Solr Results

2011-08-29 Thread Erick Erickson
Yeah, loading the document inside a Collector is a
definite no-no. Have you tried going directly
at the fields you want (assuming they're
indexed)? That *should* be much faster, but
whether it'll be fast enough is a good question. I'm
thinking some of the Terms methods here. You
*might* get some joy out of making sure lazy
field loading is enabled (and make sure the
fields you're accessing for your logic are
indexed), but I'm not entirely sure about
that bit.

This kind of problem is sometimes handled
by indexing "auth tokens" with the documents
and including an OR clause on the query
with the authorizations for a particular
user, but that works best if there is an upper
limit (in the 100s) of tokens that a user can possibly
have, often this works best with some kind of
grouping. Making this work when a user can
have tens of thousands of auth tokens is...er...
contra-indicated...

Hope this helps a bit...
Erick

On Sun, Aug 28, 2011 at 11:59 PM, Jamie Johnson  wrote:
> Just a bit more information.  Inside my class which extends
> FilteredDocIdSet all of the time seems to be getting spent in
> retrieving the document from the readerCtx, doing this
>
> Document doc = readerCtx.reader.document(docid);
>
> If I comment out this and just return true things fly along as I
> expect.  My query is returning a total of 2 million documents also.
>
> On Sun, Aug 28, 2011 at 11:39 AM, Jamie Johnson  wrote:
>> I have a need to post process Solr results based on some access
>> controls which are setup outside of Solr, currently we've written
>> something that extends SearchComponent and in the prepare method I'm
>> doing something like this
>>
>>                    QueryWrapperFilter qwf = new
>> QueryWrapperFilter(rb.getQuery());
>>                    Filter filter = new CustomFilter(qwf);
>>                    FilteredQuery fq = new FilteredQuery(rb.getQuery(), 
>> filter);
>>                    rb.setQuery(fq);
>>
>> Inside my CustomFilter I have a FilteredDocIdSet which checks if the
>> document should be returned.  This works as I expect but for some
>> reason is very very slow.  Even if I take out any of the machinery
>> which does any logic with the document and only return true in the
>> FilteredDocIdSets match method the query still takes an inordinate
>> amount of time as compared to not including this custom filter.  So my
>> question, is this the most appropriate way of handling this?  What
>> should the performance out of such a setup be expected to be?  Any
>> information/pointers would be greatly appreciated.
>>
>


Re: schema design question

2011-08-29 Thread Erick Erickson
I admit I just glanced at your problem statement, but
three things come to mind...

1> have you looked at the "limited join" patch and would
that work?

2> try searching the list for "hierarchical", very similar
questions have been discussed before, although I
don't quite remember the answers

Best
Erick

On Sun, Aug 28, 2011 at 5:52 PM, Adeel Qureshi  wrote:
> Hi there
>
> I have a question regarding how to setup schema for some data. This data is
> basically parent-child data for different types of records .. so
>
> a bunch of records representing projects and subprojects where each
> subproject has a parent project .. and a project has many child sub projects
> another bunch of records reprensenting data for projects and linked projects
> .. same parent child relationship here
> another bunch representing project and linked people ..
>
> so there are two ways I was thinking this kind of data can be indexed
>
> 1. create a single store called lets say CollectionData. use dynamic fields
> to post all this different data but use a type field to identify the type of
> records . e.g. to post two docs one representing project->linkedproject and
> another project->linkedpeople info
>
> 
> 123
> LinkedProjects
> child project name
> child project status
> ...
> parent info
> ...
> 
>
> 
> 123
> LinkedPeople
> child person name
> ...
> parent info
> ...
> 
>
> now from the same store I can run queries to get the different data while
> restricting the resultset on one type of records using the fq param ..
>
> 2. approach would be to create multiple stores for each different type of
> records .. with pretty much the same schema but now we dont need the type
> field because linkedProjects are in a linkedProjects store and linkedPeople
> are in linkedPeople store .. only drawback i guess is that you could have a
> few stores
>
> my question to you guys is which approach makes more sense. I would
> appreciate any comments.
>
> Thanks
> Adeel
>


Re: Shingle and Query Performance

2011-08-29 Thread Erick Erickson
Oh, one other thing: have you profiled your machine
to see if you're swapping? How much memory are
you giving your JVM? What is the underlying
hardware setup?

Best
Erick

On Mon, Aug 29, 2011 at 8:09 AM, Erick Erickson  wrote:
> 200K docs and 36G index? It sounds like you're storing
> your documents in the Solr index. In and of itself, that
> shouldn't hurt your query times, *unless* you have
> lazy field loading turned off, have you checked that
> lazy field loading is enabled?
>
>
>
> Best
> Erick
>
> On Sun, Aug 28, 2011 at 5:30 AM, Lord Khan Han  
> wrote:
>> Another insteresting thing is : all one word or more word queries including
>> phrase queries such as "barack obama"  slower in shingle configuration. What
>> i am doing wrong ? without shingle "barack obama" Querytime 300ms  with
>> shingle  780 ms..
>>
>>
>> On Sat, Aug 27, 2011 at 7:58 PM, Lord Khan Han 
>> wrote:
>>
>>> Hi,
>>>
>>> What is the difference between solr 3.3  and the trunk ?
>>> I will try 3.3  and let you know the results.
>>>
>>>
>>> Here the search handler:
>>>
>>> 
>>>      
>>>        explicit
>>>        10
>>>        
>>>  mrank:[0 TO 100]
>>>        explicit
>>>        10
>>>  edismax
>>>        
>>> title^1.05 url^1.2 content^1.7 m_title^10.0
>>>  
>>>  content^18.0 m_title^5.0
>>>  1
>>>  0
>>>  2<-25%
>>>  true
>>>  
>>> 5
>>>  subobjective
>>> false
>>>   
>>> 
>>>  true
>>>      
>>>
>>>
>>>
>>>
>>> On Sat, Aug 27, 2011 at 5:31 PM, Erik Hatcher wrote:
>>>
 I'm not sure what the issue could be at this point.   I see you've got
 qt=search - what's the definition of that request handler?

 What is the parsed query (from the debugQuery response)?

 Have you tried this with Solr 3.3 to see if there's any appreciable
 difference?

        Erik

 On Aug 27, 2011, at 09:34 , Lord Khan Han wrote:

 > When grouping off the query time ie 3567 ms  to 1912 ms . Grouping
 > increasing the query time and make useless to cache. But same config
 faster
 > without shingle still.
 >
 > We have and head to head test this wednesday tihs commercial search
 engine.
 > So I am looking for all suggestions.
 >
 >
 >
 > On Sat, Aug 27, 2011 at 3:37 PM, Erik Hatcher >>> >wrote:
 >
 >> Please confirm is this is caused by grouping.  Turn grouping off,
 what's
 >> query time like?
 >>
 >>
 >> On Aug 27, 2011, at 07:27 , Lord Khan Han wrote:
 >>
 >>> On the other hand We couldnt use the cache for below types queries. I
 >> think
 >>> its caused from grouping. Anyway we need to be sub second without
 cache.
 >>>
 >>>
 >>>
 >>> On Sat, Aug 27, 2011 at 2:18 PM, Lord Khan Han <
 khanuniver...@gmail.com
 >>> wrote:
 >>>
  Hi,
 
  Thanks for the reply.
 
  Here the solr log capture.:
 
  **
 
 
 >>
 hl.fragsize=100&spellcheck=true&spellcheck.q=X&group.limit=5&hl.simple.pre=&hl.fl=content&spellcheck.collate=true&wt=javabin&hl=true&rows=20&version=2&fl=score,approved,domain,host,id,lang,mimetype,title,tstamp,url,category&hl.snippets=3&start=0&q=%2B+-"X"+-"X"+-"XX"+-"XX"+-"XX"+-+-"XX"+-XXX+-"X"+-+-+-"X"+-"X"+-"X"+-+-""+-"X"+-"XX"+-"X"+-"XX"+-"XX"+-+-"X"+-"XX"+-+-"X"+-"X"+-X+-"X"+-"X"+-"X"+-"X"+-X+-"XX"+-"XX"+-XX+-X+-"X"+"X"+"X"+"XX"++&group.field=host&hl.simple.post=&group=true&qt=search&fq=mrank:[0+TO+100]&fq=word_count:[70+TO+*]
  **
 
   is the words. All phrases "x"  has two words inside.
 
  The timing from the DebugQuery:
 
  
  8654.0
  
  16.0
  
  16.0
  
  
  0.0
  
  
  0.0
  
  
  0.0
  
  
  0.0
  
  
  0.0
  
  
  0.0
  
  
  
  8638.0
  
  4473.0
  
  
  0.0
  
  
  0.0
  
  
  42.0
  
  
  0.0
  
  
  1.0
  
  
  4122.0
  
 
 
  The funny thing is if I removed the ShingleFilter from the below
 >> "sh_text"
  field and index normally  the query time is half of the current
 shingle
 >> one
  !. Shouldn't  be shingled index better for such heavy 2 word phrases
 >> search
  ? I am confused.
 
  On the other hand One of the on the shelf big FAT companies search
 >> engine
  doing the same query same machine 0.7 / 0.8 secs without cache . I am
  confident we can do better in solr but 

Re: Shingle and Query Performance

2011-08-29 Thread Erick Erickson
200K docs and 36G index? It sounds like you're storing
your documents in the Solr index. In and of itself, that
shouldn't hurt your query times, *unless* you have
lazy field loading turned off, have you checked that
lazy field loading is enabled?



Best
Erick

On Sun, Aug 28, 2011 at 5:30 AM, Lord Khan Han  wrote:
> Another insteresting thing is : all one word or more word queries including
> phrase queries such as "barack obama"  slower in shingle configuration. What
> i am doing wrong ? without shingle "barack obama" Querytime 300ms  with
> shingle  780 ms..
>
>
> On Sat, Aug 27, 2011 at 7:58 PM, Lord Khan Han wrote:
>
>> Hi,
>>
>> What is the difference between solr 3.3  and the trunk ?
>> I will try 3.3  and let you know the results.
>>
>>
>> Here the search handler:
>>
>> 
>>      
>>        explicit
>>        10
>>        
>>  mrank:[0 TO 100]
>>        explicit
>>        10
>>  edismax
>>        
>> title^1.05 url^1.2 content^1.7 m_title^10.0
>>  
>>  content^18.0 m_title^5.0
>>  1
>>  0
>>  2<-25%
>>  true
>>  
>> 5
>>  subobjective
>> false
>>   
>> 
>>  true
>>      
>>
>>
>>
>>
>> On Sat, Aug 27, 2011 at 5:31 PM, Erik Hatcher wrote:
>>
>>> I'm not sure what the issue could be at this point.   I see you've got
>>> qt=search - what's the definition of that request handler?
>>>
>>> What is the parsed query (from the debugQuery response)?
>>>
>>> Have you tried this with Solr 3.3 to see if there's any appreciable
>>> difference?
>>>
>>>        Erik
>>>
>>> On Aug 27, 2011, at 09:34 , Lord Khan Han wrote:
>>>
>>> > When grouping off the query time ie 3567 ms  to 1912 ms . Grouping
>>> > increasing the query time and make useless to cache. But same config
>>> faster
>>> > without shingle still.
>>> >
>>> > We have and head to head test this wednesday tihs commercial search
>>> engine.
>>> > So I am looking for all suggestions.
>>> >
>>> >
>>> >
>>> > On Sat, Aug 27, 2011 at 3:37 PM, Erik Hatcher >> >wrote:
>>> >
>>> >> Please confirm is this is caused by grouping.  Turn grouping off,
>>> what's
>>> >> query time like?
>>> >>
>>> >>
>>> >> On Aug 27, 2011, at 07:27 , Lord Khan Han wrote:
>>> >>
>>> >>> On the other hand We couldnt use the cache for below types queries. I
>>> >> think
>>> >>> its caused from grouping. Anyway we need to be sub second without
>>> cache.
>>> >>>
>>> >>>
>>> >>>
>>> >>> On Sat, Aug 27, 2011 at 2:18 PM, Lord Khan Han <
>>> khanuniver...@gmail.com
>>> >>> wrote:
>>> >>>
>>>  Hi,
>>> 
>>>  Thanks for the reply.
>>> 
>>>  Here the solr log capture.:
>>> 
>>>  **
>>> 
>>> 
>>> >>
>>> hl.fragsize=100&spellcheck=true&spellcheck.q=X&group.limit=5&hl.simple.pre=&hl.fl=content&spellcheck.collate=true&wt=javabin&hl=true&rows=20&version=2&fl=score,approved,domain,host,id,lang,mimetype,title,tstamp,url,category&hl.snippets=3&start=0&q=%2B+-"X"+-"X"+-"XX"+-"XX"+-"XX"+-+-"XX"+-XXX+-"X"+-+-+-"X"+-"X"+-"X"+-+-""+-"X"+-"XX"+-"X"+-"XX"+-"XX"+-+-"X"+-"XX"+-+-"X"+-"X"+-X+-"X"+-"X"+-"X"+-"X"+-X+-"XX"+-"XX"+-XX+-X+-"X"+"X"+"X"+"XX"++&group.field=host&hl.simple.post=&group=true&qt=search&fq=mrank:[0+TO+100]&fq=word_count:[70+TO+*]
>>>  **
>>> 
>>>   is the words. All phrases "x"  has two words inside.
>>> 
>>>  The timing from the DebugQuery:
>>> 
>>>  
>>>  8654.0
>>>  
>>>  16.0
>>>  
>>>  16.0
>>>  
>>>  
>>>  0.0
>>>  
>>>  
>>>  0.0
>>>  
>>>  
>>>  0.0
>>>  
>>>  
>>>  0.0
>>>  
>>>  
>>>  0.0
>>>  
>>>  
>>>  0.0
>>>  
>>>  
>>>  
>>>  8638.0
>>>  
>>>  4473.0
>>>  
>>>  
>>>  0.0
>>>  
>>>  
>>>  0.0
>>>  
>>>  
>>>  42.0
>>>  
>>>  
>>>  0.0
>>>  
>>>  
>>>  1.0
>>>  
>>>  
>>>  4122.0
>>>  
>>> 
>>> 
>>>  The funny thing is if I removed the ShingleFilter from the below
>>> >> "sh_text"
>>>  field and index normally  the query time is half of the current
>>> shingle
>>> >> one
>>>  !. Shouldn't  be shingled index better for such heavy 2 word phrases
>>> >> search
>>>  ? I am confused.
>>> 
>>>  On the other hand One of the on the shelf big FAT companies search
>>> >> engine
>>>  doing the same query same machine 0.7 / 0.8 secs without cache . I am
>>>  confident we can do better in solr but couldnt find the way at the
>>> >> moment.
>>> 
>>>  thanks for helping..
>>> 
>>> 
>>> 
>>> 
>>>  On Sat, Aug 27, 2011 at 2:46 AM, Erik Hatcher <
>>> erik.hatc...@gmail.com
>>> >>> wrote:
>>> 
>>> >
>>> > On Aug 26, 2011, at 17:49 , Lord Khan Han wrote:
>>> >> We are indexing news  document from the various sites. Currently we
>>> >> have
>>> >> 200K docs indexed. Total index

Re: synonyms vs replacements

2011-08-29 Thread Erick Erickson
See here abou the "multi word" problem
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.SynonymFilterFactory

As for the rest, it's a tradeoff (surprise, surprise, surprise ).

You're right, expanding at index time leads to a somewhat
larger index, but less complex queries. And if you change
your synonyms file, you need to re-index from scratch

Indexing at query time lets you keep your synonyms up to
date. But the queries are more complex and somewhat
slower...

Which is "better" depends (tm), so pick your poison. One
strategy is to expand at index time, and *also* expand
at query time, but with a different synonym file. The idea
is that your query-time synonym file is the set of terms that
you want to add to your index-time expansion next
time you can re-index from scratch. Then periodically you
merge your query-time syns into your index-time syns, re-index
from scratch and empty your query-time syns. Rinse, repeat.

So, there isn't really a "right" answer. Personally I prefer to
expand at index time, but that's largely a preference.

Best
Erick

On Fri, Aug 26, 2011 at 4:52 PM, Robert Petersen  wrote:
> Hello all,
>
>
>
> Which is better?   Say you add an index time synonym between nunchuck
> and nunchuk and then both words will be in the document and both will be
> searchable.   I can get the same exact behavior by putting an index time
> replacement of nunchuck => nunchuk and a search time replacement of the
> same.
>
>
>
> I figured the replacement strategy keeps the the index size slightly
> smaller by only having the one term in the index, but the synonym
> strategy only requires you update the master, not the slave farm, and
> requires slightly less work for the searchers during a user query.  Are
> there any other considerations I should be aware of?
>
>
>
> Thanks
>
>
>
> BTW nunchuk is the correct spelling.  J
>
>
>
>
>
>


New to SOLR, installation issue

2011-08-29 Thread Stephen Lacy
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Hi all,

Just started working with SOLR here.

I'm currently trying to replicate the live environment so I have a
better understanding of the system.

My first thought was that it would be so much easier if I stuck with
the binaries that are already in the ubuntu-server (not my choice)
package repos rather than compiling from source and having to manually
update each time.

However because the live environment uses a different version it's
harder to find out how it's configured.

The live is also multicore which works fine on the dev except if there
is any issue with any configuration is simply displays the index.jsp
page on http://MYDEVSOLR:8080/solr as if it's running single core.

It's running under tomcat6 but the errors don't appear to be in the
tomcat6 logs

The current thing that is causing an issue is clustering. I pulled
down the 1.4.1 source code and followed the instructions on
http://wiki.apache.org/solr/ClusteringComponent to install the
clusteringcomponent although the jars I copied were in
contrib/clustering/lib and contrib/clustering/lib/downloads to
/usr/share/solr/WEB-INF/lib
If I leave it there it throws an error when I go to
http://MYDEVSOLR:8080/solr saying that clustering component isn't
installed.
If I then add -Dsolr.clustering.enabled=true to $CATALINA_OPTS by
typing CATALINA_OPTS=" -Dsolr.clustering.enabled=true" at the bottom
of /etc/default/tomcat6 then it goes back to displaying as if there is
only one core.

Would appreciate any help or direction anyone can offer.
Thanks in advance

Stephen
-BEGIN PGP SIGNATURE-
Version: GnuPG v2.0.17 (MingW32)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iQEcBAEBAgAGBQJOW35yAAoJEDZ6quc9XS1uQGMH/iowkSFiJ+BMc8R+2pI8C4kr
snpFH/VzXb0dEo9pPnOcLWm+iBvBSYquWjE6+KwSr+xVLGI3SsrBBZd8pcr2TLiN
i+d6arkfMcyqyyOHko223riqKCWan37WIk4b4zE6S/ByGSbznebPwyRkES6dyBgV
JTA5+YQSfoi9JYk9PPbMUcUBRhMpfIdQEgwm3PWFzhcL0uYkLj7PvWwTAZX7a4pu
oIpg/uzLPPa4Jyp1veSQXaCbPG4+EfPrcePzSJjDR1iT0BRz4s8dPOogs0uILD9Z
Xi+gJ1toGgJieDAmfxpOuMdzfFBHU8svpUcn+Z3VBkdiqAOWQFybKqlbc00Hu3E=
=51Ur
-END PGP SIGNATURE-


Re: Solr Geodist

2011-08-29 Thread Erick Erickson
When you say you are using "LucidImagination", what is that?
The Dev version of LucidWorks? A certified distro (which I don't
think there are any for trunk)?

I'm using a recent (last week) trunk version that I built manually, but
I think this has been for a while

Anyway, pasting in your field definition and query (from your
first post) returns the distance as a field just fine, so I suspect
something else you've changed somehow is interfering. Have
you re-indexed from scratch? I often delete the entire
data/index directory (directory to!). Or you have an
earlier version that I think you do, have you looked at the JIRA
to see when it was applied to trunk?

One approach would be to just start with a stock Solr and try
the example, and then build up gradually to where you are now
to see what change you might have introduced is the problem,
but that's a guess...

Best
Erick

On Fri, Aug 26, 2011 at 3:37 PM, solrnovice  wrote:
> Eric, thanks for the quick response. I left out the "d" value, yes, when you
> perform a spatial query, we should have a distance of d>0, sorry about that.
>
> What is the setting of your "store" value, i mean in the schema, was it
> marked at LatLong. For some reason i dont see the geodist() being returned
> in the result set. my coordinates is setup as  "type=location", below is the
> snapshot from my schema.xml.
>
>  omitNorms="false" omitTermFreqAndPositions="true" stored="true"
> termVectors="false" type="location"/>
>
>
> We are using LucidImagination, so i guess it comes with Solr 4.0, please let
> me know if i am wrong. That may be the reason for geodist() not being
> returned.I checked the solr version by going to solr admin and checked the
> version. it shows 4.0.
>
> For now i found a work around, this works for me. the distance is returned
> in the form of "score".
>
> http://127.0.0.1:/solr/apex_dev/select/?q=*:*+_val_:%22geodist%28%29%22&rows=100&fq={!geofilt}&sfield=coordinates&pt=31.2225,-85.3931&d=50&sort=geodist%28%29%20asc&fl=*,score
>
> I read in a different post that , earlier versions of solr ( prior to 4.0),
> we have to use the score option.
>
> thanks for taking time to try the query.
>
>
> SN
>
>
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Solr-Geodist-tp3287005p3287806.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>


RE: commas in synonyms.txt are not escaping

2011-08-29 Thread Moore, Gary
Hah, I knew it was something simple. :)  Thanks.
Gary

-Original Message-
From: ysee...@gmail.com [mailto:ysee...@gmail.com] On Behalf Of Yonik Seeley
Sent: Sunday, August 28, 2011 12:50 PM
To: solr-user@lucene.apache.org
Subject: Re: commas in synonyms.txt are not escaping

Turns out this isn't a bug - I was just tripped up by the analysis
changes to the example server.

Gary, you are probably just hitting the same thing.
The "text" fieldType is no longer used by any fields by default - for
example the "text" field uses the "text_general" fieldType.
This fieldType uses the standard tokenizer, which discards stuff like
commas (hence the synonym will never match).

-Yonik
http://www.lucidimagination.com


Re: index full text as possible

2011-08-29 Thread Gora Mohanty
2011/8/29 Rode González :
> Hi Gora.
>
> The phrases are separated by dots or commas (I think it's the easiest way to 
> do this).

In that case, you should be able to use a tokeniser to split
the input into phrases, though you will probably need to write
a custom tokeniser, depending on what characters you want to
break phrases at. Please see
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters

It is also entirely possible to index the full text, and just do a
phrase search later. This is probably the easiest option, unless
you have a huge volume of text, and the volume of phrases to
be indexed can be significantly lower.

> The documents to index come from pdf (books scanned, other pdf docs) or other 
> binary docs that the /update/extract handler can manipulate.

Text PDFs should be fine if you are using Tika with Solr.
However, the PDF of scanned books will typically be a
set of images, and you would need to pre-process these
with some kind of OCR.

Regards,
Gora


RE: index full text as possible

2011-08-29 Thread Rode González
Hi Gora.

The phrases are separated by dots or commas (I think it's the easiest way to do 
this).

The documents to index come from pdf (books scanned, other pdf docs) or other 
binary docs that the /update/extract handler can manipulate.

Regards,
Rode.

-Mensaje original-
De: Gora Mohanty [mailto:g...@mimirtech.com] 
Enviado el: lunes, 29 de agosto de 2011 11:52
Para: solr-user@lucene.apache.org
Asunto: Re: index full text as possible

2011/8/29 Rode González :
> Hi all.
>
> I want to index/search from/about text books.
>
> p.e. i have the phrase "i live near mountains, but i also like the beach.".
>
> and the costumers could ask about "live near mountains". I want the exact 
> match, not the tokens matches separately.
>
> In the seach is easy, i search for the phrase... but in index time, how could 
> i index it (the complete text) like a text field in mysql? Is it possible?
>
> Possible solutions:
>
> the full text perhaps isn't possible, but index phrases using ...? what?
[...]

Your question is not completely clear: How are phrases to
be identified in the input text that is to be indexed? Also,
depending on the volume of text you have, and the available
hardware, it is quite possible to index the full text, but maybe
I am not following your intent here.

Regards,
Gora




Re: index full text as possible

2011-08-29 Thread Gora Mohanty
2011/8/29 Rode González :
> Hi all.
>
> I want to index/search from/about text books.
>
> p.e. i have the phrase "i live near mountains, but i also like the beach.".
>
> and the costumers could ask about "live near mountains". I want the exact 
> match, not the tokens matches separately.
>
> In the seach is easy, i search for the phrase... but in index time, how could 
> i index it (the complete text) like a text field in mysql? Is it possible?
>
> Possible solutions:
>
> the full text perhaps isn't possible, but index phrases using ...? what?
[...]

Your question is not completely clear: How are phrases to
be identified in the input text that is to be indexed? Also,
depending on the volume of text you have, and the available
hardware, it is quite possible to index the full text, but maybe
I am not following your intent here.

Regards,
Gora


RE: MongoDB and Solr Integration

2011-08-29 Thread Jagdish Kumar


 Hi Anshum
 
thanks for the link.. I could see some snnipets in there .. can u help me out 
with some more eloborative expalanation?
 
Thanks and regards
Jagdish

> From: ansh...@gmail.com
> Date: Mon, 29 Aug 2011 14:54:15 +0530
> Subject: Re: MongoDB and Solr Integration
> To: solr-user@lucene.apache.org
> 
> Hi Jagdeesh,
> 
> I wouldn't say that it will completely solve your problem, but I guess this
> blog post would help you get started.
> http://ai-cafe.blogspot.com/2011/08/indexing-mongodb-data-for-solr.html
> 
> --
> Anshum Gupta
> http://ai-cafe.blogspot.com
> 
> 
> On Mon, Aug 29, 2011 at 2:16 PM, Jagdish Kumar <
> jagdish.thapar...@hotmail.com> wrote:
> 
> >
> > Thanks Gora for ur reply .. I will follow these links and see if they work
> > for me ...
> >
> > Thanks and regards
> > Jagdish
> >
> >
> > > From: g...@mimirtech.com
> > > Date: Mon, 29 Aug 2011 13:23:19 +0530
> > > Subject: Re: MongoDB and Solr Integration
> > > To: solr-user@lucene.apache.org
> > >
> > > On Mon, Aug 29, 2011 at 1:01 PM, Jagdish Kumar
> > >  wrote:
> > > >
> > > > Hi Gora
> > > >
> > > > Nothing concreate I was able to get out of this query .. have you done
> > any such stuff on ur own?
> > > [...]
> > >
> > > Hmm, maybe there is nothing ready-made in the language that
> > > you want (which is that?), but surely this thread
> > >
> > http://markmail.org/message/6aqwjrqnwgn6whpw#query:mongodb%20oplog%20solr+page:1+mid:fir2ebruipdpxwrv+state:results
> > > which points to a thread discussing a Ruby implementation
> > >
> > http://groups.google.com/group/mongodb-user/browse_thread/thread/b4a2417288dabe97/49241abdcb4fc677
> > > is of help. There is also someone who has a PHP implementation,
> > > though they do say that it is slow:
> > > http://blog.knuthaugen.no/2010/04/cooking-with-mongodb-and-solr.html
> > >
> > > As the threads suggest, using Mongo's oplog to trigger Solr
> > > indexing seems to be the way to go:
> > > http://www.mongodb.org/display/DOCS/Replication+Internals
> > >
> > > We were supposed to do this for a client, but that project did
> > > not work out. Let me see if I can find some time to look at this.
> > >
> > > Regards,
> > > Gora
> >
> >
  

Re: solr UIMA exception

2011-08-29 Thread Tommaso Teofili
The UIMA AlchemyAPI annotator is failing for you due to an error no server
side and I think you should look at your Solr UIMA configuration as it seem
you wanted to extract entities from text:
"Senator Dick Durbin (D-IL)  Chicago , March
3,2007."
while the error says
"org.apache.solr.uima.processor.UIMAUpdateRequestProcessor processAdd
WARNING: skip the text processing due to null. id=12351,  text="TRIPOLI,
Libya - Fresh fighting erupted in Tripoli on Tuesday hours after Moammar
Gadhafi's son turn..."
Tommaso

2011/8/25 chanhangfai 

> Hi,
>
> I have followed this solr UIMA config, using AlchemyAPIAnnotator and
> OpenCalaisAnnotator.
>
> https://svn.apache.org/repos/asf/lucene/dev/trunk/solr/contrib/uima/README.txt
> http://wiki.apache.org/solr/SolrUIMA
>
>
> so, I got the AlchemyAPI key and OpenCalais key.
>
> and I can successfully hit
>
> http://access.alchemyapi.com/calls/url/URLGetRankedNamedEntities?apikey=my_alchemy_key&url=www.cnn.com
>
>
> but somehow, i got the following exception when I run
> *java -jar post.jar entity.xml*
>
> -
> this is entity.xml
>
> 
> 
> 12345
> Senator Dick Durbin (D-IL)  Chicago , March
> 3,2007.
> Entity Extraction
> 
> 
> -
>
> any suggestion, really appreciated..please..
>
>
>
> Aug 25, 2011 5:11:50 PM WhitespaceTokenizer process
> INFO: "Whitespace tokenizer starts processing"
> Aug 25, 2011 5:11:50 PM WhitespaceTokenizer process
> INFO: "Whitespace tokenizer finished processing"
> Aug 25, 2011 5:11:54 PM
> *org.apache.uima.analysis_engine.impl.PrimitiveAnalysisEngine_impl
> callAnalysisComponentProcess(405)*
> SEVERE: Exception occurred
> org.apache.uima.analysis_engine.AnalysisEngineProcessException
>at
>
> org.apache.uima.alchemy.annotator.AbstractAlchemyAnnotator.process(AbstractAlchemyAnnotator.java:138)
>at
>
> org.apache.uima.analysis_component.JCasAnnotator_ImplBase.process(JCasAnnotator_ImplBase.java:48)
>at
>
> org.apache.uima.analysis_engine.impl.PrimitiveAnalysisEngine_impl.callAnalysisComponentProcess(PrimitiveAnalysisEngine_impl.java:377)
>at
>
> org.apache.uima.analysis_engine.impl.PrimitiveAnalysisEngine_impl.processAndOutputNewCASes(PrimitiveAnalysisEngine_impl.java:295)
>at
>
> org.apache.uima.analysis_engine.asb.impl.ASB_impl$AggregateCasIterator.processUntilNextOutputCas(ASB_impl.java:567)
>at
>
> org.apache.uima.analysis_engine.asb.impl.ASB_impl$AggregateCasIterator.(ASB_impl.java:409)
>at
>
> org.apache.uima.analysis_engine.asb.impl.ASB_impl.process(ASB_impl.java:342)
>at
>
> org.apache.uima.analysis_engine.impl.AggregateAnalysisEngine_impl.processAndOutputNewCASes(AggregateAnalysisEngine_impl.java:267)
>at
>
> org.apache.uima.analysis_engine.impl.AnalysisEngineImplBase.process(AnalysisEngineImplBase.java:267)
>at
>
> org.apache.uima.analysis_engine.impl.AnalysisEngineImplBase.process(AnalysisEngineImplBase.java:280)
>at
>
> org.apache.solr.uima.processor.UIMAUpdateRequestProcessor.processText(UIMAUpdateRequestProcessor.java:151)
>at
>
> org.apache.solr.uima.processor.UIMAUpdateRequestProcessor.processAdd(UIMAUpdateRequestProcessor.java:77)
>at
> org.apache.solr.handler.XMLLoader.processUpdate(XMLLoader.java:147)
>at org.apache.solr.handler.XMLLoader.load(XMLLoader.java:77)
>at
>
> org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:67)
>at
>
> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129)
>at org.apache.solr.core.SolrCore.execute(SolrCore.java:1368)
>at
>
> org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:356)
>at
>
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:252)
>at
>
> org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
>at
> org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:399)
>at
> org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
>at
> org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182)
>at
> org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766)
>at
> org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:450)
>at
>
> org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230)
>at
>
> org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114)
>at
> org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152)
>at org.mortbay.jetty.Server.handle(Server.java:326)
>at
> org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:542)
>at
>
> org.mortbay.jetty.HttpConnection$RequestHandler.content(HttpConnection.java:945)
>at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:756)
>at org.mortbay.je

index full text as possible

2011-08-29 Thread Rode González
Hi all.

I want to index/search from/about text books.

p.e. i have the phrase "i live near mountains, but i also like the beach.".

and the costumers could ask about "live near mountains". I want the exact 
match, not the tokens matches separately. 

In the seach is easy, i search for the phrase... but in index time, how could i 
index it (the complete text) like a text field in mysql? Is it possible? 
 
Possible solutions:

the full text perhaps isn't possible, but index phrases using ...? what?

thanks in advance.

Rode.




Re: MongoDB and Solr Integration

2011-08-29 Thread Anshum
Hi Jagdeesh,

I wouldn't say that it will completely solve your problem, but I guess this
blog post would help you get started.
http://ai-cafe.blogspot.com/2011/08/indexing-mongodb-data-for-solr.html

--
Anshum Gupta
http://ai-cafe.blogspot.com


On Mon, Aug 29, 2011 at 2:16 PM, Jagdish Kumar <
jagdish.thapar...@hotmail.com> wrote:

>
> Thanks Gora for ur reply .. I will follow these links and see if they work
> for me ...
>
> Thanks and regards
> Jagdish
>
>
> > From: g...@mimirtech.com
> > Date: Mon, 29 Aug 2011 13:23:19 +0530
> > Subject: Re: MongoDB and Solr Integration
> > To: solr-user@lucene.apache.org
> >
> > On Mon, Aug 29, 2011 at 1:01 PM, Jagdish Kumar
> >  wrote:
> > >
> > > Hi Gora
> > >
> > > Nothing concreate I was able to get out of this query .. have you done
> any such stuff on ur own?
> > [...]
> >
> > Hmm, maybe there is nothing ready-made in the language that
> > you want (which is that?), but surely this thread
> >
> http://markmail.org/message/6aqwjrqnwgn6whpw#query:mongodb%20oplog%20solr+page:1+mid:fir2ebruipdpxwrv+state:results
> > which points to a thread discussing a Ruby implementation
> >
> http://groups.google.com/group/mongodb-user/browse_thread/thread/b4a2417288dabe97/49241abdcb4fc677
> > is of help. There is also someone who has a PHP implementation,
> > though they do say that it is slow:
> > http://blog.knuthaugen.no/2010/04/cooking-with-mongodb-and-solr.html
> >
> > As the threads suggest, using Mongo's oplog to trigger Solr
> > indexing seems to be the way to go:
> > http://www.mongodb.org/display/DOCS/Replication+Internals
> >
> > We were supposed to do this for a client, but that project did
> > not work out. Let me see if I can find some time to look at this.
> >
> > Regards,
> > Gora
>
>


Re: what is scheduling ? why should we do this?how to achieve this ?

2011-08-29 Thread nagarjuna
Hi pravesh...
 i already saw the wiki page that what u have given...from that i got
the points about collection distribution etc...
but i didnt get any link which will explain the cron job process step by
step for the windows OS ..
can please tell me how to do it for windows?

--
View this message in context: 
http://lucene.472066.n3.nabble.com/what-is-scheduling-why-should-we-do-this-how-to-achieve-this-tp3287115p3292221.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: what is scheduling ? why should we do this?how to achieve this ?

2011-08-29 Thread pravesh
The Wiki link that you referred is quite old and is not into active
development.
I would prefer the OS based scheduling using cron jobs. You can check below
link.

http://wiki.apache.org/solr/CollectionDistribution
http://wiki.apache.org/solr/CollectionDistribution 

Thanx
Pravesh

--
View this message in context: 
http://lucene.472066.n3.nabble.com/what-is-scheduling-why-should-we-do-this-how-to-achieve-this-tp3287115p3292212.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: how to update solr cache when i delete records from remote database?

2011-08-29 Thread nagarjuna
hi pravesh.
 "You would have to delete them from SOLR also"...i did'nt get clearly this
can u please make sure it
i just need to update my solr cache with the database updation thts it...i
dont want delete anything from solrif i delete the data everytime when i
got changes in my DB then what is the use of powerful search server solr?  
am i making sense or not?

--
View this message in context: 
http://lucene.472066.n3.nabble.com/how-to-update-solr-cache-when-i-delete-records-from-remote-database-tp3291879p3292203.html
Sent from the Solr - User mailing list archive at Nabble.com.


RE: MongoDB and Solr Integration

2011-08-29 Thread Jagdish Kumar

Thanks Gora for ur reply .. I will follow these links and see if they work for 
me ...
 
Thanks and regards
Jagdish
 

> From: g...@mimirtech.com
> Date: Mon, 29 Aug 2011 13:23:19 +0530
> Subject: Re: MongoDB and Solr Integration
> To: solr-user@lucene.apache.org
> 
> On Mon, Aug 29, 2011 at 1:01 PM, Jagdish Kumar
>  wrote:
> >
> > Hi Gora
> >
> > Nothing concreate I was able to get out of this query .. have you done any 
> > such stuff on ur own?
> [...]
> 
> Hmm, maybe there is nothing ready-made in the language that
> you want (which is that?), but surely this thread
> http://markmail.org/message/6aqwjrqnwgn6whpw#query:mongodb%20oplog%20solr+page:1+mid:fir2ebruipdpxwrv+state:results
> which points to a thread discussing a Ruby implementation
> http://groups.google.com/group/mongodb-user/browse_thread/thread/b4a2417288dabe97/49241abdcb4fc677
> is of help. There is also someone who has a PHP implementation,
> though they do say that it is slow:
> http://blog.knuthaugen.no/2010/04/cooking-with-mongodb-and-solr.html
> 
> As the threads suggest, using Mongo's oplog to trigger Solr
> indexing seems to be the way to go:
> http://www.mongodb.org/display/DOCS/Replication+Internals
> 
> We were supposed to do this for a client, but that project did
> not work out. Let me see if I can find some time to look at this.
> 
> Regards,
> Gora
  

Re: how to update solr cache when i delete records from remote database?

2011-08-29 Thread vighnesh

thanx for giving response.

which one should i delete from solr?

please explain in clear

--
View this message in context: 
http://lucene.472066.n3.nabble.com/how-to-update-solr-cache-when-i-delete-records-from-remote-database-tp3291879p3292194.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: what is scheduling ? why should we do this?how to achieve this ?

2011-08-29 Thread nagarjuna
Hi pravesh

   Thank u very much for ur reply now i am very much cleared about
scheduling...but,i am unable to perform scheduling in solr...can u please
provide some sample about solr scheduling?



Thank u very much.

--
View this message in context: 
http://lucene.472066.n3.nabble.com/what-is-scheduling-why-should-we-do-this-how-to-achieve-this-tp3287115p3292173.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: How to implement Spell Checker using Solr?

2011-08-29 Thread anupamxyz
The error I have been receiving after crawling using Solr is as mentioned
below: 

2011-08-24 15:47:56,225 INFO  plugin.PluginRepository - Basic Indexing
Filter (index-basic)
2011-08-24 15:47:56,225 INFO  plugin.PluginRepository - Basic Summarizer
Plug-in (summary-basic)
2011-08-24 15:47:56,225 INFO  plugin.PluginRepository - Site Query 
Filter
(query-site)
2011-08-24 15:47:56,225 INFO  plugin.PluginRepository - Http / Https
Protocol Plug-in (protocol-httpclient)
2011-08-24 15:47:56,225 INFO  plugin.PluginRepository - HTTP Framework
(lib-http)
2011-08-24 15:47:56,225 INFO  plugin.PluginRepository - Pass-through URL
Normalizer (urlnormalizer-pass)
2011-08-24 15:47:56,225 INFO  plugin.PluginRepository - Regex URL Filter
(urlfilter-regex)
2011-08-24 15:47:56,225 INFO  plugin.PluginRepository - Http Protocol
Plug-in (protocol-http)
2011-08-24 15:47:56,225 INFO  plugin.PluginRepository - XML Response 
Writer
Plug-in (response-xml)
2011-08-24 15:47:56,225 INFO  plugin.PluginRepository - Regex URL
Normalizer (urlnormalizer-regex)
2011-08-24 15:47:56,225 INFO  plugin.PluginRepository - OPIC Scoring
Plug-in (scoring-opic)
2011-08-24 15:47:56,225 INFO  plugin.PluginRepository - CyberNeko HTML
Parser (lib-nekohtml)
2011-08-24 15:47:56,225 INFO  plugin.PluginRepository - Anchor Indexing
Filter (index-anchor)
2011-08-24 15:47:56,225 INFO  plugin.PluginRepository - URL Query Filter
(query-url)
2011-08-24 15:47:56,225 INFO  plugin.PluginRepository - Regex URL Filter
Framework (lib-regex-filter)
2011-08-24 15:47:56,225 INFO  plugin.PluginRepository - JSON Response
Writer Plug-in (response-json)
2011-08-24 15:47:56,225 INFO  plugin.PluginRepository - Registered
Extension-Points:
2011-08-24 15:47:56,225 INFO  plugin.PluginRepository - Nutch Summarizer
(org.apache.nutch.searcher.Summarizer)
2011-08-24 15:47:56,225 INFO  plugin.PluginRepository - Nutch Protocol
(org.apache.nutch.protocol.Protocol)
2011-08-24 15:47:56,225 INFO  plugin.PluginRepository - Nutch Analysis
(org.apache.nutch.analysis.NutchAnalyzer)
2011-08-24 15:47:56,225 INFO  plugin.PluginRepository - Nutch Field 
Filter
(org.apache.nutch.indexer.field.FieldFilter)
2011-08-24 15:47:56,225 INFO  plugin.PluginRepository - HTML Parse 
Filter
(org.apache.nutch.parse.HtmlParseFilter)
2011-08-24 15:47:56,225 INFO  plugin.PluginRepository - Nutch Query 
Filter
(org.apache.nutch.searcher.QueryFilter)
2011-08-24 15:47:56,225 INFO  plugin.PluginRepository - Nutch Search
Results Response Writer (org.apache.nutch.searcher.response.ResponseWriter)
2011-08-24 15:47:56,225 INFO  plugin.PluginRepository - Nutch URL
Normalizer (org.apache.nutch.net.URLNormalizer)
2011-08-24 15:47:56,225 INFO  plugin.PluginRepository - Nutch URL Filter
(org.apache.nutch.net.URLFilter)
2011-08-24 15:47:56,225 INFO  plugin.PluginRepository - Nutch Online 
Search
Results Clustering Plugin (org.apache.nutch.clustering.OnlineClusterer)
2011-08-24 15:47:56,225 INFO  plugin.PluginRepository - Nutch Indexing
Filter (org.apache.nutch.indexer.IndexingFilter)
2011-08-24 15:47:56,225 INFO  plugin.PluginRepository - Nutch Content
Parser (org.apache.nutch.parse.Parser)
2011-08-24 15:47:56,225 INFO  plugin.PluginRepository - Nutch Scoring
(org.apache.nutch.scoring.ScoringFilter)
2011-08-24 15:47:56,225 INFO  plugin.PluginRepository - Ontology Model
Loader (org.apache.nutch.ontology.Ontology)
2011-08-24 15:47:56,241 INFO  indexer.IndexingFilters - Adding
org.apache.nutch.indexer.basic.BasicIndexingFilter
2011-08-24 15:47:56,241 INFO  indexer.IndexingFilters - Adding
org.apache.nutch.indexer.anchor.AnchorIndexingFilter
2011-08-24 15:47:57,366 WARN  mapred.LocalJobRunner - job_local_0001
org.apache.solr.common.SolrException: Internal Server Error

Internal Server Error

request: http://localhost:7001/solr/update?wt=javabin&version=2.2
at
org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:343)
at
org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:183)
at
org.apache.solr.client.solrj.request.UpdateRequest.process(UpdateRequest.java:217)
at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:48)
at org.apache.nutch.indexer.solr.SolrWriter.close(SolrWriter.java:69)
at
org.apache.nutch.indexer.IndexerOutputFormat$1.close(IndexerOutputFormat.java:48)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:447)
at 
org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:170)
2011-08-24 15:47:57,882 FATAL solr.SolrIndexer - SolrIndexer:
java.io.IOException: Job failed!
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1232)
at 
org.apache.nutch.indexer.solr.SolrIndexer.indexSolr(SolrIndexer.java:73)

Re: Viewing the complete document from within the index

2011-08-29 Thread pravesh
Reconstructing the document might not be possible, since,only the stored
fields are actually stored document-wise(un-inverted), where as the
indexed-only fields are put as inverted way.
In don't think SOLR/Lucene currently provides any way, so, one can
re-construct document in the way you desire. (It's sort of reverse
engineering not supported)

Thanx
Pravesh

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Viewing-the-complete-document-from-within-the-index-tp3288076p3292111.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: MongoDB and Solr Integration

2011-08-29 Thread Gora Mohanty
On Mon, Aug 29, 2011 at 1:01 PM, Jagdish Kumar
 wrote:
>
> Hi Gora
>
> Nothing concreate I was able to get out of this query .. have you done any 
> such stuff on ur own?
[...]

Hmm, maybe there is nothing ready-made in the language that
you want (which is that?), but surely this thread
http://markmail.org/message/6aqwjrqnwgn6whpw#query:mongodb%20oplog%20solr+page:1+mid:fir2ebruipdpxwrv+state:results
which points to a thread discussing a Ruby implementation
http://groups.google.com/group/mongodb-user/browse_thread/thread/b4a2417288dabe97/49241abdcb4fc677
is of help. There is also someone who has a PHP implementation,
though they do say that it is slow:
http://blog.knuthaugen.no/2010/04/cooking-with-mongodb-and-solr.html

As the threads suggest, using Mongo's oplog to trigger Solr
indexing seems to be the way to go:
http://www.mongodb.org/display/DOCS/Replication+Internals

We were supposed to do this for a client, but that project did
not work out. Let me see if I can find some time to look at this.

Regards,
Gora


Re: how i am getting data in my search field eventhough i removed data in my remote database?

2011-08-29 Thread pravesh
http://lucene.472066.n3.nabble.com/how-to-update-solr-cache-when-i-delete-records-from-remote-database-td3291879.html
http://lucene.472066.n3.nabble.com/how-to-update-solr-cache-when-i-delete-records-from-remote-database-td3291879.html
 

--
View this message in context: 
http://lucene.472066.n3.nabble.com/how-i-am-getting-data-in-my-search-field-eventhough-i-removed-data-in-my-remote-database-tp3289008p3292095.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: how to update solr cache when i delete records from remote database?

2011-08-29 Thread pravesh
You would have to delete them from SOLR also, and then commit it (commit will
automatically refresh your caches).

Thanx
Pravesh

--
View this message in context: 
http://lucene.472066.n3.nabble.com/how-to-update-solr-cache-when-i-delete-records-from-remote-database-tp3291879p3292074.html
Sent from the Solr - User mailing list archive at Nabble.com.


RE: MongoDB and Solr Integration

2011-08-29 Thread Jagdish Kumar

Hi Gora
 
Nothing concreate I was able to get out of this query .. have you done any such 
stuff on ur own?
 
Thanks and regards
Jagdish
 

> From: g...@mimirtech.com
> Date: Mon, 29 Aug 2011 12:47:21 +0530
> Subject: Re: MongoDB and Solr Integration
> To: solr-user@lucene.apache.org
> 
> On Mon, Aug 29, 2011 at 12:38 PM, Jagdish Kumar
>  wrote:
> >
> > Hi
> >
> > I need to integrate MongoDB with Solr, can anyone please help me out with 
> > this as I m not able to find any relevant information on net.
> [...]
> 
> The links from http://lmgtfy.com/?q=mongodb+solr do not help?
> 
> Regards,
> Gora
  

Re: what is scheduling ? why should we do this?how to achieve this ?

2011-08-29 Thread pravesh
SCHEDULING in OS terminology, is, when you specify cron jobs on linux/unix
machines (and scheduled tasks in windows machines).
What ever task that you schedule along with time/date or interval, it will
be automatically invoked, so, you don't have to manually log into the
machine and call the script/batch.

SOLR scheduling is also same, but with internal mechanism provided by SOLR
to set schedule to automatically invoke; delta-import, full-import, commit,
etc. This would help, so, you're not dependent at OS level because for
different OS's you have to schedule it differently(cron/scheduled-tasks).

Thanx
Pravesh

--
View this message in context: 
http://lucene.472066.n3.nabble.com/what-is-scheduling-why-should-we-do-this-how-to-achieve-this-tp3287115p3292068.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: MongoDB and Solr Integration

2011-08-29 Thread Gora Mohanty
On Mon, Aug 29, 2011 at 12:38 PM, Jagdish Kumar
 wrote:
>
> Hi
>
> I need to integrate MongoDB with Solr, can anyone please help me out with 
> this as I m not able to find any relevant information on net.
[...]

The links from http://lmgtfy.com/?q=mongodb+solr do not help?

Regards,
Gora


MongoDB and Solr Integration

2011-08-29 Thread Jagdish Kumar

Hi
 
I need to integrate MongoDB with Solr, can anyone please help me out with this 
as I m not able to find any relevant information on net.
 
Thanks and regards
Jagdish