Re: Search results after importing from Dih

2010-08-26 Thread hemant.verma

Check your index folder, does it contains files other than segment files?
If yes then your data is in index, then you need to commit it.
Try restart your solr.
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Search-results-after-importing-from-Dih-tp1365720p1366104.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Search results after importing from Dih

2010-08-26 Thread Grijesh.singh

have u commited the data,

use *:* query to see that data is commited yet or not
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Search-results-after-importing-from-Dih-tp1365720p1365927.html
Sent from the Solr - User mailing list archive at Nabble.com.


Broken links in Solr FAQ's "Why don't International Characters Work?"

2010-08-26 Thread Teruhiko Kurosaka
In
http://wiki.apache.org/solr/FAQ#Why_don.27t_International_Characters_Work.3F
These three links are broken.

http://www.nabble.com/International-Charsets-in-embedded-XML-tf1780147.html#a4897795
 (International Charsets in embedded XML  for Jetty 5.1)

http://www.nabble.com/Problem-with-surrogate-characters-in-utf-8-tf3920744.html
 (Problem with surrogate characters in utf-8  for Jetty 6)

http://lucene.apache.org/solr/api/org/apache/solr/util/AbstractSolrTestCase.html
 (AbstractSolrTestCase)

Does anyone care to fix these?

I am guessing that AbstractSolrTestCase is obsolete in the dev version.  
Is there any other method to isolate Solr bug from application server bug?

(I think Solr API doc should be versioned like Lucene(Java) API docs, BTW.)


T. "Kuro" Kurosaka



Search results after importing from Dih

2010-08-26 Thread Pavan Gupta
Hi,
I was able to successfully index rows of a simple table of mysql db using
DIH. However, when I tried searching for the indexed data using solr admin
interface, no result based on data in table was displayed. Any idea why?
Regards,
Pavan


Re: Doing Shingle but also keep special single word

2010-08-26 Thread 朱炎詹

Thanks! It seems that I really go the wrong direction.

- Original Message - 
From: "Ahmet Arslan" 

To: 
Sent: Tuesday, August 24, 2010 4:21 PM
Subject: Re: Doing Shingle but also keep special single word



The request is from our business
team, they wish user of our product can
type in partial string of a word that exists in title or
body field. But now
I also doubt if this request is really necessary?


"partial string of a word"? I think there is misunderstanding here. 
SingleFilter operates token level.


please divide this text => "please divide", "divide this", "this text"

If you want partial string of a single word, then EdgeNGramFilter and 
NGramFilter is used for that purpose.












¥¼¦b¶Ç¤J°T®§¤¤§ä¨ì¯f¬r¡C
Checked by AVG - www.avg.com
Version: 9.0.851 / Virus Database: 271.1.1/3090 - Release Date: 08/24/10 
02:34:00




Re: solr working...

2010-08-26 Thread satya swaroop
Hi all,

  Thanks for ur response and information. I used slf4j log and i kept
log.info method in every class of solr module to know which classes get
invoke on particular requesthandler or on start of solr I was able to
keep it only in solr Module but not in lucene module... i get error when i
use it in dat module.. can any one tell me other ways like this to track the
path solr

Regards,
  satya


Re: Search Results optimization

2010-08-26 Thread Chris Hostetter
: 
: if user searches for "swingline red stapler hammer hand rigid", then
: documents that matches max number of words written in query should come
: first
: e.g a document with name field as "swingline stapler" should come later than
: the document with "swingline red stapler"

at a fundemental level, Lucene searches already work this way.

A "BooleanQuery" containing 6 optional clauses from "swingline red stapler 
hammer hand rigid" will have a coord factor of 3/6 against a document with 
the value "swingline red stapler" and a coord factor of 2/6 against a 
document with the value "swingline stapler"

that said: other factors come into play: notable the fieldNorm,which 
by default gives higher scores to shorter documents.  IDF and TF also come 
into play typically, but in your example they shouldn't affect much since 
the query doesn't change and the documents only contain ech term at most 
once.

Bottom line: you need to look atthe score explanations to understand why 
documents aren't showing up in the order that you want, and then you need 
to consider what settings you chan change to get hte behavior you want 
(ie: perhaps you need to omitNorms so shorter fields don't get a boost?)

With out seeing your score explanations, we're just guessing at things 
that might help.

--
http://lucenerevolution.org/  ...  October 7-8, Boston
http://bit.ly/stump-hoss  ...  Stump The Chump!



Re: Delete by query issue

2010-08-26 Thread Chris Hostetter
: Here's the problem: the standard Solr parser is a little weird about
: negative queries. The way to make this work is to say
: *:* AND -field:[* TO *]

the default parser actually works ok ... it's a bug specific to 
deletion...
https://issues.apache.org/jira/browse/SOLR-381


-Hoss

--
http://lucenerevolution.org/  ...  October 7-8, Boston
http://bit.ly/stump-hoss  ...  Stump The Chump!



Re: Is there any strss test tool for testing Solr?

2010-08-26 Thread Chris Hostetter

: References: 
: 
: In-Reply-To: 
: Subject: Is there any strss test tool for testing Solr?

http://people.apache.org/~hossman/#threadhijack
Thread Hijacking on Mailing Lists

When starting a new discussion on a mailing list, please do not reply to 
an existing message, instead start a fresh email.  Even if you change the 
subject line of your email, other mail headers still track which thread 
you replied to and your question is "hidden" in that thread and gets less 
attention.   It makes following discussions in the mailing list archives 
particularly difficult.
See Also:  http://en.wikipedia.org/wiki/User:DonDiego/Thread_hijacking



-Hoss

--
http://lucenerevolution.org/  ...  October 7-8, Boston
http://bit.ly/stump-hoss  ...  Stump The Chump!



Re: XSL import/include relative to app server home directory...

2010-08-26 Thread Chris Hostetter

Brian: I think the problem you are encountering is similar to this 
issue...

  https://issues.apache.org/jira/browse/SOLR-1656

...if you have any thoughts on wether the patch/ideas in that issue would 
also solve the problem you are looking at please post a comment


-Hoss

--
http://lucenerevolution.org/  ...  October 7-8, Boston
http://bit.ly/stump-hoss  ...  Stump The Chump!



Re: Field getting tokenized prior to charFilter on select query

2010-08-26 Thread Chris Hostetter

You are seeing the effects of the default QueryParser.

whitespace (like '+','-','"','*', etc...) is a "special character" to the 
Lucene QueryParser.  Un-Escaped/Quoted qhitespace tells the query parser 
to construct a BooleanQuery containing multiple clauses -- each clause is 
analyzed seperately.

To have the entire input passed to the Analyzer as a single string, you 
would either quote it, or use a differnet QParser such as the "field" 
QParser...

http://wiki.apache.org/solr/SolrQuerySyntax#Other_built-in_useful_query_parsers
http://lucene.apache.org/solr/api/org/apache/solr/search/FieldQParserPlugin.html

: entire field was provided in a single call.  When the query invoked 
: PatternReplaceCharFilterFactory.create(CharStream) it invoked it 3 times 
: with 3 seperate tokens (A, &, B).  Because of this the regex won't ever 
: locate the full string in the field.

-Hoss

--
http://lucenerevolution.org/  ...  October 7-8, Boston
http://bit.ly/stump-hoss  ...  Stump The Chump!


Re: Query speed decreased dramatically, not sure why though?

2010-08-26 Thread Chris Hostetter

: ".../select?q=*&sort=evalDate+desc,score+desc&start=0&rows=10"
: 
: This query takes around 5 seconds to complete.
:
: I changed the query to the following;
: 
: ".../select?q=[* TO NOW]&sort=evalDate+desc,score+desc&start=0&rows=10"
: 
: The query now returns in around 600 milliseconds.
: 
: Can any one help with explaining why [* TO NOW] has had such a big effect?

Short answer: because these are compleltey differnet queries, and match 
completley differnet sets of documents, using completley different logic.

"q=*" means "a prefix query against the default search field where the 
prefix is the empty string" and it -- so assuming your default search 
field is called "text" q=* is equivilent to q=text:*

that will match all docs that have some value indexed in the text field.

finding all of those docs means iterating over every term in that field, 
and tracking every document associated with those values -- the default 
search field typically has *lots* of indexed terms 

conversly, "q=[* TO NOW]" means "a range query of the default search 
field for all terms that are less then 'NOW'".   for a date field "NOW" 
is converted to the current time, but unless your default search field is 
a date field that's not relevant.  more then likely "NOW" is just 
interpreted as the string "NOW" (or maybe "now" if your default serach 
field is lowercased) and that query just matches all the terms that are 
alphabetically before it.

all terms alphabeticly sorted before "NOW" are less then all terms, so the 
second query is basically garunteed to be faster then the first ... and 
most likely matches a much smaller set of documents.


-Hoss

--
http://lucenerevolution.org/  ...  October 7-8, Boston
http://bit.ly/stump-hoss  ...  Stump The Chump!



Re: Solr Admin Schema Browser and field named "keywords"

2010-08-26 Thread Chris Hostetter

:  I have a field named "keywords" in my index.  The schema browser page is not
: able to deal with this, so I have trouble getting statistical information on
: this field.  When I click on the field, Firefox hangs for a minute and then
: gives the "unresponsive script" warning.  I assume (without actually checking)
: that this is due to "keywords" being already used for something in the
: javascript code.

doubtful.

I suspect it has more to do with the amount of data in your keywords 
field and the underlying request to hte LukeRequestHandler timing out.

  have you tried using it with a test index where the "keywords" 
field has only a few words in it?

: Related to this, would it be difficult to make this feature display something
: like a status bar when it is first grabbing information, indicating how many
: fields there are and which one it's working on at the moment?  It takes a few
: minutes for it to load on my indexes, so some indication of how far along it
: is would be very nice.

Yep Yep, there is a bug tracking these kinds of improvements...

https://issues.apache.org/jira/browse/SOLR-1931

...if you have some javascript expertise and would like to help out, 
patches would be welcome. (the LukeRequestHandler used under the covers 
can return stats about each field one at a time, but the javascript 
in the schema browser doesn't currently use it that way)


-Hoss

--
http://lucenerevolution.org/  ...  October 7-8, Boston
http://bit.ly/stump-hoss  ...  Stump The Chump!



Re: Creating new Solr cores using relative paths

2010-08-26 Thread Chris Hostetter

: http://localhost:8080/solr/admin/cores
: ?action=CREATE
: &name=core1
: &instanceDir=core1
: &config=core0/conf/solrconfig.xml
: &schema=core0/conf/schema.xml
: (core1 is the name for the new core to be created, and I want to use the
: config and schema from core0 to create the new core).
: 
: but the error is always due to the servlet container thinking
: $TOMCAT_HOME/bin is the current working directory:
: Caused by: java.lang.RuntimeException: *Can't find resource
: 'core0/conf/solrconfig.xml'* in classpath or '/opt/solr/core1/conf/', *
: cwd=/opt/tomcat/bin


A major pain point is the Solr code base is that various pieces of code 
assume paths are relative to varoius other paths -- frequently the CWD -- 
and it's not always clear what the right "fix" is (ie: when should it be 
the SolrHome dire? when should it be the instanceDir for the current core? 
when should it be the conf dir for hte current core? etc...)

Even if we were startig from scratch right now, it's not clear how paths 
like the "schema" and "Config" params should be interpreted in a CREATE 
command (should they be relative the SolrHome? or relative the 
"instanceDir" param?)

Even in cases where it should be obvious, and it would be nice to "fix" 
these paths, it could easily break things for existing users with 
code/configs that have come to expect the particular excentricities (in 
"read only" cases, we can check multiple paths, but that doesn't work for 
files/dirs that Solr is being asked to create)

The easiest workarround: configure tomcat to use your SolrHome dir (ie: 
"/opt/solr" in your example) as it's Working Directory.

-Hoss

--
http://lucenerevolution.org/  ...  October 7-8, Boston
http://bit.ly/stump-hoss  ...  Stump The Chump!



Re: spellcheck index blown away during rebuild

2010-08-26 Thread Chris Hostetter

: What you're talking about is effectively promoting the spellcheck
: index to a first-class Solr index, instead of an appendage bolted on
: the side of an existing core. Given sharding and distributed search,
: this may be a better design.

even w/o promoting the spell index to be a "main" index, it still seems 
like the "rebuild" aspect of SpellCheck component could be improved to 
take advantage of regular Lucene IndexReader semanics: don't reopen the 
reader used to serve SpellComponent requests untill the "new" index is 
completley built.

I'm actaully really suprised that it doesn't work that way right now -- 
but i imagine this has to do with the way the SpellCheckCOmponent deals 
with the SpellChecker abstraction that hides the index -- still, it seems 
like there's room for improvement there.


-Hoss

--
http://lucenerevolution.org/  ...  October 7-8, Boston
http://bit.ly/stump-hoss  ...  Stump The Chump!



Status of Solr in the cloud?

2010-08-26 Thread Charlie Jackson
There seem to be a few parallel efforts at putting Solr in a cloud
configuration. See http://wiki.apache.org/solr/KattaIntegration, which
is based off of https://issues.apache.org/jira/browse/SOLR-1395. Also
http://wiki.apache.org/solr/SolrCloud which is
https://issues.apache.org/jira/browse/SOLR-1873. And another JIRA:
https://issues.apache.org/jira/browse/SOLR-1301. 

 

These all seem aimed at the same goal, correct? I'm interested in
evaluating one of these solutions for my company; which is the most
stable or most likely to eventually be part of the Solr distribution?

 

 

Thanks,

Charlie



Re: Search Results optimization

2010-08-26 Thread Rob Casson
you might find these helpful...similar question came up last week:

 http://ln-s.net/7WpX
 
http://robotlibrarian.billdueber.com/solr-forcing-items-with-all-query-terms-to-the-top-of-a-solr-search/

not exactly the same, as this case wanted to boost if *every* term
matched, but a similar tactic may workhope that helps,
rob

On Thu, Aug 26, 2010 at 4:02 PM, Hasnain  wrote:
>
> perhaps i wasnt clear in my earlier post
>
> if user searches for "swingline red stapler hammer hand rigid", then
> documents that matches max number of words written in query should come
> first
> e.g a document with name field as "swingline stapler" should come later than
> the document with "swingline red stapler"
>
> any suggestions how to achieve this functionality?
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Search-Results-optimization-tp1129374p1359916.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>


Re: Search Results optimization

2010-08-26 Thread Hasnain

perhaps i wasnt clear in my earlier post

if user searches for "swingline red stapler hammer hand rigid", then
documents that matches max number of words written in query should come
first
e.g a document with name field as "swingline stapler" should come later than
the document with "swingline red stapler"

any suggestions how to achieve this functionality?
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Search-Results-optimization-tp1129374p1359916.html
Sent from the Solr - User mailing list archive at Nabble.com.


Cutom filter implementation, advice needed

2010-08-26 Thread Ingo Renner
Hi *,

I implemented a custom filter and am using it through a QParserPlugin. I'm 
wondering however, whether my implementation is that clever yet...

Here's my QParser; I'm wondering whether I should apply the filter to all 
documents in the index (I already guess it's a bad idea) or whether I should 
use the query as provided by the already available query parser, see the 
parse() method.



public AccessFilterQParser(String qstr, SolrParams localParams, 
SolrParams params, SolrQueryRequest req) {
super(qstr, localParams, params, req);

QParserPlugin parserPlugin = 
req.getCore().getQueryPlugin(QParserPlugin.DEFAULT_QTYPE);
QParser parser = parserPlugin.createParser(qstr, localParams, 
params, req);

try {
preConstructedQuery = parser.getQuery();
} catch (ParseException e) {
throw new RuntimeException(e);
}

String fieldName = localParams.get(AccessParams.ACCESS_FIELD, 
"access");

this.accessFilter = new AccessFilter(fieldName, qstr);
}

@Override
public Query parse() throws ParseException {

Query allDocs = new MatchAllDocsQuery();

// return new FilteredQuery(allDocs, accessFilter);
return new FilteredQuery(preConstructedQuery, accessFilter);
}


I'd be happy about any advice...


best
Ingo


-- 
Ingo Renner
TYPO3 Core Developer, Release Manager TYPO3 4.2, Admin Google Summer of Code

TYPO3 Enterprise Content Management System
http://typo3.org

Apache Solr for TYPO3 - Enterprise Search meets Enterprise Content Management
http://www.typo3-solr.com








RE: how to deal with virtual collection in solr?

2010-08-26 Thread Ma, Xiaohui (NIH/NLM/LHC) [C]
Thanks so much for your help! I will try it.


-Original Message-
From: Thomas Joiner [mailto:thomas.b.joi...@gmail.com] 
Sent: Thursday, August 26, 2010 2:36 PM
To: solr-user@lucene.apache.org
Subject: Re: how to deal with virtual collection in solr?

I don't know about the shards, etc.

However I recently encountered that exception while indexing pdfs as well.
 The way that I resolved it was to upgrade to a nightly build of Solr. (You
can find them https://hudson.apache.org/hudson/view/Solr/job/Solr-trunk/).

The problem is that the version of Tika that 1.4.1 using is a very old
version of Tika, which uses a old version of PDFBox to do its parsing.  (You
might be able to fix the problem just by replacing the Tika jars...however I
don't know if there have been any API changes so I can't really suggest
that.)

We didn't upgrade to trunk in order for that functionality, but it was nice
that it started working. (The PDFs we'll be indexing won't be of later
versions, but a test file was).

On Thu, Aug 26, 2010 at 1:27 PM, Ma, Xiaohui (NIH/NLM/LHC) [C] <
xiao...@mail.nlm.nih.gov> wrote:

> Thanks so much for your help, Jan Høydahl!
>
> I made multiple cores (aa public, aa private, bb public and bb private). I
> knew how to query them individually. Please tell me if I can do a
> combinations through shards parameter now. If yes, I tried to append
> &shards=aapub,bbpub after query string. Unfortunately it didn't work.
>
> Actually all of content is the same. I don't have "collection" field in xml
> files. Please tell me how I can set a "collection" field in schema and
> simply search collection through filter.
>
> I used curl to index pdf files. I use Solr 1.4.1. I got the following error
> when I index pdf with version 1.5 and 1.6.
>
> *
> 
> 
> 
> Error 500 
> 
> HTTP ERROR: 500org.apache.tika.exception.TikaException:
> Unexpected RuntimeException from
> org.apache.tika.parser.pdf.pdfpar...@134ae32
>
> org.apache.solr.common.SolrException:
> org.apache.tika.exception.TikaException: Unexpected RuntimeException from
> org.apache.tika.parser.pdf.pdfpar...@134ae32
>at
> org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:211)
>at
> org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:54)
>at
> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131)
>at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316)
>at
> org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:338)
>at
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:241)
>at
> org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1089)
>at
> org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:365)
>at
> org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
>at
> org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:181)
>at
> org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:712)
>at
> org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:405)
>at
> org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:211)
>at
> org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114)
>at
> org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:139)
>at org.mortbay.jetty.Server.handle(Server.java:285)
>at
> org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:502)
>at
> org.mortbay.jetty.HttpConnection$RequestHandler.content(HttpConnection.java:835)
>at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:641)
>at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:202)
>at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:378)
>at
> org.mortbay.jetty.bio.SocketConnector$Connection.run(SocketConnector.java:226)
>at
> org.mortbay.thread.BoundedThreadPool$PoolThread.run(BoundedThreadPool.java:442)
> Caused by: org.apache.tika.exception.TikaException: Unexpected
> RuntimeException from org.apache.tika.parser.pdf.pdfpar...@134ae32
>at
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:121)
>at
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:105)
>at
> org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:190)
>... 22 more
> Caused by: java.lang.NullPointerException
>at org.pdfbox.pdmodel.PDPageNode.getAllKids(PDPageNode.java:194)
>at org.pdfbox.pdmodel.PDPageNode.getAllKids(PDPageNode.java:182)
>at
> org.pdfbox.pdmodel.PDDocumentCatalog.getAllPages(PDDocumentCatalog.java:226)
>at
> org.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper

Re: how to deal with virtual collection in solr?

2010-08-26 Thread Thomas Joiner
I don't know about the shards, etc.

However I recently encountered that exception while indexing pdfs as well.
 The way that I resolved it was to upgrade to a nightly build of Solr. (You
can find them https://hudson.apache.org/hudson/view/Solr/job/Solr-trunk/).

The problem is that the version of Tika that 1.4.1 using is a very old
version of Tika, which uses a old version of PDFBox to do its parsing.  (You
might be able to fix the problem just by replacing the Tika jars...however I
don't know if there have been any API changes so I can't really suggest
that.)

We didn't upgrade to trunk in order for that functionality, but it was nice
that it started working. (The PDFs we'll be indexing won't be of later
versions, but a test file was).

On Thu, Aug 26, 2010 at 1:27 PM, Ma, Xiaohui (NIH/NLM/LHC) [C] <
xiao...@mail.nlm.nih.gov> wrote:

> Thanks so much for your help, Jan Høydahl!
>
> I made multiple cores (aa public, aa private, bb public and bb private). I
> knew how to query them individually. Please tell me if I can do a
> combinations through shards parameter now. If yes, I tried to append
> &shards=aapub,bbpub after query string. Unfortunately it didn't work.
>
> Actually all of content is the same. I don't have "collection" field in xml
> files. Please tell me how I can set a "collection" field in schema and
> simply search collection through filter.
>
> I used curl to index pdf files. I use Solr 1.4.1. I got the following error
> when I index pdf with version 1.5 and 1.6.
>
> *
> 
> 
> 
> Error 500 
> 
> HTTP ERROR: 500org.apache.tika.exception.TikaException:
> Unexpected RuntimeException from
> org.apache.tika.parser.pdf.pdfpar...@134ae32
>
> org.apache.solr.common.SolrException:
> org.apache.tika.exception.TikaException: Unexpected RuntimeException from
> org.apache.tika.parser.pdf.pdfpar...@134ae32
>at
> org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:211)
>at
> org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:54)
>at
> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131)
>at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316)
>at
> org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:338)
>at
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:241)
>at
> org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1089)
>at
> org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:365)
>at
> org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
>at
> org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:181)
>at
> org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:712)
>at
> org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:405)
>at
> org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:211)
>at
> org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114)
>at
> org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:139)
>at org.mortbay.jetty.Server.handle(Server.java:285)
>at
> org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:502)
>at
> org.mortbay.jetty.HttpConnection$RequestHandler.content(HttpConnection.java:835)
>at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:641)
>at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:202)
>at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:378)
>at
> org.mortbay.jetty.bio.SocketConnector$Connection.run(SocketConnector.java:226)
>at
> org.mortbay.thread.BoundedThreadPool$PoolThread.run(BoundedThreadPool.java:442)
> Caused by: org.apache.tika.exception.TikaException: Unexpected
> RuntimeException from org.apache.tika.parser.pdf.pdfpar...@134ae32
>at
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:121)
>at
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:105)
>at
> org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:190)
>... 22 more
> Caused by: java.lang.NullPointerException
>at org.pdfbox.pdmodel.PDPageNode.getAllKids(PDPageNode.java:194)
>at org.pdfbox.pdmodel.PDPageNode.getAllKids(PDPageNode.java:182)
>at
> org.pdfbox.pdmodel.PDDocumentCatalog.getAllPages(PDDocumentCatalog.java:226)
>at
> org.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:216)
>at org.pdfbox.util.PDFTextStripper.getText(PDFTextStripper.java:149)
>at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:53)
>at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:51)
>at
> org.apac

A few query issues with solr

2010-08-26 Thread David Yang
Hi,

 

I'm new to using Solr, and I have started an index with it and it works
great. I have encountered a few minor issues that I currently solve by
modifying the query beforehand - however I feel like there is a much
more configuration oriented and Solr-correct way of achieving.

 

Current manual modifications

* Searching for "car" actually means buying a car so it should
look for "car -rent", whereas searching for "car rent" should still look
for "car rent"

* Searching for "macy's" and searching for "macys" is different
- currently I force macy's to macys

* Searching for "at&t" gets converted to "at", "t" which are
both stop worded - I am forced to convert at&t=>att before indexing and
querying

 

Is there a nice way to handle these or will I always need to resort to
manual fixes for these?

 

Cheers

David



RE: how to deal with virtual collection in solr?

2010-08-26 Thread Ma, Xiaohui (NIH/NLM/LHC) [C]
Thanks so much for your help, Jan Høydahl!

I made multiple cores (aa public, aa private, bb public and bb private). I knew 
how to query them individually. Please tell me if I can do a combinations 
through shards parameter now. If yes, I tried to append &shards=aapub,bbpub 
after query string. Unfortunately it didn't work.

Actually all of content is the same. I don't have "collection" field in xml 
files. Please tell me how I can set a "collection" field in schema and simply 
search collection through filter.

I used curl to index pdf files. I use Solr 1.4.1. I got the following error 
when I index pdf with version 1.5 and 1.6.

*



Error 500 

HTTP ERROR: 500org.apache.tika.exception.TikaException: 
Unexpected RuntimeException from org.apache.tika.parser.pdf.pdfpar...@134ae32

org.apache.solr.common.SolrException: org.apache.tika.exception.TikaException: 
Unexpected RuntimeException from org.apache.tika.parser.pdf.pdfpar...@134ae32
at 
org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:211)
at 
org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:54)
at 
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316)
at 
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:338)
at 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:241)
at 
org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1089)
at 
org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:365)
at 
org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
at 
org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:181)
at 
org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:712)
at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:405)
at 
org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:211)
at 
org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114)
at 
org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:139)
at org.mortbay.jetty.Server.handle(Server.java:285)
at 
org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:502)
at 
org.mortbay.jetty.HttpConnection$RequestHandler.content(HttpConnection.java:835)
at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:641)
at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:202)
at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:378)
at 
org.mortbay.jetty.bio.SocketConnector$Connection.run(SocketConnector.java:226)
at 
org.mortbay.thread.BoundedThreadPool$PoolThread.run(BoundedThreadPool.java:442)
Caused by: org.apache.tika.exception.TikaException: Unexpected RuntimeException 
from org.apache.tika.parser.pdf.pdfpar...@134ae32
at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:121)
at 
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:105)
at 
org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:190)
... 22 more
Caused by: java.lang.NullPointerException
at org.pdfbox.pdmodel.PDPageNode.getAllKids(PDPageNode.java:194)
at org.pdfbox.pdmodel.PDPageNode.getAllKids(PDPageNode.java:182)
at 
org.pdfbox.pdmodel.PDDocumentCatalog.getAllPages(PDDocumentCatalog.java:226)
at org.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:216)
at org.pdfbox.util.PDFTextStripper.getText(PDFTextStripper.java:149)
at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:53)
at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:51)
at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:119)
... 24 more

RequestURI=/solr/lhcpdf/update/extracthttp://jetty.mortbay.org/";>Powered by Jetty://   
 
  
***


-Original Message-
From: Jan Høydahl / Cominvent [mailto:jan@cominvent.com] 
Sent: Wednesday, August 25, 2010 4:34 PM
To: solr-user@lucene.apache.org
Subject: Re: how to deal with virtual collection in solr? 

> 1. Currently we use Verity and have more than 20 collections, each collection 
> has a index for public items and a index for private items. So there are 
> virtual collections which point to each collection and a virtual collection 
> which points to all. For example, we have AA and BB collections.
> 
> AA virtual collection --> (AA index for public items and AA index for private 
> items).
> BB virtual collection --> (BB index for public

Re: Is there any strss test tool for testing Solr?

2010-08-26 Thread Gora Mohanty
On Wed, 25 Aug 2010 19:58:36 -0700
Amit Nithian  wrote:

> i recommend JMeter. We use that to do load testing on a search
> server.
[...]

JMeter is certainly good, but we have also found Apache bench
to also be of much use. Maybe it is just us, and what we are
familiar with, but Apache bench seemed easier to automate. Also,
much easier to get up and running with, at least IMHO.

> Be careful though.. as silly as this may sound.. do NOT just
> issue random queries because that won't exercise your caches...
[...]

Conversely, we are still trying to figure out how to make real-life
measurements, without having the Solr cache coming into the picture.
For querying on a known keyword, every hit after the first, with
Apache bench, is strongly affected by the Solr cache. We tried using
random strings, but at least with Apache bench, the query string is
fixed for each invocation of Apache bench. Have to investigate
whether one can do otherwise with JMeter plugins. Also, a query
that returns no result (as a random query string typically would)
seems to be significantly faster than a real query. So, I think that
in the long run, the best way is to build information about
*typical* queries that your users run; using the Solr logs, and
then use a set of such queries for benchmarking.

Regards,
Gora


Re: Matching exact words

2010-08-26 Thread Erick Erickson
See below:

On Thu, Aug 26, 2010 at 10:24 AM, ahammad  wrote:

>
> Hello Erick,
>
> Thanks for the reply. I am a little confused by this whole stemming thing.
> What exactly does it refer to?
>

In your schema file, for the "text" field type, you'll see a line like:


Which inserts the a stemmer in your filter chain. Stemmers
algorithmically reduce words to their root, e.g. running,
runs, etc all reduce to run. The reduced term is all that's
put in your index. And when you search, assuming it goes
through the same analysis chain, your query will look for
run too. The analysis admin page is your friend here for
understanding how all this goes together. See:
/solr/admin/analysis.jsp


>
> Basically, I already have a field which is essentially a collection of many
> other fields (done using copyField). This field is a text field. So what
> you're saying is to have a duplicate of this field with different
> properties
> such that it does not stem?
>
> This is pretty much what I was suggesting, but whether it's appropriate
for your situation is up to you. Making a duplicate field may be
prohibitive,
can't tell without knowing more about your problem space.




> When querying, I assume that I will have to explicitly specify which field
> to search against...is this correct?
>
> Yep, or use the dismax request handler, it lets you do this automagically.
The  dismax request handler is probably the thing you should look at first,
it lets you configure searches to look at multiple fields with different
boosts...

Best
Erick


> I'm a little rusty on the solr stuff to be honest so please bear with me.
>
> Thanks
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Matching-exact-words-tp1353350p1357027.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>


Re: Matching exact words

2010-08-26 Thread ahammad

Hello Erick,

Thanks for the reply. I am a little confused by this whole stemming thing.
What exactly does it refer to?

Basically, I already have a field which is essentially a collection of many
other fields (done using copyField). This field is a text field. So what
you're saying is to have a duplicate of this field with different properties
such that it does not stem?

When querying, I assume that I will have to explicitly specify which field
to search against...is this correct?

I'm a little rusty on the solr stuff to be honest so please bear with me.

Thanks
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Matching-exact-words-tp1353350p1357027.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Matching exact words

2010-08-26 Thread Erick Erickson
You'll have to change your index I'm afraid. The problem is
that all the index sees is the stemmed version (assuming
you're stemming at index time). There's no information in
the index about what the original version was, so it's impossible
to back this out.

One solution is to use copyfield to make a copy of the
input that does NOT stem, and search against (or boost)
that field when you care about stemmed/unstemmed.

And a minor clarification. The "types" you refer to aren't
really a SOLR entity. They are just a convenient collection
of tokenizers and stemmers that are provided in the schema
file. You can freely create your own types by simply mixing and
matching various varieties of these (you probably already know
this, but the phrasing of your question caused me to wonder).

See:
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters

Best
Erick

On Thu, Aug 26, 2010 at 7:24 AM, ahammad  wrote:

>
> Hello,
>
> I have a case where if I search for the word "windows", I get results
> containing both "windows" and "window" (and probably other things like
> "windowing" etc.). Is there a way to find exact matches only?
>
> The field in which I am searching is a text field, which as I understand
> causes this behaviour. I cannot use a string field because it is very
> restricted, but what else can be done? I understand there are other types
> of
> text fields that are more strict than the standard field.
>
> Ideally I would like to keep my index the way it is, with the ability to
> force exact matches. For example, if I can search "windows -window" or
> something like that, that would be great. Or if I can wrap my query in a
> set
> of quotes to tell it to match exactly. I've seen that done before but I
> cannot get it to work.
>
> As a reference, here is my query:
>
> q={!boost b=$db v=$qq
>
> defType=$sh}&qq=windows&db=recip(ms(NOW,lastModifiedLong),3.16e-11,1,1)&sh=dismax
>
> To be quite frank, I am not very familiar with this syntax. I am just using
> whatever my old coworker left behind.
>
> Any tips on how to find exact matches or improve the above query will be
> greatly appreciated.
>
> Thanks
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Matching-exact-words-tp1353350p1353350.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>


Re: How to do ? Articles and Its Associated Comments Indexing , One to Many relationship

2010-08-26 Thread Erick Erickson
See below...

On Thu, Aug 26, 2010 at 4:31 AM, Sumit Arora  wrote:

> Thanks Ephraim for your response.
>
> If I use MultiValued for Comments Field then While Picking data from Solr,
> Should I use following Logic :
>
> /*  Sample PseudoCode */
>
> Get Rows from Article and Article-Comments Table ;  *// It will retrieve -
> 1
> Article and 20 Comments*
>
> Begin;
>
> Include 'Article Fields Value' in 'Solr Fields Value' Defined in Schema.Xml
>  */* One Article in this Case, So it will generate one document id for Solr
> - */*
>
> Comments = 0;
>
> While (Comments ! = 20 )
>
> {
>   Include this Comment;
>
>   ++Comments;
> }
>
> End;
>
> Result : One Article with MultipleComments as MultiValued indexed in Solr,
> Finally Solr will have only one document or multiple document ?
>
>
A multi-valued field is just what it says, a field within a single
document. So you'd have one document with 20 values for
your comment field.

However, note that SOLR doesn't have partial updates of a document,
it deletes and re-adds a document when you update. This is handled
automatically for you if you have a uniquekey defined. That is, if
you add a new document with the SAME unique key as a previous
document, the previous one will be removed and the new one
will replace it (with a new internal document id).


> If I suppose to use HighLight Text in this case, and Search - Keyword exist
> in more than one Comments ? How I can achieve below result where it has
> found 'web' keyword exist in two comments.
>
> ... 1.The *web* portal will connect a lot of people for some specific
> domain, and then people can post their interesting story, upload files
>
>  ... 2.1 accessing multiple sites will slow down the user experience - try
> not to do it. *web* hosting is not too expensive as compared to the other
> components ...
>
>
>
I believe this is controlled by the hl.fragsize, see:
http://wiki.apache.org/solr/HighlightingParameters#hl.fragsize

The other thing you should be aware of is "increment gap". This
is useful if you want, say, phrase queries to NOT work across
two comments. I.e.
comment 1: comments are very nice
comment 2: day in and day out

If you don't want a phrase query "nice day" to match the
enclosing document, you probably want to work with the
positionIncrementGap. See:
http://lucene.472066.n3.nabble.com/positionIncrementGap-in-schema-xml-td488338.html

Best
Erick


>
>
> On Thu, Aug 26, 2010 at 4:32 PM, Ephraim Ofir  wrote:
>
> > Why not define the comment field as multiValued? That way you only index
> > each document once and you don't need to collapse anything...
> >
> > Ephraim Ofir
> >
> >
> > -Original Message-
> > From: Sumit Arora [mailto:sumit1...@gmail.com]
> > Sent: Thursday, August 26, 2010 12:54 PM
> > To: solr-user@lucene.apache.org
> > Subject: How to do ? Articles and Its Associated Comments Indexing , One
> > to Many relationship
> >
> > I have set of Articles and then Comments on it, so in database I have
> > two
> > major tables one for Articles and one for Comments, but each Article
> > could
> > have many comments (One to Many).
> >
> >
> > If One Article will have 20 Comments, then on DB to SOLR - Index - Sync
> > :
> > Solr will index 20 Similar Documents with a difference of each Comment.
> >
> >
> > Use Case :
> >
> > On Search: If keyword would be a fit to more than one comment, then it
> > will
> > return duplicate documents.
> >
> >
> > One Possible solution I thought to Apply:
> >
> > **
> >
> > I should go for Indexing 20 Similar Documents with a difference of each
> > Comment.
> >
> >
> > While retrieving results from Query: I could use: collapse.field = By
> > Article Id
> >
> >
> > Am I following right approach?
> >
>


Re: sort by field length

2010-08-26 Thread Shawn Heisey

 On 5/24/2010 6:30 AM, Sascha Szott wrote:

Hi folks,

is it possible to sort by field length without having to (redundantly) 
save the length information in a seperate index field? At first, I 
thought to accomplish this using a function query, but I couldn't find 
an appropriate one.




I have a slightly different need related to this, though it may turn out 
that what Sascha wants is similar.  I would like to understand my data 
better so I can improve my schema.  I need to do some data mining that 
is (to my knowledge) difficult or impossible with the source database.  
Performance is irrelevant, as long as it finishes eventually.  
Completing in less than an hour would be nice.


I would do this on a test system with much lower performance and memory 
(4GB) than my production servers, as a single index instead of multiple 
shards.  When it finishes building, the entire test index is likely to 
be about 75GB.


What I'm after is an output that would look very much like faceting, but 
I want it to show document counts associated with field length (for a 
simple string) and number of terms (for a tokenized field) instead of 
field value.  Can Solr do that, and if so, what do I need to have 
enabled in the schema to get it?  Would branch_3x be enough, or would 
trunk be better?


Thanks,
Shawn



Multiple passes with WordDelimiterFilterFactory

2010-08-26 Thread Shawn Heisey
 Can I pass my data through WordDelimiterFilterFactory more than once?  
It occurs to me that I might get better results if I can do some of the 
filters separately and use preserveOriginal on some of them but not others.


Currently I am using the following definition on both indexing and 
querying.  Would it make sense to do the two differently?




Thanks,
Shawn



Re: Slow facet sorting - lex vs count

2010-08-26 Thread Eric Grobler
Hi Yonik,

Thanks for your help.

I will check the memory.

It might also be related to patch SOLR-792 tree faceting I installed.
I will remove it and try the same query tomorrow again.

Regards
Eric


On Wed, Aug 25, 2010 at 10:25 PM, Yonik Seeley
wrote:

> On Wed, Aug 25, 2010 at 7:22 AM, Eric Grobler 
> wrote:
> > Hi Solr experts,
> >
> > There is a huge difference doing facet sorting on lex vs count
> > The strange thing is that count sorting is fast when setting a small
> limit.
> > I realize I can do sorting in the client, but I am just curious why this
> is.
> >
> > FAST - 16ms
> > facet.field=city
> > f.city.facet.limit=5000
> > f.city.facet.sort=lex
> >
> > FAST - 20 ms
> > facet.field=city
> > f.city.facet.limit=50
> > f.city.facet.sort=count
> >
> > SLOW - over 1 second
> > facet.field=city
> > f.city.facet.limit=5000
> > f.city.facet.sort=count
>
> FYI, I just tried my own single-valued faceting test:
> 10M documents, query matches 1M docs, faceting on a field that has
> 100,000 unique values:
>
> facet.limit=100 -> 35ms
> facet.limit=5000 -> 44ms
> facet.limit=5 -> 100ms
>
> The times are reported via QTime (i.e. they do not include the time to
> write out the response to the client).
> Maybe you're running into memory issues because of the size of the
> BoundedTreeSet, response size, etc, and garbage collection is taking
> up a lot of time?
>
> -Yonik
> http://lucenerevolution.org  Lucene/Solr Conference, Boston Oct 7-8
>


Matching exact words

2010-08-26 Thread ahammad

Hello,

I have a case where if I search for the word "windows", I get results
containing both "windows" and "window" (and probably other things like
"windowing" etc.). Is there a way to find exact matches only?

The field in which I am searching is a text field, which as I understand
causes this behaviour. I cannot use a string field because it is very
restricted, but what else can be done? I understand there are other types of
text fields that are more strict than the standard field.

Ideally I would like to keep my index the way it is, with the ability to
force exact matches. For example, if I can search "windows -window" or
something like that, that would be great. Or if I can wrap my query in a set
of quotes to tell it to match exactly. I've seen that done before but I
cannot get it to work.

As a reference, here is my query:

q={!boost b=$db v=$qq
defType=$sh}&qq=windows&db=recip(ms(NOW,lastModifiedLong),3.16e-11,1,1)&sh=dismax

To be quite frank, I am not very familiar with this syntax. I am just using
whatever my old coworker left behind. 

Any tips on how to find exact matches or improve the above query will be
greatly appreciated.

Thanks
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Matching-exact-words-tp1353350p1353350.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: TurkishLowerCaseFilterFactory

2010-08-26 Thread Robert Muir
On Thu, Aug 26, 2010 at 7:28 AM, Yavuz Selim YILMAZ  wrote:

> I downloaded latest jars except snowball 3-1.jar. I can't find it any
> place?
> --
>
> Yavuz Selim YILMAZ
>
>
Hello,

in 3.1 the contrib/snowball is now integrated with contrib/analyzers, so you
just need the analyzers jar!

This way, in a single jar you have the TurkishLowerCaseFilter, but also the
Turkish stemmer from snowball, a set of Turkish stopwords in resources/, and
a Lucene TurkishAnalyzer that puts it all together.

-- 
Robert Muir
rcm...@gmail.com


Re: solr working...

2010-08-26 Thread Geert-Jan Brits
Check out Drew Farris' explantion for remote debugging Solr with Eclipse
posted a couple of days ago:
http://lucene.472066.n3.nabble.com/How-to-Debug-Sol-Code-in-Eclipse-td1262050.html

Geert-Jan

2010/8/26 Michael Griffiths 

> Take a look at the code? It _is_ open source. Open it up in Eclipse and
> debug it.
>
> -Original Message-
> From: satya swaroop [mailto:sswaro...@gmail.com]
> Sent: Thursday, August 26, 2010 8:24 AM
> To: solr-user@lucene.apache.org
> Subject: Re: solr working...
>
> Hi peter,
>I am already working on solr and it is working good. But i want
> to understand the code and know where the actual working is going on, and
> how indexing is done and how the requests are parsed and how it is
> responding and all others. TO understand the  code i asked how to start???
>
> Regards,
> satya
>


RE: solr working...

2010-08-26 Thread Michael Griffiths
Take a look at the code? It _is_ open source. Open it up in Eclipse and debug 
it.

-Original Message-
From: satya swaroop [mailto:sswaro...@gmail.com] 
Sent: Thursday, August 26, 2010 8:24 AM
To: solr-user@lucene.apache.org
Subject: Re: solr working...

Hi peter,
I am already working on solr and it is working good. But i want to 
understand the code and know where the actual working is going on, and how 
indexing is done and how the requests are parsed and how it is responding and 
all others. TO understand the  code i asked how to start???

Regards,
satya


Re: solr working...

2010-08-26 Thread satya swaroop
Hi peter,
I am already working on solr and it is working good. But i want
to understand the code and know where the actual working is going on, and
how indexing is done and how the requests are parsed and how it is
responding and all others. TO understand the  code i asked how to start???

Regards,
satya


Re: Candidate Profile Search which have multiple employers and Educations.

2010-08-26 Thread Sumit Arora
Thanks Ephraim for your response.

Actually I am not using DIH to Sync the data from DB, I wrote on DB-SYNC by
myself, and I am directly retrieving rows from MySQL-DB and Indexing to
Solr.

On my Earlier cases - I Picked Rows with Column Label from DB, and Similar
Column Defined in my Sync Program, It Picks the data and Index it, One
-to-One



So in that case, One to Many -  I have to use a inner-loop (Based on no. of
Candidate Education or Candidate Employer) to Index  (For
Candidate Education and Candidate Employer )

On Thu, Aug 26, 2010 at 4:59 PM, Ephraim Ofir  wrote:

> As far as I can tell you should use multiValued for these fields:
>
>   multiValued="true"/>
>multiValued="true"/>
>
> In order to get the data from the DB you should either create a sub
> entity with its own query or (the better performance option) use
> something like:
>
> SELECT cp.name,
>GROUP_CONCAT(ce.CandidateEducation SEPARATOR '|') AS
> multiple_educations,
>GROUP_CONCAT(e.Employer SEPARATOR '|') AS multiple_employers
> FROM CandidateProfile_Table cp
> LEFT JOIN CandidateEducation_Table ce ON cp.name = ce.name
> LEFT JOIN Employers_Table e ON cp.name = e.name
> GROUP BY cp.name
>
> This creates one line with the educations and employers concatenated
> into pipe (|) delimited fields.  Then you'd have to break up the
> multiple fields using a RegexTransformer - use something like:
>
> query="...see above..."
>transformer="RegexTransformer" >
>
> splitBy="\|"/>
> splitBy="\|"/>
> 
>
> The SQL probably doesn't fit your DB schema, but it's just to clarify
> the idea.  You might have to pick a different field separator if pipe
> (|) might be in your data...
>
> Ephraim Ofir
>
>
> -Original Message-
> From: Sumit Arora [mailto:sumit1...@gmail.com]
> Sent: Thursday, August 26, 2010 1:36 PM
> To: solr-user@lucene.apache.org
> Subject: Candidate Profile Search which have multiple employers and
> Educations.
>
> I have to search candidate's profile , on which I have following Tables
> :
>
> Candidate Profile Record : CandidateProfile_Table
>
> CandidateEducation : CandidateEducation_Table  //  EducationIn Different
> Institutes or Colleges  :
>
> Employers :  Employers_Table //More than One Employers :
>
> If I denormalize this all three Table :
>
> CandidateProfile_Table  - 1 Row for Sumit
>
> CandidateEducation_Table - 5 Rows for Sumit
>
> Employers_Table - 5 Rows for Sumit
>
> If these three tables will go to Index in Solr , It will create 25
> Documents
> for one row.
>
>
> In this Case What Should be My Approach :
>
> DeNormalize all three tables and while querying from Solr use Field
> Collpase
> parameter by CandidateProfile Id, So It will return one record.
>
> Or
>
> I should use CandidateEducation_Table,CandidateEducation_Table as
> MultiValued in Solr ?
>
>
> If that is the case, then How I can apply Solr way to use MultiValue
> e.g;
>
> I need to use  Following Configuration in Scehma.xml :
>
>
>  
>  
>
>
> After this :
>
>
> I should pick all education values(from MySql Education Database Table)
> concerned to one profile
>
> and keep this in a one variable - EducationValuesForSolr
>
> and then EducationValuesForSolr's value need to assign to Schema.XML
> defined
> variable education ?
>
>
> Please let me know If I am using right approach and Comments?
>
> /Sumit
>


Re: solr working...

2010-08-26 Thread Peter Karich
Hi!

What do you mean? You want a quickstart?
Then see
http://lucene.apache.org/solr/tutorial.html

(But I thought you already got solr working (from previous threads)!?)

Or do you want to know if solr is running? Then try the admin view:
http://localhost:8080/solr/admin/

Regards,
Peter.

> Hi all,
>   I am intrested to see the working of solr.
> 1)Can anyone tell me how to start with to know its working 
>
> Regards,
> satya
>   



Re: How to do ? Articles and Its Associated Comments Indexing , One to Many relationship

2010-08-26 Thread Sumit Arora
Thanks Ephraim for your response.

If I use MultiValued for Comments Field then While Picking data from Solr,
Should I use following Logic :

/*  Sample PseudoCode */

Get Rows from Article and Article-Comments Table ;  *// It will retrieve - 1
Article and 20 Comments*

Begin;

Include 'Article Fields Value' in 'Solr Fields Value' Defined in Schema.Xml
 */* One Article in this Case, So it will generate one document id for Solr
- */*

Comments = 0;

While (Comments ! = 20 )

{
   Include this Comment;

   ++Comments;
}

End;

Result : One Article with MultipleComments as MultiValued indexed in Solr,
Finally Solr will have only one document or multiple document ?

If I suppose to use HighLight Text in this case, and Search - Keyword exist
in more than one Comments ? How I can achieve below result where it has
found 'web' keyword exist in two comments.

... 1.The *web* portal will connect a lot of people for some specific
domain, and then people can post their interesting story, upload files

 ... 2.1 accessing multiple sites will slow down the user experience - try
not to do it. *web* hosting is not too expensive as compared to the other
components ...




On Thu, Aug 26, 2010 at 4:32 PM, Ephraim Ofir  wrote:

> Why not define the comment field as multiValued? That way you only index
> each document once and you don't need to collapse anything...
>
> Ephraim Ofir
>
>
> -Original Message-
> From: Sumit Arora [mailto:sumit1...@gmail.com]
> Sent: Thursday, August 26, 2010 12:54 PM
> To: solr-user@lucene.apache.org
> Subject: How to do ? Articles and Its Associated Comments Indexing , One
> to Many relationship
>
> I have set of Articles and then Comments on it, so in database I have
> two
> major tables one for Articles and one for Comments, but each Article
> could
> have many comments (One to Many).
>
>
> If One Article will have 20 Comments, then on DB to SOLR - Index - Sync
> :
> Solr will index 20 Similar Documents with a difference of each Comment.
>
>
> Use Case :
>
> On Search: If keyword would be a fit to more than one comment, then it
> will
> return duplicate documents.
>
>
> One Possible solution I thought to Apply:
>
> **
>
> I should go for Indexing 20 Similar Documents with a difference of each
> Comment.
>
>
> While retrieving results from Query: I could use: collapse.field = By
> Article Id
>
>
> Am I following right approach?
>


RE: Candidate Profile Search which have multiple employers and Educations.

2010-08-26 Thread Ephraim Ofir
As far as I can tell you should use multiValued for these fields:

  
  

In order to get the data from the DB you should either create a sub
entity with its own query or (the better performance option) use
something like:

SELECT cp.name,
GROUP_CONCAT(ce.CandidateEducation SEPARATOR '|') AS
multiple_educations,
GROUP_CONCAT(e.Employer SEPARATOR '|') AS multiple_employers
FROM CandidateProfile_Table cp
LEFT JOIN CandidateEducation_Table ce ON cp.name = ce.name
LEFT JOIN Employers_Table e ON cp.name = e.name
GROUP BY cp.name

This creates one line with the educations and employers concatenated
into pipe (|) delimited fields.  Then you'd have to break up the
multiple fields using a RegexTransformer - use something like:







The SQL probably doesn't fit your DB schema, but it's just to clarify
the idea.  You might have to pick a different field separator if pipe
(|) might be in your data...

Ephraim Ofir


-Original Message-
From: Sumit Arora [mailto:sumit1...@gmail.com] 
Sent: Thursday, August 26, 2010 1:36 PM
To: solr-user@lucene.apache.org
Subject: Candidate Profile Search which have multiple employers and
Educations.

I have to search candidate's profile , on which I have following Tables
:

Candidate Profile Record : CandidateProfile_Table

CandidateEducation : CandidateEducation_Table  //  EducationIn Different
Institutes or Colleges  :

Employers :  Employers_Table //More than One Employers :

If I denormalize this all three Table :

CandidateProfile_Table  - 1 Row for Sumit

CandidateEducation_Table - 5 Rows for Sumit

Employers_Table - 5 Rows for Sumit

If these three tables will go to Index in Solr , It will create 25
Documents
for one row.


In this Case What Should be My Approach :

DeNormalize all three tables and while querying from Solr use Field
Collpase
parameter by CandidateProfile Id, So It will return one record.

Or

I should use CandidateEducation_Table,CandidateEducation_Table as
MultiValued in Solr ?


If that is the case, then How I can apply Solr way to use MultiValue
e.g;

I need to use  Following Configuration in Scehma.xml :


  
  


After this :


I should pick all education values(from MySql Education Database Table)
concerned to one profile

and keep this in a one variable - EducationValuesForSolr

and then EducationValuesForSolr's value need to assign to Schema.XML
defined
variable education ?


Please let me know If I am using right approach and Comments?

/Sumit


Re: TurkishLowerCaseFilterFactory

2010-08-26 Thread Yavuz Selim YILMAZ
I downloaded latest jars except snowball 3-1.jar. I can't find it any place?
--

Yavuz Selim YILMAZ


2010/8/26 Ahmet Arslan 

> > Is there a version of solr which has
> > TurkishLowerCaseFilterFactory.java
> > I downloaded 1.4.1 version of solr , but it hasn't it.
>
> According to wiki that filter will be available in solr 3.1
> http://wiki.apache.org/solr/LanguageAnalysis#Turkish
>
> You can checkout branch 3.1
>
> http://svn.apache.org/viewvc/lucene/dev/branches/branch_3x/solr/src/java/org/apache/solr/analysis/TurkishLowerCaseFilterFactory.java?view=log
>
>
>
>
>
>


RE: How to do ? Articles and Its Associated Comments Indexing , One to Many relationship

2010-08-26 Thread Ephraim Ofir
Why not define the comment field as multiValued? That way you only index
each document once and you don't need to collapse anything...

Ephraim Ofir


-Original Message-
From: Sumit Arora [mailto:sumit1...@gmail.com] 
Sent: Thursday, August 26, 2010 12:54 PM
To: solr-user@lucene.apache.org
Subject: How to do ? Articles and Its Associated Comments Indexing , One
to Many relationship

I have set of Articles and then Comments on it, so in database I have
two
major tables one for Articles and one for Comments, but each Article
could
have many comments (One to Many).


If One Article will have 20 Comments, then on DB to SOLR - Index - Sync
:
Solr will index 20 Similar Documents with a difference of each Comment.


Use Case :

On Search: If keyword would be a fit to more than one comment, then it
will
return duplicate documents.


One Possible solution I thought to Apply:

**

I should go for Indexing 20 Similar Documents with a difference of each
Comment.


While retrieving results from Query: I could use: collapse.field = By
Article Id


Am I following right approach?


Candidate Profile Search which have multiple employers and Educations.

2010-08-26 Thread Sumit Arora
I have to search candidate's profile , on which I have following Tables :

Candidate Profile Record : CandidateProfile_Table

CandidateEducation : CandidateEducation_Table  //  EducationIn Different
Institutes or Colleges  :

Employers :  Employers_Table //More than One Employers :

If I denormalize this all three Table :

CandidateProfile_Table  - 1 Row for Sumit

CandidateEducation_Table - 5 Rows for Sumit

Employers_Table - 5 Rows for Sumit

If these three tables will go to Index in Solr , It will create 25 Documents
for one row.


In this Case What Should be My Approach :

DeNormalize all three tables and while querying from Solr use Field Collpase
parameter by CandidateProfile Id, So It will return one record.

Or

I should use CandidateEducation_Table,CandidateEducation_Table as
MultiValued in Solr ?


If that is the case, then How I can apply Solr way to use MultiValue e.g;

I need to use  Following Configuration in Scehma.xml :


  
  


After this :


I should pick all education values(from MySql Education Database Table)
concerned to one profile

and keep this in a one variable - EducationValuesForSolr

and then EducationValuesForSolr's value need to assign to Schema.XML defined
variable education ?


Please let me know If I am using right approach and Comments?

/Sumit


solr working...

2010-08-26 Thread satya swaroop
Hi all,
  I am intrested to see the working of solr.
1)Can anyone tell me how to start with to know its working 

Regards,
satya


How to do ? Articles and Its Associated Comments Indexing , One to Many relationship

2010-08-26 Thread Sumit Arora
I have set of Articles and then Comments on it, so in database I have two
major tables one for Articles and one for Comments, but each Article could
have many comments (One to Many).


If One Article will have 20 Comments, then on DB to SOLR - Index - Sync :
Solr will index 20 Similar Documents with a difference of each Comment.


Use Case :

On Search: If keyword would be a fit to more than one comment, then it will
return duplicate documents.


One Possible solution I thought to Apply:

**

I should go for Indexing 20 Similar Documents with a difference of each
Comment.


While retrieving results from Query: I could use: collapse.field = By
Article Id


Am I following right approach?


FieldCache.DEFAULT.getInts vs FieldCache.DEFAULT.getStringIndex. Memory usage

2010-08-26 Thread Marc Sturlese

I need to load a FieldCache for a field wich is a solr "integer" type and has
as maximum 3 digits. Let's say my index has 10M docs.
I am wandering what is more optimal and less memory consumig, to load a
FieldCache.DEFAUL.getInts or a FieldCache.DEFAULT.getStringIndex.

The second one will have a int[] for as many docs as the index have.
Additionally will have a String[] for as many unique terms. As I am dealing
with numbers, I will have to cast the values of the String[] to deal with
them.

If I load a FieldCache.DEFAULT.getInts I will have just an int[] with a
value of a doc field on each array position. I will be able to deal straight
with the ints... in this case will it be more optimal to use this?

Thanks in advance
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/FieldCache-DEFAULT-getInts-vs-FieldCache-DEFAULT-getStringIndex-Memory-usage-tp1348480p1348480.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: TurkishLowerCaseFilterFactory

2010-08-26 Thread Ahmet Arslan
> Is there a version of solr which has
> TurkishLowerCaseFilterFactory.java
> I downloaded 1.4.1 version of solr , but it hasn't it.

According to wiki that filter will be available in solr 3.1 
http://wiki.apache.org/solr/LanguageAnalysis#Turkish

You can checkout branch 3.1 
http://svn.apache.org/viewvc/lucene/dev/branches/branch_3x/solr/src/java/org/apache/solr/analysis/TurkishLowerCaseFilterFactory.java?view=log




  


Re: JVM GC is very frequent.

2010-08-26 Thread Marc Sturlese

http://www.lucidimagination.com/blog/2009/09/19/java-garbage-collection-boot-camp-draft/

-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/JVM-GC-is-very-frequent-tp1345760p1348065.html
Sent from the Solr - User mailing list archive at Nabble.com.


RE: How to delete documents from SOLR index using DIH

2010-08-26 Thread Ephraim Ofir
You have several options here:
1. Use the deletedPkQuery in delta import - you'll need to make a DB
query which generates the IDs to be deleted (something like: SELECT id
FROM your_table WHERE deleted = 1).
2. Add the $deleteDocById special command to your full/delta import.
3. Use preImportDeleteQuery/postImportDeleteQuery in your full/delta
query

If you want to use any of these separately from your import, you can put
them in a separate entity and do a full/delta import just on that
entity.

Ephraim Ofir


-Original Message-
From: Pawan Darira [mailto:pawan.dar...@gmail.com] 
Sent: Thursday, August 26, 2010 9:01 AM
To: solr-user@lucene.apache.org
Subject: Re: How to delete documents from SOLR index using DIH

Thanks Erick. Your solution do make sense. Actually i wanted to know,
how to
use delete via query or unique id through DIH?

Is there any specific query to be mentioned in data-config.xml? Also Is
there any separate command like "full-import" ,"delta-import" for
deleting
documents from index?



On Thu, Aug 26, 2010 at 12:03 AM, Erick Erickson
wrote:

> I'm not sure what you mean here. You can delete via query or unique
id. But
> DIH really isn't relevant here.
>
> If you've defined a unique key, simply re-adding any changed documents
will
> delete the old one and insert the new document.
>
> If this makes no sense, could you explain what the underlying problem
> you're
> trying to solve is?
>
> HTH
> Erick
>
> On Tue, Aug 24, 2010 at 8:56 PM, Pawan Darira  >wrote:
>
> > Hi
> >
> > I am using data import handler to build index. How can i delete
documents
> > from my index using DIH.
> >
> > --
> > Thanks,
> > Pawan Darira
> >
>



-- 
Thanks,
Pawan Darira


TurkishLowerCaseFilterFactory

2010-08-26 Thread Yavuz Selim YILMAZ
Is there a version of solr which has TurkishLowerCaseFilterFactory.java
I downloaded 1.4.1 version of solr , but it hasn't it.
--

Yavuz Selim YILMAZ


Re: How to delete documents from SOLR index using DIH

2010-08-26 Thread Grijesh.singh

DIH is not basically for deletion it is for inserting data into index.
Although it has a parameter "clean" which is by default true and it cleans
the index every time when full-import command is issued, means it create
index from scratch. 

If your requirement is to delete whole index you can also use: 
http://localhost:8080/solr/update?stream.body=*:*
 
http://localhost:8080/solr/update?stream.body= 

If your requirement is to delete data from index selectively, change the
above query accordingly: 
http://localhost:8080/solr/update?stream.body=adId:1002
 
http://localhost:8080/solr/update?stream.body= 
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/How-to-delete-documents-from-SOLR-index-using-DIH-tp1323794p1346743.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Regd WSTX EOFException

2010-08-26 Thread Pooja Verlani
Hi,
The client being used is php curl.
Could that be a problem?
On Wed, Aug 25, 2010 at 7:10 PM, Yonik Seeley
 wrote:
> On Wed, Aug 25, 2010 at 6:41 AM, Pooja Verlani  
> wrote:
>> Hi,
>> Sometimes while indexing to solr, I am getting  the following exception.
>> "com.ctc.wstx.exc.WstxEOFException: Unexpected end of input block in end tag"
>> I think its some configuration issue. Kindly suggest.
>>
>> I have a solr working with Tomcat 6
>
> Sounds like the input is sometimes being truncated (or corrupted) when
> it's sent to solr.
> What client are you using?
>
> -Yonik
> http://lucenerevolution.org  Lucene/Solr Conference, Boston Oct 7-8
>