Re: Index

2011-07-29 Thread Pranav Prakash
Every indexed document has to have a unique ID associated with it. You may
do a search by ID something like

http://localhost:/solr/select?q=id:X If you see a result, then the
document has been indexed and is searchable.

You might also want to check Luke (http://code.google.com/p/luke) to gain
more insight about the index.

*Pranav Prakash*

temet nosce

Twitter http://twitter.com/pranavprakash | Blog http://blog.myblive.com |
Google http://www.google.com/profiles/pranny


On Fri, Jul 29, 2011 at 03:40, GAURAV PAREEK gauravpareek2...@gmail.comwrote:

 Yes NICK you are correct ?
 how can you check whether it has been indexed by solr, and is searchable?

 On Fri, Jul 29, 2011 at 3:27 AM, Nicholas Chase nch...@earthlink.net
 wrote:

  Do you mean, how can you check whether it has been indexed by solr, and
 is
  searchable?
 
    Nick
 
 
  On 7/28/2011 5:45 PM, GAURAV PAREEK wrote:
 
  Hi All,
 
  How we can check the particular;ar file is not INDEX in solr ?
 
  Regards,
  Gaurav
 
 



Solr 3.2.0 is not writing log

2011-07-29 Thread Ruixiang Zhang
I'm using Solr 1.4 with jetty for my site, it writes log into files in
example/logs.

Now I'm testing Solr 3.2.0 with jetty on another server, but no log is
written into this folder: example/logs. It is always empty.

Do I need to do something to turn on the log? Any hint will be appreciated.


Ruixiang


Re: convert date format at indexing time

2011-07-29 Thread PacoPeralta
Please 
Is there any suggestion on This?

Thanks

--
View this message in context: 
http://lucene.472066.n3.nabble.com/convert-date-format-at-indexing-time-tp3191078p3208989.html
Sent from the Solr - User mailing list archive at Nabble.com.


Auto-Commit and failures / schema violations

2011-07-29 Thread Dirk Högemann
Hello,

we are running a large CMS with multiple customers and we are now going to use 
solr for our search and indexing tasks.
As we have a lot of users working simultaneously on the CMS we decided not to 
commit our changes programatically (we use StreamingUpdateSolrServer) on each 
add. Instead we are using the autocommit functions ins solr-config.xml.

To be reliable we write Timestamp files on each add of a document to the 
StreamingUpdateSolrServer. (In case of a crash we could restart indexing since 
that timetamp. )
Unfortunately we don't know how to be sure that the add was successfull, as 
(for example) schema violations seem to be detected on commit, which is 
therefore too late, as the timestamp is usually already overwritten then.

So: Are there any valid approaches to bes sure that an add of a document has 
been processed successfully?
Maybe: Is ist better to collect a list of documents to add and commit these, 
instead of using the auto-commit function?

Thanks in advance for any help!
Dirk Högemann
___
Schon gehört? WEB.DE hat einen genialen Phishing-Filter in die
Toolbar eingebaut! http://produkte.web.de/go/toolbar


slow highlighting because of stemming

2011-07-29 Thread Orosz György
Dear all,

I am quite new about using Solr, but would like to ask your help.
I am developing an application which should be able to highlight the results
of a query. For this I am using regex fragmenter:
highlighting
   fragmenter name=regex
class=org.apache.solr.highlight.RegexFragmenter
lst name=defaults
  int name=hl.fragsize500/int
  float name=hl.regex.slop0.5/float
  str name=hl.pre![CDATA[b]]/str
 str name=hl.post![CDATA[/b]]/str
 str name=hl.useFastVectorHighlightertrue/str
  str name=hl.regex.pattern[-\w ,/\n\']{20,300}[.?!]/str
  str name=hl.fldokumentum_syn_query/str
/lst
   /fragmenter
  /highlighting
The field is indexed with term vectors and offsets:
field name=dokumentum_syn_query type=huntext_syn indexed=true
stored=true multiValued=true termVectors=on termPositions=on
 termOffsets=on/
fieldType name=huntext_syn class=solr.TextField stored=true
indexed=true positionIncrementGap=100
  analyzer type=index
tokenizer
class=com.morphologic.solr.huntoken.HunTokenizerFactory/
filter class=solr.StopFilterFactory ignoreCase=true
words=stopwords_query.txt enablePositionIncrements=true /
 filter class=com.morphologic.solr.hunstem.HumorStemFilterFactory
 lex=/home/oroszgy/workspace/morpho/solrplugins/data/lex
 cache=alma/
filter class=solr.LowerCaseFilterFactory/
  /analyzer
  analyzer type=query
tokenizer class=solr.StandardTokenizerFactory/
 filter class=solr.StopFilterFactory ignoreCase=true
words=stopwords_query.txt enablePositionIncrements=true /
 filter class=com.morphologic.solr.hunstem.HumorStemFilterFactory
 lex=/home/oroszgy/workspace/morpho/solrplugins/data/lex
 cache=alma/
filter class=solr.SynonymFilterFactory
synonyms=synonyms_query.txt ignoreCase=true expand=true/
filter class=solr.LowerCaseFilterFactory/
  /analyzer
/fieldType

The highlighting works well, excepts that its really slow. I realized that
this is because the highlighter/fragmenter does stemming for all the results
documents again.

Could you please help me why does it happen an how should I avoid this? (I
thought that using fastvectorhighlighter will solve my problem, but it
didn't)

Thanks in advance!
Gyuri Orosz


Re: [WARNING] Index corruption and crashes in Apache Lucene Core / Apache Solr with Java 7

2011-07-29 Thread Sanne Grinovero
Hello,
thanks for the warning, that's a pretty nasty bug.

A patch was made for OpenJDK, if anybody is interested to try it out
that would be great:
http://hg.openjdk.java.net/hsx/hotspot-comp/hotspot/rev/4e761e7e6e12

Regards,
Sanne

2011/7/28 Uwe Schindler uschind...@apache.org:
 Hello Apache Lucene  Apache Solr users,
 Hello users of other Java-based Apache projects,

 Oracle released Java 7 today. Unfortunately it contains hotspot compiler
 optimizations, which miscompile some loops. This can affect code of several
 Apache projects. Sometimes JVMs only crash, but in several cases, results
 calculated can be incorrect, leading to bugs in applications (see Hotspot
 bugs 7070134 [1], 7044738 [2], 7068051 [3]).

 Apache Lucene Core and Apache Solr are two Apache projects, which are
 affected by these bugs, namely all versions released until today. Solr users
 with the default configuration will have Java crashing with SIGSEGV as soon
 as they start to index documents, as one affected part is the well-known
 Porter stemmer (see LUCENE-3335 [4]). Other loops in Lucene may be
 miscompiled, too, leading to index corruption (especially on Lucene trunk
 with pulsing codec; other loops may be affected, too - LUCENE-3346 [5]).

 These problems were detected only 5 days before the official Java 7 release,
 so Oracle had no time to fix those bugs, affecting also many more
 applications. In response to our questions, they proposed to include the
 fixes into service release u2 (eventually into service release u1, see [6]).
 This means you cannot use Apache Lucene/Solr with Java 7 releases before
 Update 2! If you do, please don't open bug reports, it is not the
 committers' fault! At least disable loop optimizations using the
 -XX:-UseLoopPredicate JVM option to not risk index corruptions.

 Please note: Also Java 6 users are affected, if they use one of those JVM
 options, which are not enabled by default: -XX:+OptimizeStringConcat or
 -XX:+AggressiveOpts

 It is strongly recommended not to use any hotspot optimization switches in
 any Java version without extensive testing!

 In case you upgrade to Java 7, remember that you may have to reindex, as the
 unicode version shipped with Java 7 changed and tokenization behaves
 differently (e.g. lowercasing). For more information, read
 JRE_VERSION_MIGRATION.txt in your distribution package!

 On behalf of the Lucene project,
 Uwe

 [1] http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=7070134
 [2] http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=7044738
 [3] http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=7068051
 [4] https://issues.apache.org/jira/browse/LUCENE-3335
 [5] https://issues.apache.org/jira/browse/LUCENE-3346
 [6] http://s.apache.org/StQ

 -
 Uwe Schindler
 uschind...@apache.org
 Apache Lucene PMC Member / Committer
 Bremen, Germany
 http://lucene.apache.org/



 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org




AUTO: Ryan J Minniear is out of the office. (returning 08/01/2011)

2011-07-29 Thread Ryan J Minniear


I am out of the office until 08/01/2011.

I will respond to your message when I return.

Please contact Robert Guthrie for any urgent issues.


Note: This is an automated response to your message  Solr 3.2.0 is not
writing log sent on 7/29/11 2:08:07.

This is the only notification you will receive while this person is away.

Updating opinion

2011-07-29 Thread roySolr
Hello,

I want some opinions for the updating process of my application. 

Users can edit there own data. This data will be validated and must
be updated every 24 hours. I want to do this at night(0:00).

Now lets say 50.000 documents are edited. The delta import will
take ~20 minutes. So the indexing proces is ready at 0:20. Some 
data is depending on day. So the index has wrong data for 20 minutes.

Now i thought i can fix this problem this way:

I can do every hour a delta import without a commit. I do this 24 times and
on
the end of the day i do a commit and optimize the index. Is this possible?
Is it faster
to do the updates in parts?  



 

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Updating-opinion-tp3209251p3209251.html
Sent from the Solr - User mailing list archive at Nabble.com.


Query on multi valued field

2011-07-29 Thread rajini maski
Hi All,

   I have a specific requirement in the multi-valued field type.The
requirement is as follows

There is a multivalued field in each document which can have mutliple
elements or single element.

For Eg: Consider that following are the documents matched for say q= *:*

*DOC1*

 doc
arr name=multi
str1/str
/arr
/doc
*
*
*DOC2*
doc
arr name=multi
str1/str
str3/str
str4/str
/arr
/doc

*DOC3*
doc
arr name=multi
str1/str
str2/str
/arr
/doc

The query is get only those documents which have multiple elements for
that multivalued field.

I.e, doc 2 and 3  should be returned from the above set..

Is there anyway to achieve this?


Awaiting reply,

Thanks  Regards,
Rajani


Combine XML data with DIH

2011-07-29 Thread O. Klein
I have folder with XML files

1.xml contains:
idhttp://www.site.com/1.html/id
linkhttp://www.othersite.com/2.html/link
contentbla1/content

2.xml contains:
idhttp://www.othersite.com/2.html/id
contentbla2lt;//contentgt;

I want to  create document in Solr:

idhttp://www.site.com/1.html/id
contentbla2lt;//contentgt;

Can this be done with DIH? And how?

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Combine-XML-data-with-DIH-tp3209413p3209413.html
Sent from the Solr - User mailing list archive at Nabble.com.


segment.gen file is not replicated

2011-07-29 Thread Bernd Fehling

Dear list,

is there a deeper logic behind why the segment.gen file is not
replicated with solr 3.2?

Is it obsolete because I have a single segment?

Regards,
Bernd



Re: Dealing with keyword stuffing

2011-07-29 Thread Pranav Prakash
Cool, So I used SweetSpotSimilarity with default params and I see some
improvements. However, I could still see some of the 'stuffed' documents
coming up in the results. I feel that SweetSpotSimilarity alone is not
enough. Going through
http://trec.nist.gov/pubs/trec16/papers/ibm-haifa.mq.final.pdf I figure out
that there are other things - Pivoted Length Normalization and term
frequency normalization that needs fine tuning too.

Should I create a custom Similarity Class that overrides all the default
behavior? I guess that should help me get more relevant results. Where
should I start beginning with it? Pl. do not assume less obvious things, I
am still learning !! :-)

*Pranav Prakash*

temet nosce

Twitter http://twitter.com/pranavprakash | Blog http://blog.myblive.com |
Google http://www.google.com/profiles/pranny


On Thu, Jul 28, 2011 at 17:03, Gora Mohanty g...@mimirtech.com wrote:

 On Thu, Jul 28, 2011 at 3:48 PM, Pranav Prakash pra...@gmail.com wrote:
 [...]
  I am not sure how to use SweetSpotSimilarity. I am googling on this, but
  any useful insights are so much appreciated.

 Replace the existing DefaultSimilarity class in schema.xml (look towards
 the bottom of the file) with the SweetSpotSimilarity class, e.g., have a
 line
 like:
  similarity class=org.apache.lucene.search.SweetSpotSimilarity/

 Regards,
 Gora



Re: Combine XML data with DIH

2011-07-29 Thread O. Klein
To make it easier, I included example config:

dataConfig
dataSource type=FileDataSource /
document
entity name=file rootEntity=false dataSource=null
processor=FileListEntityProcessor fileName=^.*\.xml$ recursive=false
baseDir=/srv/www/servers/crawler/files
  entity name=crawl pk=id datasource=file
url=${file.fileAbsolutePath} processor=XPathEntityProcessor
forEach=/doc transformer=RegexTransformer
field column=id xpath=/doc/id /
field column=link xpath=/doc/link /
field column=content xpath=/doc/content /
/entity
/entity
/document
/dataConfig


O. Klein wrote:
 
 I have folder with XML files
 
 1.xml contains:
 idhttp://www.site.com/1.html/id
 linkhttp://www.othersite.com/2.html/link
 contentbla1/content
 
 2.xml contains:
 idhttp://www.othersite.com/2.html/id
 contentbla2lt;//contentgt;
 
 I want to  create document in Solr:
 
 idhttp://www.site.com/1.html/id
 contentbla2lt;//contentgt;
 
 Can this be done with DIH? And how?
 


--
View this message in context: 
http://lucene.472066.n3.nabble.com/Combine-XML-data-with-DIH-tp3209413p3209664.html
Sent from the Solr - User mailing list archive at Nabble.com.


RE: Updating opinion

2011-07-29 Thread Dyer, James
I would imagine if you're doing updates all day the commit might take a long 
time.  You could try it though and see if it works for you.  Another option, 
which will use more disk  memory is to replicate all your data to another core 
just after midnight.  Then update the data all day long as you please (and 
commit) on the new core.  At the stroke of midnight the next day, swap cores.  
This way you can control (nearly) the exact moment the new data becomes public.

See http://wiki.apache.org/solr/CoreAdmin#SWAP

James Dyer
E-Commerce Systems
Ingram Content Group
(615) 213-4311


-Original Message-
From: roySolr [mailto:royrutten1...@gmail.com] 
Sent: Friday, July 29, 2011 5:36 AM
To: solr-user@lucene.apache.org
Subject: Updating opinion

Hello,

I want some opinions for the updating process of my application. 

Users can edit there own data. This data will be validated and must
be updated every 24 hours. I want to do this at night(0:00).

Now lets say 50.000 documents are edited. The delta import will
take ~20 minutes. So the indexing proces is ready at 0:20. Some 
data is depending on day. So the index has wrong data for 20 minutes.

Now i thought i can fix this problem this way:

I can do every hour a delta import without a commit. I do this 24 times and
on
the end of the day i do a commit and optimize the index. Is this possible?
Is it faster
to do the updates in parts?  



 

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Updating-opinion-tp3209251p3209251.html
Sent from the Solr - User mailing list archive at Nabble.com.


RE: Updating opinion

2011-07-29 Thread Dyer, James
Although, now that I think more, you could probably get away with the 
commit-at-midnight option provided it doesn't take much time to warm a new 
searcher.  Another thing is if you set a low merge factor you likely won't need 
to optimize.  The optimize usually would take a lot longer than the commit, so 
you want to avoid doing one if you can.  You still won't be able to guarantee 
the new documents are available right at the stroke of midnight be you can 
probably usually be close.  If you need to be precise, you'll probably want to 
use 2 cores.  

James Dyer
E-Commerce Systems
Ingram Content Group
(615) 213-4311


-Original Message-
From: Dyer, James 
Sent: Friday, July 29, 2011 8:58 AM
To: solr-user@lucene.apache.org
Subject: RE: Updating opinion

I would imagine if you're doing updates all day the commit might take a long 
time.  You could try it though and see if it works for you.  Another option, 
which will use more disk  memory is to replicate all your data to another core 
just after midnight.  Then update the data all day long as you please (and 
commit) on the new core.  At the stroke of midnight the next day, swap cores.  
This way you can control (nearly) the exact moment the new data becomes public.

See http://wiki.apache.org/solr/CoreAdmin#SWAP

James Dyer
E-Commerce Systems
Ingram Content Group
(615) 213-4311


-Original Message-
From: roySolr [mailto:royrutten1...@gmail.com] 
Sent: Friday, July 29, 2011 5:36 AM
To: solr-user@lucene.apache.org
Subject: Updating opinion

Hello,

I want some opinions for the updating process of my application. 

Users can edit there own data. This data will be validated and must
be updated every 24 hours. I want to do this at night(0:00).

Now lets say 50.000 documents are edited. The delta import will
take ~20 minutes. So the indexing proces is ready at 0:20. Some 
data is depending on day. So the index has wrong data for 20 minutes.

Now i thought i can fix this problem this way:

I can do every hour a delta import without a commit. I do this 24 times and
on
the end of the day i do a commit and optimize the index. Is this possible?
Is it faster
to do the updates in parts?  



 

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Updating-opinion-tp3209251p3209251.html
Sent from the Solr - User mailing list archive at Nabble.com.


RE: Index time boosting with DIH

2011-07-29 Thread Bürkle , David
Thanks for the answer.
I want to share the configuration that worked for me (see the follow up 
question at the end):
(Boosting a document on the basis of a field value at index time.)

It took me some time to figure out, that for the row.get to work, I had to use 
the column name (the one in the select list) whereas for a put the fieldname 
(or pseudo fieldname) is working.

dataConfig
  dataSource .../
  script![CDATA[
function BoostDoc(row) {
   if(row.get('SOME_COLUMN') == 'someValue') {
  row.put('$docBoost', 20);
   }  
   return row;
} 
  ]]/script   
  
  document name=mydoc
entity name=myentity 
transformer=script:BoostDoc
query=select ...
 
 field column=SOME_COLUMN name=someField / 
 ...

A follow-up question:
This is only working for non-wildcard queries for me (StandardRequestHandler as 
well as edismax)
For wildcard-queries a constant score is returned.

Is there any way to get this setting working for wildcard queries as well?




-Original Message-
From: Shalin Shekhar Mangar [mailto:shalinman...@gmail.com] 
Sent: Donnerstag, 28. Juli 2011 12:37
To: solr-user@lucene.apache.org
Subject: Re: Index time boosting with DIH

On Thu, Jul 28, 2011 at 3:56 PM, Bürkle, David david.buer...@irix.chwrote:

 Can someone point me to an example for using index time boosting with 
 the DataImportHandler.


You can use the special flag variable $docBoost to add a index time boost.

http://wiki.apache.org/solr/DataImportHandler#Special_Commands

--
Regards,
Shalin Shekhar Mangar.


Re: convert date format at indexing time

2011-07-29 Thread O. Klein
If you use DIH with TikaEntityProcessor you get the dates in Solr compatible
format if you use the dates stored in the meta-data.

dataSource type=BinURLDataSource name=bin/
entity name=tika processor=TikaEntityProcessor url=${crawl.id}
dataSource=bin onError=continue format=text
field column=created meta=true name=creation_date/  

/entity



--
View this message in context: 
http://lucene.472066.n3.nabble.com/convert-date-format-at-indexing-time-tp3191078p3209881.html
Sent from the Solr - User mailing list archive at Nabble.com.


combining xml and nutch index in solr

2011-07-29 Thread abhayd
hi 

I have a xml file which has url, category,subcategory, title kind of
details.

and we crawl the urls in xml using Nutch. Anyway for use to merge both?

like schema will look like

url
category
subcategory
title
crawl_data_summary_from_nutch
crawl_data_body_content_from_nutch

Any solution for this?

thanks
abhay


--
View this message in context: 
http://lucene.472066.n3.nabble.com/combining-xml-and-nutch-index-in-solr-tp3209911p3209911.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Combine XML data with DIH

2011-07-29 Thread abhayd
hi

I have never done this with xml files but u can have multiple data sources
in dih config 

http://wiki.apache.org/solr/DataImportHandler#multipleds

abhay


--
View this message in context: 
http://lucene.472066.n3.nabble.com/Combine-XML-data-with-DIH-tp3209413p3209933.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Combine XML data with DIH

2011-07-29 Thread O. Klein
Yeah, but how do I combine the two based on the value in link?

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Combine-XML-data-with-DIH-tp3209413p3209983.html
Sent from the Solr - User mailing list archive at Nabble.com.


Solr Incremental Indexing

2011-07-29 Thread Mohammed Lateef Hussain
Hi

Need some help in Solr incremental indexing approch.

I have built my Solr index using SolrJ API and now want to update the index
whenever any changes has been made in
database. My requirement is not to use DB triggers to call any update
events.

I want to update my index on the fly whenever my application updates any
record in database.

Note: My indexing logic to get the required data from DB is some what
complex and involves many tables.

Please suggest me how can I proceed here.

Thanks
Lateef


RE: embeded solrj doesn't refresh index

2011-07-29 Thread Jianbin Dai
Thanks Marc.  
Guess I was not clear about my previous statement. So let me rephrase.

I use DIH to import data into solr and do indexing. Everything works fine.

I have another embedded solr server setting to the same index files. I use
embedded solrj to search the index file.

So the first solr is for indexing purpose, it can be turned off once the
indexing is done.

However the changes in the index files cannot show up from embedded solrj,
that is, once the new index is built, from embedded solrj, I still get the
old results. Only after I restart the embedded solr server, the new changes
are reflected from solrj.  The embedded solrj works like there was a caching
that it always goes to first.

Thanks.

JB


-Original Message-
From: Marc Sturlese [mailto:marc.sturl...@gmail.com] 
Sent: Friday, July 22, 2011 1:57 AM
To: solr-user@lucene.apache.org
Subject: RE: embeded solrj doesn't refresh index

Are u indexing with full import? In case yes and the resultant index has
similar num of docs (that the one you had before) try setting reopenReaders
to false in solrconfig.xml
* You have to send the comit, of course.

--
View this message in context:
http://lucene.472066.n3.nabble.com/embeded-solrj-doesn-t-refresh-index-tp318
4321p3190892.html
Sent from the Solr - User mailing list archive at Nabble.com.



dealing with so many different sorting options

2011-07-29 Thread Jason Toy
As I'm using solr more and more, I'm finding that I need to do searches and
then order by new criteria.  So I am constantly add new fields into solr
 and then reindexing everything.

I want to know if adding in all this data into solr is the normal way to
deal with sorting.  I'm finding that I have almost a whole copy of my
database in solr.

Should I be pulling out all the data from solr and then sort in my database?
 This solution seems like it would take too long.
Could/Should I just move to solr as my primary store so I can query directly
against it without having to reindex all the time?


Right now we store about 50 million docs, but the size is growing pretty
fast and it is a pain to reindex everything everytime I add a new column to
sort by.


Re: Exact match not the first result returned

2011-07-29 Thread Brian Lamb
I implemented both solutions Hoss suggested and was able to achieve the
desired results. I would like to go with

 defType=dismax  qf=myname  pf=myname_str^100  q=Frank

but that doesn't seem to work if I have a query like myname:Frank
otherfield:something. So I think I will go with

q=+myname:Frank myname_str:Frank^100

Thanks for the help everyone!

Brian Lamb

On Wed, Jul 27, 2011 at 10:55 PM, Chris Hostetter
hossman_luc...@fucit.orgwrote:


 : With your solution, RECORD 1 does appear at the top but I think thats
 just
 : blind luck more than anything else because RECORD 3 shows as having the
 same
 : score. So what more can I do to push RECORD 1 up to the top. Ideally, I'd
 : like all three records returned with RECORD 1 being the first listing.

 with omitNorms RECORD1 and RECORD3 have the same score because only the
 tf() matters, and both docs contain the term frank exactly twice.

 the reason RECORD1 isn't scoring higher even though it contains (as you
 put it matchings 'Fred' exactly is that from a term perspective, RECORD1
 doesn't actually match myname:Fred exactly, because there are in fact
 other terms in that field because it's multivalued.

 one way to indicate that you (only* want documents where entire field
 values to match your input (ie: RECORD1 but no other records) would be to
 use a StrField instead of a TextField or an analyzer that doesn't split up
 tokens (lie: something using KeywordTokenizer).  that way a query on
 myname:Frank would not match a document where you had indexed the value
 Frank Stalone by a query for myname:Frank Stalone would.

 in your case, you don't want *only* the exact field value matches, but you
 want them boosted, so you could do something like copyField myname into
 myname_str and then do...

  q=+myname:Frank myname_str:Frank^100

 ...in which case a match on myname is required, but a match on
 myname_str will greatly increase the score.

 dismax (and edismax) are really designed for situations like this...

  defType=dismax  qf=myname  pf=myname_str^100  q=Frank



 -Hoss



Re: slow highlighting because of stemming

2011-07-29 Thread Mike Sokolov

I'm not sure I would identify stemming as the culprit here.

Do you have very large documents?  If so, there is a patch for FVH 
committed to limit the number of phrases it looks at; see 
hl.phraseLimit, but this won't be available until 3.4 is released.


You can also limit the amount of each document that is analyzed by the 
regular Highlighter using maxDocCharsToAnalyze (and maybe this applies 
to FVH? not sure)


Using RegexFragmenter is also probably slower than something like 
SimpleFragmenter.


There is work to implement faster highlighting for Solr/Lucene, but it 
depends on some basic changes to the search architecture so it might be 
a while before that becomes available.  See 
https://issues.apache.org/jira/browse/LUCENE-3318 if you're interested 
in following that development.


-Mike

On 07/29/2011 04:55 AM, Orosz György wrote:

Dear all,

I am quite new about using Solr, but would like to ask your help.
I am developing an application which should be able to highlight the results
of a query. For this I am using regex fragmenter:
highlighting
fragmenter name=regex
class=org.apache.solr.highlight.RegexFragmenter
 lst name=defaults
   int name=hl.fragsize500/int
   float name=hl.regex.slop0.5/float
   str name=hl.pre![CDATA[b]]/str
  str name=hl.post![CDATA[/b]]/str
  str name=hl.useFastVectorHighlightertrue/str
   str name=hl.regex.pattern[-\w ,/\n\']{20,300}[.?!]/str
   str name=hl.fldokumentum_syn_query/str
 /lst
/fragmenter
   /highlighting
The field is indexed with term vectors and offsets:
field name=dokumentum_syn_query type=huntext_syn indexed=true
stored=true multiValued=true termVectors=on termPositions=on
  termOffsets=on/
 fieldType name=huntext_syn class=solr.TextField stored=true
indexed=true positionIncrementGap=100
   analyzer type=index
 tokenizer
class=com.morphologic.solr.huntoken.HunTokenizerFactory/
 filter class=solr.StopFilterFactory ignoreCase=true
words=stopwords_query.txt enablePositionIncrements=true /
  filter class=com.morphologic.solr.hunstem.HumorStemFilterFactory
  lex=/home/oroszgy/workspace/morpho/solrplugins/data/lex
  cache=alma/
 filter class=solr.LowerCaseFilterFactory/
   /analyzer
   analyzer type=query
 tokenizer class=solr.StandardTokenizerFactory/
  filter class=solr.StopFilterFactory ignoreCase=true
words=stopwords_query.txt enablePositionIncrements=true /
  filter class=com.morphologic.solr.hunstem.HumorStemFilterFactory
  lex=/home/oroszgy/workspace/morpho/solrplugins/data/lex
  cache=alma/
 filter class=solr.SynonymFilterFactory
synonyms=synonyms_query.txt ignoreCase=true expand=true/
 filter class=solr.LowerCaseFilterFactory/
   /analyzer
 /fieldType

The highlighting works well, excepts that its really slow. I realized that
this is because the highlighter/fragmenter does stemming for all the results
documents again.

Could you please help me why does it happen an how should I avoid this? (I
thought that using fastvectorhighlighter will solve my problem, but it
didn't)

Thanks in advance!
Gyuri Orosz

   


Error with Extracting PDF metadata

2011-07-29 Thread sabman
I am using Solr 3.3 and I am trying to extract and index meta data from PDF
files. I am using the DataImportHandler with the TikaEntityProcessor to add
the documents. Here is are the fields as defined in my schema.xml file:


field name=title type=text indexed=true stored=true
multiValued=false/
   field name=description type=text indexed=true stored=true
multiValued=false/
   field name=date_published type=string indexed=false stored=true
multiValued=false/
   field name=link type=string indexed=true stored=true
multiValued=false required=false/
   field name=imgName type=string indexed=false stored=true
multiValued=false required=false/
   dynamicField name=attr_* type=textgen indexed=true stored=true
multiValued=false/

So I suppose the meta data information should be indexed and stored in
fields prefixed as attr_.

Here is how my data config file looks. It takes a source directory path from
a database, passes it to a FileListEntityProcessor which will pass each of
the pdf files found in the directory to the TikaEntityProcessor to extract
and index the content.

entity onError=skip name=fileSourcePaths rootEntity=false
dataSource=dbSource fileName=.*pdf query=select path from
file_sources
  entity name=fileSource processor=FileListEntityProcessor
transformer=ThumbnailTransformer baseDir=${fileSourcePaths.path}
recursive=true rootEntity=false
field name=link column=fileAbsolutePath thumbnail=true/
field name=imgName column=imgName/
entity rootEntity=true onError=abort name=file
processor=TikaEntityProcessor url=${fileSource.fileAbsolutePath}
dataSource=fileSource format=text
  field column=resourceName name=title meta=true/
  field column=Creation-Date name=date_published meta=true/
  field column=text name=description/
/entity
  /entity

It extracts the description and Creation-date just fine but it doesn't seem
like it is extracting resourceName and so  there is no title field for the
documents when I query the index . This is weird because both Creation-date
and resourceName are meta data. Also, none of the other possible meta data
was being stored under the attr_ fields. I came across some threads which
said there are know problems with using Tika 0.8 so I downloaded Tika 0.9
and replaced it over 0.8. I also downloaded and replaced pdfbox, jempbox and
fontbox from 1.3 to 1.4. 

I tested one of the pdf's separately with just Tika to see what meta data is
stored with the file. This is what I found:

Content-Length: 546459
Content-Type: application/pdf
Creation-Date: 2010-06-09T12:11:12Z
Last-Modified: 2010-06-09T14:53:38Z
created: Wed Jun 09 08:11:12 EDT 2010
creator: XSL Formatter V4.3 MR9a (4,3,2009,1022) for Windows
producer: Antenna House PDF Output Library 2.6.0 (Windows)
resourceName: Argentina.pdf
trapped: False
xmpTPg:NPages: 2


As you can see, it does have a resourceName meta data. I tried indexing
again but I got the same result. Creation-date extracts and indexes just
fine but not resourceName. Also the rest of the attributes are not being
indexed under the attr_ fields.

Whats going wrong?


--
View this message in context: 
http://lucene.472066.n3.nabble.com/Error-with-Extracting-PDF-metadata-tp3210813p3210813.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Auto-Commit and failures / schema violations

2011-07-29 Thread Chris Hostetter

: sure that the add was successfull, as (for example) schema violations 
: seem to be detected on commit, which is therefore too late, as the 

I have no idea what that stamement means -- if you are getting an error, 
can you be specific as to what type of error you are getting? (ie: what is 
returned to the client, and what do you see in the logs)


-Hoss


Re: I can't pass the unit test when compile from apache-solr-3.3.0-src

2011-07-29 Thread Chris Hostetter

: I find that the junit test will always fail, and told me ’BUILD FAILED‘
: 
: but if I type 'ant dist', I can get a apache-solr-3.3-SNAPSHOT.war
: with no warning.
: 
: Is it a problem just me?

Can you please be specific...
 * which test(s) fail for you?
 * what are the failures?

Any time a test fails, that info appears in the ant test output, and the 
full details or all tests are written to build/test-results

you can run ant test-reports from the solr directory to generate an HTML 
report of all the success/failure info.



-Hoss

Re: Solr versioning policy

2011-07-29 Thread Chris Hostetter

: 1. Is this the plan moving forward (to aim for a new minor release
: approximately every couple of months)?

The goal is to release minor versions more frequently as features and 
low priority bug fixes are available.  If there is a high priority bug fix 
available, and and no likelihood of a near-term minor release, then bug 
fixes releases (ie: 3.4.1) will be done (as has always been the case)

This new accelerated minor-feature release approach is possible because 
of the parallel development branches approach that was instituted a while 
back, but once those branches were created it took some time to get the 
test/build/release processes automated enough that devs felt formortable 
releasing more frequently.

There's no hard and fast rule about often releases will happen.  Anyone 
can step up and push for a release if they feel the features are ready.

: 2. Will minor version increases always be backwards compatible (so I could
: upgrade from 3.x to 3.y where y  x without having to update the
: schema/config or rebuild the indexes)?

That has always been the goal, yes.  Sometimes the mechanism for dealing 
with new bugs/features requires making changes to config files and when 
known this is noted in the Upgrading section of CHANGES.txt for the 
affected release.




-Hoss


Re: omitNorms

2011-07-29 Thread Chris Hostetter

: my field category (string) has omitNorms=True and  
omitTermFreqAndPositions=True.
:  i have indexed all docs but when i do a search like:
: http://xxx:xxx/solr/select/?q=category:AdebugQuery=on
: i see there's normalization and idf and tf. Why? i can't understand the 
reason.

those options ensure that that information isn't calculated and stored in 
your index, so they don't affect searches, but the debugging code still 
shows where the norms/tf (which don't exist for those fields) are part of 
the score calculation.

You'll note thta they are always 1 in this debug info, making them 
No-Ops in the multiplication...

: 8.676225 = (MATCH) fieldWeight(category:A in 826), product of:
:   1.0 = tf(termFreq(category:A)=1)
:   8.676225 = idf(docFreq=6978, maxDocs=15049953)
:   1.0 = fieldNorm(field=category, doc=826)


-Hoss


Re: Display term frequency / phrase freqency for documents

2011-07-29 Thread Chris Hostetter

: I'd like to expose the termFrequency / phraseFrequency to the end user in my
: application. For example I would like to be able to say Your search term
: appears X times in this document.
: 
: I can see these figures exposed via debugQuery=on, where I get output like
...
: Is there anyway to expose these figures in XML nodes though? I could parse
: them from the debug output but that feels very hack !

http://wiki.apache.org/solr/CommonQueryParameters#debug.explain.structured

-Hoss


Re: Disabling Coord on Solr queries

2011-07-29 Thread Chris Hostetter

: I am looking for the simplest way to disable coord in Solr queries.  I have
: found out Lucene allows this by construction of a BooleanQuery with
: diableCoord=false:
: public *BooleanQuery*(boolean disableCoord)
: 
: Is there any way to activate this functionality directly from a Solr query?

Not that i know of, but if you'd like to open a jira issue it owuld 
probably be fairly easy to add this to the LuceneQParser so you could do 
something like...

  q={!lucene coord=false}my boolean query


-Hoss


Looking for a senior search engineer

2011-07-29 Thread Michael Economy
Hi,

Sorry if this isn't the right place for this message, but it's a very
specific role we're looking for and I'm not sure where else to find
solr experts!


I was wondering if anyone would be interested, or knew anyone who
would be interested in working on goodreads.com's search:


We're using Solr, and we'd like someone with experience doing:
solr-replication
faceted search
more cool stuff

We run ruby on rails for the website.  Potential applicants don't need
to know ruby or rails, but they'd be expected to pick it up after
starting.

More info on our website:
http://.goodreads.com/about/us



Michael Economy
Director Engineering, Goodreads Inc.


Re: dih fetching but not adding records to index

2011-07-29 Thread abhayd
quick question

if i want to just load document with id=2 how would that work?

I tried xpath expression that works with xpath tools but not in solr. How
would i do this?

dataConfig
dataSource type=FileDataSource /
document
entity name=f processor=FileListEntityProcessor
baseDir=c:\temp fileName=promotions.xml
recursive=false rootEntity=false dataSource=null
entity name=x processor=XPathEntityProcessor
forEach=/add/doc url=${f.fileAbsolutePath} pk=id
field column=id xpath=/add/doc/[id=2]/id/
/entity
/entity
/document
/dataConfig

--
View this message in context: 
http://lucene.472066.n3.nabble.com/dih-fetching-but-not-adding-records-to-index-tp3189438p3211083.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: I found a sorting bug in solr/lucene

2011-07-29 Thread Chris Hostetter

: According to that bug list, there are other characters that break the
: sorting function.  Is there a list of safe characters I can use as a
: delimiter?

the safest field names to use (and most efficient to parse when sorting) 
are things that follow the the id semenatics in java (not including the 
$ character at the begining) ...

http://download.oracle.com/javase/1.4.2/docs/api/java/lang/Character.html#isJavaIdentifierStart%28char%29
http://download.oracle.com/javase/1.4.2/docs/api/java/lang/Character.html#isJavaIdentifierPart%28char%29

So sorts like foo_bar_baz asc will definitely work, and are heavily 
tested

I've just posted a patch to SOLR-2606 that should fix the foo:bar asc 
and foo-bar asc situations, but because of the function query sort 
parsing that happens first, they will always be slightly slower to parse.


-Hoss