Indexing Failed rolled back

2011-01-25 Thread Dinesh

i did some research in schema DIH config file and i created my own DIH, i'm
getting this error when i run

response
−
lst name=responseHeader
int name=status0/int
int name=QTime0/int
/lst
−
lst name=initArgs
−
lst name=defaults
str name=configtry.xml/str
/lst
/lst
str name=commandfull-import/str
str name=statusidle/str
str name=importResponse/
−
lst name=statusMessages
str name=Time Elapsed0:0:0.163/str
str name=Total Requests made to DataSource0/str
str name=Total Rows Fetched1/str
str name=Total Documents Processed0/str
str name=Total Documents Skipped0/str
str name=Full Dump Started2011-01-25 13:56:48/str
str name=Indexing failed. Rolled back all changes./str
str name=Rolledback2011-01-25 13:56:48/str
/lst
−
str name=WARNING
This response format is experimental.  It is likely to change in the future.
/str
/response

-
DINESHKUMAR . M
I am neither especially clever nor especially gifted. I am only very, very
curious.
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Indexing-Failed-rolled-back-tp2327412p2327412.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: DIH serialize

2011-01-25 Thread Stefan Matheis
Rich,

i played around for a few minutes with Script-Transformers, but i have not
enough knowledge to get anything done right know :/
My Idea was: looping over the given row, which should be a Java HashMap or
something like that? and do sth like this (pseudo-code):

var row_data = []
for( var key in row )
{
  row_data.push( '' + key + ' : ' + row[key] + '' );
}
row.put( 'whatever_field', '{' + row_data.join( ',' ) + '}' );

Which should result in a json-object like {'key1':'value1', 'key2':'value2'}
- and that should be okay to work with?

Regards
Stefan

On Mon, Jan 24, 2011 at 7:53 PM, Papp Richard ccode...@gmail.com wrote:

 Hi Stefan,

  yes, this is exactly what I intend - I don't want to search in this field
 - just quicly return me the result in a serialized form (the search
 criteria
 is on other fields). Well, if I could serialize the data exactly as like
 the
 PHP serialize() does I would be maximally satisfied, but any other form in
 which I could compact the data easily into one field I would be pleased.
  Can anyone help me? I guess the script is quite a good way, but I don't
 know which function should I use there to compact the data to be easily
 usable in PHP. Or any other method?

 thanks,
  Rich

 -Original Message-
 From: Stefan Matheis [mailto:matheis.ste...@googlemail.com]
 Sent: Monday, January 24, 2011 18:23
 To: solr-user@lucene.apache.org
 Subject: Re: DIH serialize

 Hi Rich,

 i'm a bit confused after reading your post .. what exactly you wanna try to
 achieve? Serializing (like http://php.net/serialize) your complete row
 into
 one field? Don't wanna search in them, just store and deliver them in your
 results? Does that make sense? Sounds a bit strange :)

 Regards
 Stefan

 On Mon, Jan 24, 2011 at 10:03 AM, Papp Richard ccode...@gmail.com wrote:

  Hi Dennis,
 
   thank you for your answer, but didn't understand why you say it doesn't
  need serialization. I'm with the option C.
   but the main question is, how to put into one field a result of many
  fields: SELECT * FROM.
 
  thanks,
   Rich
 
  -Original Message-
  From: Dennis Gearon [mailto:gear...@sbcglobal.net]
  Sent: Monday, January 24, 2011 02:07
  To: solr-user@lucene.apache.org
  Subject: Re: DIH serialize
 
  Depends on your process chain to the eventual viewer/consumer of the
 data.
 
  The questions to ask are:
   A/ Is the data IN Solr going to be viewed or processed in its original
  form:
   --set stored = 'true'
  ---no serialization needed.
   B/ If it's going to be anayzed and searched for separate from any other
  field,
 
   the analyzing will put it into  an unreadable form. If you need to
 see
  it,
  then
  ---set indexed=true and stored=true
  ---no serializaton needed.   C/ If it's NOT going to be viewed AS
 IS,
  and
  it's not going to be searched for AS IS,
(i.e. other columns will be how the data is found), and you have
  another,
 
serialzable format:
--set indexed=false and stored=true
--serialize AS PER THE INTENDED APPLICATION,
not sure that Solr can do that at all.
   C/ If it's NOT going to be viewed AS IS, and it's not going to be
 searched
  for
  AS IS,
(i.e. other columns will be how the data is found), and you have
  another,
 
serialzable format:
--set indexed=false and stored=true
--serialize AS PER THE INTENDED APPLICATION,
not sure that Solr can do that at all.
   D/ If it's NOT going to be viewed AS IS, BUT it's going to be searched
 for
  AS
  IS,
(this column will be how the data is found), and you have another,
serialzable format:
--you need to put it into TWO columns
--A SERIALIZED FIELD
--set indexed=false and stored=true
 
   --AN UNSERIALIZED FIELD
--set indexed=false and stored=true
--serialize AS PER THE INTENDED APPLICATION,
not sure that Solr can do that at all.
 
  Hope that helps!
 
 
  Dennis Gearon
 
 
  Signature Warning
  
  It is always a good idea to learn from your own mistakes. It is usually a
  better
  idea to learn from others' mistakes, so you do not have to make them
  yourself.
  from 'http://blogs.techrepublic.com.com/security/?p=4501tag=nl.e036'
 
 
  EARTH has a Right To Life,
  otherwise we all die.
 
 
 
  - Original Message 
  From: Papp Richard ccode...@gmail.com
  To: solr-user@lucene.apache.org
  Sent: Sun, January 23, 2011 2:02:05 PM
  Subject: DIH serialize
 
  Hi all,
 
 
 
   I wasted the last few hours trying to serialize some column values (from
  mysql) into a Solr column, but I just can't find such a function. I'll
 use
  the value in PHP - I don't know if it is possible to serialize in PHP
 style
  at all. This is what I tried and works with a given factor:
 
 
 
  in schema.xml:
 
field name=main_timetable  type=text indexed=false
  stored=true multiValued=true /
 
 
 
  in DIH xml:
 
 
 
  dataConfig
 
   

Re: synonyms file, and example cases

2011-01-25 Thread Stefan Matheis
Cam,

the examples with the provided inline-documentation should help you, no?
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.SynonymFilterFactory

The Backslash \ in that context looks like an Escaping-Character, to avoid
the = to be interpreted as assign-command

Regards
Stefan

On Tue, Jan 25, 2011 at 2:31 AM, Cam Bazz camb...@gmail.com wrote:

 Hello,

 I have been looking at the solr synonym file that was an example, I
 did not understand some notation:

 aaa = 

 bbb = 1 2

 ccc = 1,2

 a\=a = b\=b

 a\,a = b\,b

 fooaaa,baraaa,bazaaa

 The first one says search for  when query is aaa. am I correct?
 the second one finds 1 2 when query is bbb
 the third one is find 1 or 2 when query is ccc

 the fourth, and fifth one I have not understood.

 the last one, i assume is a group, bidirectional mapping between
 fooaaa,baraaa,bazaaa

 I am especially interested with this last one, if I do aaa,bbb it will
 find aaa and bbb when either aaa or bbb is queryied?

 am I correct in those assumptions?

 Best regards,
 C.B.



Performance optimization of Proximity/Wildcard searches

2011-01-25 Thread Salman Akram
Hi,

I am facing performance issues in three types of queries (and their
combination). Some of the queries take more than 2-3 mins. Index size is
around 150GB.


   - Wildcard
   - Proximity
   - Phrases (with common words)

I know CommonGrams and Stop words are a good way to resolve such issues but
they don't fulfill our functional requirements (Common Grams seem to have
issues with phrase proximity, stop words have issues with exact match etc).

Sharding is an option too but that too comes with limitations so want to
keep that as a last resort but I think there must be other things coz 150GB
is not too big for one drive/server with 32GB Ram.

Cache warming is a good option too but the index get updated every hour so
not sure how much would that help.

What are the other main tips that can help in performance optimization of
the above queries?

Thanks

-- 
Regards,

Salman Akram


Re: please help Problem with dataImportHandler

2011-01-25 Thread Stefan Matheis
Caused by: org.xml.sax.SAXParseException: Element type field must be
followed by either attribute specifications,  or /.

Sounds like invalid XML in your .. dataimport-config?

On Tue, Jan 25, 2011 at 5:41 AM, Dinesh mdineshkuma...@karunya.edu.inwrote:


 http://pastebin.com/tjCs5dHm

 this is the log produced by the solr server

 -
 DINESHKUMAR . M
 I am neither especially clever nor especially gifted. I am only very, very
 curious.
 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/please-help-Problem-with-dataImportHandler-tp2318585p2326659.html
 Sent from the Solr - User mailing list archive at Nabble.com.



Re: please help Problem with dataImportHandler

2011-01-25 Thread Dinesh

ya after correcting it also it is throwing an exception

-
DINESHKUMAR . M
I am neither especially clever nor especially gifted. I am only very, very
curious.
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/please-help-Problem-with-dataImportHandler-tp2318585p2327662.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Getting started with writing parser

2011-01-25 Thread Gora Mohanty
On Tue, Jan 25, 2011 at 10:05 AM, Dinesh mdineshkuma...@karunya.edu.in wrote:

 http://pastebin.com/CkxrEh6h

 this is my sample log
[...]

And, which portions of the log text do you want to preserve?
Does it go into Solr as a single error message, or do you want
to separate out parts of it.

Regards,
Gora


Re: Getting started with writing parser

2011-01-25 Thread Dinesh

i want to take the month, time, DHCPMESSAGE, from_mac, gateway_ip, net_ADDR

-
DINESHKUMAR . M
I am neither especially clever nor especially gifted. I am only very, very
curious.
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Getting-started-with-writing-parser-tp2278092p2327738.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: please help Problem with dataImportHandler

2011-01-25 Thread Dinesh

http://lucene.472066.n3.nabble.com/Getting-started-with-writing-parser-tp2278092p2327738.html

this thread explains my problem

-
DINESHKUMAR . M
I am neither especially clever nor especially gifted. I am only very, very
curious.
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/please-help-Problem-with-dataImportHandler-tp2318585p2327745.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Getting started with writing parser

2011-01-25 Thread Gora Mohanty
On Tue, Jan 25, 2011 at 11:44 AM, Dinesh mdineshkuma...@karunya.edu.in wrote:

 i don't even know whether the regex expression that i'm using for my log is
 correct or no..

If it is the same try.xml that you posted earlier, it is very likely not
going to work. You seem to have just cut and pasted entries from
the Hathi Trust blog, without understanding how they work.

Could you take a fresh look at http://wiki.apache.org/solr/DataImportHandler
and explain in words the following:
* What is your directory structure for storing the log files?
* What parts of the log file do you want to keep (you have already explained
  this in another message)?
* How would the above translate into:
  - A Solr schema
  - Setting up (a) a data source, (b) processor(s), and (c) transformers.

i very much worried i couldn't proceed in my 
 project already
 1/3 rd of the timing is over.. please help.. this is just the first stage..
 after this i have ti setup up all the log to be redirected to SYSLOG and
 from there i'll send it to SOLR server.. then i have to analyse all the
 data's that i obtained from DNS, DHCP, WIFI, SWITCES.. and i have to prepare
 a user based report on his actions.. please help me cause the day's i have
 keeps reducing.. my project leader is questioning me a lot.. pls..
[...]

Well, I am sorry, but at least I strongly feel that we should
not be doing your work for you, and especially not if it is a
student project, as seems to be the case.

If you can address the above points one by one (stay on
this thread, please), people should be able to help you.
However, it is up to you to get to understand Solr well
enough.

Regards,
Gora


Re: Getting started with writing parser

2011-01-25 Thread Dinesh

no i actually changed the directory to mine where i stored the log files.. it
is /home/exam/apa..solr/example/exampledocs

i specified it in a solr schema.. i created an DataImportHandler for that in
try.xml.. then in that i changed that file name to sample.txt

that new try.xml is
http://pastebin.com/pfVVA7Hs

i changed the log into one word per line thinking there might be error in my
regex expression.. now i'm completely stuck..

-
DINESHKUMAR . M
I am neither especially clever nor especially gifted. I am only very, very
curious.
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Getting-started-with-writing-parser-tp2278092p2327920.html
Sent from the Solr - User mailing list archive at Nabble.com.


Extracting contents of zipped files with Tika and Solr 1.4.1

2011-01-25 Thread Gary Taylor

Hi,

I posted a question in November last year about indexing content from 
multiple binary files into a single Solr document and Jayendra responded 
with a simple solution to zip them up and send that single file to Solr.


I understand that the Tika 0.4 JARs supplied with Solr 1.4.1 don't 
currently allow this to work and only the file names of the zipped files 
are indexed (and not their contents).


I've tried downloading and building the latest Tika (0.8) and replacing 
the tika-parsers and tika-core JARS in 
solr-root\contrib\extraction\lib but this still isn't indexing the 
file contents, and not doesn't even index the file names!


Is there a version of Tika that works with the Solr 1.4.1 released 
distribution which does index the contents of the zipped files?


Thanks and kind regards,
Gary



DIH From various File system locations

2011-01-25 Thread pankaj bhatt
Hi All,
 I need to index the documents presents in my file system at various
locations (e.g. C:\docs , d:\docs ).
Is there any way through which i can specify this in my DIH
Configuration.
Here is my configuration:-

document
  entity name=sd
processor=FileListEntityProcessor
fileName=docx$|doc$|pdf$|xls$|xlsx|html$|rtf$|txt$|zip$
*baseDir=G:\\Desktop\\*
recursive=false
rootEntity=true
transformer=DateFormatTransformer
onerror=continue
entity name=tikatest
processor=org.apache.solr.handler.dataimport.TikaEntityProcessor
url=${sd.fileAbsolutePath} format=text dataSource=bin
  field column=Author name=author meta=true/
  field column=Content-Type name=title meta=true/
  !-- field column=title name=title meta=true/ --
  field column=text name=all_text/
/entity

!-- field column=fileLastModified name=date
dateTimeFormat=-MM-dd'T'hh:mm:ss / --
field column=fileSize name=size/
field column=file name=filename/
/entity
!--baseDir=../site--
  /document

/ Pankaj Bhatt.


Re: Performance optimization of Proximity/Wildcard searches

2011-01-25 Thread Toke Eskildsen
On Tue, 2011-01-25 at 10:20 +0100, Salman Akram wrote:
 Cache warming is a good option too but the index get updated every hour so
 not sure how much would that help.

What is the time difference between queries with a warmed index and a
cold one? If the warmed index performs satisfactory, then one answer is
to upgrade your underlying storage. As always for IO-caused performance
problem in Lucene/Solr-land, SSD is the answer.



Recommendation on RAM-/Cache configuration

2011-01-25 Thread Martin Grotzke
Hi,

recently we're experiencing OOMEs (GC overhead limit exceeded) in our
searches. Therefore I want to get some clarification on heap and cache
configuration.

This is the situation:
- Solr 1.4.1 running on tomcat 6, Sun JVM 1.6.0_13 64bit
- JVM Heap Params: -Xmx8G -XX:MaxPermSize=256m -XX:NewSize=2G
-XX:MaxNewSize=2G -XX:SurvivorRatio=6 -XX:+UseParallelOldGC
-XX:+UseParallelGC
- The machine has 32 GB RAM
- Currently there are 4 processors/cores in the machine, this shall be
changed to 2 cores in the future.
- The index size in the filesystem is ~9.5 GB
- The index contains ~ 5.500.000 documents
- 1.500.000 of those docs are available for searches/queries, the rest are
inactive docs that are excluded from searches (via a flag/field), but
they're still stored in the index as need to be available by id (solr is the
main document store in this app)
- Caches are configured with a big size (the idea was to prevent filesystem
access / disk i/o as much as possible):
  - filterCache (solr.LRUCache): size=20, initialSize=3,
autowarmCount=1000, actual size =~ 60.000, hitratio =~ 0.99
  - documentCache (solr.LRUCache): size=20, initialSize=10,
autowarmCount=0, actual size =~ 160.000 - 190.000, hitratio =~ 0.74
  - queryResultCache (solr.LRUCache): size=20, initialSize=3,
autowarmCount=1, actual size =~ 10.000 - 60.000, hitratio =~ 0.71
- Searches are performed using a catchall text field using standard request
handler, all fields are fetched (no fl specified)
- Normally ~ 5 concurrent requests, peaks up to 30 or 40 (mostly during GC)
- Recently we also added a feature that adds weighted search for special
fields, so that the query might become s.th. like this
  q=(some query) OR name_weighted:(some query)^2.0 OR brand_weighted:(some
query)^4.0 OR longDescription_weighted:(some query)^0.5
  (it seemed as if this was the cause of the OOMEs, but IMHO it only
increased RAM usage so that now GC could not free enough RAM)

The OOMEs that we get are of type GC overhead limit exceeded, one of the
OOMEs was thrown during auto-warming.

I checked two different heapdumps, the first one autogenerated
(by -XX:+HeapDumpOnOutOfMemoryError) the second one generated manually via
jmap.
These show the following distribution of used memory - the autogenerated
dump:
 - documentCache: 56% (size ~ 195.000)
- filterCache: 15% (size ~ 60.000)
- queryResultCache: 8% (size ~ 61.000)
- fieldCache: 6% (fieldCache referenced  by WebappClassLoader)
- SolrIndexSearcher: 2%

The manually generated dump:
- documentCache: 48% (size ~ 195.000)
- filterCache: 20% (size ~ 60.000)
- fieldCache: 11% (fieldCache hängt am WebappClassLoader)
- queryResultCache: 7% (size ~ 61.000)
- fieldValueCache: 3%

We are also running two search engines with 17GB heap, these don't run into
OOMEs. Though, with these bigger heap sizes the longest requests are even
longer due to longer stop-the-world gc cycles.
Therefore my goal is to run with a smaller heap, IMHO even smaller than 8GB
would be good to reduce the time needed for full gc.

So what's the right path to follow now? What would you recommend to change
on the configuration (solr/jvm)?

Would you say it is ok to reduce the cache sizes? Would this increase disk
i/o, or would the index be hold in the OS's disk cache?

Do have other recommendations to follow / questions?

Thanx  cheers,
Martin


Re: Specifying an AnalyzerFactory in the schema

2011-01-25 Thread Renaud Delbru

Hi Chris,

On 24/01/11 21:18, Chris Hostetter wrote:

: I notice that in the schema, it is only possible to specify a Analyzer class,
: but not a Factory class as for the other elements (Tokenizer, Fitler, etc.).
: This limits the use of this feature, as it is impossible to specify parameters
: for the Analyzer.
: I have looked at the IndexSchema implementation, and I think this requires a
: simple fix. Do I open an issue about it ?

Support for constructing Analyzers directly is very crude, and primarily
existed for making it easy for people with old indexes and analyzers to
keep working.

moving foward, Lucene/Solr eventtually won't ship concret Analyzers
implementations at all (at least, that's the last concensus i remember) so
enhancing support for loading Analyzers (or AnalyzerFactories) doesn't
make much sense.

Practically speaking, if you have an existing Analyzer that you want to
use in Solr, instead of writting an AnalyzerFactory for it, you could
just write a TokenizerFactory that wraps it instead -- functinally that
would let you achieve everything ana AnalyzerFactory would, except that
Solr would already handle letting the schema.xml specify the
positionIncrementGap (which you could happily ignore if you wanted)
Thanks for the trick, I haven't thought about doing that. This should 
work indeed.


cheers
--
Renaud Delbru


Use terracotta bigmemory for solr-caches

2011-01-25 Thread Martin Grotzke
Hi,

as the biggest parts of our jvm heap are used by solr caches I asked myself
if it wouldn't make sense to run solr caches backed by terracotta's
bigmemory (http://www.terracotta.org/bigmemory).
The goal is to reduce the time needed for full / stop-the-world GC cycles,
as with our 8GB heap the longest requests take up to several minutes.

What do you think?

Cheers,
Martin


Re: Performance optimization of Proximity/Wildcard searches

2011-01-25 Thread Salman Akram
By warmed index you only mean warming the SOLR cache or OS cache? As I said
our index is updated every hour so I am not sure how much SOLR cache would
be helpful but OS cache should still be helpful, right?

I haven't compared the results with a proper script but from manual testing
here are some of the observations.

'Recent' queries which are in cache of course return immediately (only if
they are exactly same - even if they took 3-4 mins first time). I will need
to test how many recent queries stay in cache but still this would work only
for very common queries. User can run different queries and I want at least
them to be at 'acceptable' level (5-10 secs) even if not very fast.

Our warm up script currently executes all distinct queries in our logs
having count  5. It was run yesterday (with all the indexing update every
hour after that) and today when I executed some of the same queries again
their time seemed a little less (around 15-20%), I am not sure if this means
anything. However, still their time is not acceptable.

What do you think is the best way to compare results? First run all the warm
up queries and then execute same randomly and compare?

We are using Windows server, would it make a big difference if we move to
Linux? Our load is not high but some queries are really complex.

Also I was hoping to move to SSD in last after trying out all software
options. Is that an agreed fact that on large indexes (which don't fit in
RAM) proximity/wildcard/phrase queries (on common words) would be slow and
it can be only improved by cache warm up and better hardware? Otherwise with
an index of around 150GB such queries will take more than a min?

If that's the case I know this question is very subjective but if a single
query takes 2 min on SAS 10K RPM what would its approx time be on a good SSD
(everything else same)?

Thanks!


On Tue, Jan 25, 2011 at 3:44 PM, Toke Eskildsen t...@statsbiblioteket.dkwrote:

 On Tue, 2011-01-25 at 10:20 +0100, Salman Akram wrote:
  Cache warming is a good option too but the index get updated every hour
 so
  not sure how much would that help.

 What is the time difference between queries with a warmed index and a
 cold one? If the warmed index performs satisfactory, then one answer is
 to upgrade your underlying storage. As always for IO-caused performance
 problem in Lucene/Solr-land, SSD is the answer.




-- 
Regards,

Salman Akram


Re: Possible Memory Leaks / Upgrading to a Later Version of Solr or Lucene

2011-01-25 Thread Markus Jelsma
Hi,

Are you sure you need CMS incremental mode? It's only adviced when running on 
a machine with one or two processors. If you have more you should consider 
disabling the incremental flags.

Cheers,

On Monday 24 January 2011 19:32:38 Simon Wistow wrote:
 We have two slaves replicating off one master every 2 minutes.
 
 Both using the CMS + ParNew Garbage collector. Specifically
 
 -server -XX:+UseConcMarkSweepGC -XX:+UseParNewGC
 -XX:+CMSIncrementalMode -XX:+CMSIncrementalPacing
 
 but periodically they both get into a GC storm and just keel over.
 
 Looking through the GC logs the amount of memory reclaimed in each GC
 run gets less and less until we get a concurrent mode failure and then
 Solr effectively dies.
 
 Is it possible there's a memory leak? I note that later versions of
 Lucene have fixed a few leaks. Our current versions are relatively old
 
   Solr Implementation Version: 1.4.1 955763M - mark - 2010-06-17
 18:06:42
 
   Lucene Implementation Version: 2.9.3 951790 - 2010-06-06 01:30:55
 
 so I'm wondering if upgrading to later version of Lucene might help (of
 course it might not but I'm trying to investigate all options at this
 point). If so what's the best way to go about this? Can I just grab the
 Lucene jars and drop them somewhere (or unpack and then repack the solr
 war file?). Or should I use a nightly solr 1.4?
 
 Or am I barking up completely the wrong tree? I'm trawling through heap
 logs and gc logs at the moment trying to to see what other tuning I can
 do but any other hints, tips, tricks or cluebats gratefully received.
 Even if it's just Yeah, we had that problem and we added more slaves
 and periodically restarted them
 
 thanks,
 
 Simon

-- 
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350


Re: Weird behaviour with phrase queries

2011-01-25 Thread Erick Erickson
Frankly, this puzzles me. It *looks* like it should be OK. One warning, the
analysis page sometimes is a bit misleading, so beware of that.

But the output of your queries make it look like the query is parsing as you
expect, which leaves the question of whether your index contains what
you think it does. You might get a copy of Luke, which allows you to examine
what's actually in your index instead of what you think is in there.
Sometimes
there are surprises here!

I didn't mean to re-index your whole corpus, I was thinking that you could
just index a few documents in a test index so you have something small to
look at.

Sorry I can't spot what's happening right away.

Good luck!
Erick

On Tue, Jan 25, 2011 at 2:45 AM, Jerome Renard jerome.ren...@gmail.comwrote:

 Erick,

 On Mon, Jan 24, 2011 at 9:57 PM, Erick Erickson 
 erickerick...@gmail.comwrote:

 Hmmm, I don't see any screen shots. Several things:
 1 If your stopword file has comments, I'm not sure what the effect would
 be.


 Ha, I thought comments were supported in stopwords.txt


 2 Something's not right here, or I'm being fooled again. Your withresults
 xml has this line:
 str name=parsedquery+DisjunctionMaxQuery((meta_text:ecol d
 ingenieur)~0.01) ()/str
 and your noresults has this line:
 str name=parsedquery+DisjunctionMaxQuery((meta_text:academi
 charpenti)~0.01) DisjunctionMaxQuery((meta_text:academi
 charpenti~100)~0.01)/str

 the empty () in the first one often means you're NOT going to your
 configured dismax parser in solrconfig.xml. Yet that doesn't square with
 your custom qt, so I'm puzzled.

 Could we see your raw query string on the way in? It's almost as if you
 defined qt in one and defType in the other, which are not equivalent.


 You are right I fixed this problem (my bad).

 3 It may take 12 hours to index, but you could experiment with a smaller
 subset. You say you know that the noresults one should return documents,
 what proof do
 you have? If there's a single document that you know should match this,
 just
 index it and a few others and you should be able to make many runs until
 you
 get
 to the bottom of this...


 I could but I always thought I had to fully re-index after updating
 schema.xml. If
 I update only few documents will that take the changes into account without
 breaking
 the rest ?


 And obviously your stemming is happening on the query, are you sure it's
 happening at index time too?


 Since you did not get the screenshots you will find attached the full
 output of the analysis
 for a phrase that works and for another that does not.

 Thanks for your support

 Best Regards,

 --
 Jérôme



Re: Recommendation on RAM-/Cache configuration

2011-01-25 Thread Markus Jelsma
On Tuesday 25 January 2011 11:54:55 Martin Grotzke wrote:
 Hi,
 
 recently we're experiencing OOMEs (GC overhead limit exceeded) in our
 searches. Therefore I want to get some clarification on heap and cache
 configuration.
 
 This is the situation:
 - Solr 1.4.1 running on tomcat 6, Sun JVM 1.6.0_13 64bit
 - JVM Heap Params: -Xmx8G -XX:MaxPermSize=256m -XX:NewSize=2G
 -XX:MaxNewSize=2G -XX:SurvivorRatio=6 -XX:+UseParallelOldGC
 -XX:+UseParallelGC

Consider switching to HotSpot JVM, use the -server as the first switch.

 - The machine has 32 GB RAM
 - Currently there are 4 processors/cores in the machine, this shall be
 changed to 2 cores in the future.
 - The index size in the filesystem is ~9.5 GB
 - The index contains ~ 5.500.000 documents
 - 1.500.000 of those docs are available for searches/queries, the rest are
 inactive docs that are excluded from searches (via a flag/field), but
 they're still stored in the index as need to be available by id (solr is
 the main document store in this app)

How do you exclude them? It should use filter queries. I also remember (but i 
just cannot find it back so please correct my if i'm wrong) that in 1.4.x 
sorting is done before filtering. It should be an improvement if filtering is 
done before sorting.
If you use sorting, it takes up a huge amount of RAM if filtering is not done 
first.

 - Caches are configured with a big size (the idea was to prevent filesystem
 access / disk i/o as much as possible):

There is only disk I/O if the kernel can't keep the index (or parts) in its 
page cache.

   - filterCache (solr.LRUCache): size=20, initialSize=3,
 autowarmCount=1000, actual size =~ 60.000, hitratio =~ 0.99
   - documentCache (solr.LRUCache): size=20, initialSize=10,
 autowarmCount=0, actual size =~ 160.000 - 190.000, hitratio =~ 0.74
   - queryResultCache (solr.LRUCache): size=20, initialSize=3,
 autowarmCount=1, actual size =~ 10.000 - 60.000, hitratio =~ 0.71

You should decrease the initialSize values. But your hitratio's seem very 
nice.

 - Searches are performed using a catchall text field using standard request
 handler, all fields are fetched (no fl specified)
 - Normally ~ 5 concurrent requests, peaks up to 30 or 40 (mostly during GC)
 - Recently we also added a feature that adds weighted search for special
 fields, so that the query might become s.th. like this
   q=(some query) OR name_weighted:(some query)^2.0 OR brand_weighted:(some
 query)^4.0 OR longDescription_weighted:(some query)^0.5
   (it seemed as if this was the cause of the OOMEs, but IMHO it only
 increased RAM usage so that now GC could not free enough RAM)
 
 The OOMEs that we get are of type GC overhead limit exceeded, one of the
 OOMEs was thrown during auto-warming.

Warming takes additional RAM. The current searcher still has its caches full 
and newSearcher is getting filled up. Decreasing sizes might help.

 
 I checked two different heapdumps, the first one autogenerated
 (by -XX:+HeapDumpOnOutOfMemoryError) the second one generated manually via
 jmap.
 These show the following distribution of used memory - the autogenerated
 dump:
  - documentCache: 56% (size ~ 195.000)
 - filterCache: 15% (size ~ 60.000)
 - queryResultCache: 8% (size ~ 61.000)
 - fieldCache: 6% (fieldCache referenced  by WebappClassLoader)
 - SolrIndexSearcher: 2%
 
 The manually generated dump:
 - documentCache: 48% (size ~ 195.000)
 - filterCache: 20% (size ~ 60.000)
 - fieldCache: 11% (fieldCache hängt am WebappClassLoader)
 - queryResultCache: 7% (size ~ 61.000)
 - fieldValueCache: 3%
 
 We are also running two search engines with 17GB heap, these don't run into
 OOMEs. Though, with these bigger heap sizes the longest requests are even
 longer due to longer stop-the-world gc cycles.
 Therefore my goal is to run with a smaller heap, IMHO even smaller than 8GB
 would be good to reduce the time needed for full gc.
 
 So what's the right path to follow now? What would you recommend to change
 on the configuration (solr/jvm)?

Try tuning the GC
http://java.sun.com/performance/reference/whitepapers/tuning.html
http://www.oracle.com/technetwork/java/gc-tuning-5-138395.html

 
 Would you say it is ok to reduce the cache sizes? Would this increase disk
 i/o, or would the index be hold in the OS's disk cache?

Yes! If you also allocate less RAM to the JVM then there is more for the OS to 
cache.

 
 Do have other recommendations to follow / questions?
 
 Thanx  cheers,
 Martin

-- 
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350


Re: Adding weightage to the facets count

2011-01-25 Thread Johannes Goll
Hi Siva,

try using the Solr Stats Component
http://wiki.apache.org/solr/StatsComponent

similar to
select/?q=*:*stats=truestats.field={your-weight-field}stats.facet={your-facet-field}

and get the sum field from the response. You may need to resort the weighted
facet counts to get a descending list of facet counts.

Note, there is a bug for using the Stats Component with multi-valued facet
fields.

For details see
https://issues.apache.org/jira/browse/SOLR-1782

Johannes

2011/1/24 Chris Hostetter hossman_luc...@fucit.org


 : prod1 has tag called “Light Weight” with weightage 20,
 : prod2 has tag called “Light Weight” with weightage 100,
 :
 : If i get facet for “Light Weight” , i will get Light Weight (2) ,
 : here i need to consider the weightage in to account, and the result will
 be
 : Light Weight (120)
 :
 : How can we achieve this?Any ideas are really helpful.


 It's not really possible with Solr out of the box.  Faceting is fast and
 efficient in Solr because it's all done using set intersections (and most
 of the sets can be kept in ram very compactly and reused).  For what you
 are describing you'd need to no only assocaited a weighted payload with
 every TermPosition, but also factor that weight in when doing the
 faceting, which means efficient set operations are now out the window.

 If you know java it would be probably be possible to write a custom
 SolrPlugin (a SearchComponent) to do this type of faceting in special
 cases (assuming you indexed in a particular way) but i'm not sure off hte
 top of my head how well it would scale -- the basic algo i'm thinking of
 is (after indexing each facet term wit ha weight payload) to iterate over
 the DocSet of all matching documents in parallel with an iteration over
 a TermPositions, skipping ahead to only the docs that match the query, and
 recording the sum of the payloads for each term.

 Hmmm...

 except TermPositions iterates over term, doc, freq, position tuples,
 so you would have to iterate over every term, and for every term then loop
 over all matching docs ... like i said, not sure how efficient it would
 wind up being.

 You might be happier all arround if you just do some sampling -- store the
 tag+weight pairs so thta htey cna be retireved with each doc, and then
 when you get your top facet constraints back, look at the first page of
 results, and figure out what the sun weight is for each of those
 constraints based solely on the page#1 results.

 i've had happy users using a similar appraoch in the past.

 -Hoss




-- 
Johannes Goll
211 Curry Ford Lane
Gaithersburg, Maryland 20878


Re: EdgeNgram Auto suggest - doubles ignore

2011-01-25 Thread johnnyisrael

Hi Eric,

You are right, there is a copy field to EdgeNgram, I tried the configuration
but it not working as expected.

Configuration I tried:



fieldType name=”query” class=”solr.TextField” positionIncrementGap=”100″
termVectors=”true”
analyzer type=”index”
tokenizer class=”solr.StandardTokenizerFactory”/
filter class=”solr.LowerCaseFilterFactory”/
/analyzer
analyzer type=”query”
tokenizer class=”solr.StandardTokenizerFactory”/
filter class=”solr.LowerCaseFilterFactory”/
/analyzer
/fieldType

fieldType name=”edgytext” class=”solr.TextField”
positionIncrementGap=”100″
analyzer type=”index”
tokenizer class=”solr.WhitespaceTokenizerFactory”/
filter class=”solr.LowerCaseFilterFactory”/
filter class=”solr.EdgeNGramFilterFactory” minGramSize=”3″
maxGramSize=”25″/
/analyzer
analyzer type=”query”
tokenizer class=”solr.KeywordTokenizerFactory”/
filter class=”solr.LowerCaseFilterFactory”/
/analyzer
/fieldType

field name=”user_query” type=”query” indexed=”true” stored=”true”
omitNorms=”true” omitTermFreqAndPositions=”true” /
field name=”edgy_user_query” type=”edgytext” indexed=”true” stored=”true”
omitNorms=”true” omitTermFreqAndPositions=”true” /

defaultSearchFieldedgy_user_query/defaultSearchField
copyField source=”user_query” dest=”edgy_user_query”/

==

When I search for the term apple.

It is returning results for pineapple vers apple, milk with apple,
apple milk shake ...

Is there any other way to overcome this problem?

Thanks,

Johnny


-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/EdgeNgram-Auto-suggest-doubles-ignore-tp2321919p2329370.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: DIH From various File system locations

2011-01-25 Thread Estrada Groups
I would just use Nutch and specify the -solr param on the command line. That 
will add the extracted content your instance of solr.

Adam

Sent from my iPhone

On Jan 25, 2011, at 5:29 AM, pankaj bhatt panbh...@gmail.com wrote:

 Hi All,
 I need to index the documents presents in my file system at various
 locations (e.g. C:\docs , d:\docs ).
Is there any way through which i can specify this in my DIH
 Configuration.
Here is my configuration:-
 
 document
  entity name=sd
processor=FileListEntityProcessor
fileName=docx$|doc$|pdf$|xls$|xlsx|html$|rtf$|txt$|zip$
 *baseDir=G:\\Desktop\\*
recursive=false
rootEntity=true
transformer=DateFormatTransformer
 onerror=continue
entity name=tikatest
 processor=org.apache.solr.handler.dataimport.TikaEntityProcessor
 url=${sd.fileAbsolutePath} format=text dataSource=bin
  field column=Author name=author meta=true/
  field column=Content-Type name=title meta=true/
  !-- field column=title name=title meta=true/ --
  field column=text name=all_text/
/entity
 
!-- field column=fileLastModified name=date
 dateTimeFormat=-MM-dd'T'hh:mm:ss / --
field column=fileSize name=size/
field column=file name=filename/
/entity
 !--baseDir=../site--
  /document
 
 / Pankaj Bhatt.


Re: Recommendation on RAM-/Cache configuration

2011-01-25 Thread Martin Grotzke
On Tue, Jan 25, 2011 at 2:06 PM, Markus Jelsma
markus.jel...@openindex.iowrote:

 On Tuesday 25 January 2011 11:54:55 Martin Grotzke wrote:
  Hi,
 
  recently we're experiencing OOMEs (GC overhead limit exceeded) in our
  searches. Therefore I want to get some clarification on heap and cache
  configuration.
 
  This is the situation:
  - Solr 1.4.1 running on tomcat 6, Sun JVM 1.6.0_13 64bit
  - JVM Heap Params: -Xmx8G -XX:MaxPermSize=256m -XX:NewSize=2G
  -XX:MaxNewSize=2G -XX:SurvivorRatio=6 -XX:+UseParallelOldGC
  -XX:+UseParallelGC

 Consider switching to HotSpot JVM, use the -server as the first switch.

The jvm options I mentioned were not all, we're running the jvm with -server
(of course).



  - The machine has 32 GB RAM
  - Currently there are 4 processors/cores in the machine, this shall be
  changed to 2 cores in the future.
  - The index size in the filesystem is ~9.5 GB
  - The index contains ~ 5.500.000 documents
  - 1.500.000 of those docs are available for searches/queries, the rest
 are
  inactive docs that are excluded from searches (via a flag/field), but
  they're still stored in the index as need to be available by id (solr is
  the main document store in this app)

 How do you exclude them? It should use filter queries.

The docs are indexed with a field findable on which we do a filter query.


 I also remember (but i
 just cannot find it back so please correct my if i'm wrong) that in 1.4.x
 sorting is done before filtering. It should be an improvement if filtering
 is
 done before sorting.

Hmm, I cannot imagine a case where it makes sense to sort before filtering.
Can't believe that solr does it like this.
Can anyone shed a light on this?


 If you use sorting, it takes up a huge amount of RAM if filtering is not
 done
 first.

  - Caches are configured with a big size (the idea was to prevent
 filesystem
  access / disk i/o as much as possible):

 There is only disk I/O if the kernel can't keep the index (or parts) in its
 page cache.

Yes, I'll keep an eye on disk I/O.



- filterCache (solr.LRUCache): size=20, initialSize=3,
  autowarmCount=1000, actual size =~ 60.000, hitratio =~ 0.99
- documentCache (solr.LRUCache): size=20, initialSize=10,
  autowarmCount=0, actual size =~ 160.000 - 190.000, hitratio =~ 0.74
- queryResultCache (solr.LRUCache): size=20, initialSize=3,
  autowarmCount=1, actual size =~ 10.000 - 60.000, hitratio =~ 0.71

 You should decrease the initialSize values. But your hitratio's seem very
 nice.

Does the initialSize have a real impact? According to
http://wiki.apache.org/solr/SolrCaching#initialSize it's the initial size of
the HashMap backing the cache.
What would you say are reasonable values for size/initialSize/autowarmCount?

Cheers,
Martin


Re: Extracting contents of zipped files with Tika and Solr 1.4.1

2011-01-25 Thread Erlend Garåsen

On 25.01.11 11.30, Erlend Garåsen wrote:


Tika version 0.8 is not included in the latest release/trunk from SVN.


Ouch, I wrote not instead of now. Sorry, I replied in a hurry.

And to clarify, by content I mean the main content of a Word file. 
Title and other kinds of metadata are successfully extracted by the old 
0.4 version of Tika, but you need a newer Tika version (0.8) in order to 
fetch the main content as well. So try the newest Solr version from trunk.


Erlend

--
Erlend Garåsen
Center for Information Technology Services
University of Oslo
P.O. Box 1086 Blindern, N-0317 OSLO, Norway
Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP: 31050


Re: Getting started with writing parser

2011-01-25 Thread Gora Mohanty
On Tue, Jan 25, 2011 at 3:46 PM, Dinesh mdineshkuma...@karunya.edu.in wrote:

 no i actually changed the directory to mine where i stored the log files.. it
 is /home/exam/apa..solr/example/exampledocs

 i specified it in a solr schema.. i created an DataImportHandler for that in
 try.xml.. then in that i changed that file name to sample.txt

 that new try.xml is
 http://pastebin.com/pfVVA7Hs
[...]

Let us take this one part at a time.

In your inner nested entity,
  entity name=tryli...
what do you expect the attribute
  url=${hathifile.fileAbsolutePath}
to resolve to?

Regards,
Gora


Re: Use terracotta bigmemory for solr-caches

2011-01-25 Thread Em

Hi Martin,

are you sure that your GC is well tuned?
A request that needs more than a minute isn't the standard, even when I
consider all the other postings about response-performance...

Regards
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Use-terracotta-bigmemory-for-solr-caches-tp2328257p2330652.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Extracting contents of zipped files with Tika and Solr 1.4.1

2011-01-25 Thread Gary Taylor

Thanks Erlend.

Not used SVN before, but have managed to download and build latest trunk 
code.


Now I'm getting an error when trying to access the admin page (via 
Jetty) because I specify HTMLStripStandardTokenizerFactory in my 
schema.xml, but this appears to be no-longer supplied as part of the 
build so I get an exception cos it can't find that class.  I've checked 
the CHANGES.txt and found the following in the change list to 1.4.0 (!?) :


66. SOLR-1343: Added HTMLStripCharFilter and marked HTMLStripReader, 
HTMLStripWhitespaceTokenizerFactory and
HTMLStripStandardTokenizerFactory deprecated. To strip HTML tags, 
HTMLStripCharFilter can be used with an arbitrary Tokenizer. (koji)


Unfortunately, I can't seem to get that to work correctly.  Does anyone 
have an example fieldType stanza (for schema.xml) for stripping out HTML ?


Thanks and kind regards,
Gary.



On 25/01/2011 14:17, Erlend Garåsen wrote:

On 25.01.11 11.30, Erlend Garåsen wrote:


Tika version 0.8 is not included in the latest release/trunk from SVN.


Ouch, I wrote not instead of now. Sorry, I replied in a hurry.

And to clarify, by content I mean the main content of a Word file. 
Title and other kinds of metadata are successfully extracted by the 
old 0.4 version of Tika, but you need a newer Tika version (0.8) in 
order to fetch the main content as well. So try the newest Solr 
version from trunk.


Erlend






List of indexed or stored fields

2011-01-25 Thread kenf_nc

I use a lot of dynamic fields, so looking at my schema isn't a good way to
see all the field names that may be indexed across all documents. Is there a
way to query solr for that information? All field names that are indexed, or
stored? Possibly a count by field name? Is there any other metadata about a
field that can be queried?
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/List-of-indexed-or-stored-fields-tp2330986p2330986.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Extracting contents of zipped files with Tika and Solr 1.4.1

2011-01-25 Thread Gary Taylor

OK, got past the schema.xml problem, but now I'm back to square one.

I can index the contents of binary files (Word, PDF etc...), as well as 
text files, but it won't index the content of files inside a zip.


As an example, I have two txt files - doc1.txt and doc2.txt.  If I index 
either of them individually using:


curl 
http://localhost:8983/solr/core0/update/extract?literal.docid=74fmap.content=textliteral.type=5; 
-F file=@doc1.txt


and commit, Solr will index the contents and searches will match.

If I zip those two files up into solr1.zip, and index that using:

curl 
http://localhost:8983/solr/core0/update/extract?literal.docid=74fmap.content=textliteral.type=5; 
-F file=@solr1.zip


and commit, the file names are indexed, but not their contents.

I have checked that Tika can correctly process the zip file when used 
standalone with the tika-app jar - it outputs both the filenames and 
contents.  Should I be able to index the contents of files stored in a 
zip by using extract ?


Thanks and kind regards,
Gary.


On 25/01/2011 15:32, Gary Taylor wrote:

Thanks Erlend.

Not used SVN before, but have managed to download and build latest 
trunk code.


Now I'm getting an error when trying to access the admin page (via 
Jetty) because I specify HTMLStripStandardTokenizerFactory in my 
schema.xml, but this appears to be no-longer supplied as part of the 
build so I get an exception cos it can't find that class.  I've 
checked the CHANGES.txt and found the following in the change list to 
1.4.0 (!?) :


66. SOLR-1343: Added HTMLStripCharFilter and marked HTMLStripReader, 
HTMLStripWhitespaceTokenizerFactory and
HTMLStripStandardTokenizerFactory deprecated. To strip HTML tags, 
HTMLStripCharFilter can be used with an arbitrary Tokenizer. (koji)


Unfortunately, I can't seem to get that to work correctly.  Does 
anyone have an example fieldType stanza (for schema.xml) for stripping 
out HTML ?


Thanks and kind regards,
Gary.



On 25/01/2011 14:17, Erlend Garåsen wrote:

On 25.01.11 11.30, Erlend Garåsen wrote:


Tika version 0.8 is not included in the latest release/trunk from SVN.


Ouch, I wrote not instead of now. Sorry, I replied in a hurry.

And to clarify, by content I mean the main content of a Word file. 
Title and other kinds of metadata are successfully extracted by the 
old 0.4 version of Tika, but you need a newer Tika version (0.8) in 
order to fetch the main content as well. So try the newest Solr 
version from trunk.


Erlend








Re: List of indexed or stored fields

2011-01-25 Thread Juan Grande
You can query all the indexed or stored fields (including dynamic fields)
using the LukeRequestHandler: http://localhost:8983/solr/example/admin/luke

See also: http://wiki.apache.org/solr/LukeRequestHandler

Regards,
*
**Juan G. Grande*
-- Solr Consultant @ http://www.plugtree.com
-- Blog @ http://juanggrande.wordpress.com

On Tue, Jan 25, 2011 at 12:39 PM, kenf_nc ken.fos...@realestate.com wrote:


 I use a lot of dynamic fields, so looking at my schema isn't a good way to
 see all the field names that may be indexed across all documents. Is there
 a
 way to query solr for that information? All field names that are indexed,
 or
 stored? Possibly a count by field name? Is there any other metadata about a
 field that can be queried?
 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/List-of-indexed-or-stored-fields-tp2330986p2330986.html
 Sent from the Solr - User mailing list archive at Nabble.com.



Re: DIH From various File system locations

2011-01-25 Thread pankaj bhatt
Thanks Adam, It seems like Nutch use to solve most of my concerns.
i would be great if you can have share resources for Nutch with us.

/ Pankaj Bhatt.

On Tue, Jan 25, 2011 at 7:21 PM, Estrada Groups 
estrada.adam.gro...@gmail.com wrote:

 I would just use Nutch and specify the -solr param on the command line.
 That will add the extracted content your instance of solr.

 Adam

 Sent from my iPhone

 On Jan 25, 2011, at 5:29 AM, pankaj bhatt panbh...@gmail.com wrote:

  Hi All,
  I need to index the documents presents in my file system at
 various
  locations (e.g. C:\docs , d:\docs ).
 Is there any way through which i can specify this in my DIH
  Configuration.
 Here is my configuration:-
 
  document
   entity name=sd
 processor=FileListEntityProcessor
 fileName=docx$|doc$|pdf$|xls$|xlsx|html$|rtf$|txt$|zip$
  *baseDir=G:\\Desktop\\*
 recursive=false
 rootEntity=true
 transformer=DateFormatTransformer
  onerror=continue
 entity name=tikatest
  processor=org.apache.solr.handler.dataimport.TikaEntityProcessor
  url=${sd.fileAbsolutePath} format=text dataSource=bin
   field column=Author name=author meta=true/
   field column=Content-Type name=title meta=true/
   !-- field column=title name=title meta=true/ --
   field column=text name=all_text/
 /entity
 
 !-- field column=fileLastModified name=date
  dateTimeFormat=-MM-dd'T'hh:mm:ss / --
 field column=fileSize name=size/
 field column=file name=filename/
 /entity
  !--baseDir=../site--
   /document
 
  / Pankaj Bhatt.



Re: Extracting contents of zipped files with Tika and Solr 1.4.1

2011-01-25 Thread Jayendra Patil
Hi Gary,

The latest Solr Trunk was able to extract and index the contents of the zip
file using the ExtractingRequestHandler.
The snapshot of Trunk we worked upon had the Tika 0.8 snapshot jars and
worked pretty well.

Tested again with sample url and works fine -
curl 
http://localhost:8080/solr/core0/update/extract?stream.file=C:/temp/extract/777045.zipliteral.id=777045literal.title=Testcommit=true


You would probably need to drill down to the Tika Jars and
the apache-solr-cell-4.0-dev.jar used for Rich documents indexing.

Regards,
Jayendra

On Tue, Jan 25, 2011 at 11:08 AM, Gary Taylor g...@inovem.com wrote:

 OK, got past the schema.xml problem, but now I'm back to square one.

 I can index the contents of binary files (Word, PDF etc...), as well as
 text files, but it won't index the content of files inside a zip.

 As an example, I have two txt files - doc1.txt and doc2.txt.  If I index
 either of them individually using:

 curl 
 http://localhost:8983/solr/core0/update/extract?literal.docid=74fmap.content=textliteral.type=5;
 -F file=@doc1.txt

 and commit, Solr will index the contents and searches will match.

 If I zip those two files up into solr1.zip, and index that using:

 curl 
 http://localhost:8983/solr/core0/update/extract?literal.docid=74fmap.content=textliteral.type=5;
 -F file=@solr1.zip

 and commit, the file names are indexed, but not their contents.

 I have checked that Tika can correctly process the zip file when used
 standalone with the tika-app jar - it outputs both the filenames and
 contents.  Should I be able to index the contents of files stored in a zip
 by using extract ?


 Thanks and kind regards,
 Gary.


 On 25/01/2011 15:32, Gary Taylor wrote:

 Thanks Erlend.

 Not used SVN before, but have managed to download and build latest trunk
 code.

 Now I'm getting an error when trying to access the admin page (via Jetty)
 because I specify HTMLStripStandardTokenizerFactory in my schema.xml, but
 this appears to be no-longer supplied as part of the build so I get an
 exception cos it can't find that class.  I've checked the CHANGES.txt and
 found the following in the change list to 1.4.0 (!?) :

 66. SOLR-1343: Added HTMLStripCharFilter and marked HTMLStripReader,
 HTMLStripWhitespaceTokenizerFactory andHTMLStripStandardTokenizerFactory
 deprecated. To strip HTML tags, HTMLStripCharFilter can be used with an
 arbitrary Tokenizer. (koji)

 Unfortunately, I can't seem to get that to work correctly.  Does anyone
 have an example fieldType stanza (for schema.xml) for stripping out HTML ?

 Thanks and kind regards,
 Gary.



 On 25/01/2011 14:17, Erlend Garåsen wrote:

 On 25.01.11 11.30, Erlend Garåsen wrote:

  Tika version 0.8 is not included in the latest release/trunk from SVN.


 Ouch, I wrote not instead of now. Sorry, I replied in a hurry.

 And to clarify, by content I mean the main content of a Word file.
 Title and other kinds of metadata are successfully extracted by the old 0.4
 version of Tika, but you need a newer Tika version (0.8) in order to fetch
 the main content as well. So try the newest Solr version from trunk.

 Erlend







How to Configure Solr to pick my lucene custom filter

2011-01-25 Thread Valiveti

Hi ,

I have written a lucene custom filter.
I could not figure out on how to configure Solr to pick this custom filter
for search.

How to configure Solr to pick my custom filter?
Will the Solr standard search handler pick this custom filter?

Thanks,
Valiveti

-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/How-to-Configure-Solr-to-pick-my-lucene-custom-filter-tp2331928p2331928.html
Sent from the Solr - User mailing list archive at Nabble.com.


in-index representaton of tokens

2011-01-25 Thread Dennis Gearon
So, the index is a list of tokens per column, right?

There's a table per column that lists the analyzed tokens?

And the tokens per column are represented as what, system integers? 32/64 bit 
unsigned ints?

 Dennis Gearon


Signature Warning

It is always a good idea to learn from your own mistakes. It is usually a 
better 
idea to learn from others’ mistakes, so you do not have to make them yourself. 
from 'http://blogs.techrepublic.com.com/security/?p=4501tag=nl.e036'


EARTH has a Right To Life,
otherwise we all die.



Re: in-index representaton of tokens

2011-01-25 Thread Jonathan Rochkind

Why does it matter?  You can't really get at them unless you store them.

I don't know what table per column means, there's nothing in Solr 
architecture called a table or a column. Although by column you 
probably mean more or less Solr field.  There is nothing like a 
table in Solr.


Solr is still not an rdbms.

On 1/25/2011 12:26 PM, Dennis Gearon wrote:

So, the index is a list of tokens per column, right?

There's a table per column that lists the analyzed tokens?

And the tokens per column are represented as what, system integers? 32/64 bit
unsigned ints?

  Dennis Gearon


Signature Warning

It is always a good idea to learn from your own mistakes. It is usually a better
idea to learn from others’ mistakes, so you do not have to make them yourself.
from 'http://blogs.techrepublic.com.com/security/?p=4501tag=nl.e036'


EARTH has a Right To Life,
otherwise we all die.



Re: EdgeNgram Auto suggest - doubles ignore

2011-01-25 Thread Erick Erickson
Let's back up here because now I'm not clear what you actually want.
EdgeNGrams
are a way of matching substrings, which is what's happening here. Of course
searching apple against any of the three examples, just as searching for
apple
without grams would match, that's the expected behavior.

So, we need a clear problem definition of what you're trying to do, along
with
example queries (please post the results of adding debugQuery=on).

Best
Erick

On Tue, Jan 25, 2011 at 8:29 AM, johnnyisrael johnnyi.john...@gmail.comwrote:


 Hi Eric,

 You are right, there is a copy field to EdgeNgram, I tried the
 configuration
 but it not working as expected.

 Configuration I tried:

 

 fieldType name=”query” class=”solr.TextField” positionIncrementGap=”100″
 termVectors=”true”
 analyzer type=”index”
 tokenizer class=”solr.StandardTokenizerFactory”/
 filter class=”solr.LowerCaseFilterFactory”/
 /analyzer
 analyzer type=”query”
 tokenizer class=”solr.StandardTokenizerFactory”/
 filter class=”solr.LowerCaseFilterFactory”/
 /analyzer
 /fieldType

 fieldType name=”edgytext” class=”solr.TextField”
 positionIncrementGap=”100″
 analyzer type=”index”
 tokenizer class=”solr.WhitespaceTokenizerFactory”/
 filter class=”solr.LowerCaseFilterFactory”/
 filter class=”solr.EdgeNGramFilterFactory” minGramSize=”3″
 maxGramSize=”25″/
 /analyzer
 analyzer type=”query”
 tokenizer class=”solr.KeywordTokenizerFactory”/
 filter class=”solr.LowerCaseFilterFactory”/
 /analyzer
 /fieldType

 field name=”user_query” type=”query” indexed=”true” stored=”true”
 omitNorms=”true” omitTermFreqAndPositions=”true” /
 field name=”edgy_user_query” type=”edgytext” indexed=”true” stored=”true”
 omitNorms=”true” omitTermFreqAndPositions=”true” /

 defaultSearchFieldedgy_user_query/defaultSearchField
 copyField source=”user_query” dest=”edgy_user_query”/

 ==

 When I search for the term apple.

 It is returning results for pineapple vers apple, milk with apple,
 apple milk shake ...

 Is there any other way to overcome this problem?

 Thanks,

 Johnny


 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/EdgeNgram-Auto-suggest-doubles-ignore-tp2321919p2329370.html
 Sent from the Solr - User mailing list archive at Nabble.com.



Re: Highlighting with/without Term Vectors

2011-01-25 Thread Salman Akram
Anyone?

On Tue, Jan 25, 2011 at 12:57 AM, Salman Akram 
salman.ak...@northbaysolutions.net wrote:

 Just to add one thing, in case it makes a difference.

 Max document size on which highlighting needs to be done is few hundred
 kb's (in file system). In index its compressed so should be much smaller.
 Total documents are more than 100 million.


 On Tue, Jan 25, 2011 at 12:42 AM, Salman Akram 
 salman.ak...@northbaysolutions.net wrote:

 Hi,

 Does anyone have any benchmarks how much highlighting speeds up with Term
 Vectors (compared to without it)? e.g. if highlighting on 20 documents take
 1 sec with Term Vectors any idea how long it will take without them?

 I need to know since the index used for highlighting has a TVF file of
 around 450GB (approx 65% of total index size) so I am trying to see whether
 the decreasing the index size by dropping TVF would be more helpful for
 performance (less RAM, should be good for I/O too I guess) or keeping it is
 still better?

 I know the best way is try it out but indexing takes a very long time so
 trying to see whether its even worthy or not.

 --
 Regards,

 Salman Akram




 --
 Regards,

 Salman Akram




-- 
Regards,

Salman Akram


Re: How to Configure Solr to pick my lucene custom filter

2011-01-25 Thread Erick Erickson
Presumably your custom filter is in a jar file. Drop that jar file in
solr_home/lib
and refer it from your schema.xml file by its full name
(e.g. com.yourcompany.filter.yourcustomfilter) just like the other filters
and it should
work fine.

You can also put your jar anywhere you'd like and alter solrconfig.xml with
an
addition al lib... tag in the config section (see the example
solrconfig.xml).

Best
Erick

On Tue, Jan 25, 2011 at 12:07 PM, Valiveti narasimha.valiv...@gmail.comwrote:


 Hi ,

 I have written a lucene custom filter.
 I could not figure out on how to configure Solr to pick this custom filter
 for search.

 How to configure Solr to pick my custom filter?
 Will the Solr standard search handler pick this custom filter?

 Thanks,
 Valiveti

 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/How-to-Configure-Solr-to-pick-my-lucene-custom-filter-tp2331928p2331928.html
 Sent from the Solr - User mailing list archive at Nabble.com.



Re: List of indexed or stored fields

2011-01-25 Thread kenf_nc

That's exactly what I wanted, thanks. Any idea what

  long name=version1294513299077/long 

refers to under the index section? I have 2 cores on one Tomcat instance,
and 1 on a second instance (different server) and all 3 have different
numbers for version, so I don't think it's the version of Luke.
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/List-of-indexed-or-stored-fields-tp2330986p2333281.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: List of indexed or stored fields

2011-01-25 Thread Markus Jelsma
The index version. Can be used in replication to determine whether to 
replicate or not.

On Tuesday 25 January 2011 20:30:21 kenf_nc wrote:
 refers to under the index section? I have 2 cores on one Tomcat instance,
 and 1 on a second instance (different server) and all 3 have different
 numbers for version, so I don't think it's the version of Luke.

-- 
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350


Re: EdgeNgram Auto suggest - doubles ignore

2011-01-25 Thread johnnyisrael

Hi Eric,

What I want here is, lets say I have 3 documents like 

[pineapple vers apple, milk with apple, apple milk shake ]

and If i search for apple, it should return only apple milk shake
because that term alone starts with the letter apple which I typed in. It
should not bring others and if I type milk it should return only milk
with apple

I want an output Similar like a Google auto suggest.

Is there a way to achieve  this without encapsulating with double quotes.

Thanks,

Johnny
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/EdgeNgram-Auto-suggest-doubles-ignore-tp2321919p2333602.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: DIH From various File system locations

2011-01-25 Thread Adam Estrada
There are a few tutorials out there.

1. http://wiki.apache.org/nutch/RunningNutchAndSolr (not the most practical)
2. http://www.lucidimagination.com/blog/2009/03/09/nutch-solr/ (similar to 1.)
3. Build the latest from branch
http://svn.apache.org/repos/asf/nutch/branches/branch-1.3/ and read
this one.

http://www.adamestrada.com/2010/04/24/web-crawling-with-nutch/

but add the solr parameter at the end bin/nutch crawl urls -depth 5
-topN 100 -solr http://localhost:8983/solr

This will automatically add the data nutch collected to Solr. For
larger files I would also increase your JAVA_OPTS env to something
like JAVA_OPTS=' Xmx2048m'

Adam




On Tue, Jan 25, 2011 at 11:41 AM, pankaj bhatt panbh...@gmail.com wrote:
 Thanks Adam, It seems like Nutch use to solve most of my concerns.
 i would be great if you can have share resources for Nutch with us.

 / Pankaj Bhatt.

 On Tue, Jan 25, 2011 at 7:21 PM, Estrada Groups 
 estrada.adam.gro...@gmail.com wrote:

 I would just use Nutch and specify the -solr param on the command line.
 That will add the extracted content your instance of solr.

 Adam

 Sent from my iPhone

 On Jan 25, 2011, at 5:29 AM, pankaj bhatt panbh...@gmail.com wrote:

  Hi All,
          I need to index the documents presents in my file system at
 various
  locations (e.g. C:\docs , d:\docs ).
     Is there any way through which i can specify this in my DIH
  Configuration.
     Here is my configuration:-
 
  document
       entity name=sd
         processor=FileListEntityProcessor
         fileName=docx$|doc$|pdf$|xls$|xlsx|html$|rtf$|txt$|zip$
  *baseDir=G:\\Desktop\\*
         recursive=false
         rootEntity=true
         transformer=DateFormatTransformer
  onerror=continue
         entity name=tikatest
  processor=org.apache.solr.handler.dataimport.TikaEntityProcessor
  url=${sd.fileAbsolutePath} format=text dataSource=bin
           field column=Author name=author meta=true/
           field column=Content-Type name=title meta=true/
           !-- field column=title name=title meta=true/ --
           field column=text name=all_text/
         /entity
 
         !-- field column=fileLastModified name=date
  dateTimeFormat=-MM-dd'T'hh:mm:ss / --
         field column=fileSize name=size/
         field column=file name=filename/
     /entity
  !--baseDir=../site--
   /document
 
  / Pankaj Bhatt.




Re: DIH From various File system locations

2011-01-25 Thread Adam Estrada
I take that back...Use am currently using version 1.2 and make sure
that the latest versions of Tika and PDFBox is in the contrib folder.
1.3 is structured a bit differently and it doesn't look like there is
a contrib directory. Maybe one of the Nutch contributors can comment
on this?

Adam

On Tue, Jan 25, 2011 at 3:21 PM, Adam Estrada
estrada.adam.gro...@gmail.com wrote:
 There are a few tutorials out there.

 1. http://wiki.apache.org/nutch/RunningNutchAndSolr (not the most practical)
 2. http://www.lucidimagination.com/blog/2009/03/09/nutch-solr/ (similar to 1.)
 3. Build the latest from branch
 http://svn.apache.org/repos/asf/nutch/branches/branch-1.3/ and read
 this one.

 http://www.adamestrada.com/2010/04/24/web-crawling-with-nutch/

 but add the solr parameter at the end bin/nutch crawl urls -depth 5
 -topN 100 -solr http://localhost:8983/solr

 This will automatically add the data nutch collected to Solr. For
 larger files I would also increase your JAVA_OPTS env to something
 like JAVA_OPTS=' Xmx2048m'

 Adam




 On Tue, Jan 25, 2011 at 11:41 AM, pankaj bhatt panbh...@gmail.com wrote:
 Thanks Adam, It seems like Nutch use to solve most of my concerns.
 i would be great if you can have share resources for Nutch with us.

 / Pankaj Bhatt.

 On Tue, Jan 25, 2011 at 7:21 PM, Estrada Groups 
 estrada.adam.gro...@gmail.com wrote:

 I would just use Nutch and specify the -solr param on the command line.
 That will add the extracted content your instance of solr.

 Adam

 Sent from my iPhone

 On Jan 25, 2011, at 5:29 AM, pankaj bhatt panbh...@gmail.com wrote:

  Hi All,
          I need to index the documents presents in my file system at
 various
  locations (e.g. C:\docs , d:\docs ).
     Is there any way through which i can specify this in my DIH
  Configuration.
     Here is my configuration:-
 
  document
       entity name=sd
         processor=FileListEntityProcessor
         fileName=docx$|doc$|pdf$|xls$|xlsx|html$|rtf$|txt$|zip$
  *baseDir=G:\\Desktop\\*
         recursive=false
         rootEntity=true
         transformer=DateFormatTransformer
  onerror=continue
         entity name=tikatest
  processor=org.apache.solr.handler.dataimport.TikaEntityProcessor
  url=${sd.fileAbsolutePath} format=text dataSource=bin
           field column=Author name=author meta=true/
           field column=Content-Type name=title meta=true/
           !-- field column=title name=title meta=true/ --
           field column=text name=all_text/
         /entity
 
         !-- field column=fileLastModified name=date
  dateTimeFormat=-MM-dd'T'hh:mm:ss / --
         field column=fileSize name=size/
         field column=file name=filename/
     /entity
  !--baseDir=../site--
   /document
 
  / Pankaj Bhatt.





CFP - Berlin Buzzwords 2011 - Search, Score, Scale

2011-01-25 Thread Isabel Drost
This is to announce the Berlin Buzzwords 2011. The second edition of the 
successful conference on scalable and open search, data processing and data 
storage in Germany, taking place in Berlin.

Call for Presentations Berlin Buzzwords
   http://berlinbuzzwords.de
  Berlin Buzzwords 2011 - Search, Store, Scale
6/7 June 2011

The event will comprise presentations on scalable data processing. We invite 
you 
to submit talks on the topics:

   * IR / Search - Lucene, Solr, katta or comparable solutions
   * NoSQL - like CouchDB, MongoDB, Jackrabbit, HBase and others
   * Hadoop - Hadoop itself, MapReduce, Cascading or Pig and relatives
   * Closely related topics not explicitly listed above are welcome. We are
 looking for presentations on the implementation of the systems themselves,
 real world applications and case studies.

Important Dates (all dates in GMT +2)
   * Submission deadline: March 1st 2011, 23:59 MEZ
   * Notification of accepted speakers: March 22th, 2011, MEZ.
   * Publication of final schedule: April 5th, 2011.
   * Conference: June 6/7. 2011

High quality, technical submissions are called for, ranging from principles to 
practice. We are looking for real world use cases, background on the 
architecture of specific projects and a deep dive into architectures built on 
top of e.g. Hadoop clusters.

Proposals should be submitted at http://berlinbuzzwords.de/content/cfp-0 no 
later than March 1st, 2011. Acceptance notifications will be sent out soon 
after 
the submission deadline. Please include your name, bio and email, the title of 
the talk, a brief abstract in English language. Please indicate whether you 
want 
to give a lightning (10min), short (20min) or long (40min) presentation and 
indicate the level of experience with the topic your audience should have (e.g. 
whether your talk will be suitable for newbies or is targeted for experienced 
users.) If you'd like to pitch your brand new product in your talk, please let 
us know as well - there will be extra space for presenting new ideas, awesome 
products and great new projects.

The presentation format is short. We will be enforcing the schedule rigorously.

If you are interested in sponsoring the event (e.g. we would be happy to 
provide 
videos after the event, free drinks for attendees as well as an after-show 
party), please contact us.

Follow @hadoopberlin on Twitter for updates. Tickets, news on the conference, 
and the final schedule are be published at http://berlinbuzzwords.de.

Program Chairs: Isabel Drost, Jan Lehnardt, and Simon Willnauer.

Please re-distribute this CfP to people who might be interested.

If you are local and wish to meet us earlier, please note that this Thursday 
evening there will be an Apache Hadoop Get Together (videos kindly sponsored by 
Cloudera, venue kindly provided for free by Zanox) featuring talks on Apache 
Hadoop in production as well as news on current Apache Lucene developments.

Contact us at:

newthinking communications 
GmbH Schönhauser Allee 6/7 
10119 Berlin, 
Germany 

Julia Gemählich
Isabel Drost 

+49(0)30-9210 596


signature.asc
Description: This is a digitally signed message part.


Re: How to Configure Solr to pick my lucene custom filter

2011-01-25 Thread Valiveti

Hi Eric,

Thanks for the reply.

I Did see some entries in the solrconfig.xml for adding custom
reposneHandlers, queryParsers and queryResponseWriters.

Bit could not find the one for adding the custom filter.

Could you point to the exact location or syntax to be used.

Thanks,
Valiveti


-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/How-to-Configure-Solr-to-pick-my-lucene-custom-filter-tp2331928p2334120.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: EdgeNgram Auto suggest - doubles ignore

2011-01-25 Thread Jonathan Rochkind
I haven't figured out any way to achieve that AT ALL without making a 
seperate Solr index just to serve autosuggest queries. At least when you 
want to auto-suggest on a multi-value field. Someone posted a crazy 
tricky way to do it with a single-valued field a while ago.  If you 
can/are willing to make a seperate Solr index with a schema set up for 
auto-suggest specifically, it's easy. But from an existing schema, where 
you want to auto-suggest just based on the values in one field, it's a 
multi-valued field, and you want to allow matches in the middle of the 
field -- I don't think there's a way to do it.


On 1/25/2011 3:03 PM, johnnyisrael wrote:

Hi Eric,

What I want here is, lets say I have 3 documents like

[pineapple vers apple, milk with apple, apple milk shake ]

and If i search for apple, it should return only apple milk shake
because that term alone starts with the letter apple which I typed in. It
should not bring others and if I type milk it should return only milk
with apple

I want an output Similar like a Google auto suggest.

Is there a way to achieve  this without encapsulating with double quotes.

Thanks,

Johnny


Re: EdgeNgram Auto suggest - doubles ignore

2011-01-25 Thread Markus Jelsma
Then you don't need NGrams at all. A wildcard will suffice or you can use the 
TermsComponent.

If these strings are indexed as single tokens (KeywordTokenizer with 
LowercaseFilter) you can simply do field:app* to retrieve the apple milk 
shake. You can also use the string field type but then you must make sure the 
values are already lowercased before indexing.

Be careful though, there is no query time analysis for wildcard (and fuzzy) 
queries so make sure 

 Hi Eric,
 
 What I want here is, lets say I have 3 documents like
 
 [pineapple vers apple, milk with apple, apple milk shake ]
 
 and If i search for apple, it should return only apple milk shake
 because that term alone starts with the letter apple which I typed in. It
 should not bring others and if I type milk it should return only milk
 with apple
 
 I want an output Similar like a Google auto suggest.
 
 Is there a way to achieve  this without encapsulating with double quotes.
 
 Thanks,
 
 Johnny


Re: EdgeNgram Auto suggest - doubles ignore

2011-01-25 Thread Markus Jelsma
Oh, i should perhaps mention that EdgeNGrams will yield results a lot quicker 
than using wildcards at the cost of a larger index. You should, of course, use 
EdgeNGrams if you worry about performance and have a huge index and a number 
of queries per second.

 Then you don't need NGrams at all. A wildcard will suffice or you can use
 the TermsComponent.
 
 If these strings are indexed as single tokens (KeywordTokenizer with
 LowercaseFilter) you can simply do field:app* to retrieve the apple milk
 shake. You can also use the string field type but then you must make sure
 the values are already lowercased before indexing.
 
 Be careful though, there is no query time analysis for wildcard (and fuzzy)
 queries so make sure
 
  Hi Eric,
  
  What I want here is, lets say I have 3 documents like
  
  [pineapple vers apple, milk with apple, apple milk shake ]
  
  and If i search for apple, it should return only apple milk shake
  because that term alone starts with the letter apple which I typed in.
  It should not bring others and if I type milk it should return only
  milk with apple
  
  I want an output Similar like a Google auto suggest.
  
  Is there a way to achieve  this without encapsulating with double quotes.
  
  Thanks,
  
  Johnny


Re: EdgeNgram Auto suggest - doubles ignore

2011-01-25 Thread mesenthil

The index contains around 1.5 million documents. As this is used for
autosuggest feature, performance is an important factor. 

So it looks like, using edgeNgram it is difficult to achieve the the
following 

Result should return only those terms where search letter is matching with
the first word only. For example, when we type M,  it should return
Mumford and Sons and not jackson Michael. 


Jonathan,

Is it possible to achieve this when we have separate index using edgeNgram?
 
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/EdgeNgram-Auto-suggest-doubles-ignore-tp2321919p2334538.html
Sent from the Solr - User mailing list archive at Nabble.com.


Specifying optional terms with standard (lucene) request handler?

2011-01-25 Thread Daniel Pötzinger
Hi

I am searching for a way to specify optional terms in a query ( that dont need 
to match (But if they match should influence the scoring) )

Using the dismax parser a query like this:
str name=mm2/str
str name=debugQueryon/str
str name=q+lorem ipsum dolor amet/str
str name=qfcontent/str
str name=hl.fl/
str name=qtdismax/str
Will be parsed into something like this:
str name=parsedquery_toString
+((+(content:lor) (content:ipsum) (content:dolor) (content:amet))~2) ()
/str
Which will result that only 2 of the 3 optional terms need to match?


How can optional terms be specified using the standard request handler?
My concrete requirement is that a certain term should match but another is 
optional. But if the optional part matches - it should give the document an 
extra score.
Something like :-)
str name=qcontent:lorem #optional#content:optionalboostword^10/str

An idea would be to use a function query to boost the document:
str name=q
content:lorem _val_:query({!lucene v='optionalword^20'})
/str
Which will result in:
str name=parsedquery_toString
+content:forum +query(content:optionalword^20.0,def=0.0)
/str
Is this a good way or are there other suggestions?

Thanks for any opinion and tips on this

Daniel




Re: EdgeNgram Auto suggest - doubles ignore

2011-01-25 Thread Jonathan Rochkind
Ah, sorry, I got confused about your requirements, if you just want to 
match at the beginning of the field, it may be more possible.  Using 
edgegrams or wildcard. If you have a single-valued field. Do you have a 
single-valued or a multi-valued field?  That is, does each document have 
just one value, or multiple?   I still get confused about how to do it 
with edgegrams, even with single-valued field, but I think maybe it's 
possible.


_Definitely_ possible, with or without edgegrams, if you are 
willing/able to make a completely seperate Solr index where each term 
for auto-suggest is a document.  Yes.


The problem lies in what results are. In general, Solr's results are 
the documents you have in the Solr index. Thus it makes everything a lot 
easier to deal with if you have an index where each document in the 
index is a term for auto-suggest.   But that doesnt' always meet 
requirements if you need to auto-suggest within existing fq's and such, 
and of course it takes more resources to run an additional solr index.


On 1/25/2011 5:03 PM, mesenthil wrote:

The index contains around 1.5 million documents. As this is used for
autosuggest feature, performance is an important factor.

So it looks like, using edgeNgram it is difficult to achieve the the
following

Result should return only those terms where search letter is matching with
the first word only. For example, when we type M,  it should return
Mumford and Sons and not jackson Michael.


Jonathan,

Is it possible to achieve this when we have separate index using edgeNgram?



Re: Specifying optional terms with standard (lucene) request handler?

2011-01-25 Thread Jonathan Rochkind

With the 'lucene' query parser?

include q.op=OR and then put a + (mandatory) in front of every term 
in the 'q' that is NOT optional, the rest will be optional.  I think 
that will do what want.


Jonathan

On 1/25/2011 5:07 PM, Daniel Pötzinger wrote:

Hi

I am searching for a way to specify optional terms in a query ( that dont need 
to match (But if they match should influence the scoring) )

Using the dismax parser a query like this:
str name=mm2/str
str name=debugQueryon/str
str name=q+lorem ipsum dolor amet/str
str name=qfcontent/str
str name=hl.fl/
str name=qtdismax/str
Will be parsed into something like this:
str name=parsedquery_toString
+((+(content:lor) (content:ipsum) (content:dolor) (content:amet))~2) ()
/str
Which will result that only 2 of the 3 optional terms need to match?


How can optional terms be specified using the standard request handler?
My concrete requirement is that a certain term should match but another is 
optional. But if the optional part matches - it should give the document an 
extra score.
Something like :-)
str name=qcontent:lorem #optional#content:optionalboostword^10/str

An idea would be to use a function query to boost the document:
str name=q
content:lorem _val_:query({!lucene v='optionalword^20'})
/str
Which will result in:
str name=parsedquery_toString
+content:forum +query(content:optionalword^20.0,def=0.0)
/str
Is this a good way or are there other suggestions?

Thanks for any opinion and tips on this

Daniel




Re: EdgeNgram Auto suggest - doubles ignore

2011-01-25 Thread mesenthil

Right now our configuration says multivalues=true. But that need not be
true in our case. Will make it false and try and update this thread with
more details..
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/EdgeNgram-Auto-suggest-doubles-ignore-tp2321919p2334627.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Solr set up issues with Magento

2011-01-25 Thread Sandhya Padala
Thank you Markus. I have added few more fields to schema.xml.

Now looks like the products are getting indexed. But no search results.

In Magento if I configure to use SOlr as the search engine. Search is not
returning any results.  If I change the search engine to use Magento's
inbuilt MYSQL , Search results are returned.  Can you please direct me on
where/how I  should start debug process.

If I use Solr admin and enter the search query that doesn't return any
results either.

Thank you,
Sandhya

On Mon, Jan 24, 2011 at 4:11 PM, Markus Jelsma
markus.jel...@openindex.iowrote:

 Hi,

 You haven't defined the field in Solr's schema.xml configuration so it
 needs to
 be added first. Perhaps following the tutorial might be a good idea.

 http://lucene.apache.org/solr/tutorial.html

 Cheers.

  Hello Team:
 
 
I am in the process of setting up Solr 1.4 with Magento ENterprise
  Edition 1.9.
 
  When I try to index the products I get the following error message.
 
  Jan 24, 2011 3:30:14 PM
 org.apache.solr.update.processor.LogUpdateProcessor
  fini
  sh
  INFO: {} 0 0
  Jan 24, 2011 3:30:14 PM org.apache.solr.common.SolrException log
  SEVERE: org.apache.solr.common.SolrException: ERROR:unknown field
  'in_stock' at
  org.apache.solr.update.DocumentBuilder.toDocument(DocumentBuilder.jav
  a:289)
  at
  org.apache.solr.update.processor.RunUpdateProcessor.processAdd(RunUpd
  ateProcessorFactory.java:60)
  at
  org.apache.solr.handler.XMLLoader.processUpdate(XMLLoader.java:139)
  at org.apache.solr.handler.XMLLoader.load(XMLLoader.java:69)
  at
  org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(Co
  ntentStreamHandlerBase.java:54)
  at
  org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandl
  erBase.java:131)
  at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316)
  at
  org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter
  .java:338)
  at
  org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilte
  r.java:241)
  at
  org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(Appl
  icationFilterChain.java:244)
  at
  org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationF
  ilterChain.java:210)
  at
  org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperV
  alve.java:240)
  at
  org.apache.catalina.core.StandardContextValve.invoke(StandardContextV
  alve.java:161)
  at
  org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.j
  ava:164)
  at
  org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.j
  ava:100)
  at
  org.apache.catalina.valves.AccessLogValve.invoke(AccessLogValve.java:
  550)
  at
  org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineVal
  ve.java:118)
  at
  org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.jav
  a:380)
  at
  org.apache.coyote.http11.Http11Processor.process(Http11Processor.java
 
  :243)
 
  at
  org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.proce
  ss(Http11Protocol.java:188)
  at
  org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.proce
  ss(Http11Protocol.java:166)
  at
  org.apache.tomcat.util.net.JIoEndpoint$SocketProcessor.run(JIoEndpoin
  t.java:288)
  at
  java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExec
  utor.java:886)
  at
  java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor
  .java:908)
  at java.lang.Thread.run(Thread.java:662)
 
  Jan 24, 2011 3:30:14 PM org.apache.solr.core.SolrCore execute
  INFO: [] webapp=/solr path=/update params={wt=json} status=400 QTime=0
  Jan 24, 2011 3:30:14 PM org.apache.solr.update.DirectUpdateHandler2
  rollback INFO: start rollback
  Jan 24, 2011 3:30:14 PM org.apache.solr.update.DirectUpdateHandler2
  rollback INFO: end_rollback
  Jan 24, 2011 3:30:14 PM
 org.apache.solr.update.processor.LogUpdateProcessor
  fini
  sh
  INFO: {rollback=} 0 16
  Jan 24, 2011 3:30:14 PM org.apache.solr.core.SolrCore execute
 
  I am a new to both Magento and SOlr. I could have done some thing stupid
  during installation. I really look forward for your help.
 
  Thank you,
  Sandhya



Best way to build a solr-based m2 project

2011-01-25 Thread Paul Libbrecht

Hello list,

Apologies if this was already asked, I haven't found the answer in the archive.
I've been out of this list for quite some time now, hence.

I am looking at a good way to package a project based on maven2 that would 
create me a solr-based webapp.
I would expect such projects as the velocity contrib or even the default solr 
to both include everything ready for it but I don't see this organized and, in 
particular, I see nothing that contains a packaging of type war.

Have I missed something?
Should I simply attempt at copying some bits into my source then make sure it 
gets copied to the right place?

I found a solr archetype but it's only delivering a standalone solr which does 
not interest me.

thanks in advance

paul

Re: in-index representaton of tokens

2011-01-25 Thread Dennis Gearon
I am saying there is a list of tokens that have been parsed (a table of them) 
for each column? Or one for the whole index?

 Dennis Gearon


Signature Warning

It is always a good idea to learn from your own mistakes. It is usually a 
better 
idea to learn from others’ mistakes, so you do not have to make them yourself. 
from 'http://blogs.techrepublic.com.com/security/?p=4501tag=nl.e036'


EARTH has a Right To Life,
otherwise we all die.



- Original Message 
From: Jonathan Rochkind rochk...@jhu.edu
To: solr-user@lucene.apache.org solr-user@lucene.apache.org
Sent: Tue, January 25, 2011 9:29:36 AM
Subject: Re: in-index representaton of tokens

Why does it matter?  You can't really get at them unless you store them.

I don't know what table per column means, there's nothing in Solr 
architecture called a table or a column. Although by column you 
probably mean more or less Solr field.  There is nothing like a 
table in Solr.

Solr is still not an rdbms.

On 1/25/2011 12:26 PM, Dennis Gearon wrote:
 So, the index is a list of tokens per column, right?

 There's a table per column that lists the analyzed tokens?

 And the tokens per column are represented as what, system integers? 32/64 bit
 unsigned ints?

   Dennis Gearon


 Signature Warning
 
 It is always a good idea to learn from your own mistakes. It is usually a 
better
 idea to learn from others’ mistakes, so you do not have to make them yourself.
 from 'http://blogs.techrepublic.com.com/security/?p=4501tag=nl.e036'


 EARTH has a Right To Life,
 otherwise we all die.




Re: How to Configure Solr to pick my lucene custom filter

2011-01-25 Thread Erick Erickson
First, let's be sure we're talking about the same thing. My response was for
adding
a filter to your analysis chain for a field in Schema.xml. Are you talking
about a different
sort of filter?

Best
Erick

On Tue, Jan 25, 2011 at 4:09 PM, Valiveti narasimha.valiv...@gmail.comwrote:


 Hi Eric,

 Thanks for the reply.

 I Did see some entries in the solrconfig.xml for adding custom
 reposneHandlers, queryParsers and queryResponseWriters.

 Bit could not find the one for adding the custom filter.

 Could you point to the exact location or syntax to be used.

 Thanks,
 Valiveti


 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/How-to-Configure-Solr-to-pick-my-lucene-custom-filter-tp2331928p2334120.html
 Sent from the Solr - User mailing list archive at Nabble.com.



Re: in-index representaton of tokens

2011-01-25 Thread Markus Jelsma
This should shed some light on the matter
http://lucene.apache.org/java/2_9_0/fileformats.html

 I am saying there is a list of tokens that have been parsed (a table of
 them) for each column? Or one for the whole index?
 
  Dennis Gearon
 
 
 Signature Warning
 
 It is always a good idea to learn from your own mistakes. It is usually a
 better idea to learn from others’ mistakes, so you do not have to make
 them yourself. from
 'http://blogs.techrepublic.com.com/security/?p=4501tag=nl.e036'
 
 
 EARTH has a Right To Life,
 otherwise we all die.
 
 
 
 - Original Message 
 From: Jonathan Rochkind rochk...@jhu.edu
 To: solr-user@lucene.apache.org solr-user@lucene.apache.org
 Sent: Tue, January 25, 2011 9:29:36 AM
 Subject: Re: in-index representaton of tokens
 
 Why does it matter?  You can't really get at them unless you store them.
 
 I don't know what table per column means, there's nothing in Solr
 architecture called a table or a column. Although by column you
 probably mean more or less Solr field.  There is nothing like a
 table in Solr.
 
 Solr is still not an rdbms.
 
 On 1/25/2011 12:26 PM, Dennis Gearon wrote:
  So, the index is a list of tokens per column, right?
  
  There's a table per column that lists the analyzed tokens?
  
  And the tokens per column are represented as what, system integers? 32/64
  bit unsigned ints?
  
Dennis Gearon
  
  Signature Warning
  
  It is always a good idea to learn from your own mistakes. It is usually a
 
 better
 
  idea to learn from others’ mistakes, so you do not have to make them
  yourself. from
  'http://blogs.techrepublic.com.com/security/?p=4501tag=nl.e036'
  
  
  EARTH has a Right To Life,
  otherwise we all die.


RE: DIH serialize

2011-01-25 Thread Papp Richard
Dear Stefan,

  thank you for your help! 
  Well, I wrote a small script, even if not json, but works:

  script![CDATA[
function my_serialize(row)
{
  var st = ;
  
  st = row.get('stt_id') + || +
row.get('stt_name') + || +
row.get('stt_date_from') + || +
row.get('stt_date_to') + || +
row.get('stt_monday') + || +
row.get('stt_tuesday') + || +
row.get('stt_wednesday') + || +
row.get('stt_thursday') + || +
row.get('stt_friday') + || +
row.get('stt_saturday') + || +
row.get('stt_sunday') ;

  var ret = new java.util.HashMap();
  ret.put('main_timetable', st);
  
  return ret;
}
  ]]/script

regards,
  Rich

-Original Message-
From: Stefan Matheis [mailto:matheis.ste...@googlemail.com] 
Sent: Tuesday, January 25, 2011 11:13
To: solr-user@lucene.apache.org
Subject: Re: DIH serialize

Rich,

i played around for a few minutes with Script-Transformers, but i have not
enough knowledge to get anything done right know :/
My Idea was: looping over the given row, which should be a Java HashMap or
something like that? and do sth like this (pseudo-code):

var row_data = []
for( var key in row )
{
  row_data.push( '' + key + ' : ' + row[key] + '' );
}
row.put( 'whatever_field', '{' + row_data.join( ',' ) + '}' );

Which should result in a json-object like {'key1':'value1', 'key2':'value2'}
- and that should be okay to work with?

Regards
Stefan

On Mon, Jan 24, 2011 at 7:53 PM, Papp Richard ccode...@gmail.com wrote:

 Hi Stefan,

  yes, this is exactly what I intend - I don't want to search in this field
 - just quicly return me the result in a serialized form (the search
 criteria
 is on other fields). Well, if I could serialize the data exactly as like
 the
 PHP serialize() does I would be maximally satisfied, but any other form in
 which I could compact the data easily into one field I would be pleased.
  Can anyone help me? I guess the script is quite a good way, but I don't
 know which function should I use there to compact the data to be easily
 usable in PHP. Or any other method?

 thanks,
  Rich

 -Original Message-
 From: Stefan Matheis [mailto:matheis.ste...@googlemail.com]
 Sent: Monday, January 24, 2011 18:23
 To: solr-user@lucene.apache.org
 Subject: Re: DIH serialize

 Hi Rich,

 i'm a bit confused after reading your post .. what exactly you wanna try
to
 achieve? Serializing (like http://php.net/serialize) your complete row
 into
 one field? Don't wanna search in them, just store and deliver them in your
 results? Does that make sense? Sounds a bit strange :)

 Regards
 Stefan

 On Mon, Jan 24, 2011 at 10:03 AM, Papp Richard ccode...@gmail.com wrote:

  Hi Dennis,
 
   thank you for your answer, but didn't understand why you say it doesn't
  need serialization. I'm with the option C.
   but the main question is, how to put into one field a result of many
  fields: SELECT * FROM.
 
  thanks,
   Rich
 
  -Original Message-
  From: Dennis Gearon [mailto:gear...@sbcglobal.net]
  Sent: Monday, January 24, 2011 02:07
  To: solr-user@lucene.apache.org
  Subject: Re: DIH serialize
 
  Depends on your process chain to the eventual viewer/consumer of the
 data.
 
  The questions to ask are:
   A/ Is the data IN Solr going to be viewed or processed in its original
  form:
   --set stored = 'true'
  ---no serialization needed.
   B/ If it's going to be anayzed and searched for separate from any other
  field,
 
   the analyzing will put it into  an unreadable form. If you need to
 see
  it,
  then
  ---set indexed=true and stored=true
  ---no serializaton needed.   C/ If it's NOT going to be viewed AS
 IS,
  and
  it's not going to be searched for AS IS,
(i.e. other columns will be how the data is found), and you have
  another,
 
serialzable format:
--set indexed=false and stored=true
--serialize AS PER THE INTENDED APPLICATION,
not sure that Solr can do that at all.
   C/ If it's NOT going to be viewed AS IS, and it's not going to be
 searched
  for
  AS IS,
(i.e. other columns will be how the data is found), and you have
  another,
 
serialzable format:
--set indexed=false and stored=true
--serialize AS PER THE INTENDED APPLICATION,
not sure that Solr can do that at all.
   D/ If it's NOT going to be viewed AS IS, BUT it's going to be searched
 for
  AS
  IS,
(this column will be how the data is found), and you have another,
serialzable format:
--you need to put it into TWO columns
--A SERIALIZED FIELD
--set indexed=false and stored=true
 
   --AN UNSERIALIZED FIELD
--set indexed=false and stored=true
--serialize AS PER THE INTENDED APPLICATION,
not sure that Solr can do that at all.
 
  Hope that helps!
 
 
  Dennis Gearon
 
 
  Signature Warning
  

RE: in-index representaton of tokens

2011-01-25 Thread Jonathan Rochkind
There aren't any tables involved. There's basically one list (per field) of 
unique tokens for the entire index, and also, a list for each token of which 
documents contain that token. Which is efficiently encoded, but I don't know 
the details of that encoding, maybe someone who does can tell you, or you can 
look at the lucene source, or get one of the several good books on lucene.  
These 'lists' are set up so you can efficiently look up a token, and see what 
documents contain that token.  That's basically what lucene does, the purpose 
of lucene. Oh, and then there's term positions and such too, so not only can 
you see what documents contain that token but you can do proximity searches and 
stuff. 

This all gets into lucene implementation details I am not familiar with though. 
 

Why do you want to know?  If you have specific concerns about disk space or RAM 
usage or something and how different schema choices effect it, ask them, and 
someone can probably tell you more easily than someone can explain the total 
architecture of lucene in a short listserv message. But, hey, maybe someone 
other than me can do that too!

From: Dennis Gearon [gear...@sbcglobal.net]
Sent: Tuesday, January 25, 2011 7:02 PM
To: solr-user@lucene.apache.org
Subject: Re: in-index representaton of tokens

I am saying there is a list of tokens that have been parsed (a table of them)
for each column? Or one for the whole index?

 Dennis Gearon


Signature Warning

It is always a good idea to learn from your own mistakes. It is usually a better
idea to learn from others’ mistakes, so you do not have to make them yourself.
from 'http://blogs.techrepublic.com.com/security/?p=4501tag=nl.e036'


EARTH has a Right To Life,
otherwise we all die.



- Original Message 
From: Jonathan Rochkind rochk...@jhu.edu
To: solr-user@lucene.apache.org solr-user@lucene.apache.org
Sent: Tue, January 25, 2011 9:29:36 AM
Subject: Re: in-index representaton of tokens

Why does it matter?  You can't really get at them unless you store them.

I don't know what table per column means, there's nothing in Solr
architecture called a table or a column. Although by column you
probably mean more or less Solr field.  There is nothing like a
table in Solr.

Solr is still not an rdbms.

On 1/25/2011 12:26 PM, Dennis Gearon wrote:
 So, the index is a list of tokens per column, right?

 There's a table per column that lists the analyzed tokens?

 And the tokens per column are represented as what, system integers? 32/64 bit
 unsigned ints?

   Dennis Gearon


 Signature Warning
 
 It is always a good idea to learn from your own mistakes. It is usually a
better
 idea to learn from others’ mistakes, so you do not have to make them yourself.
 from 'http://blogs.techrepublic.com.com/security/?p=4501tag=nl.e036'


 EARTH has a Right To Life,
 otherwise we all die.




Re: EdgeNgram Auto suggest - doubles ignore

2011-01-25 Thread Erick Erickson
OK, try this.

Use some analysis chain for your field like:

analyzer
tokenizer class=solr.KeywordTokenizerFactory/
filter class=solr.LowerCaseFilterFactory /
/analyzer

This can be a multiValued field, BTW.

now use the TermsComponent to fetch your data. See:
http://wiki.apache.org/solr/TermsComponent

and specify terms.prefix=apple e.g.
http://localhost:8983/solr/terms?terms.prefix=appterms.fl=blivet

The return list should be what you want. Note that the returned
values will be lower cased, and you can only specify
lower case in your search term (all because of specifying
the lowercase filter in my example).

This should be very fast no matter what your index size, as the
return list size defaults to 10 (though you can specify different
numbers).

Best
Erick

On Tue, Jan 25, 2011 at 3:03 PM, johnnyisrael johnnyi.john...@gmail.comwrote:


 Hi Eric,

 What I want here is, lets say I have 3 documents like

 [pineapple vers apple, milk with apple, apple milk shake ]

 and If i search for apple, it should return only apple milk shake
 because that term alone starts with the letter apple which I typed in. It
 should not bring others and if I type milk it should return only milk
 with apple

 I want an output Similar like a Google auto suggest.

 Is there a way to achieve  this without encapsulating with double quotes.

 Thanks,

 Johnny
 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/EdgeNgram-Auto-suggest-doubles-ignore-tp2321919p2333602.html
 Sent from the Solr - User mailing list archive at Nabble.com.



Re: Solr set up issues with Magento

2011-01-25 Thread Erick Erickson
There's almost no information to go on here. Please review:
http://wiki.apache.org/solr/UsingMailingLists

Best
Erick

On Tue, Jan 25, 2011 at 6:13 PM, Sandhya Padala geend...@gmail.com wrote:

 Thank you Markus. I have added few more fields to schema.xml.

 Now looks like the products are getting indexed. But no search results.

 In Magento if I configure to use SOlr as the search engine. Search is not
 returning any results.  If I change the search engine to use Magento's
 inbuilt MYSQL , Search results are returned.  Can you please direct me on
 where/how I  should start debug process.

 If I use Solr admin and enter the search query that doesn't return any
 results either.

 Thank you,
 Sandhya

 On Mon, Jan 24, 2011 at 4:11 PM, Markus Jelsma
 markus.jel...@openindex.iowrote:

  Hi,
 
  You haven't defined the field in Solr's schema.xml configuration so it
  needs to
  be added first. Perhaps following the tutorial might be a good idea.
 
  http://lucene.apache.org/solr/tutorial.html
 
  Cheers.
 
   Hello Team:
  
  
 I am in the process of setting up Solr 1.4 with Magento ENterprise
   Edition 1.9.
  
   When I try to index the products I get the following error message.
  
   Jan 24, 2011 3:30:14 PM
  org.apache.solr.update.processor.LogUpdateProcessor
   fini
   sh
   INFO: {} 0 0
   Jan 24, 2011 3:30:14 PM org.apache.solr.common.SolrException log
   SEVERE: org.apache.solr.common.SolrException: ERROR:unknown field
   'in_stock' at
   org.apache.solr.update.DocumentBuilder.toDocument(DocumentBuilder.jav
   a:289)
   at
   org.apache.solr.update.processor.RunUpdateProcessor.processAdd(RunUpd
   ateProcessorFactory.java:60)
   at
   org.apache.solr.handler.XMLLoader.processUpdate(XMLLoader.java:139)
   at org.apache.solr.handler.XMLLoader.load(XMLLoader.java:69)
   at
   org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(Co
   ntentStreamHandlerBase.java:54)
   at
   org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandl
   erBase.java:131)
   at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316)
   at
   org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter
   .java:338)
   at
   org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilte
   r.java:241)
   at
   org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(Appl
   icationFilterChain.java:244)
   at
   org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationF
   ilterChain.java:210)
   at
   org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperV
   alve.java:240)
   at
   org.apache.catalina.core.StandardContextValve.invoke(StandardContextV
   alve.java:161)
   at
   org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.j
   ava:164)
   at
   org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.j
   ava:100)
   at
   org.apache.catalina.valves.AccessLogValve.invoke(AccessLogValve.java:
   550)
   at
   org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineVal
   ve.java:118)
   at
   org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.jav
   a:380)
   at
   org.apache.coyote.http11.Http11Processor.process(Http11Processor.java
  
   :243)
  
   at
   org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.proce
   ss(Http11Protocol.java:188)
   at
   org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.proce
   ss(Http11Protocol.java:166)
   at
   org.apache.tomcat.util.net.JIoEndpoint$SocketProcessor.run(JIoEndpoin
   t.java:288)
   at
   java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExec
   utor.java:886)
   at
   java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor
   .java:908)
   at java.lang.Thread.run(Thread.java:662)
  
   Jan 24, 2011 3:30:14 PM org.apache.solr.core.SolrCore execute
   INFO: [] webapp=/solr path=/update params={wt=json} status=400 QTime=0
   Jan 24, 2011 3:30:14 PM org.apache.solr.update.DirectUpdateHandler2
   rollback INFO: start rollback
   Jan 24, 2011 3:30:14 PM org.apache.solr.update.DirectUpdateHandler2
   rollback INFO: end_rollback
   Jan 24, 2011 3:30:14 PM
  org.apache.solr.update.processor.LogUpdateProcessor
   fini
   sh
   INFO: {rollback=} 0 16
   Jan 24, 2011 3:30:14 PM org.apache.solr.core.SolrCore execute
  
   I am a new to both Magento and SOlr. I could have done some thing
 stupid
   during installation. I really look forward for your help.
  
   Thank you,
   Sandhya
 



Specifying optional terms with standard (lucene) request handler?

2011-01-25 Thread Daniel Pötzinger
Hello 

I am searching for a way to specify optional terms in a query ( that dont need 
to match (But if they match should influence the scoring) )

Using the dismax parser a query like this:
str name=mm2/str
str name=debugQueryon/str
str name=q+lorem ipsum dolor amet/str
str name=qfcontent/str
str name=hl.fl/
str name=qtdismax/str
Will be parsed into something like this:
str name=parsedquery_toString
+((+(content:lor) (content:ipsum) (content:dolor) (content:amet))~2) ()
/str
Which will result that only 2 of the 3 optional terms need to match?


How can optional terms be specified using the standard request handler?
My concrete requirement is that a certain term should match but another is 
optional. But if the optional part matches - it should give the document an 
extra score.
Something like :-)
str name=qcontent:lorem #optional#content:optionalboostword^10/str

An idea would be to use a function query to boost the document:
str name=q
content:lorem _val_:query({!lucene v='optionalword^20'})
/str
Which will result in:
str name=parsedquery_toString
+content:forum +query(content:optionalword^20.0,def=0.0)
/str
Is this a good way or are there other suggestions?

Thanks for any opinion and tips on this

Daniel






DIH clean=false

2011-01-25 Thread cyang2010

I am not sure if i really understand what that mean by clean=false.

In my understanding, for full-import with default clean=true, it will blow
off all document of the existing index.  Then full import data from a table
into a index.  Is that right?

Then for clean=false, my understanding is that it won't blow off existing
index.   For data that exist in index and db table (by the same uniqueKey)
it will update the index data regardless if there is actual field update. 
For existing index data but not existing in table (by comparing uniqueKey),
it will leave it in the index.   Is it correct?  Otherwise, what is the
difference from clean=true?

Look for your knowledge on this.  Thanks!
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/DIH-clean-false-tp2351120p2351120.html
Sent from the Solr - User mailing list archive at Nabble.com.