date:20110125

Indexing Failed rolled back

2011-01-25 Thread Dinesh


i did some research in schema DIH config file and i created my own DIH, i'm
getting this error when i run

response
−
lst name=responseHeader
int name=status0/int
int name=QTime0/int
/lst
−
lst name=initArgs
−
lst name=defaults
str name=configtry.xml/str
/lst
/lst
str name=commandfull-import/str
str name=statusidle/str
str name=importResponse/
−
lst name=statusMessages
str name=Time Elapsed0:0:0.163/str
str name=Total Requests made to DataSource0/str
str name=Total Rows Fetched1/str
str name=Total Documents Processed0/str
str name=Total Documents Skipped0/str
str name=Full Dump Started2011-01-25 13:56:48/str
str name=Indexing failed. Rolled back all changes./str
str name=Rolledback2011-01-25 13:56:48/str
/lst
−
str name=WARNING
This response format is experimental.  It is likely to change in the future.
/str
/response

-
DINESHKUMAR . M
I am neither especially clever nor especially gifted. I am only very, very
curious.
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Indexing-Failed-rolled-back-tp2327412p2327412.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: DIH serialize

2011-01-25 Thread Stefan Matheis

Rich,

i played around for a few minutes with Script-Transformers, but i have not
enough knowledge to get anything done right know :/
My Idea was: looping over the given row, which should be a Java HashMap or
something like that? and do sth like this (pseudo-code):

var row_data = []
for( var key in row )
{
  row_data.push( '' + key + ' : ' + row[key] + '' );
}
row.put( 'whatever_field', '{' + row_data.join( ',' ) + '}' );

Which should result in a json-object like {'key1':'value1', 'key2':'value2'}
- and that should be okay to work with?

Regards
Stefan

On Mon, Jan 24, 2011 at 7:53 PM, Papp Richard ccode...@gmail.com wrote:

 Hi Stefan,

  yes, this is exactly what I intend - I don't want to search in this field
 - just quicly return me the result in a serialized form (the search
 criteria
 is on other fields). Well, if I could serialize the data exactly as like
 the
 PHP serialize() does I would be maximally satisfied, but any other form in
 which I could compact the data easily into one field I would be pleased.
  Can anyone help me? I guess the script is quite a good way, but I don't
 know which function should I use there to compact the data to be easily
 usable in PHP. Or any other method?

 thanks,
  Rich

 -Original Message-
 From: Stefan Matheis [mailto:matheis.ste...@googlemail.com]
 Sent: Monday, January 24, 2011 18:23
 To: solr-user@lucene.apache.org
 Subject: Re: DIH serialize

 Hi Rich,

 i'm a bit confused after reading your post .. what exactly you wanna try to
 achieve? Serializing (like http://php.net/serialize) your complete row
 into
 one field? Don't wanna search in them, just store and deliver them in your
 results? Does that make sense? Sounds a bit strange :)

 Regards
 Stefan

 On Mon, Jan 24, 2011 at 10:03 AM, Papp Richard ccode...@gmail.com wrote:

  Hi Dennis,
 
   thank you for your answer, but didn't understand why you say it doesn't
  need serialization. I'm with the option C.
   but the main question is, how to put into one field a result of many
  fields: SELECT * FROM.
 
  thanks,
   Rich
 
  -Original Message-
  From: Dennis Gearon [mailto:gear...@sbcglobal.net]
  Sent: Monday, January 24, 2011 02:07
  To: solr-user@lucene.apache.org
  Subject: Re: DIH serialize
 
  Depends on your process chain to the eventual viewer/consumer of the
 data.
 
  The questions to ask are:
   A/ Is the data IN Solr going to be viewed or processed in its original
  form:
   --set stored = 'true'
  ---no serialization needed.
   B/ If it's going to be anayzed and searched for separate from any other
  field,
 
   the analyzing will put it into  an unreadable form. If you need to
 see
  it,
  then
  ---set indexed=true and stored=true
  ---no serializaton needed.   C/ If it's NOT going to be viewed AS
 IS,
  and
  it's not going to be searched for AS IS,
(i.e. other columns will be how the data is found), and you have
  another,
 
serialzable format:
--set indexed=false and stored=true
--serialize AS PER THE INTENDED APPLICATION,
not sure that Solr can do that at all.
   C/ If it's NOT going to be viewed AS IS, and it's not going to be
 searched
  for
  AS IS,
(i.e. other columns will be how the data is found), and you have
  another,
 
serialzable format:
--set indexed=false and stored=true
--serialize AS PER THE INTENDED APPLICATION,
not sure that Solr can do that at all.
   D/ If it's NOT going to be viewed AS IS, BUT it's going to be searched
 for
  AS
  IS,
(this column will be how the data is found), and you have another,
serialzable format:
--you need to put it into TWO columns
--A SERIALIZED FIELD
--set indexed=false and stored=true
 
   --AN UNSERIALIZED FIELD
--set indexed=false and stored=true
--serialize AS PER THE INTENDED APPLICATION,
not sure that Solr can do that at all.
 
  Hope that helps!
 
 
  Dennis Gearon
 
 
  Signature Warning
  
  It is always a good idea to learn from your own mistakes. It is usually a
  better
  idea to learn from others' mistakes, so you do not have to make them
  yourself.
  from 'http://blogs.techrepublic.com.com/security/?p=4501tag=nl.e036'
 
 
  EARTH has a Right To Life,
  otherwise we all die.
 
 
 
  - Original Message 
  From: Papp Richard ccode...@gmail.com
  To: solr-user@lucene.apache.org
  Sent: Sun, January 23, 2011 2:02:05 PM
  Subject: DIH serialize
 
  Hi all,
 
 
 
   I wasted the last few hours trying to serialize some column values (from
  mysql) into a Solr column, but I just can't find such a function. I'll
 use
  the value in PHP - I don't know if it is possible to serialize in PHP
 style
  at all. This is what I tried and works with a given factor:
 
 
 
  in schema.xml:
 
field name=main_timetable  type=text indexed=false
  stored=true multiValued=true /
 
 
 
  in DIH xml:
 
 
 
  dataConfig

Re: synonyms file, and example cases

2011-01-25 Thread Stefan Matheis

Cam,

the examples with the provided inline-documentation should help you, no?
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.SynonymFilterFactory

The Backslash \ in that context looks like an Escaping-Character, to avoid
the = to be interpreted as assign-command

Regards
Stefan

On Tue, Jan 25, 2011 at 2:31 AM, Cam Bazz camb...@gmail.com wrote:

 Hello,

 I have been looking at the solr synonym file that was an example, I
 did not understand some notation:

 aaa = 

 bbb = 1 2

 ccc = 1,2

 a\=a = b\=b

 a\,a = b\,b

 fooaaa,baraaa,bazaaa

 The first one says search for  when query is aaa. am I correct?
 the second one finds 1 2 when query is bbb
 the third one is find 1 or 2 when query is ccc

 the fourth, and fifth one I have not understood.

 the last one, i assume is a group, bidirectional mapping between
 fooaaa,baraaa,bazaaa

 I am especially interested with this last one, if I do aaa,bbb it will
 find aaa and bbb when either aaa or bbb is queryied?

 am I correct in those assumptions?

 Best regards,
 C.B.

Performance optimization of Proximity/Wildcard searches

2011-01-25 Thread Salman Akram

Hi,

I am facing performance issues in three types of queries (and their
combination). Some of the queries take more than 2-3 mins. Index size is
around 150GB.


   - Wildcard
   - Proximity
   - Phrases (with common words)

I know CommonGrams and Stop words are a good way to resolve such issues but
they don't fulfill our functional requirements (Common Grams seem to have
issues with phrase proximity, stop words have issues with exact match etc).

Sharding is an option too but that too comes with limitations so want to
keep that as a last resort but I think there must be other things coz 150GB
is not too big for one drive/server with 32GB Ram.

Cache warming is a good option too but the index get updated every hour so
not sure how much would that help.

What are the other main tips that can help in performance optimization of
the above queries?

Thanks

-- 
Regards,

Salman Akram

Re: please help Problem with dataImportHandler

2011-01-25 Thread Stefan Matheis

Caused by: org.xml.sax.SAXParseException: Element type field must be
followed by either attribute specifications,  or /.

Sounds like invalid XML in your .. dataimport-config?

On Tue, Jan 25, 2011 at 5:41 AM, Dinesh mdineshkuma...@karunya.edu.inwrote:


 http://pastebin.com/tjCs5dHm

 this is the log produced by the solr server

 -
 DINESHKUMAR . M
 I am neither especially clever nor especially gifted. I am only very, very
 curious.
 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/please-help-Problem-with-dataImportHandler-tp2318585p2326659.html
 Sent from the Solr - User mailing list archive at Nabble.com.

Re: please help Problem with dataImportHandler

2011-01-25 Thread Dinesh


ya after correcting it also it is throwing an exception

-
DINESHKUMAR . M
I am neither especially clever nor especially gifted. I am only very, very
curious.
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/please-help-Problem-with-dataImportHandler-tp2318585p2327662.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Getting started with writing parser

2011-01-25 Thread Gora Mohanty

On Tue, Jan 25, 2011 at 10:05 AM, Dinesh mdineshkuma...@karunya.edu.in wrote:

 http://pastebin.com/CkxrEh6h

 this is my sample log
[...]

And, which portions of the log text do you want to preserve?
Does it go into Solr as a single error message, or do you want
to separate out parts of it.

Regards,
Gora

Re: Getting started with writing parser

2011-01-25 Thread Dinesh


i want to take the month, time, DHCPMESSAGE, from_mac, gateway_ip, net_ADDR

-
DINESHKUMAR . M
I am neither especially clever nor especially gifted. I am only very, very
curious.
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Getting-started-with-writing-parser-tp2278092p2327738.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: please help Problem with dataImportHandler

2011-01-25 Thread Dinesh


http://lucene.472066.n3.nabble.com/Getting-started-with-writing-parser-tp2278092p2327738.html

this thread explains my problem

-
DINESHKUMAR . M
I am neither especially clever nor especially gifted. I am only very, very
curious.
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/please-help-Problem-with-dataImportHandler-tp2318585p2327745.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Getting started with writing parser

2011-01-25 Thread Gora Mohanty

On Tue, Jan 25, 2011 at 11:44 AM, Dinesh mdineshkuma...@karunya.edu.in wrote:

 i don't even know whether the regex expression that i'm using for my log is
 correct or no..

If it is the same try.xml that you posted earlier, it is very likely not
going to work. You seem to have just cut and pasted entries from
the Hathi Trust blog, without understanding how they work.

Could you take a fresh look at http://wiki.apache.org/solr/DataImportHandler
and explain in words the following:
* What is your directory structure for storing the log files?
* What parts of the log file do you want to keep (you have already explained
  this in another message)?
* How would the above translate into:
  - A Solr schema
  - Setting up (a) a data source, (b) processor(s), and (c) transformers.

i very much worried i couldn't proceed in my 
 project already
 1/3 rd of the timing is over.. please help.. this is just the first stage..
 after this i have ti setup up all the log to be redirected to SYSLOG and
 from there i'll send it to SOLR server.. then i have to analyse all the
 data's that i obtained from DNS, DHCP, WIFI, SWITCES.. and i have to prepare
 a user based report on his actions.. please help me cause the day's i have
 keeps reducing.. my project leader is questioning me a lot.. pls..
[...]

Well, I am sorry, but at least I strongly feel that we should
not be doing your work for you, and especially not if it is a
student project, as seems to be the case.

If you can address the above points one by one (stay on
this thread, please), people should be able to help you.
However, it is up to you to get to understand Solr well
enough.

Regards,
Gora

Re: Getting started with writing parser

2011-01-25 Thread Dinesh


no i actually changed the directory to mine where i stored the log files.. it
is /home/exam/apa..solr/example/exampledocs

i specified it in a solr schema.. i created an DataImportHandler for that in
try.xml.. then in that i changed that file name to sample.txt

that new try.xml is
http://pastebin.com/pfVVA7Hs

i changed the log into one word per line thinking there might be error in my
regex expression.. now i'm completely stuck..

-
DINESHKUMAR . M
I am neither especially clever nor especially gifted. I am only very, very
curious.
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Getting-started-with-writing-parser-tp2278092p2327920.html
Sent from the Solr - User mailing list archive at Nabble.com.

Extracting contents of zipped files with Tika and Solr 1.4.1

2011-01-25 Thread Gary Taylor


Hi,

I posted a question in November last year about indexing content from 
multiple binary files into a single Solr document and Jayendra responded 
with a simple solution to zip them up and send that single file to Solr.


I understand that the Tika 0.4 JARs supplied with Solr 1.4.1 don't 
currently allow this to work and only the file names of the zipped files 
are indexed (and not their contents).


I've tried downloading and building the latest Tika (0.8) and replacing 
the tika-parsers and tika-core JARS in 
solr-root\contrib\extraction\lib but this still isn't indexing the 
file contents, and not doesn't even index the file names!


Is there a version of Tika that works with the Solr 1.4.1 released 
distribution which does index the contents of the zipped files?


Thanks and kind regards,
Gary

DIH From various File system locations

2011-01-25 Thread pankaj bhatt

Hi All,
 I need to index the documents presents in my file system at various
locations (e.g. C:\docs , d:\docs ).
Is there any way through which i can specify this in my DIH
Configuration.
Here is my configuration:-

document
  entity name=sd
processor=FileListEntityProcessor
fileName=docx$|doc$|pdf$|xls$|xlsx|html$|rtf$|txt$|zip$
*baseDir=G:\\Desktop\\*
recursive=false
rootEntity=true
transformer=DateFormatTransformer
onerror=continue
entity name=tikatest
processor=org.apache.solr.handler.dataimport.TikaEntityProcessor
url=${sd.fileAbsolutePath} format=text dataSource=bin
  field column=Author name=author meta=true/
  field column=Content-Type name=title meta=true/
  !-- field column=title name=title meta=true/ --
  field column=text name=all_text/
/entity

!-- field column=fileLastModified name=date
dateTimeFormat=-MM-dd'T'hh:mm:ss / --
field column=fileSize name=size/
field column=file name=filename/
/entity
!--baseDir=../site--
  /document

/ Pankaj Bhatt.

Re: Performance optimization of Proximity/Wildcard searches

2011-01-25 Thread Toke Eskildsen

On Tue, 2011-01-25 at 10:20 +0100, Salman Akram wrote:
 Cache warming is a good option too but the index get updated every hour so
 not sure how much would that help.

What is the time difference between queries with a warmed index and a
cold one? If the warmed index performs satisfactory, then one answer is
to upgrade your underlying storage. As always for IO-caused performance
problem in Lucene/Solr-land, SSD is the answer.

Recommendation on RAM-/Cache configuration

2011-01-25 Thread Martin Grotzke

Hi,

recently we're experiencing OOMEs (GC overhead limit exceeded) in our
searches. Therefore I want to get some clarification on heap and cache
configuration.

This is the situation:
- Solr 1.4.1 running on tomcat 6, Sun JVM 1.6.0_13 64bit
- JVM Heap Params: -Xmx8G -XX:MaxPermSize=256m -XX:NewSize=2G
-XX:MaxNewSize=2G -XX:SurvivorRatio=6 -XX:+UseParallelOldGC
-XX:+UseParallelGC
- The machine has 32 GB RAM
- Currently there are 4 processors/cores in the machine, this shall be
changed to 2 cores in the future.
- The index size in the filesystem is ~9.5 GB
- The index contains ~ 5.500.000 documents
- 1.500.000 of those docs are available for searches/queries, the rest are
inactive docs that are excluded from searches (via a flag/field), but
they're still stored in the index as need to be available by id (solr is the
main document store in this app)
- Caches are configured with a big size (the idea was to prevent filesystem
access / disk i/o as much as possible):
  - filterCache (solr.LRUCache): size=20, initialSize=3,
autowarmCount=1000, actual size =~ 60.000, hitratio =~ 0.99
  - documentCache (solr.LRUCache): size=20, initialSize=10,
autowarmCount=0, actual size =~ 160.000 - 190.000, hitratio =~ 0.74
  - queryResultCache (solr.LRUCache): size=20, initialSize=3,
autowarmCount=1, actual size =~ 10.000 - 60.000, hitratio =~ 0.71
- Searches are performed using a catchall text field using standard request
handler, all fields are fetched (no fl specified)
- Normally ~ 5 concurrent requests, peaks up to 30 or 40 (mostly during GC)
- Recently we also added a feature that adds weighted search for special
fields, so that the query might become s.th. like this
  q=(some query) OR name_weighted:(some query)^2.0 OR brand_weighted:(some
query)^4.0 OR longDescription_weighted:(some query)^0.5
  (it seemed as if this was the cause of the OOMEs, but IMHO it only
increased RAM usage so that now GC could not free enough RAM)

The OOMEs that we get are of type GC overhead limit exceeded, one of the
OOMEs was thrown during auto-warming.

I checked two different heapdumps, the first one autogenerated
(by -XX:+HeapDumpOnOutOfMemoryError) the second one generated manually via
jmap.
These show the following distribution of used memory - the autogenerated
dump:
 - documentCache: 56% (size ~ 195.000)
- filterCache: 15% (size ~ 60.000)
- queryResultCache: 8% (size ~ 61.000)
- fieldCache: 6% (fieldCache referenced  by WebappClassLoader)
- SolrIndexSearcher: 2%

The manually generated dump:
- documentCache: 48% (size ~ 195.000)
- filterCache: 20% (size ~ 60.000)
- fieldCache: 11% (fieldCache hängt am WebappClassLoader)
- queryResultCache: 7% (size ~ 61.000)
- fieldValueCache: 3%

We are also running two search engines with 17GB heap, these don't run into
OOMEs. Though, with these bigger heap sizes the longest requests are even
longer due to longer stop-the-world gc cycles.
Therefore my goal is to run with a smaller heap, IMHO even smaller than 8GB
would be good to reduce the time needed for full gc.

So what's the right path to follow now? What would you recommend to change
on the configuration (solr/jvm)?

Would you say it is ok to reduce the cache sizes? Would this increase disk
i/o, or would the index be hold in the OS's disk cache?

Do have other recommendations to follow / questions?

Thanx  cheers,
Martin

Re: Specifying an AnalyzerFactory in the schema

2011-01-25 Thread Renaud Delbru


Hi Chris,

On 24/01/11 21:18, Chris Hostetter wrote:

: I notice that in the schema, it is only possible to specify a Analyzer class,
: but not a Factory class as for the other elements (Tokenizer, Fitler, etc.).
: This limits the use of this feature, as it is impossible to specify parameters
: for the Analyzer.
: I have looked at the IndexSchema implementation, and I think this requires a
: simple fix. Do I open an issue about it ?

Support for constructing Analyzers directly is very crude, and primarily
existed for making it easy for people with old indexes and analyzers to
keep working.

moving foward, Lucene/Solr eventtually won't ship concret Analyzers
implementations at all (at least, that's the last concensus i remember) so
enhancing support for loading Analyzers (or AnalyzerFactories) doesn't
make much sense.

Practically speaking, if you have an existing Analyzer that you want to
use in Solr, instead of writting an AnalyzerFactory for it, you could
just write a TokenizerFactory that wraps it instead -- functinally that
would let you achieve everything ana AnalyzerFactory would, except that
Solr would already handle letting the schema.xml specify the
positionIncrementGap (which you could happily ignore if you wanted)
Thanks for the trick, I haven't thought about doing that. This should 
work indeed.


cheers
--
Renaud Delbru

Use terracotta bigmemory for solr-caches

2011-01-25 Thread Martin Grotzke

Hi,

as the biggest parts of our jvm heap are used by solr caches I asked myself
if it wouldn't make sense to run solr caches backed by terracotta's
bigmemory (http://www.terracotta.org/bigmemory).
The goal is to reduce the time needed for full / stop-the-world GC cycles,
as with our 8GB heap the longest requests take up to several minutes.

What do you think?

Cheers,
Martin

Re: Performance optimization of Proximity/Wildcard searches

2011-01-25 Thread Salman Akram

By warmed index you only mean warming the SOLR cache or OS cache? As I said
our index is updated every hour so I am not sure how much SOLR cache would
be helpful but OS cache should still be helpful, right?

I haven't compared the results with a proper script but from manual testing
here are some of the observations.

'Recent' queries which are in cache of course return immediately (only if
they are exactly same - even if they took 3-4 mins first time). I will need
to test how many recent queries stay in cache but still this would work only
for very common queries. User can run different queries and I want at least
them to be at 'acceptable' level (5-10 secs) even if not very fast.

Our warm up script currently executes all distinct queries in our logs
having count  5. It was run yesterday (with all the indexing update every
hour after that) and today when I executed some of the same queries again
their time seemed a little less (around 15-20%), I am not sure if this means
anything. However, still their time is not acceptable.

What do you think is the best way to compare results? First run all the warm
up queries and then execute same randomly and compare?

We are using Windows server, would it make a big difference if we move to
Linux? Our load is not high but some queries are really complex.

Also I was hoping to move to SSD in last after trying out all software
options. Is that an agreed fact that on large indexes (which don't fit in
RAM) proximity/wildcard/phrase queries (on common words) would be slow and
it can be only improved by cache warm up and better hardware? Otherwise with
an index of around 150GB such queries will take more than a min?

If that's the case I know this question is very subjective but if a single
query takes 2 min on SAS 10K RPM what would its approx time be on a good SSD
(everything else same)?

Thanks!


On Tue, Jan 25, 2011 at 3:44 PM, Toke Eskildsen t...@statsbiblioteket.dkwrote:

 On Tue, 2011-01-25 at 10:20 +0100, Salman Akram wrote:
  Cache warming is a good option too but the index get updated every hour
 so
  not sure how much would that help.

 What is the time difference between queries with a warmed index and a
 cold one? If the warmed index performs satisfactory, then one answer is
 to upgrade your underlying storage. As always for IO-caused performance
 problem in Lucene/Solr-land, SSD is the answer.




-- 
Regards,

Salman Akram

Re: Possible Memory Leaks / Upgrading to a Later Version of Solr or Lucene

2011-01-25 Thread Markus Jelsma

Hi,

Are you sure you need CMS incremental mode? It's only adviced when running on 
a machine with one or two processors. If you have more you should consider 
disabling the incremental flags.

Cheers,

On Monday 24 January 2011 19:32:38 Simon Wistow wrote:
 We have two slaves replicating off one master every 2 minutes.
 
 Both using the CMS + ParNew Garbage collector. Specifically
 
 -server -XX:+UseConcMarkSweepGC -XX:+UseParNewGC
 -XX:+CMSIncrementalMode -XX:+CMSIncrementalPacing
 
 but periodically they both get into a GC storm and just keel over.
 
 Looking through the GC logs the amount of memory reclaimed in each GC
 run gets less and less until we get a concurrent mode failure and then
 Solr effectively dies.
 
 Is it possible there's a memory leak? I note that later versions of
 Lucene have fixed a few leaks. Our current versions are relatively old
 
   Solr Implementation Version: 1.4.1 955763M - mark - 2010-06-17
 18:06:42
 
   Lucene Implementation Version: 2.9.3 951790 - 2010-06-06 01:30:55
 
 so I'm wondering if upgrading to later version of Lucene might help (of
 course it might not but I'm trying to investigate all options at this
 point). If so what's the best way to go about this? Can I just grab the
 Lucene jars and drop them somewhere (or unpack and then repack the solr
 war file?). Or should I use a nightly solr 1.4?
 
 Or am I barking up completely the wrong tree? I'm trawling through heap
 logs and gc logs at the moment trying to to see what other tuning I can
 do but any other hints, tips, tricks or cluebats gratefully received.
 Even if it's just Yeah, we had that problem and we added more slaves
 and periodically restarted them
 
 thanks,
 
 Simon

-- 
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350

Re: Weird behaviour with phrase queries

2011-01-25 Thread Erick Erickson

Frankly, this puzzles me. It *looks* like it should be OK. One warning, the
analysis page sometimes is a bit misleading, so beware of that.

But the output of your queries make it look like the query is parsing as you
expect, which leaves the question of whether your index contains what
you think it does. You might get a copy of Luke, which allows you to examine
what's actually in your index instead of what you think is in there.
Sometimes
there are surprises here!

I didn't mean to re-index your whole corpus, I was thinking that you could
just index a few documents in a test index so you have something small to
look at.

Sorry I can't spot what's happening right away.

Good luck!
Erick

On Tue, Jan 25, 2011 at 2:45 AM, Jerome Renard jerome.ren...@gmail.comwrote:

 Erick,

 On Mon, Jan 24, 2011 at 9:57 PM, Erick Erickson 
 erickerick...@gmail.comwrote:

 Hmmm, I don't see any screen shots. Several things:
 1 If your stopword file has comments, I'm not sure what the effect would
 be.


 Ha, I thought comments were supported in stopwords.txt


 2 Something's not right here, or I'm being fooled again. Your withresults
 xml has this line:
 str name=parsedquery+DisjunctionMaxQuery((meta_text:ecol d
 ingenieur)~0.01) ()/str
 and your noresults has this line:
 str name=parsedquery+DisjunctionMaxQuery((meta_text:academi
 charpenti)~0.01) DisjunctionMaxQuery((meta_text:academi
 charpenti~100)~0.01)/str

 the empty () in the first one often means you're NOT going to your
 configured dismax parser in solrconfig.xml. Yet that doesn't square with
 your custom qt, so I'm puzzled.

 Could we see your raw query string on the way in? It's almost as if you
 defined qt in one and defType in the other, which are not equivalent.


 You are right I fixed this problem (my bad).

 3 It may take 12 hours to index, but you could experiment with a smaller
 subset. You say you know that the noresults one should return documents,
 what proof do
 you have? If there's a single document that you know should match this,
 just
 index it and a few others and you should be able to make many runs until
 you
 get
 to the bottom of this...


 I could but I always thought I had to fully re-index after updating
 schema.xml. If
 I update only few documents will that take the changes into account without
 breaking
 the rest ?


 And obviously your stemming is happening on the query, are you sure it's
 happening at index time too?


 Since you did not get the screenshots you will find attached the full
 output of the analysis
 for a phrase that works and for another that does not.

 Thanks for your support

 Best Regards,

 --
 Jérôme

Re: Recommendation on RAM-/Cache configuration

2011-01-25 Thread Markus Jelsma

On Tuesday 25 January 2011 11:54:55 Martin Grotzke wrote:
 Hi,
 
 recently we're experiencing OOMEs (GC overhead limit exceeded) in our
 searches. Therefore I want to get some clarification on heap and cache
 configuration.
 
 This is the situation:
 - Solr 1.4.1 running on tomcat 6, Sun JVM 1.6.0_13 64bit
 - JVM Heap Params: -Xmx8G -XX:MaxPermSize=256m -XX:NewSize=2G
 -XX:MaxNewSize=2G -XX:SurvivorRatio=6 -XX:+UseParallelOldGC
 -XX:+UseParallelGC

Consider switching to HotSpot JVM, use the -server as the first switch.

 - The machine has 32 GB RAM
 - Currently there are 4 processors/cores in the machine, this shall be
 changed to 2 cores in the future.
 - The index size in the filesystem is ~9.5 GB
 - The index contains ~ 5.500.000 documents
 - 1.500.000 of those docs are available for searches/queries, the rest are
 inactive docs that are excluded from searches (via a flag/field), but
 they're still stored in the index as need to be available by id (solr is
 the main document store in this app)

How do you exclude them? It should use filter queries. I also remember (but i 
just cannot find it back so please correct my if i'm wrong) that in 1.4.x 
sorting is done before filtering. It should be an improvement if filtering is 
done before sorting.
If you use sorting, it takes up a huge amount of RAM if filtering is not done 
first.

 - Caches are configured with a big size (the idea was to prevent filesystem
 access / disk i/o as much as possible):

There is only disk I/O if the kernel can't keep the index (or parts) in its 
page cache.

   - filterCache (solr.LRUCache): size=20, initialSize=3,
 autowarmCount=1000, actual size =~ 60.000, hitratio =~ 0.99
   - documentCache (solr.LRUCache): size=20, initialSize=10,
 autowarmCount=0, actual size =~ 160.000 - 190.000, hitratio =~ 0.74
   - queryResultCache (solr.LRUCache): size=20, initialSize=3,
 autowarmCount=1, actual size =~ 10.000 - 60.000, hitratio =~ 0.71

You should decrease the initialSize values. But your hitratio's seem very 
nice.

 - Searches are performed using a catchall text field using standard request
 handler, all fields are fetched (no fl specified)
 - Normally ~ 5 concurrent requests, peaks up to 30 or 40 (mostly during GC)
 - Recently we also added a feature that adds weighted search for special
 fields, so that the query might become s.th. like this
   q=(some query) OR name_weighted:(some query)^2.0 OR brand_weighted:(some
 query)^4.0 OR longDescription_weighted:(some query)^0.5
   (it seemed as if this was the cause of the OOMEs, but IMHO it only
 increased RAM usage so that now GC could not free enough RAM)
 
 The OOMEs that we get are of type GC overhead limit exceeded, one of the
 OOMEs was thrown during auto-warming.

Warming takes additional RAM. The current searcher still has its caches full 
and newSearcher is getting filled up. Decreasing sizes might help.

 
 I checked two different heapdumps, the first one autogenerated
 (by -XX:+HeapDumpOnOutOfMemoryError) the second one generated manually via
 jmap.
 These show the following distribution of used memory - the autogenerated
 dump:
  - documentCache: 56% (size ~ 195.000)
 - filterCache: 15% (size ~ 60.000)
 - queryResultCache: 8% (size ~ 61.000)
 - fieldCache: 6% (fieldCache referenced  by WebappClassLoader)
 - SolrIndexSearcher: 2%
 
 The manually generated dump:
 - documentCache: 48% (size ~ 195.000)
 - filterCache: 20% (size ~ 60.000)
 - fieldCache: 11% (fieldCache hängt am WebappClassLoader)
 - queryResultCache: 7% (size ~ 61.000)
 - fieldValueCache: 3%
 
 We are also running two search engines with 17GB heap, these don't run into
 OOMEs. Though, with these bigger heap sizes the longest requests are even
 longer due to longer stop-the-world gc cycles.
 Therefore my goal is to run with a smaller heap, IMHO even smaller than 8GB
 would be good to reduce the time needed for full gc.
 
 So what's the right path to follow now? What would you recommend to change
 on the configuration (solr/jvm)?

Try tuning the GC
http://java.sun.com/performance/reference/whitepapers/tuning.html
http://www.oracle.com/technetwork/java/gc-tuning-5-138395.html

 
 Would you say it is ok to reduce the cache sizes? Would this increase disk
 i/o, or would the index be hold in the OS's disk cache?

Yes! If you also allocate less RAM to the JVM then there is more for the OS to 
cache.

 
 Do have other recommendations to follow / questions?
 
 Thanx  cheers,
 Martin

-- 
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350

Re: Adding weightage to the facets count

2011-01-25 Thread Johannes Goll

Hi Siva,

try using the Solr Stats Component
http://wiki.apache.org/solr/StatsComponent

similar to
select/?q=*:*stats=truestats.field={your-weight-field}stats.facet={your-facet-field}

and get the sum field from the response. You may need to resort the weighted
facet counts to get a descending list of facet counts.

Note, there is a bug for using the Stats Component with multi-valued facet
fields.

For details see
https://issues.apache.org/jira/browse/SOLR-1782

Johannes

2011/1/24 Chris Hostetter hossman_luc...@fucit.org


 : prod1 has tag called “Light Weight” with weightage 20,
 : prod2 has tag called “Light Weight” with weightage 100,
 :
 : If i get facet for “Light Weight” , i will get Light Weight (2) ,
 : here i need to consider the weightage in to account, and the result will
 be
 : Light Weight (120)
 :
 : How can we achieve this?Any ideas are really helpful.


 It's not really possible with Solr out of the box.  Faceting is fast and
 efficient in Solr because it's all done using set intersections (and most
 of the sets can be kept in ram very compactly and reused).  For what you
 are describing you'd need to no only assocaited a weighted payload with
 every TermPosition, but also factor that weight in when doing the
 faceting, which means efficient set operations are now out the window.

 If you know java it would be probably be possible to write a custom
 SolrPlugin (a SearchComponent) to do this type of faceting in special
 cases (assuming you indexed in a particular way) but i'm not sure off hte
 top of my head how well it would scale -- the basic algo i'm thinking of
 is (after indexing each facet term wit ha weight payload) to iterate over
 the DocSet of all matching documents in parallel with an iteration over
 a TermPositions, skipping ahead to only the docs that match the query, and
 recording the sum of the payloads for each term.

 Hmmm...

 except TermPositions iterates over term, doc, freq, position tuples,
 so you would have to iterate over every term, and for every term then loop
 over all matching docs ... like i said, not sure how efficient it would
 wind up being.

 You might be happier all arround if you just do some sampling -- store the
 tag+weight pairs so thta htey cna be retireved with each doc, and then
 when you get your top facet constraints back, look at the first page of
 results, and figure out what the sun weight is for each of those
 constraints based solely on the page#1 results.

 i've had happy users using a similar appraoch in the past.

 -Hoss




-- 
Johannes Goll
211 Curry Ford Lane
Gaithersburg, Maryland 20878

Re: EdgeNgram Auto suggest - doubles ignore

2011-01-25 Thread johnnyisrael


Hi Eric,

You are right, there is a copy field to EdgeNgram, I tried the configuration
but it not working as expected.

Configuration I tried:



fieldType name=”query” class=”solr.TextField” positionIncrementGap=”100″
termVectors=”true”
analyzer type=”index”
tokenizer class=”solr.StandardTokenizerFactory”/
filter class=”solr.LowerCaseFilterFactory”/
/analyzer
analyzer type=”query”
tokenizer class=”solr.StandardTokenizerFactory”/
filter class=”solr.LowerCaseFilterFactory”/
/analyzer
/fieldType

fieldType name=”edgytext” class=”solr.TextField”
positionIncrementGap=”100″
analyzer type=”index”
tokenizer class=”solr.WhitespaceTokenizerFactory”/
filter class=”solr.LowerCaseFilterFactory”/
filter class=”solr.EdgeNGramFilterFactory” minGramSize=”3″
maxGramSize=”25″/
/analyzer
analyzer type=”query”
tokenizer class=”solr.KeywordTokenizerFactory”/
filter class=”solr.LowerCaseFilterFactory”/
/analyzer
/fieldType

field name=”user_query” type=”query” indexed=”true” stored=”true”
omitNorms=”true” omitTermFreqAndPositions=”true” /
field name=”edgy_user_query” type=”edgytext” indexed=”true” stored=”true”
omitNorms=”true” omitTermFreqAndPositions=”true” /

defaultSearchFieldedgy_user_query/defaultSearchField
copyField source=”user_query” dest=”edgy_user_query”/

==

When I search for the term apple.

It is returning results for pineapple vers apple, milk with apple,
apple milk shake ...

Is there any other way to overcome this problem?

Thanks,

Johnny


-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/EdgeNgram-Auto-suggest-doubles-ignore-tp2321919p2329370.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: DIH From various File system locations

2011-01-25 Thread Estrada Groups

I would just use Nutch and specify the -solr param on the command line. That 
will add the extracted content your instance of solr.

Adam

Sent from my iPhone

On Jan 25, 2011, at 5:29 AM, pankaj bhatt panbh...@gmail.com wrote:

 Hi All,
 I need to index the documents presents in my file system at various
 locations (e.g. C:\docs , d:\docs ).
Is there any way through which i can specify this in my DIH
 Configuration.
Here is my configuration:-
 
 document
  entity name=sd
processor=FileListEntityProcessor
fileName=docx$|doc$|pdf$|xls$|xlsx|html$|rtf$|txt$|zip$
 *baseDir=G:\\Desktop\\*
recursive=false
rootEntity=true
transformer=DateFormatTransformer
 onerror=continue
entity name=tikatest
 processor=org.apache.solr.handler.dataimport.TikaEntityProcessor
 url=${sd.fileAbsolutePath} format=text dataSource=bin
  field column=Author name=author meta=true/
  field column=Content-Type name=title meta=true/
  !-- field column=title name=title meta=true/ --
  field column=text name=all_text/
/entity
 
!-- field column=fileLastModified name=date
 dateTimeFormat=-MM-dd'T'hh:mm:ss / --
field column=fileSize name=size/
field column=file name=filename/
/entity
 !--baseDir=../site--
  /document
 
 / Pankaj Bhatt.

Re: Recommendation on RAM-/Cache configuration

2011-01-25 Thread Martin Grotzke

On Tue, Jan 25, 2011 at 2:06 PM, Markus Jelsma
markus.jel...@openindex.iowrote:

 On Tuesday 25 January 2011 11:54:55 Martin Grotzke wrote:
  Hi,
 
  recently we're experiencing OOMEs (GC overhead limit exceeded) in our
  searches. Therefore I want to get some clarification on heap and cache
  configuration.
 
  This is the situation:
  - Solr 1.4.1 running on tomcat 6, Sun JVM 1.6.0_13 64bit
  - JVM Heap Params: -Xmx8G -XX:MaxPermSize=256m -XX:NewSize=2G
  -XX:MaxNewSize=2G -XX:SurvivorRatio=6 -XX:+UseParallelOldGC
  -XX:+UseParallelGC

 Consider switching to HotSpot JVM, use the -server as the first switch.

The jvm options I mentioned were not all, we're running the jvm with -server
(of course).



  - The machine has 32 GB RAM
  - Currently there are 4 processors/cores in the machine, this shall be
  changed to 2 cores in the future.
  - The index size in the filesystem is ~9.5 GB
  - The index contains ~ 5.500.000 documents
  - 1.500.000 of those docs are available for searches/queries, the rest
 are
  inactive docs that are excluded from searches (via a flag/field), but
  they're still stored in the index as need to be available by id (solr is
  the main document store in this app)

 How do you exclude them? It should use filter queries.

The docs are indexed with a field findable on which we do a filter query.


 I also remember (but i
 just cannot find it back so please correct my if i'm wrong) that in 1.4.x
 sorting is done before filtering. It should be an improvement if filtering
 is
 done before sorting.

Hmm, I cannot imagine a case where it makes sense to sort before filtering.
Can't believe that solr does it like this.
Can anyone shed a light on this?


 If you use sorting, it takes up a huge amount of RAM if filtering is not
 done
 first.

  - Caches are configured with a big size (the idea was to prevent
 filesystem
  access / disk i/o as much as possible):

 There is only disk I/O if the kernel can't keep the index (or parts) in its
 page cache.

Yes, I'll keep an eye on disk I/O.



- filterCache (solr.LRUCache): size=20, initialSize=3,
  autowarmCount=1000, actual size =~ 60.000, hitratio =~ 0.99
- documentCache (solr.LRUCache): size=20, initialSize=10,
  autowarmCount=0, actual size =~ 160.000 - 190.000, hitratio =~ 0.74
- queryResultCache (solr.LRUCache): size=20, initialSize=3,
  autowarmCount=1, actual size =~ 10.000 - 60.000, hitratio =~ 0.71

 You should decrease the initialSize values. But your hitratio's seem very
 nice.

Does the initialSize have a real impact? According to
http://wiki.apache.org/solr/SolrCaching#initialSize it's the initial size of
the HashMap backing the cache.
What would you say are reasonable values for size/initialSize/autowarmCount?

Cheers,
Martin

Re: Extracting contents of zipped files with Tika and Solr 1.4.1

2011-01-25 Thread Erlend Garåsen


On 25.01.11 11.30, Erlend Garåsen wrote:


Tika version 0.8 is not included in the latest release/trunk from SVN.


Ouch, I wrote not instead of now. Sorry, I replied in a hurry.

And to clarify, by content I mean the main content of a Word file. 
Title and other kinds of metadata are successfully extracted by the old 
0.4 version of Tika, but you need a newer Tika version (0.8) in order to 
fetch the main content as well. So try the newest Solr version from trunk.


Erlend

--
Erlend Garåsen
Center for Information Technology Services
University of Oslo
P.O. Box 1086 Blindern, N-0317 OSLO, Norway
Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP: 31050

Re: Getting started with writing parser

2011-01-25 Thread Gora Mohanty

On Tue, Jan 25, 2011 at 3:46 PM, Dinesh mdineshkuma...@karunya.edu.in wrote:

 no i actually changed the directory to mine where i stored the log files.. it
 is /home/exam/apa..solr/example/exampledocs

 i specified it in a solr schema.. i created an DataImportHandler for that in
 try.xml.. then in that i changed that file name to sample.txt

 that new try.xml is
 http://pastebin.com/pfVVA7Hs
[...]

Let us take this one part at a time.

In your inner nested entity,
  entity name=tryli...
what do you expect the attribute
  url=${hathifile.fileAbsolutePath}
to resolve to?

Regards,
Gora

Re: Use terracotta bigmemory for solr-caches

2011-01-25 Thread Em


Hi Martin,

are you sure that your GC is well tuned?
A request that needs more than a minute isn't the standard, even when I
consider all the other postings about response-performance...

Regards
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Use-terracotta-bigmemory-for-solr-caches-tp2328257p2330652.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Extracting contents of zipped files with Tika and Solr 1.4.1

2011-01-25 Thread Gary Taylor


Thanks Erlend.

Not used SVN before, but have managed to download and build latest trunk 
code.


Now I'm getting an error when trying to access the admin page (via 
Jetty) because I specify HTMLStripStandardTokenizerFactory in my 
schema.xml, but this appears to be no-longer supplied as part of the 
build so I get an exception cos it can't find that class.  I've checked 
the CHANGES.txt and found the following in the change list to 1.4.0 (!?) :


66. SOLR-1343: Added HTMLStripCharFilter and marked HTMLStripReader, 
HTMLStripWhitespaceTokenizerFactory and
HTMLStripStandardTokenizerFactory deprecated. To strip HTML tags, 
HTMLStripCharFilter can be used with an arbitrary Tokenizer. (koji)


Unfortunately, I can't seem to get that to work correctly.  Does anyone 
have an example fieldType stanza (for schema.xml) for stripping out HTML ?


Thanks and kind regards,
Gary.



On 25/01/2011 14:17, Erlend Garåsen wrote:

On 25.01.11 11.30, Erlend Garåsen wrote:


Tika version 0.8 is not included in the latest release/trunk from SVN.


Ouch, I wrote not instead of now. Sorry, I replied in a hurry.

And to clarify, by content I mean the main content of a Word file. 
Title and other kinds of metadata are successfully extracted by the 
old 0.4 version of Tika, but you need a newer Tika version (0.8) in 
order to fetch the main content as well. So try the newest Solr 
version from trunk.


Erlend

List of indexed or stored fields

2011-01-25 Thread kenf_nc


I use a lot of dynamic fields, so looking at my schema isn't a good way to
see all the field names that may be indexed across all documents. Is there a
way to query solr for that information? All field names that are indexed, or
stored? Possibly a count by field name? Is there any other metadata about a
field that can be queried?
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/List-of-indexed-or-stored-fields-tp2330986p2330986.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Extracting contents of zipped files with Tika and Solr 1.4.1

2011-01-25 Thread Gary Taylor

OK, got past the schema.xml problem, but now I'm back to square one.

I can index the contents of binary files (Word, PDF etc...), as well as
text files, but it won't index the content of files inside a zip.

As an example, I have two txt files - doc1.txt and doc2.txt. If I index
either of them individually using:

curl
http://localhost:8983/solr/core0/update/extract?literal.docid=74fmap.content=textliteral.type=5;
-F file=@doc1.txt

and commit, Solr will index the contents and searches will match.

If I zip those two files up into solr1.zip, and index that using:

curl
http://localhost:8983/solr/core0/update/extract?literal.docid=74fmap.content=textliteral.type=5;
-F file=@solr1.zip

and commit, the file names are indexed, but not their contents.

I have checked that Tika can correctly process the zip file when used
standalone with the tika-app jar - it outputs both the filenames and
contents. Should I be able to index the contents of files stored in a
zip by using extract ?

Thanks and kind regards,
Gary.

On 25/01/2011 15:32, Gary Taylor wrote:

Thanks Erlend.

Not used SVN before, but have managed to download and build latest
trunk code.

Now I'm getting an error when trying to access the admin page (via
Jetty) because I specify HTMLStripStandardTokenizerFactory in my
schema.xml, but this appears to be no-longer supplied as part of the
build so I get an exception cos it can't find that class. I've
checked the CHANGES.txt and found the following in the change list to
1.4.0 (!?) :

66. SOLR-1343: Added HTMLStripCharFilter and marked HTMLStripReader,
HTMLStripWhitespaceTokenizerFactory and
HTMLStripStandardTokenizerFactory deprecated. To strip HTML tags,
HTMLStripCharFilter can be used with an arbitrary Tokenizer. (koji)

Unfortunately, I can't seem to get that to work correctly. Does
anyone have an example fieldType stanza (for schema.xml) for stripping
out HTML ?

Thanks and kind regards,
Gary.

On 25/01/2011 14:17, Erlend Garåsen wrote:

On 25.01.11 11.30, Erlend Garåsen wrote:

Tika version 0.8 is not included in the latest release/trunk from SVN.

Ouch, I wrote not instead of now. Sorry, I replied in a hurry.

And to clarify, by content I mean the main content of a Word file.
Title and other kinds of metadata are successfully extracted by the
old 0.4 version of Tika, but you need a newer Tika version (0.8) in
order to fetch the main content as well. So try the newest Solr
version from trunk.

Erlend

Re: List of indexed or stored fields

2011-01-25 Thread Juan Grande

You can query all the indexed or stored fields (including dynamic fields)
using the LukeRequestHandler: http://localhost:8983/solr/example/admin/luke

See also: http://wiki.apache.org/solr/LukeRequestHandler

Regards,
*
**Juan G. Grande*
-- Solr Consultant @ http://www.plugtree.com
-- Blog @ http://juanggrande.wordpress.com

On Tue, Jan 25, 2011 at 12:39 PM, kenf_nc ken.fos...@realestate.com wrote:


 I use a lot of dynamic fields, so looking at my schema isn't a good way to
 see all the field names that may be indexed across all documents. Is there
 a
 way to query solr for that information? All field names that are indexed,
 or
 stored? Possibly a count by field name? Is there any other metadata about a
 field that can be queried?
 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/List-of-indexed-or-stored-fields-tp2330986p2330986.html
 Sent from the Solr - User mailing list archive at Nabble.com.

Re: DIH From various File system locations

2011-01-25 Thread pankaj bhatt

Thanks Adam, It seems like Nutch use to solve most of my concerns.
i would be great if you can have share resources for Nutch with us.

/ Pankaj Bhatt.

On Tue, Jan 25, 2011 at 7:21 PM, Estrada Groups 
estrada.adam.gro...@gmail.com wrote:

 I would just use Nutch and specify the -solr param on the command line.
 That will add the extracted content your instance of solr.

 Adam

 Sent from my iPhone

 On Jan 25, 2011, at 5:29 AM, pankaj bhatt panbh...@gmail.com wrote:

  Hi All,
  I need to index the documents presents in my file system at
 various
  locations (e.g. C:\docs , d:\docs ).
 Is there any way through which i can specify this in my DIH
  Configuration.
 Here is my configuration:-
 
  document
   entity name=sd
 processor=FileListEntityProcessor
 fileName=docx$|doc$|pdf$|xls$|xlsx|html$|rtf$|txt$|zip$
  *baseDir=G:\\Desktop\\*
 recursive=false
 rootEntity=true
 transformer=DateFormatTransformer
  onerror=continue
 entity name=tikatest
  processor=org.apache.solr.handler.dataimport.TikaEntityProcessor
  url=${sd.fileAbsolutePath} format=text dataSource=bin
   field column=Author name=author meta=true/
   field column=Content-Type name=title meta=true/
   !-- field column=title name=title meta=true/ --
   field column=text name=all_text/
 /entity
 
 !-- field column=fileLastModified name=date
  dateTimeFormat=-MM-dd'T'hh:mm:ss / --
 field column=fileSize name=size/
 field column=file name=filename/
 /entity
  !--baseDir=../site--
   /document
 
  / Pankaj Bhatt.

Re: Extracting contents of zipped files with Tika and Solr 1.4.1

2011-01-25 Thread Jayendra Patil

Hi Gary,

The latest Solr Trunk was able to extract and index the contents of the zip
file using the ExtractingRequestHandler.
The snapshot of Trunk we worked upon had the Tika 0.8 snapshot jars and
worked pretty well.

Tested again with sample url and works fine -
curl
http://localhost:8080/solr/core0/update/extract?stream.file=C:/temp/extract/777045.zipliteral.id=777045literal.title=Testcommit=true

You would probably need to drill down to the Tika Jars and
the apache-solr-cell-4.0-dev.jar used for Rich documents indexing.

Regards,
Jayendra

On Tue, Jan 25, 2011 at 11:08 AM, Gary Taylor g...@inovem.com wrote: