date:20090216

indexing Chienese langage

2009-02-16 Thread revathy arun

Hi,

When I index chinese content using chinese tokenizer and analyzer in solr
1.3 ,some of the chinese text files are getting indexed but others are not.

Since chinese has got many different language subtypes as in standard
chinese,simplified chinese etc which of these does the chinese tokenizer
support and is there any method to find the type of  chiense language  from
the file?

Rgds

DIH transformers

2009-02-16 Thread Fergus McMenemie

Hello.

I have been beating my head around the data-config.xml listed
at the end of this message. It breaks in a few different ways.

  1) I have bodged TemplateTransformer to allow it to return 
 when one of the variables is undefined. This ensures my
 uniqueKey is always defined. But thinking more on
 Nobel's comments there is use in having it work both ways.
 ie leaving the column undefined or replacing the variable
 with . I still like my idea about using the default
 value of a solr field from schema.xml, but I cant figure
 out how/where to best implement it. 

  2) Having used TemplateTransformer to assign a value to an 
 entity column that column cannot be used in other 
 TemplateTransformer operations. In my project I am 
 attempting to reuse x.fileWebPath. To fix this, the 
 last line of transformRow() in TemplateTransformer.java
 needs replaced with the following which as well as 
 'putting' the templated-ed string in 'row' also saves it
 into the 'resolver'.

 **originally**
  row.put(column, resolver.replaceTokens(expr));
  }

 **new**
  String columnName = map.get(DataImporter.COLUMN);
  expr=resolver.replaceTokens(expr);
  row.put(columnName, expr);
  resolverMapCopy.put(columnName, expr);
  }

 As an aside I think I ran into the issues covered by 
 SOLR-993. It took a while to figure out I could not a
 a single columnname/value to the resolver. I had instead
 to add to the map that was already stored within the
 resolver.

  3) No entity column names can be used within RegexTransformer.
 I guess all the stuff that was added to TemplateTransformer
 to allow column names to be used in templates needs re-added
 into RegexTransformer. I am doing that now... but am confused
 by the fragment of code which copies from resolverMap into
 resolverMapCopy. As best I can see resolverMap is always 
 empty; but I am barely able to follow the code! Can somebody
 explain when/why resolverMap would be populated.

 Also, I begin to understand comments made by Noble in
 SOL-1001 about resolving entity attributes in 
 ContextImpl.getEntityAttribute and I guess Shalin was
 right as well. However it also seems wrong that at the
 top of every transformer we are going to repeat the
 same code to load the resolver with information about the 
 entity.

  4) In that I am reusing template output within other templates
 the order of execution becomes important. Can I assume that
 the explicitly listed columns in an entity are processed by
 the various transformers in the order they appear within
 data-config.xml. I *think* that the list of columns within
 an entity as returned by getAllEntityFields() is actually
 an ArrayList which I think or order dependent. IS this
 correct?

  5) Should I raise this as a single JIRA issue?

  6) Having played with this stuff, I was going to add a bit
 more to the wiki highlighting some of the possibilities
 and issues with transformers. But want to check with the 
 list first!


   dataConfig
   dataSource name=myfilereader type=FileDataSource/
document
entity name=jc
   processor=FileListEntityProcessor
   fileName=^.*\.xml$
   newerThan='NOW-1000DAYS'
   recursive=true
   rootEntity=false
   dataSource=null
   baseDir=/Volumes/spare/ts/solr/content
   
entity name=x
  dataSource=myfilereader
  processor=XPathEntityProcessor
  url=${jc.fileAbsolutePath}
  rootEntity=true
  stream=false
  forEach=/record | /record/mediaBlock
  
transformer=DateFormatTransformer,TemplateTransformer,RegexTransformer

field column=fileAbsolutePath   template=${jc.fileAbsolutePath} /
field column=fileWebPathregex=${x.test}(.*) 
replaceWith=/ford$1 sourceColName=fileAbsolutePath/
field column=title  xpath=/record/title /
field column=para1 name=para  xpath=/record/sect1/para /
field column=para2 name=para  xpath=/record/list/listitem/para /
field column=pubdate
xpath=/record/metadata/da...@qualifier='pubDate'] dateTimeFormat=MMdd   
/

field column=vurl   
xpath=/record/mediaBlock/mediaObject/@vurl /
field column=imgSrcArticle  
template=${dataimporter.request.fordinstalldir} /
field column=imgCpation xpath=/record/mediaBlock/caption  /

field column=test   
template=${dataimporter.request.contentinstalldir} /
!-- **problem is that vurl is just a fragment of the info needed to access the 
picture. --
field column=imgWebPathICON regex=(.*)/.* 
replaceWith=$1/imagery/${x.vurl}s.jpg sourceColName=fileWebPath/
field column=imgWebPathFULL regex=(.*)/.*

Re: spellcheck.onlyMorePopular

2009-02-16 Thread Marcus Stratmann


Shalin Shekhar Mangar wrote:

The implementation is a bit more complicated.

1. Read all tokens from the specified field in the solr index.
2. Create n-grams of the terms read in #1 and index them into a separate
Lucene index (spellcheck index).
3. When asked for suggestions, create n-grams of the query terms, search the
spellcheck index and collects the top (by lucene score) 10*spellcheck.count
results.
4. If onlyMorePopular=true, determine frequency of each result in the solr
index and remove terms which have lesser frequency.
5. Compute the edit distance between the result and the query token.
6. Return the top spellcheck.count results (sorted by edit distance
descending) which are greater than specified accuracy.


Thanks, I think this makes things clear(er) now. I do agree that the 
documentation needs improvement on this point, as you said later in this 
thread. :)




Your primary use-case is not spellcheck at all but this might work with some
hacking. Fuzzy queries may be a better solution as Walter said. Storing, all
successful search queries may be hard to scale.


This is certainly true.

The drawback of fuzzy searching is that you get back exact and fuzzy 
hits together in one result set (correct me if I'm wrong). One could 
filter out the exact/fuzzy hits but this would make paging impossible.


The approach using KeywordTokenizer as you suggested before seems to be 
more promising to me. Unfortunately there seems to be no documentation 
for this (at least in conjunction with spell checking). If I understand 
this rightly, the tokenizer must be applied to the field in the search 
index (not the spell checking index). Is that correct?


Thanks,
Marcus

Re: almost realtime updates with replication

2009-02-16 Thread sunnyfr


Hi Hoss,

Is it a problem if the snappuller miss one snapshot before the last one ?? 

Cheer,
Have a nice day,


hossman wrote:
 
 :
 : There are a couple queries that we would like to run almost realtime so
 : I would like to have it so our client sends an update on every new
 : document and then have solr configured to do an autocommit every 5-10
 : seconds.
 :
 : reading the Wiki, it seems like this isn't possible because of the
 : strain of snapshotting and pulling to the slaves at such a high rate.
 : What I was thinking was for these few queries to just query the master
 : and the rest can query the slave with the not realtime data, although
 : I'm assuming this wouldn't work either because since a snapshot is
 : created on every commit, we would still impact the performance too much?
 
 there is no reason why a commit has to trigger a snapshot, that happens
 only if you configure a postCommit hook to do so in your solrconfig.xml
 
 you can absolutely commit every 5 seconds, but have a seperate cron task
 that runs snapshooter ever 5 minutes -- you could even continue to run
 snapshooter on every commit, and get a new snapshot ever 5 seconds, but
 only run snappuller on your slave machines ever 5 minutes (the
 snapshots are hardlinks and don't take up a lot of space, and snappuller
 only needs to fetch the most recent snapshot)
 
 your idea of querying the msater directly for these queries seems
 perfectly fine to me ... just make sure the auto warm count on the caches
 on your master is very tiny so the new searchers are ready quickly after
 each commit.
 
 
 
 
 -Hoss
 
 
 

-- 
View this message in context: 
http://www.nabble.com/almost-realtime-updates-with-replication-tp12276614p22034406.html
Sent from the Solr - User mailing list archive at Nabble.com.

snapshot created if there is no documente updated/new?

2009-02-16 Thread sunnyfr


Hi 

I would like to know if a snapshot is automaticly created even if there is
no document update or added ? 

Thanks a lot,
-- 
View this message in context: 
http://www.nabble.com/snapshot-created-if-there-is-no-documente-updated-new--tp22034462p22034462.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: almost realtime updates with replication

2009-02-16 Thread Noble Paul നോബിള്‍ नोब्ळ्

I guess , it should not be a problem
--Noble

On Mon, Feb 16, 2009 at 3:28 PM, sunnyfr johanna...@gmail.com wrote:

 Hi Hoss,

 Is it a problem if the snappuller miss one snapshot before the last one ??

 Cheer,
 Have a nice day,


 hossman wrote:

 :
 : There are a couple queries that we would like to run almost realtime so
 : I would like to have it so our client sends an update on every new
 : document and then have solr configured to do an autocommit every 5-10
 : seconds.
 :
 : reading the Wiki, it seems like this isn't possible because of the
 : strain of snapshotting and pulling to the slaves at such a high rate.
 : What I was thinking was for these few queries to just query the master
 : and the rest can query the slave with the not realtime data, although
 : I'm assuming this wouldn't work either because since a snapshot is
 : created on every commit, we would still impact the performance too much?

 there is no reason why a commit has to trigger a snapshot, that happens
 only if you configure a postCommit hook to do so in your solrconfig.xml

 you can absolutely commit every 5 seconds, but have a seperate cron task
 that runs snapshooter ever 5 minutes -- you could even continue to run
 snapshooter on every commit, and get a new snapshot ever 5 seconds, but
 only run snappuller on your slave machines ever 5 minutes (the
 snapshots are hardlinks and don't take up a lot of space, and snappuller
 only needs to fetch the most recent snapshot)

 your idea of querying the msater directly for these queries seems
 perfectly fine to me ... just make sure the auto warm count on the caches
 on your master is very tiny so the new searchers are ready quickly after
 each commit.




 -Hoss




 --
 View this message in context: 
 http://www.nabble.com/almost-realtime-updates-with-replication-tp12276614p22034406.html
 Sent from the Solr - User mailing list archive at Nabble.com.





-- 
--Noble Paul

Re: facet count on partial results

2009-02-16 Thread Karl Wettin



15 feb 2009 kl. 20.15 skrev Yonik Seeley:

On Sat, Feb 14, 2009 at 6:45 AM, karl wettin karl.wet...@gmail.com  
wrote:

Also, as my threadshold is based on the distance in score between the
first result it sounds like using a result start position greater  
than

0 is something I have to look out for. Or?


Hmmm - this isn't that easy in general as it requires knowledge of the
max score, right?


Hmmm indeed. Does Solr not collect 0-20 even though the request is for  
10-20? Wouldn't it then be possible to inject some code that limits  
the DocSet at that layer?


There is more. Not important but a nice thing to get: I create  
multiple documents per entity from my primary data source (e.g. each  
entity a book and each document a paragraph from the book) but I only  
want to present the top scoring document per entity. I handle this  
with client side post processing of the results. This means that I  
potentially get facet counts from documents that I actually don't  
present to the user. I would be nice to handle this in the same layer  
as my score threadshold restriction, but it would require loading the  
primary key from the document rather early. And it would also mean  
that even though I might get 2000 results within the threadshold the  
actual number of results I want to pass on to the client is a lot less  
than that. I.e. I'll have to request more results than I want in order  
to ensure I get enough even after filtering out documents that points  
at the an entity already member of the result list but with a greater  
score.


The question is if I can fit all this stuff in the same layer as the  
by score threadshold result set limiter.



I'm rather lost in the Solr code. Pointers at class and method names  
is most welcome.




 karl

Re: DIH transformers

2009-02-16 Thread Noble Paul നോബിള്‍ नोब्ळ्

On Mon, Feb 16, 2009 at 3:22 PM, Fergus McMenemie fer...@twig.me.uk wrote:
 Hello.

 I have been beating my head around the data-config.xml listed
 at the end of this message. It breaks in a few different ways.

  1) I have bodged TemplateTransformer to allow it to return
 when one of the variables is undefined. This ensures my
 uniqueKey is always defined. But thinking more on
 Nobel's comments there is use in having it work both ways.
 ie leaving the column undefined or replacing the variable
 with . I still like my idea about using the default
 value of a solr field from schema.xml, but I cant figure
 out how/where to best implement it.
When a value is missing from the templatewe may end up giving
constructing a partial string which may not be desired. If we leave it
out as empty, then Solr would automatically put in the default value
and it should be solved. Just in case you wish to know the
defaultvalue in the schema.xml you can get it from the api.
fields = context.getAllEntityFields();
String defval = fields.get(0).get(defaultvalue);

  2) Having used TemplateTransformer to assign a value to an
 entity column that column cannot be used in other
 TemplateTransformer operations. In my project I am
 attempting to reuse x.fileWebPath. To fix this, the
 last line of transformRow() in TemplateTransformer.java
 needs replaced with the following which as well as
 'putting' the templated-ed string in 'row' also saves it
 into the 'resolver'.

 **originally**
  row.put(column, resolver.replaceTokens(expr));
  }

 **new**
  String columnName = map.get(DataImporter.COLUMN);
  expr=resolver.replaceTokens(expr);
  row.put(columnName, expr);
  resolverMapCopy.put(columnName, expr);
  }

isn't it better to write a custom transformer to achieve this. I did
not want a standard component to change the state of the
VariableResolver .

I am not sure what is the best way.


 As an aside I think I ran into the issues covered by
 SOLR-993. It took a while to figure out I could not a
 a single columnname/value to the resolver. I had instead
 to add to the map that was already stored within the
 resolver.

  3) No entity column names can be used within RegexTransformer.
 I guess all the stuff that was added to TemplateTransformer
 to allow column names to be used in templates needs re-added
 into RegexTransformer. I am doing that now... but am confused
 by the fragment of code which copies from resolverMap into
 resolverMapCopy. As best I can see resolverMap is always
 empty; but I am barely able to follow the code! Can somebody
 explain when/why resolverMap would be populated.

The behavior is like this, the expression ${currentEntity.colName}
does not work automatically. Because the row is not added to
VariableResolver .TemplateTransformer has hacked the stuff to make it
work.

We can think of modifying this behavior

 Also, I begin to understand comments made by Noble in
 SOL-1001 about resolving entity attributes in
 ContextImpl.getEntityAttribute and I guess Shalin was
 right as well. However it also seems wrong that at the
 top of every transformer we are going to repeat the
 same code to load the resolver with information about the
 entity.

  4) In that I am reusing template output within other templates
 the order of execution becomes important. Can I assume that
 the explicitly listed columns in an entity are processed by
 the various transformers in the order they appear within
 data-config.xml. I *think* that the list of columns within
 an entity as returned by getAllEntityFields() is actually
 an ArrayList which I think or order dependent. IS this
 correct?

IT IS CORRECT

  5) Should I raise this as a single JIRA issue?
Do not add ONE issue forall. If they are logically connected  put all
of them into one.If not, split them into as many issues as possible.

  6) Having played with this stuff, I was going to add a bit
 more to the wiki highlighting some of the possibilities
 and issues with transformers. But want to check with the
 list first!


   dataConfig
   dataSource name=myfilereader type=FileDataSource/
document
entity name=jc
   processor=FileListEntityProcessor
   fileName=^.*\.xml$
   newerThan='NOW-1000DAYS'
   recursive=true
   rootEntity=false
   dataSource=null
   baseDir=/Volumes/spare/ts/solr/content
   
entity name=x
  dataSource=myfilereader
  processor=XPathEntityProcessor
  url=${jc.fileAbsolutePath}
  rootEntity=true
  stream=false
  forEach=/record | /record/mediaBlock
  
 transformer=DateFormatTransformer,TemplateTransformer,RegexTransformer

 field

Distributed search

2009-02-16 Thread revathy arun

Hi,

Can we use multicore to have several indexes per webapp and use distributed
search to merge the indexes?

for exampe if we have 3 cores -core0 ,core1 and core2 for 3 different
languages and to search across all the 3 indexes
use the shard parameter as
shard=localhost:8080/solr/core0,localhost:8080/solr/core1,localhost:8080/solr/core2

Regards
Sujatha

Re: Release of solr 1.4 autosuggest

2009-02-16 Thread Grant Ingersoll



On Feb 16, 2009, at 12:05 AM, Pooja Verlani wrote:


Hi All,
I am interested in TermComponent addition in solr 1.4 (
http://wiki.apache.org/solr/TermsComponent). When
should we expect solr 1.4 to be available for use?
Also, can this Termcomponent be made available as a plugin for solr  
1.3?


I'm guessing the TermComponent patch would apply to the 1.3 source,  
but I haven't tried it.





Kindly reply if you have any idea.

Regards,
Pooja


--
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)  
using Solr/Lucene:

http://www.lucidimagination.com/search

Re: Word Locations Search Components

2009-02-16 Thread Grant Ingersoll



On Feb 15, 2009, at 10:33 PM, Johnny X wrote:



Hi there,


I was told before that I'd need to create a custom search component  
to do
what I want to do, but I'm thinking it might actually be a custom  
analyzer.


Basically, I'm indexing e-mail in XML in Solr and searching the  
'content'

field which is parsed as 'text'.

I want to ignore certain elements of the e-mail (i.e. corporate  
banners),
but also identify the actual content of those e-mails including  
corporate

information.

To identify the banners I need something a little more developed  
than a stop
word list. I need to evaluate the frequency of certain words around  
words
like 'privileged' and 'corporate' within a word window of about  
100ish words

to determine whether they're banners and then remove them from being
indexed.

I need to do the opposite during the same time to identify, in a  
similar
manner, which e-mails include corporate information in their actual  
content.


I suppose if I'm doing this I don't want what's processed to be  
indexed as
what's returned in a search, because then presumably it won't be the  
full
e-mail, so do I need to store some kind of copy field that keeps the  
full

e-mail and is fully indexed to be returned instead?


Storage and indexing are separate things in Lucene/Solr, so setting  
the Field as stored will keep the original, so no need for a copy  
field for this particular issue.





Can what I'm suggesting be done and can anyone direct me to a guide?


Hmm, this kind of stuff may be better off as part of preprocessing,  
but it could be done as an analyzer, I suppose. How are you  
determining the words to evaluate?  Is it based on collection  
statistics or just within a document?  Or do you just have a list of  
marker words that indicate the areas of interest?  Do you need to  
keep track of anything beyond the life of one document being analyzed?


If you were doing this as an analyzer, you would need to buffer the  
tokens internally so that you could examine them in a window, and then  
make a decision as to what tokens to output.  I believe the  
RemoveDuplicatesTokenFilter demonstrates how to do this.  Basically,  
you just need a List to store the tokens in if you see certain  
conditions met.








On another note, is there an easy way to destroy an index...any  
custom code?


Send in a delete by query command with the *:* query.





Thanks for any help!



--
View this message in context: 
http://www.nabble.com/Word-Locations---Search-Components-tp22031139p22031139.html
Sent from the Solr - User mailing list archive at Nabble.com.



--
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)  
using Solr/Lucene:

http://www.lucidimagination.com/search

Re: Multilanguage

2009-02-16 Thread Erick Erickson

I recommend that you search both this and the
Lucene list. You'll find that this topic has been
discussed many times, and several approaches
have been outlined.

The searchable archives are linked to from here:
http://lucene.apache.org/java/docs/mailinglists.html.

Best
Erick

On Mon, Feb 16, 2009 at 12:42 AM, revathy arun revas...@gmail.com wrote:

 Hi,
 I have a scenario where ,i need to  convert pdf content to text  and then
 index the same at run time .I do not know as to what language the pdf would
 be ,in this case which is the best  soln i have with respect the content
 field type in the schema where the text content would be indexed to?

 That is can i use the default tokenizer for all languages and  since i
 would
 not know the language and hence would not be able to stem the
 tokens,how would  this impact search?Is there any other solution for the
 same?

 Rgds

Re: almost realtime updates with replication

2009-02-16 Thread sunnyfr


Hi Noble,

So ok I don't mind really if it miss one, if it get the last one it's good.
I've was wondering as well if a snapshot is created even if no document has
been update?

Thanks a lot Noble,
Wish you a very nice day,


Noble Paul നോബിള്‍  नोब्ळ् wrote:
 
 I guess , it should not be a problem
 --Noble
 
 On Mon, Feb 16, 2009 at 3:28 PM, sunnyfr johanna...@gmail.com wrote:

 Hi Hoss,

 Is it a problem if the snappuller miss one snapshot before the last one
 ??

 Cheer,
 Have a nice day,


 hossman wrote:

 :
 : There are a couple queries that we would like to run almost realtime
 so
 : I would like to have it so our client sends an update on every new
 : document and then have solr configured to do an autocommit every 5-10
 : seconds.
 :
 : reading the Wiki, it seems like this isn't possible because of the
 : strain of snapshotting and pulling to the slaves at such a high rate.
 : What I was thinking was for these few queries to just query the master
 : and the rest can query the slave with the not realtime data, although
 : I'm assuming this wouldn't work either because since a snapshot is
 : created on every commit, we would still impact the performance too
 much?

 there is no reason why a commit has to trigger a snapshot, that happens
 only if you configure a postCommit hook to do so in your solrconfig.xml

 you can absolutely commit every 5 seconds, but have a seperate cron task
 that runs snapshooter ever 5 minutes -- you could even continue to run
 snapshooter on every commit, and get a new snapshot ever 5 seconds, but
 only run snappuller on your slave machines ever 5 minutes (the
 snapshots are hardlinks and don't take up a lot of space, and snappuller
 only needs to fetch the most recent snapshot)

 your idea of querying the msater directly for these queries seems
 perfectly fine to me ... just make sure the auto warm count on the
 caches
 on your master is very tiny so the new searchers are ready quickly after
 each commit.




 -Hoss




 --
 View this message in context:
 http://www.nabble.com/almost-realtime-updates-with-replication-tp12276614p22034406.html
 Sent from the Solr - User mailing list archive at Nabble.com.


 
 
 
 -- 
 --Noble Paul
 
 

-- 
View this message in context: 
http://www.nabble.com/almost-realtime-updates-with-replication-tp12276614p22037977.html
Sent from the Solr - User mailing list archive at Nabble.com.

snapshot as big as the index folder?

2009-02-16 Thread sunnyfr


Hi,

Is it normal or did I miss something ?? 
5.8Gbook/data/snapshot.20090216153346
12K book/data/spellchecker2
4.0Kbook/data/index
12K book/data/spellcheckerFile
12K book/data/spellchecker1
5.8Gbook/data/

Last update ? 
str name=Total Requests made to DataSource92562/str
str name=Total Rows Fetched45492/str
str name=Total Documents Skipped0/str
str name=Delta Dump started2009-02-16 15:20:01/str
str name=Identifying Delta2009-02-16 15:20:01/str
str name=Deltas Obtained2009-02-16 15:20:42/str
str name=Building documents2009-02-16 15:20:42/str
str name=Total Changed Documents13223/str
−
str name=
Indexing completed. Added/Updated: 13223 documents. Deleted 0 documents.
/str
str name=Committed2009-02-16 15:33:50/str
str name=Optimized2009-02-16 15:33:50/str
str name=Time taken 0:13:48.853/str


Thanks a lot,

-- 
View this message in context: 
http://www.nabble.com/snapshot-as-big-as-the-index-folder--tp22038427p22038427.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: snapshot as big as the index folder?

2009-02-16 Thread sunnyfr


It change a lot in few minute ?? is it normal ? thanks

5.8Gbook/data/snapshot.20090216153346
4.0Kbook/data/index
5.8Gbook/data/
r...@search-07:/data/solr# du -h book/data/
5.8Gbook/data/snapshot.20090216153346
3.7Gbook/data/index
4.0Kbook/data/snapshot.20090216153759
9.4Gbook/data/
r...@search-07:/data/solr# du -h book/data/
5.8Gvideo/data/snapshot.20090216153346
4.4Gbook/data/index
4.0Kbook/data/snapshot.20090216153759
11G book/data/
r...@search-07:/data/solr# du -h book/data/
5.8Gbook/data/snapshot.20090216153346
5.8Gbook/data/index
4.0Kbook/data/snapshot.20090216154819
4.0Kbook/data/snapshot.20090216154820
15M book/data/snapshot.20090216153759
12G book/data/




sunnyfr wrote:
 
 Hi,
 
 Is it normal or did I miss something ?? 
 5.8G  book/data/snapshot.20090216153346
 12K   book/data/spellchecker2
 4.0K  book/data/index
 12K   book/data/spellcheckerFile
 12K   book/data/spellchecker1
 5.8G  book/data/
 
 Last update ? 
 str name=Total Requests made to DataSource92562/str
 str name=Total Rows Fetched45492/str
 str name=Total Documents Skipped0/str
 str name=Delta Dump started2009-02-16 15:20:01/str
 str name=Identifying Delta2009-02-16 15:20:01/str
 str name=Deltas Obtained2009-02-16 15:20:42/str
 str name=Building documents2009-02-16 15:20:42/str
 str name=Total Changed Documents13223/str
 −
 str name=
 Indexing completed. Added/Updated: 13223 documents. Deleted 0 documents.
 /str
 str name=Committed2009-02-16 15:33:50/str
 str name=Optimized2009-02-16 15:33:50/str
 str name=Time taken 0:13:48.853/str
 
 
 Thanks a lot,
 
 

-- 
View this message in context: 
http://www.nabble.com/snapshot-as-big-as-the-index-folder--tp22038427p22038656.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Word Locations Search Components

2009-02-16 Thread Alexander Ramos Jardim

I would go for a business logic solution and not a Solr customization in
this case, as you need to filter information that you actually would like to
see in diferent fields on your index.

Did you already tried to split the email in several fields like subject,
from, to, content, signature, etc etc etc ?


2009/2/16 Johnny X jonathanwel...@gmail.com


 Hi there,


 I was told before that I'd need to create a custom search component to do
 what I want to do, but I'm thinking it might actually be a custom analyzer.

 Basically, I'm indexing e-mail in XML in Solr and searching the 'content'
 field which is parsed as 'text'.

 I want to ignore certain elements of the e-mail (i.e. corporate banners),
 but also identify the actual content of those e-mails including corporate
 information.

 To identify the banners I need something a little more developed than a
 stop
 word list. I need to evaluate the frequency of certain words around words
 like 'privileged' and 'corporate' within a word window of about 100ish
 words
 to determine whether they're banners and then remove them from being
 indexed.

 I need to do the opposite during the same time to identify, in a similar
 manner, which e-mails include corporate information in their actual
 content.

 I suppose if I'm doing this I don't want what's processed to be indexed as
 what's returned in a search, because then presumably it won't be the full
 e-mail, so do I need to store some kind of copy field that keeps the full
 e-mail and is fully indexed to be returned instead?

 Can what I'm suggesting be done and can anyone direct me to a guide?


 On another note, is there an easy way to destroy an index...any custom
 code?


 Thanks for any help!



 --
 View this message in context:
 http://www.nabble.com/Word-Locations---Search-Components-tp22031139p22031139.html
 Sent from the Solr - User mailing list archive at Nabble.com.




-- 
Alexander Ramos Jardim

Re: Word Locations Search Components

2009-02-16 Thread Johnny X


Basically I'm working on the Enron dataset, and I've already de-duplicated
the collection and applied a spam filter. All the e-mails after this have
been parsed to XML and each field (so To, From, Date etc) has been
separated, along with one large field for the remaining e-mail content
(called Content). 

So yes, to answer your question. Bearing in mind though this still
represents around 240, 000ish files to compute.

I have no idea about Solr analyzers/search components, but my theory was
that I'd need an analyzer to remove 'banner-like' content from being indexed
and a search component to identify 'corporate-like' information in the
content of the e-mails.

What is a business logical solution and how will that work?


Thanks.



zayhen wrote:
 
 I would go for a business logic solution and not a Solr customization in
 this case, as you need to filter information that you actually would like
 to
 see in diferent fields on your index.
 
 Did you already tried to split the email in several fields like subject,
 from, to, content, signature, etc etc etc ?
 
 
 2009/2/16 Johnny X jonathanwel...@gmail.com
 

 Hi there,


 I was told before that I'd need to create a custom search component to do
 what I want to do, but I'm thinking it might actually be a custom
 analyzer.

 Basically, I'm indexing e-mail in XML in Solr and searching the 'content'
 field which is parsed as 'text'.

 I want to ignore certain elements of the e-mail (i.e. corporate banners),
 but also identify the actual content of those e-mails including corporate
 information.

 To identify the banners I need something a little more developed than a
 stop
 word list. I need to evaluate the frequency of certain words around words
 like 'privileged' and 'corporate' within a word window of about 100ish
 words
 to determine whether they're banners and then remove them from being
 indexed.

 I need to do the opposite during the same time to identify, in a similar
 manner, which e-mails include corporate information in their actual
 content.

 I suppose if I'm doing this I don't want what's processed to be indexed
 as
 what's returned in a search, because then presumably it won't be the full
 e-mail, so do I need to store some kind of copy field that keeps the full
 e-mail and is fully indexed to be returned instead?

 Can what I'm suggesting be done and can anyone direct me to a guide?


 On another note, is there an easy way to destroy an index...any custom
 code?


 Thanks for any help!



 --
 View this message in context:
 http://www.nabble.com/Word-Locations---Search-Components-tp22031139p22031139.html
 Sent from the Solr - User mailing list archive at Nabble.com.


 
 
 -- 
 Alexander Ramos Jardim
 
 
 -
 RPG da Ilha 
 

-- 
View this message in context: 
http://www.nabble.com/Word-Locations---Search-Components-tp22031139p22038912.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Word Locations Search Components

2009-02-16 Thread Erick Erickson

I think you essentially have to do much of the same work either
way, so take whatever comes easiest. Personally, I think
that pre-processing the data (and using two fields) would be
easiest, but it's up to you.

Using a custom analyzer would involve collecting all the contents,
deciding what is relevant and emitting those tokens one by one.
The advantage here (and it's not very important) is that you'd only
need one field as Grant said.

The other approach would be to read the contents into a buffer,
apply whatever business logic you determine to remove the
irrelevant text, and then submitting this to the normal analyzers.
The advantage here is that it's a simpler flow. Analyzers are
usually just used for breaking up an incoming stream and doing
specific transformations (stop words, stemming, etc). These
transformations are pretty context-less. Extending that process
to handle complex rules about what's relevant is a bit of a stretch.
But if you do pre-process the data, storing the input won't be what you
want and you'll need to store the original text in a separate field.

Best
Erick



On Mon, Feb 16, 2009 at 10:05 AM, Johnny X jonathanwel...@gmail.com wrote:


 Basically I'm working on the Enron dataset, and I've already de-duplicated
 the collection and applied a spam filter. All the e-mails after this have
 been parsed to XML and each field (so To, From, Date etc) has been
 separated, along with one large field for the remaining e-mail content
 (called Content).

 So yes, to answer your question. Bearing in mind though this still
 represents around 240, 000ish files to compute.

 I have no idea about Solr analyzers/search components, but my theory was
 that I'd need an analyzer to remove 'banner-like' content from being
 indexed
 and a search component to identify 'corporate-like' information in the
 content of the e-mails.

 What is a business logical solution and how will that work?


 Thanks.



 zayhen wrote:
 
  I would go for a business logic solution and not a Solr customization in
  this case, as you need to filter information that you actually would like
  to
  see in diferent fields on your index.
 
  Did you already tried to split the email in several fields like subject,
  from, to, content, signature, etc etc etc ?
 
 
  2009/2/16 Johnny X jonathanwel...@gmail.com
 
 
  Hi there,
 
 
  I was told before that I'd need to create a custom search component to
 do
  what I want to do, but I'm thinking it might actually be a custom
  analyzer.
 
  Basically, I'm indexing e-mail in XML in Solr and searching the
 'content'
  field which is parsed as 'text'.
 
  I want to ignore certain elements of the e-mail (i.e. corporate
 banners),
  but also identify the actual content of those e-mails including
 corporate
  information.
 
  To identify the banners I need something a little more developed than a
  stop
  word list. I need to evaluate the frequency of certain words around
 words
  like 'privileged' and 'corporate' within a word window of about 100ish
  words
  to determine whether they're banners and then remove them from being
  indexed.
 
  I need to do the opposite during the same time to identify, in a similar
  manner, which e-mails include corporate information in their actual
  content.
 
  I suppose if I'm doing this I don't want what's processed to be indexed
  as
  what's returned in a search, because then presumably it won't be the
 full
  e-mail, so do I need to store some kind of copy field that keeps the
 full
  e-mail and is fully indexed to be returned instead?
 
  Can what I'm suggesting be done and can anyone direct me to a guide?
 
 
  On another note, is there an easy way to destroy an index...any custom
  code?
 
 
  Thanks for any help!
 
 
 
  --
  View this message in context:
 
 http://www.nabble.com/Word-Locations---Search-Components-tp22031139p22031139.html
  Sent from the Solr - User mailing list archive at Nabble.com.
 
 
 
 
  --
  Alexander Ramos Jardim
 
 
  -
  RPG da Ilha
 

 --
 View this message in context:
 http://www.nabble.com/Word-Locations---Search-Components-tp22031139p22038912.html
 Sent from the Solr - User mailing list archive at Nabble.com.

Re: facet count on partial results

2009-02-16 Thread Yonik Seeley

 On Sat, Feb 14, 2009 at 6:45 AM, karl wettin karl.wet...@gmail.com
 wrote:
 Also, as my threadshold is based on the distance in score between the
 first result it sounds like using a result start position greater than
 0 is something I have to look out for. Or?

 Hmmm - this isn't that easy in general as it requires knowledge of the
 max score, right?

 Hmmm indeed. Does Solr not collect 0-20 even though the request is for
 10-20? Wouldn't it then be possible to inject some code that limits the
 DocSet at that layer?

Yes, Solr would actually collect 0-20, but the entire set of matching
documents must still be scored to find the maximum score.  So if the
threshold will be a function of maxScore, it still requires two
passes, no?

 There is more. Not important but a nice thing to get: I create multiple
 documents per entity from my primary data source (e.g. each entity a book
 and each document a paragraph from the book) but I only want to present the
 top scoring document per entity.

This sounds like field collapsing.  There's is a patch that's still in
the works:
http://wiki.apache.org/solr/FieldCollapsing

-Yonik
http://www.lucidimagination.com

Re: Release of solr 1.4 autosuggest

2009-02-16 Thread Noble Paul നോബിള്‍ नोब्ळ्

the logging used is  changed j.u.l to slf4j . That is the only problem
I can see. If you drop in that jar as well it should just work

On Mon, Feb 16, 2009 at 6:49 PM, Grant Ingersoll gsing...@apache.org wrote:

 On Feb 16, 2009, at 12:05 AM, Pooja Verlani wrote:

 Hi All,
 I am interested in TermComponent addition in solr 1.4 (
 http://wiki.apache.org/solr/TermsComponent). When
 should we expect solr 1.4 to be available for use?
 Also, can this Termcomponent be made available as a plugin for solr 1.3?

 I'm guessing the TermComponent patch would apply to the 1.3 source, but I
 haven't tried it.



 Kindly reply if you have any idea.

 Regards,
 Pooja

 --
 Grant Ingersoll
 http://www.lucidimagination.com/

 Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) using
 Solr/Lucene:
 http://www.lucidimagination.com/search





-- 
--Noble Paul

Re: almost realtime updates with replication

2009-02-16 Thread Noble Paul നോബിള്‍ नोब्ळ्

yes , it does . it just blindly creates hard links irrespective of a
document is added or not. but no snappull will happen because there is
no new file to be downloaded

On Mon, Feb 16, 2009 at 7:40 PM, sunnyfr johanna...@gmail.com wrote:

 Hi Noble,

 So ok I don't mind really if it miss one, if it get the last one it's good.
 I've was wondering as well if a snapshot is created even if no document has
 been update?

 Thanks a lot Noble,
 Wish you a very nice day,


 Noble Paul നോബിള്‍  नोब्ळ् wrote:

 I guess , it should not be a problem
 --Noble

 On Mon, Feb 16, 2009 at 3:28 PM, sunnyfr johanna...@gmail.com wrote:

 Hi Hoss,

 Is it a problem if the snappuller miss one snapshot before the last one
 ??

 Cheer,
 Have a nice day,


 hossman wrote:

 :
 : There are a couple queries that we would like to run almost realtime
 so
 : I would like to have it so our client sends an update on every new
 : document and then have solr configured to do an autocommit every 5-10
 : seconds.
 :
 : reading the Wiki, it seems like this isn't possible because of the
 : strain of snapshotting and pulling to the slaves at such a high rate.
 : What I was thinking was for these few queries to just query the master
 : and the rest can query the slave with the not realtime data, although
 : I'm assuming this wouldn't work either because since a snapshot is
 : created on every commit, we would still impact the performance too
 much?

 there is no reason why a commit has to trigger a snapshot, that happens
 only if you configure a postCommit hook to do so in your solrconfig.xml

 you can absolutely commit every 5 seconds, but have a seperate cron task
 that runs snapshooter ever 5 minutes -- you could even continue to run
 snapshooter on every commit, and get a new snapshot ever 5 seconds, but
 only run snappuller on your slave machines ever 5 minutes (the
 snapshots are hardlinks and don't take up a lot of space, and snappuller
 only needs to fetch the most recent snapshot)

 your idea of querying the msater directly for these queries seems
 perfectly fine to me ... just make sure the auto warm count on the
 caches
 on your master is very tiny so the new searchers are ready quickly after
 each commit.




 -Hoss




 --
 View this message in context:
 http://www.nabble.com/almost-realtime-updates-with-replication-tp12276614p22034406.html
 Sent from the Solr - User mailing list archive at Nabble.com.





 --
 --Noble Paul



 --
 View this message in context: 
 http://www.nabble.com/almost-realtime-updates-with-replication-tp12276614p22037977.html
 Sent from the Solr - User mailing list archive at Nabble.com.





-- 
--Noble Paul

Re: delete snapshot??

2009-02-16 Thread sunnyfr


Hi,

Ok but can I use it more often then every day like every three hours,
because snapshot are quite big.

Thanks a lot,


Bill Au wrote:
 
 The --delete option of the rsync command deletes extraneous files from the
 destination directory.  It does not delete Solr snapshots.  To do that you
 can use the snapcleaner on the master and/or slave.
 
 Bill
 
 On Fri, Feb 13, 2009 at 10:15 AM, sunnyfr johanna...@gmail.com wrote:
 

 root 26834 16.2  0.0  19412   824 ?S16:05   0:08 rsync
 -Wa
 --delete rsync://##.##.##.##:18180/solr/snapshot.20090213160051/
 /data/solr/books/data/snapshot.20090213160051-wip

 Hi obviously it can't delete them because the adress is bad it shouldnt
 be
 :
 rsync://##.##.##.##:18180/solr/snapshot.20090213160051/
 but:
 rsync://##.##.##.##:18180/solr/books/snapshot.20090213160051/

 Where should I change this, I checked my script.conf on the slave server
 but
 it seems good.

 Because files can be very big and my server in few hours is getting full.

 So actually snapcleaner is not necessary on the master ? what about the
 slave?

 Thanks a lot,
 Sunny
 --
 View this message in context:
 http://www.nabble.com/delete-snapshot---tp21998333p21998333.html
 Sent from the Solr - User mailing list archive at Nabble.com.


 
 

-- 
View this message in context: 
http://www.nabble.com/delete-snapshot---tp21998333p22041332.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: delete snapshot??

2009-02-16 Thread Noble Paul നോബിള്‍ नोब्ळ्

they are just hardlinks. they do not consume space on disk

On Mon, Feb 16, 2009 at 10:34 PM, sunnyfr johanna...@gmail.com wrote:

 Hi,

 Ok but can I use it more often then every day like every three hours,
 because snapshot are quite big.

 Thanks a lot,


 Bill Au wrote:

 The --delete option of the rsync command deletes extraneous files from the
 destination directory.  It does not delete Solr snapshots.  To do that you
 can use the snapcleaner on the master and/or slave.

 Bill

 On Fri, Feb 13, 2009 at 10:15 AM, sunnyfr johanna...@gmail.com wrote:


 root 26834 16.2  0.0  19412   824 ?S16:05   0:08 rsync
 -Wa
 --delete rsync://##.##.##.##:18180/solr/snapshot.20090213160051/
 /data/solr/books/data/snapshot.20090213160051-wip

 Hi obviously it can't delete them because the adress is bad it shouldnt
 be
 :
 rsync://##.##.##.##:18180/solr/snapshot.20090213160051/
 but:
 rsync://##.##.##.##:18180/solr/books/snapshot.20090213160051/

 Where should I change this, I checked my script.conf on the slave server
 but
 it seems good.

 Because files can be very big and my server in few hours is getting full.

 So actually snapcleaner is not necessary on the master ? what about the
 slave?

 Thanks a lot,
 Sunny
 --
 View this message in context:
 http://www.nabble.com/delete-snapshot---tp21998333p21998333.html
 Sent from the Solr - User mailing list archive at Nabble.com.





 --
 View this message in context: 
 http://www.nabble.com/delete-snapshot---tp21998333p22041332.html
 Sent from the Solr - User mailing list archive at Nabble.com.





-- 
--Noble Paul

Input XML duplicate fields uniqueness

2009-02-16 Thread Adi_Jinx


Hi,

I have an Input XML as

rec id=1 updt=12-Feb-2009
updated_rec
account id=1 loc=NJ pass=safsafsd#sf08 type=Dev active=1
updated_item
loc new=NJ old=CP 
/updated_item
/account
account id=2 loc=KL pass=080jnkdfhjwf type=Int active=0 
updated_item
pass new=080jnkdfhjwf old=08dedf  
/updated_item
/account
/updated_rec
/rec

now for SOLR indexing converted it to
adddoc
field name=rec.id1/field
field name=rec.updt12-Feb-2009/field

field name=rec.updated_rec.account.id1/field
field name=rec.updated_rec.account.locNJ/field
field name=rec.updated_rec.account.passsafsafsd#sf08/field
field name=rec.updated_rec.account.typeDev/field
field name=rec.updated_rec.account.active1/field
field name=rec.updated_rec.account.updated_item.loc.newNJ/field
field name=rec.updated_rec.account.updated_item.loc.oldCP/field

field name=rec.updated_rec.account.id2/field
field name=rec.updated_rec.account.locKL/field
field name=rec.updated_rec.account.pass080jnkdfhjwf/field
field name=rec.updated_rec.account.typeInt/field
field name=rec.updated_rec.account.active0/field
field
name=rec.updated_rec.account.updated_item.pass.new080jnkdfhjwf/field
field
name=rec.updated_rec.account.updated_item.pass.old08dedf/field
/doc/add


I was able to index it. Just put this single xml and searched based on
rec.id and response xml returned however input xml tag order was not
maintained. So I was unable to identify which attributes of account belongs
to which account. Is there any way out to maintain order? or tokenize the
field name so that primary key can be appended
(rec.updated_rec.account.1.loc) however still be able to search
rec.updated_rec.account.loc field...

Need some suggestion.. may be my apporach is totally wrong in dealing with
this problem.


-- 
View this message in context: 
http://www.nabble.com/Input-XML-duplicate-fields-uniqueness-tp22042765p22042765.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: delete snapshot??

2009-02-16 Thread sunnyfr


Hi Noble,

But how come i've space error ?? :( 
thanks a lot,

Feb 16 18:28:34 search-07 jsvc.exec[8872]: ataImporter.java:361) Caused by:
java.io.IOException: No space left on device ^Iat
java.io.RandomAccessFile.writeBytes(Native Method) ^Iat
java.io.RandomAccessFile.write(RandomAccessFile.java:466) ^Iat
org.apache.lucene.store.FSDirectory$FSIndexOutput.flushBuffer(FSDirectory.java:679)
^Iat
org.apache.lucene.store.BufferedIndexOutput.flushBuffer(BufferedIndexOutput.java:96)
^Iat
org.apache.lucene.store.BufferedIndexOutput.flush(BufferedIndexOutput.java:85)
^Iat
org.apache.lucene.store.BufferedIndexOutput.seek(BufferedIndexOutput.java:124)
^Iat
org.apache.lucene.store.FSDirectory$FSIndexOutput.seek(FSDirectory.java:704)
^Iat org.apache.lucene.index.TermInfosWriter.close(TermInfosWriter.java:220)
^Iat
org.apache.lucene.index.FormatPostingsFieldsWriter.finish(FormatPostingsFieldsWriter.java:70)
^Iat
org.apache.lucene.index.SegmentMerger.mergeTerms(SegmentMerger.java:494)
^Iat org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:141)
^Iat org.apache.lucene.index.IndexW



Noble Paul നോബിള്‍  नोब्ळ् wrote:
 
 they are just hardlinks. they do not consume space on disk
 
 On Mon, Feb 16, 2009 at 10:34 PM, sunnyfr johanna...@gmail.com wrote:

 Hi,

 Ok but can I use it more often then every day like every three hours,
 because snapshot are quite big.

 Thanks a lot,


 Bill Au wrote:

 The --delete option of the rsync command deletes extraneous files from
 the
 destination directory.  It does not delete Solr snapshots.  To do that
 you
 can use the snapcleaner on the master and/or slave.

 Bill

 On Fri, Feb 13, 2009 at 10:15 AM, sunnyfr johanna...@gmail.com wrote:


 root 26834 16.2  0.0  19412   824 ?S16:05   0:08 rsync
 -Wa
 --delete rsync://##.##.##.##:18180/solr/snapshot.20090213160051/
 /data/solr/books/data/snapshot.20090213160051-wip

 Hi obviously it can't delete them because the adress is bad it shouldnt
 be
 :
 rsync://##.##.##.##:18180/solr/snapshot.20090213160051/
 but:
 rsync://##.##.##.##:18180/solr/books/snapshot.20090213160051/

 Where should I change this, I checked my script.conf on the slave
 server
 but
 it seems good.

 Because files can be very big and my server in few hours is getting
 full.

 So actually snapcleaner is not necessary on the master ? what about the
 slave?

 Thanks a lot,
 Sunny
 --
 View this message in context:
 http://www.nabble.com/delete-snapshot---tp21998333p21998333.html
 Sent from the Solr - User mailing list archive at Nabble.com.





 --
 View this message in context:
 http://www.nabble.com/delete-snapshot---tp21998333p22041332.html
 Sent from the Solr - User mailing list archive at Nabble.com.


 
 
 
 -- 
 --Noble Paul
 
 

-- 
View this message in context: 
http://www.nabble.com/delete-snapshot---tp21998333p22044788.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Input XML duplicate fields uniqueness

2009-02-16 Thread Shalin Shekhar Mangar

On Mon, Feb 16, 2009 at 11:47 PM, Adi_Jinx rohit_wa...@yahoo.com wrote:


 I was able to index it. Just put this single xml and searched based on
 rec.id and response xml returned however input xml tag order was not
 maintained. So I was unable to identify which attributes of account belongs
 to which account. Is there any way out to maintain order? or tokenize the
 field name so that primary key can be appended
 (rec.updated_rec.account.1.loc) however still be able to search
 rec.updated_rec.account.loc field...


How about creating a Solr document for each account and adding the recid and
updt attributes from the record tag?

-- 
Regards,
Shalin Shekhar Mangar.

can the TermsComponent be used in combination with fq?

2009-02-16 Thread Peter Wolanin

We have been trying to figure out how to construct, for example, a
directory page with an overview of available facets for several
fields.

Looking at the issue and wiki

http://wiki.apache.org/solr/TermsComponent
https://issues.apache.org/jira/browse/SOLR-877

It would seem like this component would be useful for this.  However -
we often require that some filtering be applied to search results
based on which user is searching (e.g. public vs. private content).
Is it possible to apply filtering here, or will we need to do
something like running a q=*:*fq=status:1 and then getting facets?

Note - also - the wiki page references a tutorial including this
/autocomplete path, but I cannot ifnd any trace of such.  I was able
to get results similar to the examples on the wiki page by adding the
following to solrconfig.xml:

  searchComponent name=terms
class=org.apache.solr.handler.component.TermsComponent /
  !-- a request handler utilizing the elevator component --
  requestHandler name=/autocomplete class=solr.SearchHandler
startup=lazy
lst name=defaults
  str name=echoParamsexplicit/str
/lst
arr name=components
  strterms/str
/arr
  /requestHandler


Is this the right way to activate this?

Thanks,

Peter

-- 
Peter M. Wolanin, Ph.D.
Momentum Specialist,  Acquia. Inc.
peter.wola...@acquia.com

Re: term offsets not returned with tv=true

2009-02-16 Thread Koji Sekiguchi

Your request seems to be fine. Have you reindexed after setting 
termOffsets definition

to document field?

Koji


Jeffrey Baker wrote:

I'm trying to exercise the termOffset functions in the nightly build
(2009-02-11) but it doesn't seem to do anything.  I have an item in my
schema like so:

field name=document type=text indexed=true stored=true
multiValued=false termVectors=true termPositions=true
termOffsets=true /

And I attempt this query:

qt=tvrh
tv=true
tv.offsets=true
indent=true
wt=json
facet.mincount=1
facet=true
hl=on
hl.fl=document
hl.mergeContiguous=true
hl.requireFieldMatch=true
fl=document,id,title,doctype,score
hl.usePhraseHighlighter=true
hl.snippets=3
hl.fragsize=200
hl.maxAnalyzedChars=1048576
hl.simple.pre=[[[hit]
hl.simple.post=[[[/hit]
rows=20
q=iphone

... where most of those parameters are irrelevant to this question (I
think).  The response looks like this:

termVectors:[
  doc-51630,[
uniqueKey,streetevents:2012449],
  doc-19343,[
uniqueKey,streetevents:1904785],
  doc-22599,[
uniqueKey,streetevents:1873725],
  doc-52660,[
uniqueKey,streetevents:2029389],
  doc-37532,[
uniqueKey,streetevents:1665907],
  doc-49797,[
uniqueKey,streetevents:1996051],
  doc-21476,[
uniqueKey,streetevents:1885188],
  doc-24671,[
uniqueKey,streetevents:1820498],
  doc-25617,[
uniqueKey,streetevents:1794743],
  doc-48135,[
uniqueKey,streetevents:1981537],
  doc-47239,[
uniqueKey,streetevents:1940855],
  doc-54651,[
uniqueKey,streetevents:2069828],
  doc-48085,[
uniqueKey,streetevents:1979847],
  doc-28956,[
uniqueKey,streetevents:1766038],
  doc-47986,[
uniqueKey,streetevents:1978001],
  doc-32287,[
uniqueKey,streetevents:1740905],
  doc-41568,[
uniqueKey,streetevents:1599906],
  doc-44964,[
uniqueKey,streetevents:1782481],
  doc-43900,[
uniqueKey,streetevents:1748639],
  doc-45390,[
uniqueKey,streetevents:1811998],

I guess I was expecting to get some lists of term offsets.  Am I doing it wrong?

-jwb

Re: Release of solr 1.4 autosuggest

2009-02-16 Thread David Smiley @MITRE.org


Sorry for budding in on this thread but what value is added by TermComponent
when you can use faceting for auto-suggest?  And with faceting, you can
limit the suggestion by existing words before the word the user is typing by
using it for q.

~ David Smiley


Pooja Verlani wrote:
 
 Hi All,
 I am interested in TermComponent addition in solr 1.4 (
 http://wiki.apache.org/solr/TermsComponent). When
 should we expect solr 1.4 to be available for use?
 Also, can this Termcomponent be made available as a plugin for solr 1.3?
 
 Kindly reply if you have any idea.
 
 Regards,
 Pooja
 
 

-- 
View this message in context: 
http://www.nabble.com/Release-of-solr-1.4---autosuggest-tp22031697p22047806.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: delete snapshot??

2009-02-16 Thread sunnyfr


Hi Noble,

I maybe don't get something 
Ok if it's hard link but how come i've not space left on device error and
30G shown on the data folder ??
sorry I'm quite new  

6.0G/data/solr/book/data/snapshot.20090216214502
35M /data/solr/book/data/snapshot.20090216195003
12M /data/solr/book/data/snapshot.20090216195502
12K /data/solr/book/data/spellchecker2
36M /data/solr/book/data/snapshot.20090216185502
37M /data/solr/book/data/snapshot.20090216203502
6.0M/data/solr/book/data/index
12K /data/solr/book/data/snapshot.20090216204002
5.8G/data/solr/book/data/snapshot.20090216172020
12K /data/solr/book/data/spellcheckerFile
28K /data/solr/book/data/snapshot.20090216200503
40K /data/solr/book/data/snapshot.20090216194002
24K /data/solr/book/data/snapshot.2009021622
32K /data/solr/book/data/snapshot.20090216184502
20K /data/solr/book/data/snapshot.20090216191004
1.1M/data/solr/book/data/snapshot.20090216213502
1.1M/data/solr/book/data/snapshot.20090216201502
1.1M/data/solr/book/data/snapshot.20090216213005
24K /data/solr/book/data/snapshot.20090216191502
1.1M/data/solr/book/data/snapshot.20090216212503
107M/data/solr/book/data/snapshot.20090216212002
14M /data/solr/book/data/snapshot.20090216190502
32K /data/solr/book/data/snapshot.20090216201002
2.3M/data/solr/book/data/snapshot.20090216204502
28K /data/solr/book/data/snapshot.20090216184002
5.8G/data/solr/book/data/snapshot.20090216181425
44K /data/solr/book/data/snapshot.20090216190001
20K /data/solr/book/data/snapshot.20090216183401
1.1M/data/solr/book/data/snapshot.20090216203002
44K /data/solr/book/data/snapshot.20090216194502
36K /data/solr/book/data/snapshot.20090216185004
12K /data/solr/book/data/snapshot.20090216182720
12K /data/solr/book/data/snapshot.20090216214001
5.8G/data/solr/book/data/snapshot.20090216175106
1.1M/data/solr/book/data/snapshot.20090216202003
5.8G/data/solr/book/data/snapshot.20090216173224
12K /data/solr/book/data/spellchecker1
1.1M/data/solr/book/data/snapshot.20090216202502
30G /data/solr/book/data
 thanks a lot,


Noble Paul നോബിള്‍  नोब्ळ् wrote:
 
 they are just hardlinks. they do not consume space on disk
 
 On Mon, Feb 16, 2009 at 10:34 PM, sunnyfr johanna...@gmail.com wrote:

 Hi,

 Ok but can I use it more often then every day like every three hours,
 because snapshot are quite big.

 Thanks a lot,


 Bill Au wrote:

 The --delete option of the rsync command deletes extraneous files from
 the
 destination directory.  It does not delete Solr snapshots.  To do that
 you
 can use the snapcleaner on the master and/or slave.

 Bill

 On Fri, Feb 13, 2009 at 10:15 AM, sunnyfr johanna...@gmail.com wrote:


 root 26834 16.2  0.0  19412   824 ?S16:05   0:08 rsync
 -Wa
 --delete rsync://##.##.##.##:18180/solr/snapshot.20090213160051/
 /data/solr/books/data/snapshot.20090213160051-wip

 Hi obviously it can't delete them because the adress is bad it shouldnt
 be
 :
 rsync://##.##.##.##:18180/solr/snapshot.20090213160051/
 but:
 rsync://##.##.##.##:18180/solr/books/snapshot.20090213160051/

 Where should I change this, I checked my script.conf on the slave
 server
 but
 it seems good.

 Because files can be very big and my server in few hours is getting
 full.

 So actually snapcleaner is not necessary on the master ? what about the
 slave?

 Thanks a lot,
 Sunny
 --
 View this message in context:
 http://www.nabble.com/delete-snapshot---tp21998333p21998333.html
 Sent from the Solr - User mailing list archive at Nabble.com.





 --
 View this message in context:
 http://www.nabble.com/delete-snapshot---tp21998333p22041332.html
 Sent from the Solr - User mailing list archive at Nabble.com.


 
 
 
 -- 
 --Noble Paul
 
 

-- 
View this message in context: 
http://www.nabble.com/delete-snapshot---tp21998333p22048391.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: delete snapshot??

2009-02-16 Thread sunnyfr


Hi Noble,

I maybe don't get something 
Ok if it's hard link but how come i've not space left on device error and
30G shown on the data folder ??
sorry I'm quite new  

6.0G/data/solr/book/data/snapshot.20090216214502
35M /data/solr/book/data/snapshot.20090216195003
12M /data/solr/book/data/snapshot.20090216195502
12K /data/solr/book/data/spellchecker2
36M /data/solr/book/data/snapshot.20090216185502
37M /data/solr/book/data/snapshot.20090216203502
6.0M/data/solr/book/data/index
12K /data/solr/book/data/snapshot.20090216204002
5.8G/data/solr/book/data/snapshot.20090216172020
12K /data/solr/book/data/spellcheckerFile
28K /data/solr/book/data/snapshot.20090216200503
40K /data/solr/book/data/snapshot.20090216194002
24K /data/solr/book/data/snapshot.2009021622
32K /data/solr/book/data/snapshot.20090216184502
20K /data/solr/book/data/snapshot.20090216191004
1.1M/data/solr/book/data/snapshot.20090216213502
1.1M/data/solr/book/data/snapshot.20090216201502
1.1M/data/solr/book/data/snapshot.20090216213005
24K /data/solr/book/data/snapshot.20090216191502
1.1M/data/solr/book/data/snapshot.20090216212503
107M/data/solr/book/data/snapshot.20090216212002
14M /data/solr/book/data/snapshot.20090216190502
32K /data/solr/book/data/snapshot.20090216201002
2.3M/data/solr/book/data/snapshot.20090216204502
28K /data/solr/book/data/snapshot.20090216184002
5.8G/data/solr/book/data/snapshot.20090216181425
44K /data/solr/book/data/snapshot.20090216190001
20K /data/solr/book/data/snapshot.20090216183401
1.1M/data/solr/book/data/snapshot.20090216203002
44K /data/solr/book/data/snapshot.20090216194502
36K /data/solr/book/data/snapshot.20090216185004
12K /data/solr/book/data/snapshot.20090216182720
12K /data/solr/book/data/snapshot.20090216214001
5.8G/data/solr/book/data/snapshot.20090216175106
1.1M/data/solr/book/data/snapshot.20090216202003
5.8G/data/solr/book/data/snapshot.20090216173224
12K /data/solr/book/data/spellchecker1
1.1M/data/solr/book/data/snapshot.20090216202502
30G /data/solr/book/data
 thanks a lot,


Noble Paul നോബിള്‍  नोब्ळ् wrote:
 
 they are just hardlinks. they do not consume space on disk
 
 On Mon, Feb 16, 2009 at 10:34 PM, sunnyfr johanna...@gmail.com wrote:

 Hi,

 Ok but can I use it more often then every day like every three hours,
 because snapshot are quite big.

 Thanks a lot,


 Bill Au wrote:

 The --delete option of the rsync command deletes extraneous files from
 the
 destination directory.  It does not delete Solr snapshots.  To do that
 you
 can use the snapcleaner on the master and/or slave.

 Bill

 On Fri, Feb 13, 2009 at 10:15 AM, sunnyfr johanna...@gmail.com wrote:


 root 26834 16.2  0.0  19412   824 ?S16:05   0:08 rsync
 -Wa
 --delete rsync://##.##.##.##:18180/solr/snapshot.20090213160051/
 /data/solr/books/data/snapshot.20090213160051-wip

 Hi obviously it can't delete them because the adress is bad it shouldnt
 be
 :
 rsync://##.##.##.##:18180/solr/snapshot.20090213160051/
 but:
 rsync://##.##.##.##:18180/solr/books/snapshot.20090213160051/

 Where should I change this, I checked my script.conf on the slave
 server
 but
 it seems good.

 Because files can be very big and my server in few hours is getting
 full.

 So actually snapcleaner is not necessary on the master ? what about the
 slave?

 Thanks a lot,
 Sunny
 --
 View this message in context:
 http://www.nabble.com/delete-snapshot---tp21998333p21998333.html
 Sent from the Solr - User mailing list archive at Nabble.com.





 --
 View this message in context:
 http://www.nabble.com/delete-snapshot---tp21998333p22041332.html
 Sent from the Solr - User mailing list archive at Nabble.com.


 
 
 
 -- 
 --Noble Paul
 
 

-- 
View this message in context: 
http://www.nabble.com/delete-snapshot---tp21998333p22048398.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Release of solr 1.4 autosuggest

2009-02-16 Thread Grant Ingersoll



On Feb 16, 2009, at 6:13 PM, David Smiley @MITRE.org wrote:



Sorry for budding in on this thread but what value is added by  
TermComponent

when you can use faceting for auto-suggest?



Yeah, you can do auto-suggest w/ faceting, no doubt.  In fact the  
TermComponent could just as well be called Term Faceting or something  
like that.  I mostly wrote the TermComp for exposing Lucene's  
underlying TermEnum and thought the auto-suggest would be a bonus.



And with faceting, you can
limit the suggestion by existing words before the word the user is  
typing by

using it for q.



Not sure I follow, but the whole point of auto-suggest is to limit by  
existing words, right?  The TermComp uses Lucene's internal TermEnum  
to return results without any of the other stuff related to faceting.   
And, of course, you would only ask for terms beginning with the word  
that is being typed.


I haven't tested if it is faster or not, but I do know there is a fair  
amount less code involved, so it _might_ be.  It would be good to do  
some perf. comparisons.


-Grant
--
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)  
using Solr/Lucene:

http://www.lucidimagination.com/search

Re: Input XML duplicate fields uniqueness

2009-02-16 Thread Adi_Jinx




Shalin Shekhar Mangar wrote:
 
 On Mon, Feb 16, 2009 at 11:47 PM, Adi_Jinx rohit_wa...@yahoo.com wrote:
 
 How about creating a Solr document for each account and adding the recid
 and
 updt attributes from the record tag?
 
 -- 
 Regards,
 Shalin Shekhar Mangar.
 
 

However then I do need to allow duplicate for my uniqueue key that is
rec.id. My purpose is to track account changes and somebody should be able
to query it. XML posted here only has entity_updated, I have added and
deleted also. IN that case I may have to post 4-5 docs with same rec.id. Is
there ay other way out?
-- 
View this message in context: 
http://www.nabble.com/Input-XML-duplicate-fields-uniqueness-tp22042765p22049210.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Distributed search

2009-02-16 Thread Otis Gospodnetic

Hi,

That should work, yes, though it may not be a wise thing to do 
performance-wise, if the number of CPU cores that solr server has is lower than 
the number of Solr cores.

Otis --
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch 





From: revathy arun revas...@gmail.com
To: solr-user@lucene.apache.org
Sent: Monday, February 16, 2009 8:18:36 PM
Subject: Distributed search

Hi,

Can we use multicore to have several indexes per webapp and use distributed
search to merge the indexes?

for exampe if we have 3 cores -core0 ,core1 and core2 for 3 different
languages and to search across all the 3 indexes
use the shard parameter as
shard=localhost:8080/solr/core0,localhost:8080/solr/core1,localhost:8080/solr/core2

Regards
Sujatha

Re: indexing Chienese langage

2009-02-16 Thread Otis Gospodnetic

Hi,

While some of the characters in simplified and traditional Chinese do differ, 
the Chinese tokenizer doesn't care - it simply creates ngram tokens.

Otis --
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch 





From: revathy arun revas...@gmail.com
To: solr-user@lucene.apache.org
Sent: Monday, February 16, 2009 4:30:47 PM
Subject: indexing Chienese langage

Hi,

When I index chinese content using chinese tokenizer and analyzer in solr
1.3 ,some of the chinese text files are getting indexed but others are not.

Since chinese has got many different language subtypes as in standard
chinese,simplified chinese etc which of these does the chinese tokenizer
support and is there any method to find the type of  chiense language  from
the file?

Rgds

Re: Multilanguage

2009-02-16 Thread Otis Gospodnetic

Hi,

The best option would be to identify the language after parsing the PDF and 
then index it using an appropriate analyzer defined in schema.xml.

Otis --
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch 





From: revathy arun revas...@gmail.com
To: solr-user@lucene.apache.org
Sent: Monday, February 16, 2009 1:42:07 PM
Subject: Multilanguage

Hi,
I have a scenario where ,i need to  convert pdf content to text  and then
index the same at run time .I do not know as to what language the pdf would
be ,in this case which is the best  soln i have with respect the content
field type in the schema where the text content would be indexed to?

That is can i use the default tokenizer for all languages and  since i would
not know the language and hence would not be able to stem the
tokens,how would  this impact search?Is there any other solution for the
same?

Rgds

Re: Outofmemory error for large files

2009-02-16 Thread Otis Gospodnetic

Siddharth,

At the end of your email you said:
One option I see is to break the file in chunks, but with this, I won't be 
able to search with multiple words if they are distributed in different 
documents.

Unless I'm missing something unusual about your application, I don't think the 
above is technically correct.  Have you tried doing this and have you 
then tried your searches?  Everything should still work, even if you index one 
document at a time.

Otis--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch 





From: Gargate, Siddharth sgarg...@ptc.com
To: solr-user@lucene.apache.org
Sent: Monday, February 16, 2009 2:00:58 PM
Subject: Outofmemory error for large files


I am trying to index around 150 MB text file with 1024 MB max heap. But
I get Outofmemory error in the SolrJ code. 

Exception in thread main java.lang.OutOfMemoryError: Java heap space
    at java.util.Arrays.copyOf(Arrays.java:2882)
    at
java.lang.AbstractStringBuilder.expandCapacity(AbstractStringBuilder.jav
a:100)
    at
java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:572)
    at java.lang.StringBuffer.append(StringBuffer.java:320)
    at java.io.StringWriter.write(StringWriter.java:60)
    at org.apache.solr.common.util.XML.escape(XML.java:206)
    at org.apache.solr.common.util.XML.escapeCharData(XML.java:79)
    at org.apache.solr.common.util.XML.writeXML(XML.java:149)
    at
org.apache.solr.client.solrj.util.ClientUtils.writeXML(ClientUtils.java:
115)
    at
org.apache.solr.client.solrj.request.UpdateRequest.writeXML(UpdateReques
t.java:200)
    at
org.apache.solr.client.solrj.request.UpdateRequest.getXML(UpdateRequest.
java:178)
    at
org.apache.solr.client.solrj.request.UpdateRequest.getContentStreams(Upd
ateRequest.java:173)
    at
org.apache.solr.client.solrj.embedded.EmbeddedSolrServer.request(Embedde
dSolrServer.java:136)
    at
org.apache.solr.client.solrj.request.UpdateRequest.process(UpdateRequest
.java:243)
    at
org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:63)


I modified the UpdateRequest class to initialize the StringWriter object
in UpdateRequest.getXML with initial size, and cleared the
SolrInputDocument that is having the reference of the file text. Then I
am getting OOM as below:


Exception in thread main java.lang.OutOfMemoryError: Java heap space
    at java.util.Arrays.copyOf(Arrays.java:2786)
    at java.lang.StringCoding.safeTrim(StringCoding.java:64)
    at java.lang.StringCoding.access$300(StringCoding.java:34)
    at
java.lang.StringCoding$StringEncoder.encode(StringCoding.java:251)
    at java.lang.StringCoding.encode(StringCoding.java:272)
    at java.lang.String.getBytes(String.java:947)
    at
org.apache.solr.common.util.ContentStreamBase$StringStream.getStream(Con
tentStreamBase.java:142)
    at
org.apache.solr.common.util.ContentStreamBase$StringStream.getReader(Con
tentStreamBase.java:154)
    at org.apache.solr.handler.XMLLoader.load(XMLLoader.java:61)
    at
org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(Conte
ntStreamHandlerBase.java:54)
    at
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerB
ase.java:131)
    at org.apache.solr.core.SolrCore.execute(SolrCore.java:1333)
    at
org.apache.solr.client.solrj.embedded.EmbeddedSolrServer.request(Embedde
dSolrServer.java:139)
    at
org.apache.solr.client.solrj.request.UpdateRequest.process(UpdateRequest
.java:249)
    at
org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:63)


After I increase the heap size upto 1250 MB, I get OOM as 

Exception in thread main java.lang.OutOfMemoryError: Java heap space
    at java.util.Arrays.copyOfRange(Arrays.java:3209)
    at java.lang.String.init(String.java:216)
    at java.lang.StringBuffer.toString(StringBuffer.java:585)
    at
com.ctc.wstx.util.TextBuffer.contentsAsString(TextBuffer.java:403)
    at
com.ctc.wstx.sr.BasicStreamReader.getText(BasicStreamReader.java:821)
    at org.apache.solr.handler.XMLLoader.readDoc(XMLLoader.java:276)
    at
org.apache.solr.handler.XMLLoader.processUpdate(XMLLoader.java:139)
    at org.apache.solr.handler.XMLLoader.load(XMLLoader.java:69)
    at
org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(Conte
ntStreamHandlerBase.java:54)
    at
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerB
ase.java:131)
    at org.apache.solr.core.SolrCore.execute(SolrCore.java:1333)
    at
org.apache.solr.client.solrj.embedded.EmbeddedSolrServer.request(Embedde
dSolrServer.java:139)
    at
org.apache.solr.client.solrj.request.UpdateRequest.process(UpdateRequest
.java:249)
    at
org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:63)


So looks like I won't be able to get out of these OOMs. 
Is there any way to avoid these OOMs? One option I see is to break the
file in chunks, but with this, I won't be able to search with multiple
words if they are distributed in different documents.
Also, can somebody tell me the

Re: Word Locations Search Components

2009-02-16 Thread Otis Gospodnetic

Hi,

Wouldn't this be as easy as:
- split email into paragraphs
- for each paragraph compute signature (MD5 or something fuzzier, like in 
SOLR-799)
- for each signature look for other emails with this signature
- when you find an email with an identical signature, you know you've found the 
banner

I'd do this in a pre-processing phase.  You may have to add special logic for 
 and other email-quoting characters.  Perhaps you can make use of assumption 
that banners always come at the end of emails.  Perhaps you can make use 
of situations where the banner appears multiple times in a single email (the 
one with lots of back-and-forth replies, for example).

This is similar to MoreLikeThis on paragraph level.

Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch





From: Johnny X jonathanwel...@gmail.com
To: solr-user@lucene.apache.org
Sent: Monday, February 16, 2009 11:05:40 PM
Subject: Re: Word Locations  Search Components


Basically I'm working on the Enron dataset, and I've already de-duplicated
the collection and applied a spam filter. All the e-mails after this have
been parsed to XML and each field (so To, From, Date etc) has been
separated, along with one large field for the remaining e-mail content
(called Content). 

So yes, to answer your question. Bearing in mind though this still
represents around 240, 000ish files to compute.

I have no idea about Solr analyzers/search components, but my theory was
that I'd need an analyzer to remove 'banner-like' content from being indexed
and a search component to identify 'corporate-like' information in the
content of the e-mails.

What is a business logical solution and how will that work?


Thanks.



zayhen wrote:
 
 I would go for a business logic solution and not a Solr customization in
 this case, as you need to filter information that you actually would like
 to
 see in diferent fields on your index.
 
 Did you already tried to split the email in several fields like subject,
 from, to, content, signature, etc etc etc ?
 
 
 2009/2/16 Johnny X jonathanwel...@gmail.com
 

 Hi there,


 I was told before that I'd need to create a custom search component to do
 what I want to do, but I'm thinking it might actually be a custom
 analyzer.

 Basically, I'm indexing e-mail in XML in Solr and searching the 'content'
 field which is parsed as 'text'.

 I want to ignore certain elements of the e-mail (i.e. corporate banners),
 but also identify the actual content of those e-mails including corporate
 information.

 To identify the banners I need something a little more developed than a
 stop
 word list. I need to evaluate the frequency of certain words around words
 like 'privileged' and 'corporate' within a word window of about 100ish
 words
 to determine whether they're banners and then remove them from being
 indexed.

 I need to do the opposite during the same time to identify, in a similar
 manner, which e-mails include corporate information in their actual
 content.

 I suppose if I'm doing this I don't want what's processed to be indexed
 as
 what's returned in a search, because then presumably it won't be the full
 e-mail, so do I need to store some kind of copy field that keeps the full
 e-mail and is fully indexed to be returned instead?

 Can what I'm suggesting be done and can anyone direct me to a guide?


 On another note, is there an easy way to destroy an index...any custom
 code?


 Thanks for any help!



 --
 View this message in context:
 http://www.nabble.com/Word-Locations---Search-Components-tp22031139p22031139.html
 Sent from the Solr - User mailing list archive at Nabble.com.


 
 
 -- 
 Alexander Ramos Jardim
 
 
 -
 RPG da Ilha 
 

-- 
View this message in context: 
http://www.nabble.com/Word-Locations---Search-Components-tp22031139p22038912.html
Sent from the Solr - User mailing list archive at Nabble.com.

RE: Outofmemory error for large files

2009-02-16 Thread Gargate, Siddharth


 Otis,

I haven't tried it yet but what I meant is :
If we divide the content in multiple parts, then words will be splitted in two 
different SOLR documents. If the main document contains 'Hello World' then 
these two words might get indexed in two different documents. Searching for 
'Hello world' won't give me the required search result unless I use OR in the 
query.

Thanks,
Siddharth

-Original Message-
From: Otis Gospodnetic [mailto:otis_gospodne...@yahoo.com] 
Sent: Tuesday, February 17, 2009 9:58 AM
To: solr-user@lucene.apache.org
Subject: Re: Outofmemory error for large files

Siddharth,

At the end of your email you said:
One option I see is to break the file in chunks, but with this, I won't be 
able to search with multiple words if they are distributed in different 
documents.

Unless I'm missing something unusual about your application, I don't think the 
above is technically correct.  Have you tried doing this and have you 
then tried your searches?  Everything should still work, even if you index one 
document at a time.

Otis--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch 





From: Gargate, Siddharth sgarg...@ptc.com
To: solr-user@lucene.apache.org
Sent: Monday, February 16, 2009 2:00:58 PM
Subject: Outofmemory error for large files


I am trying to index around 150 MB text file with 1024 MB max heap. But I get 
Outofmemory error in the SolrJ code. 

Exception in thread main java.lang.OutOfMemoryError: Java heap space
    at java.util.Arrays.copyOf(Arrays.java:2882)
    at
java.lang.AbstractStringBuilder.expandCapacity(AbstractStringBuilder.jav
a:100)
    at
java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:572)
    at java.lang.StringBuffer.append(StringBuffer.java:320)
    at java.io.StringWriter.write(StringWriter.java:60)
    at org.apache.solr.common.util.XML.escape(XML.java:206)
    at org.apache.solr.common.util.XML.escapeCharData(XML.java:79)
    at org.apache.solr.common.util.XML.writeXML(XML.java:149)
    at
org.apache.solr.client.solrj.util.ClientUtils.writeXML(ClientUtils.java:
115)
    at
org.apache.solr.client.solrj.request.UpdateRequest.writeXML(UpdateReques
t.java:200)
    at
org.apache.solr.client.solrj.request.UpdateRequest.getXML(UpdateRequest.
java:178)
    at
org.apache.solr.client.solrj.request.UpdateRequest.getContentStreams(Upd
ateRequest.java:173)
    at
org.apache.solr.client.solrj.embedded.EmbeddedSolrServer.request(Embedde
dSolrServer.java:136)
    at
org.apache.solr.client.solrj.request.UpdateRequest.process(UpdateRequest
.java:243)
    at
org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:63)


I modified the UpdateRequest class to initialize the StringWriter object in 
UpdateRequest.getXML with initial size, and cleared the SolrInputDocument that 
is having the reference of the file text. Then I am getting OOM as below:


Exception in thread main java.lang.OutOfMemoryError: Java heap space
    at java.util.Arrays.copyOf(Arrays.java:2786)
    at java.lang.StringCoding.safeTrim(StringCoding.java:64)
    at java.lang.StringCoding.access$300(StringCoding.java:34)
    at
java.lang.StringCoding$StringEncoder.encode(StringCoding.java:251)
    at java.lang.StringCoding.encode(StringCoding.java:272)
    at java.lang.String.getBytes(String.java:947)
    at
org.apache.solr.common.util.ContentStreamBase$StringStream.getStream(Con
tentStreamBase.java:142)
    at
org.apache.solr.common.util.ContentStreamBase$StringStream.getReader(Con
tentStreamBase.java:154)
    at org.apache.solr.handler.XMLLoader.load(XMLLoader.java:61)
    at
org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(Conte
ntStreamHandlerBase.java:54)
    at
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerB
ase.java:131)
    at org.apache.solr.core.SolrCore.execute(SolrCore.java:1333)
    at
org.apache.solr.client.solrj.embedded.EmbeddedSolrServer.request(Embedde
dSolrServer.java:139)
    at
org.apache.solr.client.solrj.request.UpdateRequest.process(UpdateRequest
.java:249)
    at
org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:63)


After I increase the heap size upto 1250 MB, I get OOM as 

Exception in thread main java.lang.OutOfMemoryError: Java heap space
    at java.util.Arrays.copyOfRange(Arrays.java:3209)
    at java.lang.String.init(String.java:216)
    at java.lang.StringBuffer.toString(StringBuffer.java:585)
    at
com.ctc.wstx.util.TextBuffer.contentsAsString(TextBuffer.java:403)
    at
com.ctc.wstx.sr.BasicStreamReader.getText(BasicStreamReader.java:821)
    at org.apache.solr.handler.XMLLoader.readDoc(XMLLoader.java:276)
    at
org.apache.solr.handler.XMLLoader.processUpdate(XMLLoader.java:139)
    at org.apache.solr.handler.XMLLoader.load(XMLLoader.java:69)
    at
org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(Conte
ntStreamHandlerBase.java:54)
    at
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerB
ase.java:131)
    at

Re: Outofmemory error for large files

2009-02-16 Thread Otis Gospodnetic

Siddharth,

But does your 150MB file represent a single Document?  That doesn't sound right.

Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch





From: Gargate, Siddharth sgarg...@ptc.com
To: solr-user@lucene.apache.org
Sent: Tuesday, February 17, 2009 12:39:53 PM
Subject: RE: Outofmemory error for large files


Otis,

I haven't tried it yet but what I meant is :
If we divide the content in multiple parts, then words will be splitted in two 
different SOLR documents. If the main document contains 'Hello World' then 
these two words might get indexed in two different documents. Searching for 
'Hello world' won't give me the required search result unless I use OR in the 
query.

Thanks,
Siddharth

-Original Message-
From: Otis Gospodnetic [mailto:otis_gospodne...@yahoo.com] 
Sent: Tuesday, February 17, 2009 9:58 AM
To: solr-user@lucene.apache.org
Subject: Re: Outofmemory error for large files

Siddharth,

At the end of your email you said:
One option I see is to break the file in chunks, but with this, I won't be 
able to search with multiple words if they are distributed in different 
documents.

Unless I'm missing something unusual about your application, I don't think the 
above is technically correct.  Have you tried doing this and have you 
then tried your searches?  Everything should still work, even if you index one 
document at a time.

Otis--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch 





From: Gargate, Siddharth sgarg...@ptc.com
To: solr-user@lucene.apache.org
Sent: Monday, February 16, 2009 2:00:58 PM
Subject: Outofmemory error for large files


I am trying to index around 150 MB text file with 1024 MB max heap. But I get 
Outofmemory error in the SolrJ code. 

Exception in thread main java.lang.OutOfMemoryError: Java heap space
    at java.util.Arrays.copyOf(Arrays.java:2882)
    at
java.lang.AbstractStringBuilder.expandCapacity(AbstractStringBuilder.jav
a:100)
    at
java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:572)
    at java.lang.StringBuffer.append(StringBuffer.java:320)
    at java.io.StringWriter.write(StringWriter.java:60)
    at org.apache.solr.common.util.XML.escape(XML.java:206)
    at org.apache.solr.common.util.XML.escapeCharData(XML.java:79)
    at org.apache.solr.common.util.XML.writeXML(XML.java:149)
    at
org.apache.solr.client.solrj.util.ClientUtils.writeXML(ClientUtils.java:
115)
    at
org.apache.solr.client.solrj.request.UpdateRequest.writeXML(UpdateReques
t.java:200)
    at
org.apache.solr.client.solrj.request.UpdateRequest.getXML(UpdateRequest.
java:178)
    at
org.apache.solr.client.solrj.request.UpdateRequest.getContentStreams(Upd
ateRequest.java:173)
    at
org.apache.solr.client.solrj.embedded.EmbeddedSolrServer.request(Embedde
dSolrServer.java:136)
    at
org.apache.solr.client.solrj.request.UpdateRequest.process(UpdateRequest
.java:243)
    at
org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:63)


I modified the UpdateRequest class to initialize the StringWriter object in 
UpdateRequest.getXML with initial size, and cleared the SolrInputDocument that 
is having the reference of the file text. Then I am getting OOM as below:


Exception in thread main java.lang.OutOfMemoryError: Java heap space
    at java.util.Arrays.copyOf(Arrays.java:2786)
    at java.lang.StringCoding.safeTrim(StringCoding.java:64)
    at java.lang.StringCoding.access$300(StringCoding.java:34)
    at
java.lang.StringCoding$StringEncoder.encode(StringCoding.java:251)
    at java.lang.StringCoding.encode(StringCoding.java:272)
    at java.lang.String.getBytes(String.java:947)
    at
org.apache.solr.common.util.ContentStreamBase$StringStream.getStream(Con
tentStreamBase.java:142)
    at
org.apache.solr.common.util.ContentStreamBase$StringStream.getReader(Con
tentStreamBase.java:154)
    at org.apache.solr.handler.XMLLoader.load(XMLLoader.java:61)
    at
org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(Conte
ntStreamHandlerBase.java:54)
    at
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerB
ase.java:131)
    at org.apache.solr.core.SolrCore.execute(SolrCore.java:1333)
    at
org.apache.solr.client.solrj.embedded.EmbeddedSolrServer.request(Embedde
dSolrServer.java:139)
    at
org.apache.solr.client.solrj.request.UpdateRequest.process(UpdateRequest
.java:249)
    at
org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:63)


After I increase the heap size upto 1250 MB, I get OOM as 

Exception in thread main java.lang.OutOfMemoryError: Java heap space
    at java.util.Arrays.copyOfRange(Arrays.java:3209)
    at java.lang.String.init(String.java:216)
    at java.lang.StringBuffer.toString(StringBuffer.java:585)
    at
com.ctc.wstx.util.TextBuffer.contentsAsString(TextBuffer.java:403)
    at
com.ctc.wstx.sr.BasicStreamReader.getText(BasicStreamReader.java:821)
    at

How to fetch all matching records :urgent

2009-02-16 Thread Neha Bhardwaj

Hello,

I am using getResults method of queryResponse class, on a keyword that has
more than hundred of matching records. Bit this method  returns me only 10
results. And then throw an array index out of bound exception.

 

how can I fetch all the results?

Its really important and urgent for me , kindly reply

 

Neha Bhardwaj| Software Engineer  | Persistent Systems Limited.

 

 mailto:akshat_maheshw...@persistent.co.in neha_bhard...@persistent.co.in
| Cell: +91 9272383082 | Tel: +91 (20) 302 35257

 


DISCLAIMER
==
This e-mail may contain privileged and confidential information which is the 
property of Persistent Systems Ltd. It is intended only for the use of the 
individual or entity to which it is addressed. If you are not the intended 
recipient, you are not authorized to read, retain, copy, print, distribute or 
use this message. If you have received this communication in error, please 
notify the sender and delete all copies of this message. Persistent Systems 
Ltd. does not accept any liability for virus infected mails.

Re: How to fetch all matching records :urgent

2009-02-16 Thread Walter Underwood

Increment the start value by 10 and make another request.

wunder


On 2/16/09 9:13 PM, Neha Bhardwaj neha_bhard...@persistent.co.in wrote:

 Hello,
 
 I am using getResults method of queryResponse class, on a keyword that has
 more than hundred of matching records. Bit this method  returns me only 10
 results. And then throw an array index out of bound exception.
 
  
 
 how can I fetch all the results?
 
 Its really important and urgent for me , kindly reply
 
  
 
 Neha Bhardwaj| Software Engineer  | Persistent Systems Limited.
 
  
 
  mailto:akshat_maheshw...@persistent.co.in neha_bhard...@persistent.co.in
 | Cell: +91 9272383082 | Tel: +91 (20) 302 35257
 
  
 
 
 DISCLAIMER
 ==
 This e-mail may contain privileged and confidential information which is the
 property of Persistent Systems Ltd. It is intended only for the use of the
 individual or entity to which it is addressed. If you are not the intended
 recipient, you are not authorized to read, retain, copy, print, distribute or
 use this message. If you have received this communication in error, please
 notify the sender and delete all copies of this message. Persistent Systems
 Ltd. does not accept any liability for virus infected mails.

Re: delete snapshot??

2009-02-16 Thread Noble Paul നോബിള്‍ नोब्ळ्

The hardlinks will prevent the unused files from getting cleaned up.
So the diskspace is consumed for unused index files also. You may need
to delete unused snapshots from time to time
--Noble

On Tue, Feb 17, 2009 at 5:24 AM, sunnyfr johanna...@gmail.com wrote:

 Hi Noble,

 I maybe don't get something
 Ok if it's hard link but how come i've not space left on device error and
 30G shown on the data folder ??
 sorry I'm quite new

 6.0G/data/solr/book/data/snapshot.20090216214502
 35M /data/solr/book/data/snapshot.20090216195003
 12M /data/solr/book/data/snapshot.20090216195502
 12K /data/solr/book/data/spellchecker2
 36M /data/solr/book/data/snapshot.20090216185502
 37M /data/solr/book/data/snapshot.20090216203502
 6.0M/data/solr/book/data/index
 12K /data/solr/book/data/snapshot.20090216204002
 5.8G/data/solr/book/data/snapshot.20090216172020
 12K /data/solr/book/data/spellcheckerFile
 28K /data/solr/book/data/snapshot.20090216200503
 40K /data/solr/book/data/snapshot.20090216194002
 24K /data/solr/book/data/snapshot.2009021622
 32K /data/solr/book/data/snapshot.20090216184502
 20K /data/solr/book/data/snapshot.20090216191004
 1.1M/data/solr/book/data/snapshot.20090216213502
 1.1M/data/solr/book/data/snapshot.20090216201502
 1.1M/data/solr/book/data/snapshot.20090216213005
 24K /data/solr/book/data/snapshot.20090216191502
 1.1M/data/solr/book/data/snapshot.20090216212503
 107M/data/solr/book/data/snapshot.20090216212002
 14M /data/solr/book/data/snapshot.20090216190502
 32K /data/solr/book/data/snapshot.20090216201002
 2.3M/data/solr/book/data/snapshot.20090216204502
 28K /data/solr/book/data/snapshot.20090216184002
 5.8G/data/solr/book/data/snapshot.20090216181425
 44K /data/solr/book/data/snapshot.20090216190001
 20K /data/solr/book/data/snapshot.20090216183401
 1.1M/data/solr/book/data/snapshot.20090216203002
 44K /data/solr/book/data/snapshot.20090216194502
 36K /data/solr/book/data/snapshot.20090216185004
 12K /data/solr/book/data/snapshot.20090216182720
 12K /data/solr/book/data/snapshot.20090216214001
 5.8G/data/solr/book/data/snapshot.20090216175106
 1.1M/data/solr/book/data/snapshot.20090216202003
 5.8G/data/solr/book/data/snapshot.20090216173224
 12K /data/solr/book/data/spellchecker1
 1.1M/data/solr/book/data/snapshot.20090216202502
 30G /data/solr/book/data
  thanks a lot,


 Noble Paul നോബിള്‍  नोब्ळ् wrote:

 they are just hardlinks. they do not consume space on disk

 On Mon, Feb 16, 2009 at 10:34 PM, sunnyfr johanna...@gmail.com wrote:

 Hi,

 Ok but can I use it more often then every day like every three hours,
 because snapshot are quite big.

 Thanks a lot,


 Bill Au wrote:

 The --delete option of the rsync command deletes extraneous files from
 the
 destination directory.  It does not delete Solr snapshots.  To do that
 you
 can use the snapcleaner on the master and/or slave.

 Bill

 On Fri, Feb 13, 2009 at 10:15 AM, sunnyfr johanna...@gmail.com wrote:


 root 26834 16.2  0.0  19412   824 ?S16:05   0:08 rsync
 -Wa
 --delete rsync://##.##.##.##:18180/solr/snapshot.20090213160051/
 /data/solr/books/data/snapshot.20090213160051-wip

 Hi obviously it can't delete them because the adress is bad it shouldnt
 be
 :
 rsync://##.##.##.##:18180/solr/snapshot.20090213160051/
 but:
 rsync://##.##.##.##:18180/solr/books/snapshot.20090213160051/

 Where should I change this, I checked my script.conf on the slave
 server
 but
 it seems good.

 Because files can be very big and my server in few hours is getting
 full.

 So actually snapcleaner is not necessary on the master ? what about the
 slave?

 Thanks a lot,
 Sunny
 --
 View this message in context:
 http://www.nabble.com/delete-snapshot---tp21998333p21998333.html
 Sent from the Solr - User mailing list archive at Nabble.com.





 --
 View this message in context:
 http://www.nabble.com/delete-snapshot---tp21998333p22041332.html
 Sent from the Solr - User mailing list archive at Nabble.com.





 --
 --Noble Paul



 --
 View this message in context: 
 http://www.nabble.com/delete-snapshot---tp21998333p22048398.html
 Sent from the Solr - User mailing list archive at Nabble.com.





-- 
--Noble Paul

Re: Outofmemory error for large files

2009-02-16 Thread Shalin Shekhar Mangar

On Tue, Feb 17, 2009 at 10:26 AM, Otis Gospodnetic 
otis_gospodne...@yahoo.com wrote:

 Siddharth,

 But does your 150MB file represent a single Document?  That doesn't sound
 right.


Otis, Solrj writes the whole XML in memory before writing it to server. That
may be one reason behind Sidhharth's OOME. See
https://issues.apache.org/jira/browse/SOLR-973

-- 
Regards,
Shalin Shekhar Mangar.

Re: Outofmemory error for large files

2009-02-16 Thread Otis Gospodnetic

Right.  But I was trying to point out that a single 150MB Document is not in 
fact what the o.p. wants to do.  For example, if your 150MB represents, say, a 
whole book, should that really be a single document?  Or should individual 
chapters be separate documents, for example?

Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch





From: Shalin Shekhar Mangar shalinman...@gmail.com
To: solr-user@lucene.apache.org
Sent: Tuesday, February 17, 2009 2:48:08 PM
Subject: Re: Outofmemory error for large files

On Tue, Feb 17, 2009 at 10:26 AM, Otis Gospodnetic 
otis_gospodne...@yahoo.com wrote:

 Siddharth,

 But does your 150MB file represent a single Document?  That doesn't sound
 right.


Otis, Solrj writes the whole XML in memory before writing it to server. That
may be one reason behind Sidhharth's OOME. See
https://issues.apache.org/jira/browse/SOLR-973

-- 
Regards,
Shalin Shekhar Mangar.

Re: Need help with DictionaryCompoundWordTokenFilterFactory

2009-02-16 Thread Otis Gospodnetic

Ralf,

Not sure if you got this working or not, but perhaps a simple solution is 
changing the default boolean operator from OR to AND.

Otis --
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch 





From: Kraus, Ralf | pixelhouse GmbH r...@pixelhouse.de
To: solr-user@lucene.apache.org
Sent: Friday, February 6, 2009 6:23:51 PM
Subject: Need help with DictionaryCompoundWordTokenFilterFactory

Hi,

Now I ran into another problem by using the 
solr.DictionaryCompoundWordTokenFilterFactory :-(
If I search for the german word Spargelcremesuppe which contains Spargel, 
Creme and Suppe SOLR will find way to many result.
Its because SOLR finds EVERY entry with either one of the three words in it :-(

Here is my schema.xml

      fieldType name=text_text class=solr.TextField 
positionIncrementGap=100
          analyzer
              tokenizer class=solr.WhitespaceTokenizerFactory/
              filter class=solr.DictionaryCompoundWordTokenFilterFactory
                              dictionary=dictionary.txt
                              minWordSize=5
                              minSubwordSize=2
                              maxSubwordSize=15
                              onlyLongestMatch=true /
              filter class=solr.SynonymFilterFactory synonyms=synonyms.txt 
ignoreCase=true expand=true/
              filter class=solr.StopFilterFactory ignoreCase=true 
words=stopwords.txt/
              filter class=solr.LowerCaseFilterFactory/
              filter class=solr.RemoveDuplicatesTokenFilterFactory/
              filter class=solr.SnowballPorterFilterFactory 
language=German /
          /analyzer
      /fieldType

Any help ?

Greets,

Ralf Kraus

46 matches

Mail list logo