indexing Chienese langage
Hi, When I index chinese content using chinese tokenizer and analyzer in solr 1.3 ,some of the chinese text files are getting indexed but others are not. Since chinese has got many different language subtypes as in standard chinese,simplified chinese etc which of these does the chinese tokenizer support and is there any method to find the type of chiense language from the file? Rgds
DIH transformers
Hello. I have been beating my head around the data-config.xml listed at the end of this message. It breaks in a few different ways. 1) I have bodged TemplateTransformer to allow it to return when one of the variables is undefined. This ensures my uniqueKey is always defined. But thinking more on Nobel's comments there is use in having it work both ways. ie leaving the column undefined or replacing the variable with . I still like my idea about using the default value of a solr field from schema.xml, but I cant figure out how/where to best implement it. 2) Having used TemplateTransformer to assign a value to an entity column that column cannot be used in other TemplateTransformer operations. In my project I am attempting to reuse x.fileWebPath. To fix this, the last line of transformRow() in TemplateTransformer.java needs replaced with the following which as well as 'putting' the templated-ed string in 'row' also saves it into the 'resolver'. **originally** row.put(column, resolver.replaceTokens(expr)); } **new** String columnName = map.get(DataImporter.COLUMN); expr=resolver.replaceTokens(expr); row.put(columnName, expr); resolverMapCopy.put(columnName, expr); } As an aside I think I ran into the issues covered by SOLR-993. It took a while to figure out I could not a a single columnname/value to the resolver. I had instead to add to the map that was already stored within the resolver. 3) No entity column names can be used within RegexTransformer. I guess all the stuff that was added to TemplateTransformer to allow column names to be used in templates needs re-added into RegexTransformer. I am doing that now... but am confused by the fragment of code which copies from resolverMap into resolverMapCopy. As best I can see resolverMap is always empty; but I am barely able to follow the code! Can somebody explain when/why resolverMap would be populated. Also, I begin to understand comments made by Noble in SOL-1001 about resolving entity attributes in ContextImpl.getEntityAttribute and I guess Shalin was right as well. However it also seems wrong that at the top of every transformer we are going to repeat the same code to load the resolver with information about the entity. 4) In that I am reusing template output within other templates the order of execution becomes important. Can I assume that the explicitly listed columns in an entity are processed by the various transformers in the order they appear within data-config.xml. I *think* that the list of columns within an entity as returned by getAllEntityFields() is actually an ArrayList which I think or order dependent. IS this correct? 5) Should I raise this as a single JIRA issue? 6) Having played with this stuff, I was going to add a bit more to the wiki highlighting some of the possibilities and issues with transformers. But want to check with the list first! dataConfig dataSource name=myfilereader type=FileDataSource/ document entity name=jc processor=FileListEntityProcessor fileName=^.*\.xml$ newerThan='NOW-1000DAYS' recursive=true rootEntity=false dataSource=null baseDir=/Volumes/spare/ts/solr/content entity name=x dataSource=myfilereader processor=XPathEntityProcessor url=${jc.fileAbsolutePath} rootEntity=true stream=false forEach=/record | /record/mediaBlock transformer=DateFormatTransformer,TemplateTransformer,RegexTransformer field column=fileAbsolutePath template=${jc.fileAbsolutePath} / field column=fileWebPathregex=${x.test}(.*) replaceWith=/ford$1 sourceColName=fileAbsolutePath/ field column=title xpath=/record/title / field column=para1 name=para xpath=/record/sect1/para / field column=para2 name=para xpath=/record/list/listitem/para / field column=pubdate xpath=/record/metadata/da...@qualifier='pubDate'] dateTimeFormat=MMdd / field column=vurl xpath=/record/mediaBlock/mediaObject/@vurl / field column=imgSrcArticle template=${dataimporter.request.fordinstalldir} / field column=imgCpation xpath=/record/mediaBlock/caption / field column=test template=${dataimporter.request.contentinstalldir} / !-- **problem is that vurl is just a fragment of the info needed to access the picture. -- field column=imgWebPathICON regex=(.*)/.* replaceWith=$1/imagery/${x.vurl}s.jpg sourceColName=fileWebPath/ field column=imgWebPathFULL regex=(.*)/.*
Re: spellcheck.onlyMorePopular
Shalin Shekhar Mangar wrote: The implementation is a bit more complicated. 1. Read all tokens from the specified field in the solr index. 2. Create n-grams of the terms read in #1 and index them into a separate Lucene index (spellcheck index). 3. When asked for suggestions, create n-grams of the query terms, search the spellcheck index and collects the top (by lucene score) 10*spellcheck.count results. 4. If onlyMorePopular=true, determine frequency of each result in the solr index and remove terms which have lesser frequency. 5. Compute the edit distance between the result and the query token. 6. Return the top spellcheck.count results (sorted by edit distance descending) which are greater than specified accuracy. Thanks, I think this makes things clear(er) now. I do agree that the documentation needs improvement on this point, as you said later in this thread. :) Your primary use-case is not spellcheck at all but this might work with some hacking. Fuzzy queries may be a better solution as Walter said. Storing, all successful search queries may be hard to scale. This is certainly true. The drawback of fuzzy searching is that you get back exact and fuzzy hits together in one result set (correct me if I'm wrong). One could filter out the exact/fuzzy hits but this would make paging impossible. The approach using KeywordTokenizer as you suggested before seems to be more promising to me. Unfortunately there seems to be no documentation for this (at least in conjunction with spell checking). If I understand this rightly, the tokenizer must be applied to the field in the search index (not the spell checking index). Is that correct? Thanks, Marcus
Re: almost realtime updates with replication
Hi Hoss, Is it a problem if the snappuller miss one snapshot before the last one ?? Cheer, Have a nice day, hossman wrote: : : There are a couple queries that we would like to run almost realtime so : I would like to have it so our client sends an update on every new : document and then have solr configured to do an autocommit every 5-10 : seconds. : : reading the Wiki, it seems like this isn't possible because of the : strain of snapshotting and pulling to the slaves at such a high rate. : What I was thinking was for these few queries to just query the master : and the rest can query the slave with the not realtime data, although : I'm assuming this wouldn't work either because since a snapshot is : created on every commit, we would still impact the performance too much? there is no reason why a commit has to trigger a snapshot, that happens only if you configure a postCommit hook to do so in your solrconfig.xml you can absolutely commit every 5 seconds, but have a seperate cron task that runs snapshooter ever 5 minutes -- you could even continue to run snapshooter on every commit, and get a new snapshot ever 5 seconds, but only run snappuller on your slave machines ever 5 minutes (the snapshots are hardlinks and don't take up a lot of space, and snappuller only needs to fetch the most recent snapshot) your idea of querying the msater directly for these queries seems perfectly fine to me ... just make sure the auto warm count on the caches on your master is very tiny so the new searchers are ready quickly after each commit. -Hoss -- View this message in context: http://www.nabble.com/almost-realtime-updates-with-replication-tp12276614p22034406.html Sent from the Solr - User mailing list archive at Nabble.com.
snapshot created if there is no documente updated/new?
Hi I would like to know if a snapshot is automaticly created even if there is no document update or added ? Thanks a lot, -- View this message in context: http://www.nabble.com/snapshot-created-if-there-is-no-documente-updated-new--tp22034462p22034462.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: almost realtime updates with replication
I guess , it should not be a problem --Noble On Mon, Feb 16, 2009 at 3:28 PM, sunnyfr johanna...@gmail.com wrote: Hi Hoss, Is it a problem if the snappuller miss one snapshot before the last one ?? Cheer, Have a nice day, hossman wrote: : : There are a couple queries that we would like to run almost realtime so : I would like to have it so our client sends an update on every new : document and then have solr configured to do an autocommit every 5-10 : seconds. : : reading the Wiki, it seems like this isn't possible because of the : strain of snapshotting and pulling to the slaves at such a high rate. : What I was thinking was for these few queries to just query the master : and the rest can query the slave with the not realtime data, although : I'm assuming this wouldn't work either because since a snapshot is : created on every commit, we would still impact the performance too much? there is no reason why a commit has to trigger a snapshot, that happens only if you configure a postCommit hook to do so in your solrconfig.xml you can absolutely commit every 5 seconds, but have a seperate cron task that runs snapshooter ever 5 minutes -- you could even continue to run snapshooter on every commit, and get a new snapshot ever 5 seconds, but only run snappuller on your slave machines ever 5 minutes (the snapshots are hardlinks and don't take up a lot of space, and snappuller only needs to fetch the most recent snapshot) your idea of querying the msater directly for these queries seems perfectly fine to me ... just make sure the auto warm count on the caches on your master is very tiny so the new searchers are ready quickly after each commit. -Hoss -- View this message in context: http://www.nabble.com/almost-realtime-updates-with-replication-tp12276614p22034406.html Sent from the Solr - User mailing list archive at Nabble.com. -- --Noble Paul
Re: facet count on partial results
15 feb 2009 kl. 20.15 skrev Yonik Seeley: On Sat, Feb 14, 2009 at 6:45 AM, karl wettin karl.wet...@gmail.com wrote: Also, as my threadshold is based on the distance in score between the first result it sounds like using a result start position greater than 0 is something I have to look out for. Or? Hmmm - this isn't that easy in general as it requires knowledge of the max score, right? Hmmm indeed. Does Solr not collect 0-20 even though the request is for 10-20? Wouldn't it then be possible to inject some code that limits the DocSet at that layer? There is more. Not important but a nice thing to get: I create multiple documents per entity from my primary data source (e.g. each entity a book and each document a paragraph from the book) but I only want to present the top scoring document per entity. I handle this with client side post processing of the results. This means that I potentially get facet counts from documents that I actually don't present to the user. I would be nice to handle this in the same layer as my score threadshold restriction, but it would require loading the primary key from the document rather early. And it would also mean that even though I might get 2000 results within the threadshold the actual number of results I want to pass on to the client is a lot less than that. I.e. I'll have to request more results than I want in order to ensure I get enough even after filtering out documents that points at the an entity already member of the result list but with a greater score. The question is if I can fit all this stuff in the same layer as the by score threadshold result set limiter. I'm rather lost in the Solr code. Pointers at class and method names is most welcome. karl
Re: DIH transformers
On Mon, Feb 16, 2009 at 3:22 PM, Fergus McMenemie fer...@twig.me.uk wrote: Hello. I have been beating my head around the data-config.xml listed at the end of this message. It breaks in a few different ways. 1) I have bodged TemplateTransformer to allow it to return when one of the variables is undefined. This ensures my uniqueKey is always defined. But thinking more on Nobel's comments there is use in having it work both ways. ie leaving the column undefined or replacing the variable with . I still like my idea about using the default value of a solr field from schema.xml, but I cant figure out how/where to best implement it. When a value is missing from the templatewe may end up giving constructing a partial string which may not be desired. If we leave it out as empty, then Solr would automatically put in the default value and it should be solved. Just in case you wish to know the defaultvalue in the schema.xml you can get it from the api. fields = context.getAllEntityFields(); String defval = fields.get(0).get(defaultvalue); 2) Having used TemplateTransformer to assign a value to an entity column that column cannot be used in other TemplateTransformer operations. In my project I am attempting to reuse x.fileWebPath. To fix this, the last line of transformRow() in TemplateTransformer.java needs replaced with the following which as well as 'putting' the templated-ed string in 'row' also saves it into the 'resolver'. **originally** row.put(column, resolver.replaceTokens(expr)); } **new** String columnName = map.get(DataImporter.COLUMN); expr=resolver.replaceTokens(expr); row.put(columnName, expr); resolverMapCopy.put(columnName, expr); } isn't it better to write a custom transformer to achieve this. I did not want a standard component to change the state of the VariableResolver . I am not sure what is the best way. As an aside I think I ran into the issues covered by SOLR-993. It took a while to figure out I could not a a single columnname/value to the resolver. I had instead to add to the map that was already stored within the resolver. 3) No entity column names can be used within RegexTransformer. I guess all the stuff that was added to TemplateTransformer to allow column names to be used in templates needs re-added into RegexTransformer. I am doing that now... but am confused by the fragment of code which copies from resolverMap into resolverMapCopy. As best I can see resolverMap is always empty; but I am barely able to follow the code! Can somebody explain when/why resolverMap would be populated. The behavior is like this, the expression ${currentEntity.colName} does not work automatically. Because the row is not added to VariableResolver .TemplateTransformer has hacked the stuff to make it work. We can think of modifying this behavior Also, I begin to understand comments made by Noble in SOL-1001 about resolving entity attributes in ContextImpl.getEntityAttribute and I guess Shalin was right as well. However it also seems wrong that at the top of every transformer we are going to repeat the same code to load the resolver with information about the entity. 4) In that I am reusing template output within other templates the order of execution becomes important. Can I assume that the explicitly listed columns in an entity are processed by the various transformers in the order they appear within data-config.xml. I *think* that the list of columns within an entity as returned by getAllEntityFields() is actually an ArrayList which I think or order dependent. IS this correct? IT IS CORRECT 5) Should I raise this as a single JIRA issue? Do not add ONE issue forall. If they are logically connected put all of them into one.If not, split them into as many issues as possible. 6) Having played with this stuff, I was going to add a bit more to the wiki highlighting some of the possibilities and issues with transformers. But want to check with the list first! dataConfig dataSource name=myfilereader type=FileDataSource/ document entity name=jc processor=FileListEntityProcessor fileName=^.*\.xml$ newerThan='NOW-1000DAYS' recursive=true rootEntity=false dataSource=null baseDir=/Volumes/spare/ts/solr/content entity name=x dataSource=myfilereader processor=XPathEntityProcessor url=${jc.fileAbsolutePath} rootEntity=true stream=false forEach=/record | /record/mediaBlock transformer=DateFormatTransformer,TemplateTransformer,RegexTransformer field
Distributed search
Hi, Can we use multicore to have several indexes per webapp and use distributed search to merge the indexes? for exampe if we have 3 cores -core0 ,core1 and core2 for 3 different languages and to search across all the 3 indexes use the shard parameter as shard=localhost:8080/solr/core0,localhost:8080/solr/core1,localhost:8080/solr/core2 Regards Sujatha
Re: Release of solr 1.4 autosuggest
On Feb 16, 2009, at 12:05 AM, Pooja Verlani wrote: Hi All, I am interested in TermComponent addition in solr 1.4 ( http://wiki.apache.org/solr/TermsComponent). When should we expect solr 1.4 to be available for use? Also, can this Termcomponent be made available as a plugin for solr 1.3? I'm guessing the TermComponent patch would apply to the 1.3 source, but I haven't tried it. Kindly reply if you have any idea. Regards, Pooja -- Grant Ingersoll http://www.lucidimagination.com/ Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) using Solr/Lucene: http://www.lucidimagination.com/search
Re: Word Locations Search Components
On Feb 15, 2009, at 10:33 PM, Johnny X wrote: Hi there, I was told before that I'd need to create a custom search component to do what I want to do, but I'm thinking it might actually be a custom analyzer. Basically, I'm indexing e-mail in XML in Solr and searching the 'content' field which is parsed as 'text'. I want to ignore certain elements of the e-mail (i.e. corporate banners), but also identify the actual content of those e-mails including corporate information. To identify the banners I need something a little more developed than a stop word list. I need to evaluate the frequency of certain words around words like 'privileged' and 'corporate' within a word window of about 100ish words to determine whether they're banners and then remove them from being indexed. I need to do the opposite during the same time to identify, in a similar manner, which e-mails include corporate information in their actual content. I suppose if I'm doing this I don't want what's processed to be indexed as what's returned in a search, because then presumably it won't be the full e-mail, so do I need to store some kind of copy field that keeps the full e-mail and is fully indexed to be returned instead? Storage and indexing are separate things in Lucene/Solr, so setting the Field as stored will keep the original, so no need for a copy field for this particular issue. Can what I'm suggesting be done and can anyone direct me to a guide? Hmm, this kind of stuff may be better off as part of preprocessing, but it could be done as an analyzer, I suppose. How are you determining the words to evaluate? Is it based on collection statistics or just within a document? Or do you just have a list of marker words that indicate the areas of interest? Do you need to keep track of anything beyond the life of one document being analyzed? If you were doing this as an analyzer, you would need to buffer the tokens internally so that you could examine them in a window, and then make a decision as to what tokens to output. I believe the RemoveDuplicatesTokenFilter demonstrates how to do this. Basically, you just need a List to store the tokens in if you see certain conditions met. On another note, is there an easy way to destroy an index...any custom code? Send in a delete by query command with the *:* query. Thanks for any help! -- View this message in context: http://www.nabble.com/Word-Locations---Search-Components-tp22031139p22031139.html Sent from the Solr - User mailing list archive at Nabble.com. -- Grant Ingersoll http://www.lucidimagination.com/ Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) using Solr/Lucene: http://www.lucidimagination.com/search
Re: Multilanguage
I recommend that you search both this and the Lucene list. You'll find that this topic has been discussed many times, and several approaches have been outlined. The searchable archives are linked to from here: http://lucene.apache.org/java/docs/mailinglists.html. Best Erick On Mon, Feb 16, 2009 at 12:42 AM, revathy arun revas...@gmail.com wrote: Hi, I have a scenario where ,i need to convert pdf content to text and then index the same at run time .I do not know as to what language the pdf would be ,in this case which is the best soln i have with respect the content field type in the schema where the text content would be indexed to? That is can i use the default tokenizer for all languages and since i would not know the language and hence would not be able to stem the tokens,how would this impact search?Is there any other solution for the same? Rgds
Re: almost realtime updates with replication
Hi Noble, So ok I don't mind really if it miss one, if it get the last one it's good. I've was wondering as well if a snapshot is created even if no document has been update? Thanks a lot Noble, Wish you a very nice day, Noble Paul നോബിള് नोब्ळ् wrote: I guess , it should not be a problem --Noble On Mon, Feb 16, 2009 at 3:28 PM, sunnyfr johanna...@gmail.com wrote: Hi Hoss, Is it a problem if the snappuller miss one snapshot before the last one ?? Cheer, Have a nice day, hossman wrote: : : There are a couple queries that we would like to run almost realtime so : I would like to have it so our client sends an update on every new : document and then have solr configured to do an autocommit every 5-10 : seconds. : : reading the Wiki, it seems like this isn't possible because of the : strain of snapshotting and pulling to the slaves at such a high rate. : What I was thinking was for these few queries to just query the master : and the rest can query the slave with the not realtime data, although : I'm assuming this wouldn't work either because since a snapshot is : created on every commit, we would still impact the performance too much? there is no reason why a commit has to trigger a snapshot, that happens only if you configure a postCommit hook to do so in your solrconfig.xml you can absolutely commit every 5 seconds, but have a seperate cron task that runs snapshooter ever 5 minutes -- you could even continue to run snapshooter on every commit, and get a new snapshot ever 5 seconds, but only run snappuller on your slave machines ever 5 minutes (the snapshots are hardlinks and don't take up a lot of space, and snappuller only needs to fetch the most recent snapshot) your idea of querying the msater directly for these queries seems perfectly fine to me ... just make sure the auto warm count on the caches on your master is very tiny so the new searchers are ready quickly after each commit. -Hoss -- View this message in context: http://www.nabble.com/almost-realtime-updates-with-replication-tp12276614p22034406.html Sent from the Solr - User mailing list archive at Nabble.com. -- --Noble Paul -- View this message in context: http://www.nabble.com/almost-realtime-updates-with-replication-tp12276614p22037977.html Sent from the Solr - User mailing list archive at Nabble.com.
snapshot as big as the index folder?
Hi, Is it normal or did I miss something ?? 5.8Gbook/data/snapshot.20090216153346 12K book/data/spellchecker2 4.0Kbook/data/index 12K book/data/spellcheckerFile 12K book/data/spellchecker1 5.8Gbook/data/ Last update ? str name=Total Requests made to DataSource92562/str str name=Total Rows Fetched45492/str str name=Total Documents Skipped0/str str name=Delta Dump started2009-02-16 15:20:01/str str name=Identifying Delta2009-02-16 15:20:01/str str name=Deltas Obtained2009-02-16 15:20:42/str str name=Building documents2009-02-16 15:20:42/str str name=Total Changed Documents13223/str − str name= Indexing completed. Added/Updated: 13223 documents. Deleted 0 documents. /str str name=Committed2009-02-16 15:33:50/str str name=Optimized2009-02-16 15:33:50/str str name=Time taken 0:13:48.853/str Thanks a lot, -- View this message in context: http://www.nabble.com/snapshot-as-big-as-the-index-folder--tp22038427p22038427.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: snapshot as big as the index folder?
It change a lot in few minute ?? is it normal ? thanks 5.8Gbook/data/snapshot.20090216153346 4.0Kbook/data/index 5.8Gbook/data/ r...@search-07:/data/solr# du -h book/data/ 5.8Gbook/data/snapshot.20090216153346 3.7Gbook/data/index 4.0Kbook/data/snapshot.20090216153759 9.4Gbook/data/ r...@search-07:/data/solr# du -h book/data/ 5.8Gvideo/data/snapshot.20090216153346 4.4Gbook/data/index 4.0Kbook/data/snapshot.20090216153759 11G book/data/ r...@search-07:/data/solr# du -h book/data/ 5.8Gbook/data/snapshot.20090216153346 5.8Gbook/data/index 4.0Kbook/data/snapshot.20090216154819 4.0Kbook/data/snapshot.20090216154820 15M book/data/snapshot.20090216153759 12G book/data/ sunnyfr wrote: Hi, Is it normal or did I miss something ?? 5.8G book/data/snapshot.20090216153346 12K book/data/spellchecker2 4.0K book/data/index 12K book/data/spellcheckerFile 12K book/data/spellchecker1 5.8G book/data/ Last update ? str name=Total Requests made to DataSource92562/str str name=Total Rows Fetched45492/str str name=Total Documents Skipped0/str str name=Delta Dump started2009-02-16 15:20:01/str str name=Identifying Delta2009-02-16 15:20:01/str str name=Deltas Obtained2009-02-16 15:20:42/str str name=Building documents2009-02-16 15:20:42/str str name=Total Changed Documents13223/str − str name= Indexing completed. Added/Updated: 13223 documents. Deleted 0 documents. /str str name=Committed2009-02-16 15:33:50/str str name=Optimized2009-02-16 15:33:50/str str name=Time taken 0:13:48.853/str Thanks a lot, -- View this message in context: http://www.nabble.com/snapshot-as-big-as-the-index-folder--tp22038427p22038656.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Word Locations Search Components
I would go for a business logic solution and not a Solr customization in this case, as you need to filter information that you actually would like to see in diferent fields on your index. Did you already tried to split the email in several fields like subject, from, to, content, signature, etc etc etc ? 2009/2/16 Johnny X jonathanwel...@gmail.com Hi there, I was told before that I'd need to create a custom search component to do what I want to do, but I'm thinking it might actually be a custom analyzer. Basically, I'm indexing e-mail in XML in Solr and searching the 'content' field which is parsed as 'text'. I want to ignore certain elements of the e-mail (i.e. corporate banners), but also identify the actual content of those e-mails including corporate information. To identify the banners I need something a little more developed than a stop word list. I need to evaluate the frequency of certain words around words like 'privileged' and 'corporate' within a word window of about 100ish words to determine whether they're banners and then remove them from being indexed. I need to do the opposite during the same time to identify, in a similar manner, which e-mails include corporate information in their actual content. I suppose if I'm doing this I don't want what's processed to be indexed as what's returned in a search, because then presumably it won't be the full e-mail, so do I need to store some kind of copy field that keeps the full e-mail and is fully indexed to be returned instead? Can what I'm suggesting be done and can anyone direct me to a guide? On another note, is there an easy way to destroy an index...any custom code? Thanks for any help! -- View this message in context: http://www.nabble.com/Word-Locations---Search-Components-tp22031139p22031139.html Sent from the Solr - User mailing list archive at Nabble.com. -- Alexander Ramos Jardim
Re: Word Locations Search Components
Basically I'm working on the Enron dataset, and I've already de-duplicated the collection and applied a spam filter. All the e-mails after this have been parsed to XML and each field (so To, From, Date etc) has been separated, along with one large field for the remaining e-mail content (called Content). So yes, to answer your question. Bearing in mind though this still represents around 240, 000ish files to compute. I have no idea about Solr analyzers/search components, but my theory was that I'd need an analyzer to remove 'banner-like' content from being indexed and a search component to identify 'corporate-like' information in the content of the e-mails. What is a business logical solution and how will that work? Thanks. zayhen wrote: I would go for a business logic solution and not a Solr customization in this case, as you need to filter information that you actually would like to see in diferent fields on your index. Did you already tried to split the email in several fields like subject, from, to, content, signature, etc etc etc ? 2009/2/16 Johnny X jonathanwel...@gmail.com Hi there, I was told before that I'd need to create a custom search component to do what I want to do, but I'm thinking it might actually be a custom analyzer. Basically, I'm indexing e-mail in XML in Solr and searching the 'content' field which is parsed as 'text'. I want to ignore certain elements of the e-mail (i.e. corporate banners), but also identify the actual content of those e-mails including corporate information. To identify the banners I need something a little more developed than a stop word list. I need to evaluate the frequency of certain words around words like 'privileged' and 'corporate' within a word window of about 100ish words to determine whether they're banners and then remove them from being indexed. I need to do the opposite during the same time to identify, in a similar manner, which e-mails include corporate information in their actual content. I suppose if I'm doing this I don't want what's processed to be indexed as what's returned in a search, because then presumably it won't be the full e-mail, so do I need to store some kind of copy field that keeps the full e-mail and is fully indexed to be returned instead? Can what I'm suggesting be done and can anyone direct me to a guide? On another note, is there an easy way to destroy an index...any custom code? Thanks for any help! -- View this message in context: http://www.nabble.com/Word-Locations---Search-Components-tp22031139p22031139.html Sent from the Solr - User mailing list archive at Nabble.com. -- Alexander Ramos Jardim - RPG da Ilha -- View this message in context: http://www.nabble.com/Word-Locations---Search-Components-tp22031139p22038912.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Word Locations Search Components
I think you essentially have to do much of the same work either way, so take whatever comes easiest. Personally, I think that pre-processing the data (and using two fields) would be easiest, but it's up to you. Using a custom analyzer would involve collecting all the contents, deciding what is relevant and emitting those tokens one by one. The advantage here (and it's not very important) is that you'd only need one field as Grant said. The other approach would be to read the contents into a buffer, apply whatever business logic you determine to remove the irrelevant text, and then submitting this to the normal analyzers. The advantage here is that it's a simpler flow. Analyzers are usually just used for breaking up an incoming stream and doing specific transformations (stop words, stemming, etc). These transformations are pretty context-less. Extending that process to handle complex rules about what's relevant is a bit of a stretch. But if you do pre-process the data, storing the input won't be what you want and you'll need to store the original text in a separate field. Best Erick On Mon, Feb 16, 2009 at 10:05 AM, Johnny X jonathanwel...@gmail.com wrote: Basically I'm working on the Enron dataset, and I've already de-duplicated the collection and applied a spam filter. All the e-mails after this have been parsed to XML and each field (so To, From, Date etc) has been separated, along with one large field for the remaining e-mail content (called Content). So yes, to answer your question. Bearing in mind though this still represents around 240, 000ish files to compute. I have no idea about Solr analyzers/search components, but my theory was that I'd need an analyzer to remove 'banner-like' content from being indexed and a search component to identify 'corporate-like' information in the content of the e-mails. What is a business logical solution and how will that work? Thanks. zayhen wrote: I would go for a business logic solution and not a Solr customization in this case, as you need to filter information that you actually would like to see in diferent fields on your index. Did you already tried to split the email in several fields like subject, from, to, content, signature, etc etc etc ? 2009/2/16 Johnny X jonathanwel...@gmail.com Hi there, I was told before that I'd need to create a custom search component to do what I want to do, but I'm thinking it might actually be a custom analyzer. Basically, I'm indexing e-mail in XML in Solr and searching the 'content' field which is parsed as 'text'. I want to ignore certain elements of the e-mail (i.e. corporate banners), but also identify the actual content of those e-mails including corporate information. To identify the banners I need something a little more developed than a stop word list. I need to evaluate the frequency of certain words around words like 'privileged' and 'corporate' within a word window of about 100ish words to determine whether they're banners and then remove them from being indexed. I need to do the opposite during the same time to identify, in a similar manner, which e-mails include corporate information in their actual content. I suppose if I'm doing this I don't want what's processed to be indexed as what's returned in a search, because then presumably it won't be the full e-mail, so do I need to store some kind of copy field that keeps the full e-mail and is fully indexed to be returned instead? Can what I'm suggesting be done and can anyone direct me to a guide? On another note, is there an easy way to destroy an index...any custom code? Thanks for any help! -- View this message in context: http://www.nabble.com/Word-Locations---Search-Components-tp22031139p22031139.html Sent from the Solr - User mailing list archive at Nabble.com. -- Alexander Ramos Jardim - RPG da Ilha -- View this message in context: http://www.nabble.com/Word-Locations---Search-Components-tp22031139p22038912.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: facet count on partial results
On Sat, Feb 14, 2009 at 6:45 AM, karl wettin karl.wet...@gmail.com wrote: Also, as my threadshold is based on the distance in score between the first result it sounds like using a result start position greater than 0 is something I have to look out for. Or? Hmmm - this isn't that easy in general as it requires knowledge of the max score, right? Hmmm indeed. Does Solr not collect 0-20 even though the request is for 10-20? Wouldn't it then be possible to inject some code that limits the DocSet at that layer? Yes, Solr would actually collect 0-20, but the entire set of matching documents must still be scored to find the maximum score. So if the threshold will be a function of maxScore, it still requires two passes, no? There is more. Not important but a nice thing to get: I create multiple documents per entity from my primary data source (e.g. each entity a book and each document a paragraph from the book) but I only want to present the top scoring document per entity. This sounds like field collapsing. There's is a patch that's still in the works: http://wiki.apache.org/solr/FieldCollapsing -Yonik http://www.lucidimagination.com
Re: Release of solr 1.4 autosuggest
the logging used is changed j.u.l to slf4j . That is the only problem I can see. If you drop in that jar as well it should just work On Mon, Feb 16, 2009 at 6:49 PM, Grant Ingersoll gsing...@apache.org wrote: On Feb 16, 2009, at 12:05 AM, Pooja Verlani wrote: Hi All, I am interested in TermComponent addition in solr 1.4 ( http://wiki.apache.org/solr/TermsComponent). When should we expect solr 1.4 to be available for use? Also, can this Termcomponent be made available as a plugin for solr 1.3? I'm guessing the TermComponent patch would apply to the 1.3 source, but I haven't tried it. Kindly reply if you have any idea. Regards, Pooja -- Grant Ingersoll http://www.lucidimagination.com/ Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) using Solr/Lucene: http://www.lucidimagination.com/search -- --Noble Paul
Re: almost realtime updates with replication
yes , it does . it just blindly creates hard links irrespective of a document is added or not. but no snappull will happen because there is no new file to be downloaded On Mon, Feb 16, 2009 at 7:40 PM, sunnyfr johanna...@gmail.com wrote: Hi Noble, So ok I don't mind really if it miss one, if it get the last one it's good. I've was wondering as well if a snapshot is created even if no document has been update? Thanks a lot Noble, Wish you a very nice day, Noble Paul നോബിള് नोब्ळ् wrote: I guess , it should not be a problem --Noble On Mon, Feb 16, 2009 at 3:28 PM, sunnyfr johanna...@gmail.com wrote: Hi Hoss, Is it a problem if the snappuller miss one snapshot before the last one ?? Cheer, Have a nice day, hossman wrote: : : There are a couple queries that we would like to run almost realtime so : I would like to have it so our client sends an update on every new : document and then have solr configured to do an autocommit every 5-10 : seconds. : : reading the Wiki, it seems like this isn't possible because of the : strain of snapshotting and pulling to the slaves at such a high rate. : What I was thinking was for these few queries to just query the master : and the rest can query the slave with the not realtime data, although : I'm assuming this wouldn't work either because since a snapshot is : created on every commit, we would still impact the performance too much? there is no reason why a commit has to trigger a snapshot, that happens only if you configure a postCommit hook to do so in your solrconfig.xml you can absolutely commit every 5 seconds, but have a seperate cron task that runs snapshooter ever 5 minutes -- you could even continue to run snapshooter on every commit, and get a new snapshot ever 5 seconds, but only run snappuller on your slave machines ever 5 minutes (the snapshots are hardlinks and don't take up a lot of space, and snappuller only needs to fetch the most recent snapshot) your idea of querying the msater directly for these queries seems perfectly fine to me ... just make sure the auto warm count on the caches on your master is very tiny so the new searchers are ready quickly after each commit. -Hoss -- View this message in context: http://www.nabble.com/almost-realtime-updates-with-replication-tp12276614p22034406.html Sent from the Solr - User mailing list archive at Nabble.com. -- --Noble Paul -- View this message in context: http://www.nabble.com/almost-realtime-updates-with-replication-tp12276614p22037977.html Sent from the Solr - User mailing list archive at Nabble.com. -- --Noble Paul
Re: delete snapshot??
Hi, Ok but can I use it more often then every day like every three hours, because snapshot are quite big. Thanks a lot, Bill Au wrote: The --delete option of the rsync command deletes extraneous files from the destination directory. It does not delete Solr snapshots. To do that you can use the snapcleaner on the master and/or slave. Bill On Fri, Feb 13, 2009 at 10:15 AM, sunnyfr johanna...@gmail.com wrote: root 26834 16.2 0.0 19412 824 ?S16:05 0:08 rsync -Wa --delete rsync://##.##.##.##:18180/solr/snapshot.20090213160051/ /data/solr/books/data/snapshot.20090213160051-wip Hi obviously it can't delete them because the adress is bad it shouldnt be : rsync://##.##.##.##:18180/solr/snapshot.20090213160051/ but: rsync://##.##.##.##:18180/solr/books/snapshot.20090213160051/ Where should I change this, I checked my script.conf on the slave server but it seems good. Because files can be very big and my server in few hours is getting full. So actually snapcleaner is not necessary on the master ? what about the slave? Thanks a lot, Sunny -- View this message in context: http://www.nabble.com/delete-snapshot---tp21998333p21998333.html Sent from the Solr - User mailing list archive at Nabble.com. -- View this message in context: http://www.nabble.com/delete-snapshot---tp21998333p22041332.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: delete snapshot??
they are just hardlinks. they do not consume space on disk On Mon, Feb 16, 2009 at 10:34 PM, sunnyfr johanna...@gmail.com wrote: Hi, Ok but can I use it more often then every day like every three hours, because snapshot are quite big. Thanks a lot, Bill Au wrote: The --delete option of the rsync command deletes extraneous files from the destination directory. It does not delete Solr snapshots. To do that you can use the snapcleaner on the master and/or slave. Bill On Fri, Feb 13, 2009 at 10:15 AM, sunnyfr johanna...@gmail.com wrote: root 26834 16.2 0.0 19412 824 ?S16:05 0:08 rsync -Wa --delete rsync://##.##.##.##:18180/solr/snapshot.20090213160051/ /data/solr/books/data/snapshot.20090213160051-wip Hi obviously it can't delete them because the adress is bad it shouldnt be : rsync://##.##.##.##:18180/solr/snapshot.20090213160051/ but: rsync://##.##.##.##:18180/solr/books/snapshot.20090213160051/ Where should I change this, I checked my script.conf on the slave server but it seems good. Because files can be very big and my server in few hours is getting full. So actually snapcleaner is not necessary on the master ? what about the slave? Thanks a lot, Sunny -- View this message in context: http://www.nabble.com/delete-snapshot---tp21998333p21998333.html Sent from the Solr - User mailing list archive at Nabble.com. -- View this message in context: http://www.nabble.com/delete-snapshot---tp21998333p22041332.html Sent from the Solr - User mailing list archive at Nabble.com. -- --Noble Paul
Input XML duplicate fields uniqueness
Hi, I have an Input XML as rec id=1 updt=12-Feb-2009 updated_rec account id=1 loc=NJ pass=safsafsd#sf08 type=Dev active=1 updated_item loc new=NJ old=CP /updated_item /account account id=2 loc=KL pass=080jnkdfhjwf type=Int active=0 updated_item pass new=080jnkdfhjwf old=08dedf /updated_item /account /updated_rec /rec now for SOLR indexing converted it to adddoc field name=rec.id1/field field name=rec.updt12-Feb-2009/field field name=rec.updated_rec.account.id1/field field name=rec.updated_rec.account.locNJ/field field name=rec.updated_rec.account.passsafsafsd#sf08/field field name=rec.updated_rec.account.typeDev/field field name=rec.updated_rec.account.active1/field field name=rec.updated_rec.account.updated_item.loc.newNJ/field field name=rec.updated_rec.account.updated_item.loc.oldCP/field field name=rec.updated_rec.account.id2/field field name=rec.updated_rec.account.locKL/field field name=rec.updated_rec.account.pass080jnkdfhjwf/field field name=rec.updated_rec.account.typeInt/field field name=rec.updated_rec.account.active0/field field name=rec.updated_rec.account.updated_item.pass.new080jnkdfhjwf/field field name=rec.updated_rec.account.updated_item.pass.old08dedf/field /doc/add I was able to index it. Just put this single xml and searched based on rec.id and response xml returned however input xml tag order was not maintained. So I was unable to identify which attributes of account belongs to which account. Is there any way out to maintain order? or tokenize the field name so that primary key can be appended (rec.updated_rec.account.1.loc) however still be able to search rec.updated_rec.account.loc field... Need some suggestion.. may be my apporach is totally wrong in dealing with this problem. -- View this message in context: http://www.nabble.com/Input-XML-duplicate-fields-uniqueness-tp22042765p22042765.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: delete snapshot??
Hi Noble, But how come i've space error ?? :( thanks a lot, Feb 16 18:28:34 search-07 jsvc.exec[8872]: ataImporter.java:361) Caused by: java.io.IOException: No space left on device ^Iat java.io.RandomAccessFile.writeBytes(Native Method) ^Iat java.io.RandomAccessFile.write(RandomAccessFile.java:466) ^Iat org.apache.lucene.store.FSDirectory$FSIndexOutput.flushBuffer(FSDirectory.java:679) ^Iat org.apache.lucene.store.BufferedIndexOutput.flushBuffer(BufferedIndexOutput.java:96) ^Iat org.apache.lucene.store.BufferedIndexOutput.flush(BufferedIndexOutput.java:85) ^Iat org.apache.lucene.store.BufferedIndexOutput.seek(BufferedIndexOutput.java:124) ^Iat org.apache.lucene.store.FSDirectory$FSIndexOutput.seek(FSDirectory.java:704) ^Iat org.apache.lucene.index.TermInfosWriter.close(TermInfosWriter.java:220) ^Iat org.apache.lucene.index.FormatPostingsFieldsWriter.finish(FormatPostingsFieldsWriter.java:70) ^Iat org.apache.lucene.index.SegmentMerger.mergeTerms(SegmentMerger.java:494) ^Iat org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:141) ^Iat org.apache.lucene.index.IndexW Noble Paul നോബിള് नोब्ळ् wrote: they are just hardlinks. they do not consume space on disk On Mon, Feb 16, 2009 at 10:34 PM, sunnyfr johanna...@gmail.com wrote: Hi, Ok but can I use it more often then every day like every three hours, because snapshot are quite big. Thanks a lot, Bill Au wrote: The --delete option of the rsync command deletes extraneous files from the destination directory. It does not delete Solr snapshots. To do that you can use the snapcleaner on the master and/or slave. Bill On Fri, Feb 13, 2009 at 10:15 AM, sunnyfr johanna...@gmail.com wrote: root 26834 16.2 0.0 19412 824 ?S16:05 0:08 rsync -Wa --delete rsync://##.##.##.##:18180/solr/snapshot.20090213160051/ /data/solr/books/data/snapshot.20090213160051-wip Hi obviously it can't delete them because the adress is bad it shouldnt be : rsync://##.##.##.##:18180/solr/snapshot.20090213160051/ but: rsync://##.##.##.##:18180/solr/books/snapshot.20090213160051/ Where should I change this, I checked my script.conf on the slave server but it seems good. Because files can be very big and my server in few hours is getting full. So actually snapcleaner is not necessary on the master ? what about the slave? Thanks a lot, Sunny -- View this message in context: http://www.nabble.com/delete-snapshot---tp21998333p21998333.html Sent from the Solr - User mailing list archive at Nabble.com. -- View this message in context: http://www.nabble.com/delete-snapshot---tp21998333p22041332.html Sent from the Solr - User mailing list archive at Nabble.com. -- --Noble Paul -- View this message in context: http://www.nabble.com/delete-snapshot---tp21998333p22044788.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Input XML duplicate fields uniqueness
On Mon, Feb 16, 2009 at 11:47 PM, Adi_Jinx rohit_wa...@yahoo.com wrote: I was able to index it. Just put this single xml and searched based on rec.id and response xml returned however input xml tag order was not maintained. So I was unable to identify which attributes of account belongs to which account. Is there any way out to maintain order? or tokenize the field name so that primary key can be appended (rec.updated_rec.account.1.loc) however still be able to search rec.updated_rec.account.loc field... How about creating a Solr document for each account and adding the recid and updt attributes from the record tag? -- Regards, Shalin Shekhar Mangar.
can the TermsComponent be used in combination with fq?
We have been trying to figure out how to construct, for example, a directory page with an overview of available facets for several fields. Looking at the issue and wiki http://wiki.apache.org/solr/TermsComponent https://issues.apache.org/jira/browse/SOLR-877 It would seem like this component would be useful for this. However - we often require that some filtering be applied to search results based on which user is searching (e.g. public vs. private content). Is it possible to apply filtering here, or will we need to do something like running a q=*:*fq=status:1 and then getting facets? Note - also - the wiki page references a tutorial including this /autocomplete path, but I cannot ifnd any trace of such. I was able to get results similar to the examples on the wiki page by adding the following to solrconfig.xml: searchComponent name=terms class=org.apache.solr.handler.component.TermsComponent / !-- a request handler utilizing the elevator component -- requestHandler name=/autocomplete class=solr.SearchHandler startup=lazy lst name=defaults str name=echoParamsexplicit/str /lst arr name=components strterms/str /arr /requestHandler Is this the right way to activate this? Thanks, Peter -- Peter M. Wolanin, Ph.D. Momentum Specialist, Acquia. Inc. peter.wola...@acquia.com
Re: term offsets not returned with tv=true
Your request seems to be fine. Have you reindexed after setting termOffsets definition to document field? Koji Jeffrey Baker wrote: I'm trying to exercise the termOffset functions in the nightly build (2009-02-11) but it doesn't seem to do anything. I have an item in my schema like so: field name=document type=text indexed=true stored=true multiValued=false termVectors=true termPositions=true termOffsets=true / And I attempt this query: qt=tvrh tv=true tv.offsets=true indent=true wt=json facet.mincount=1 facet=true hl=on hl.fl=document hl.mergeContiguous=true hl.requireFieldMatch=true fl=document,id,title,doctype,score hl.usePhraseHighlighter=true hl.snippets=3 hl.fragsize=200 hl.maxAnalyzedChars=1048576 hl.simple.pre=[[[hit] hl.simple.post=[[[/hit] rows=20 q=iphone ... where most of those parameters are irrelevant to this question (I think). The response looks like this: termVectors:[ doc-51630,[ uniqueKey,streetevents:2012449], doc-19343,[ uniqueKey,streetevents:1904785], doc-22599,[ uniqueKey,streetevents:1873725], doc-52660,[ uniqueKey,streetevents:2029389], doc-37532,[ uniqueKey,streetevents:1665907], doc-49797,[ uniqueKey,streetevents:1996051], doc-21476,[ uniqueKey,streetevents:1885188], doc-24671,[ uniqueKey,streetevents:1820498], doc-25617,[ uniqueKey,streetevents:1794743], doc-48135,[ uniqueKey,streetevents:1981537], doc-47239,[ uniqueKey,streetevents:1940855], doc-54651,[ uniqueKey,streetevents:2069828], doc-48085,[ uniqueKey,streetevents:1979847], doc-28956,[ uniqueKey,streetevents:1766038], doc-47986,[ uniqueKey,streetevents:1978001], doc-32287,[ uniqueKey,streetevents:1740905], doc-41568,[ uniqueKey,streetevents:1599906], doc-44964,[ uniqueKey,streetevents:1782481], doc-43900,[ uniqueKey,streetevents:1748639], doc-45390,[ uniqueKey,streetevents:1811998], I guess I was expecting to get some lists of term offsets. Am I doing it wrong? -jwb
Re: Release of solr 1.4 autosuggest
Sorry for budding in on this thread but what value is added by TermComponent when you can use faceting for auto-suggest? And with faceting, you can limit the suggestion by existing words before the word the user is typing by using it for q. ~ David Smiley Pooja Verlani wrote: Hi All, I am interested in TermComponent addition in solr 1.4 ( http://wiki.apache.org/solr/TermsComponent). When should we expect solr 1.4 to be available for use? Also, can this Termcomponent be made available as a plugin for solr 1.3? Kindly reply if you have any idea. Regards, Pooja -- View this message in context: http://www.nabble.com/Release-of-solr-1.4---autosuggest-tp22031697p22047806.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: delete snapshot??
Hi Noble, I maybe don't get something Ok if it's hard link but how come i've not space left on device error and 30G shown on the data folder ?? sorry I'm quite new 6.0G/data/solr/book/data/snapshot.20090216214502 35M /data/solr/book/data/snapshot.20090216195003 12M /data/solr/book/data/snapshot.20090216195502 12K /data/solr/book/data/spellchecker2 36M /data/solr/book/data/snapshot.20090216185502 37M /data/solr/book/data/snapshot.20090216203502 6.0M/data/solr/book/data/index 12K /data/solr/book/data/snapshot.20090216204002 5.8G/data/solr/book/data/snapshot.20090216172020 12K /data/solr/book/data/spellcheckerFile 28K /data/solr/book/data/snapshot.20090216200503 40K /data/solr/book/data/snapshot.20090216194002 24K /data/solr/book/data/snapshot.2009021622 32K /data/solr/book/data/snapshot.20090216184502 20K /data/solr/book/data/snapshot.20090216191004 1.1M/data/solr/book/data/snapshot.20090216213502 1.1M/data/solr/book/data/snapshot.20090216201502 1.1M/data/solr/book/data/snapshot.20090216213005 24K /data/solr/book/data/snapshot.20090216191502 1.1M/data/solr/book/data/snapshot.20090216212503 107M/data/solr/book/data/snapshot.20090216212002 14M /data/solr/book/data/snapshot.20090216190502 32K /data/solr/book/data/snapshot.20090216201002 2.3M/data/solr/book/data/snapshot.20090216204502 28K /data/solr/book/data/snapshot.20090216184002 5.8G/data/solr/book/data/snapshot.20090216181425 44K /data/solr/book/data/snapshot.20090216190001 20K /data/solr/book/data/snapshot.20090216183401 1.1M/data/solr/book/data/snapshot.20090216203002 44K /data/solr/book/data/snapshot.20090216194502 36K /data/solr/book/data/snapshot.20090216185004 12K /data/solr/book/data/snapshot.20090216182720 12K /data/solr/book/data/snapshot.20090216214001 5.8G/data/solr/book/data/snapshot.20090216175106 1.1M/data/solr/book/data/snapshot.20090216202003 5.8G/data/solr/book/data/snapshot.20090216173224 12K /data/solr/book/data/spellchecker1 1.1M/data/solr/book/data/snapshot.20090216202502 30G /data/solr/book/data thanks a lot, Noble Paul നോബിള് नोब्ळ् wrote: they are just hardlinks. they do not consume space on disk On Mon, Feb 16, 2009 at 10:34 PM, sunnyfr johanna...@gmail.com wrote: Hi, Ok but can I use it more often then every day like every three hours, because snapshot are quite big. Thanks a lot, Bill Au wrote: The --delete option of the rsync command deletes extraneous files from the destination directory. It does not delete Solr snapshots. To do that you can use the snapcleaner on the master and/or slave. Bill On Fri, Feb 13, 2009 at 10:15 AM, sunnyfr johanna...@gmail.com wrote: root 26834 16.2 0.0 19412 824 ?S16:05 0:08 rsync -Wa --delete rsync://##.##.##.##:18180/solr/snapshot.20090213160051/ /data/solr/books/data/snapshot.20090213160051-wip Hi obviously it can't delete them because the adress is bad it shouldnt be : rsync://##.##.##.##:18180/solr/snapshot.20090213160051/ but: rsync://##.##.##.##:18180/solr/books/snapshot.20090213160051/ Where should I change this, I checked my script.conf on the slave server but it seems good. Because files can be very big and my server in few hours is getting full. So actually snapcleaner is not necessary on the master ? what about the slave? Thanks a lot, Sunny -- View this message in context: http://www.nabble.com/delete-snapshot---tp21998333p21998333.html Sent from the Solr - User mailing list archive at Nabble.com. -- View this message in context: http://www.nabble.com/delete-snapshot---tp21998333p22041332.html Sent from the Solr - User mailing list archive at Nabble.com. -- --Noble Paul -- View this message in context: http://www.nabble.com/delete-snapshot---tp21998333p22048391.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: delete snapshot??
Hi Noble, I maybe don't get something Ok if it's hard link but how come i've not space left on device error and 30G shown on the data folder ?? sorry I'm quite new 6.0G/data/solr/book/data/snapshot.20090216214502 35M /data/solr/book/data/snapshot.20090216195003 12M /data/solr/book/data/snapshot.20090216195502 12K /data/solr/book/data/spellchecker2 36M /data/solr/book/data/snapshot.20090216185502 37M /data/solr/book/data/snapshot.20090216203502 6.0M/data/solr/book/data/index 12K /data/solr/book/data/snapshot.20090216204002 5.8G/data/solr/book/data/snapshot.20090216172020 12K /data/solr/book/data/spellcheckerFile 28K /data/solr/book/data/snapshot.20090216200503 40K /data/solr/book/data/snapshot.20090216194002 24K /data/solr/book/data/snapshot.2009021622 32K /data/solr/book/data/snapshot.20090216184502 20K /data/solr/book/data/snapshot.20090216191004 1.1M/data/solr/book/data/snapshot.20090216213502 1.1M/data/solr/book/data/snapshot.20090216201502 1.1M/data/solr/book/data/snapshot.20090216213005 24K /data/solr/book/data/snapshot.20090216191502 1.1M/data/solr/book/data/snapshot.20090216212503 107M/data/solr/book/data/snapshot.20090216212002 14M /data/solr/book/data/snapshot.20090216190502 32K /data/solr/book/data/snapshot.20090216201002 2.3M/data/solr/book/data/snapshot.20090216204502 28K /data/solr/book/data/snapshot.20090216184002 5.8G/data/solr/book/data/snapshot.20090216181425 44K /data/solr/book/data/snapshot.20090216190001 20K /data/solr/book/data/snapshot.20090216183401 1.1M/data/solr/book/data/snapshot.20090216203002 44K /data/solr/book/data/snapshot.20090216194502 36K /data/solr/book/data/snapshot.20090216185004 12K /data/solr/book/data/snapshot.20090216182720 12K /data/solr/book/data/snapshot.20090216214001 5.8G/data/solr/book/data/snapshot.20090216175106 1.1M/data/solr/book/data/snapshot.20090216202003 5.8G/data/solr/book/data/snapshot.20090216173224 12K /data/solr/book/data/spellchecker1 1.1M/data/solr/book/data/snapshot.20090216202502 30G /data/solr/book/data thanks a lot, Noble Paul നോബിള് नोब्ळ् wrote: they are just hardlinks. they do not consume space on disk On Mon, Feb 16, 2009 at 10:34 PM, sunnyfr johanna...@gmail.com wrote: Hi, Ok but can I use it more often then every day like every three hours, because snapshot are quite big. Thanks a lot, Bill Au wrote: The --delete option of the rsync command deletes extraneous files from the destination directory. It does not delete Solr snapshots. To do that you can use the snapcleaner on the master and/or slave. Bill On Fri, Feb 13, 2009 at 10:15 AM, sunnyfr johanna...@gmail.com wrote: root 26834 16.2 0.0 19412 824 ?S16:05 0:08 rsync -Wa --delete rsync://##.##.##.##:18180/solr/snapshot.20090213160051/ /data/solr/books/data/snapshot.20090213160051-wip Hi obviously it can't delete them because the adress is bad it shouldnt be : rsync://##.##.##.##:18180/solr/snapshot.20090213160051/ but: rsync://##.##.##.##:18180/solr/books/snapshot.20090213160051/ Where should I change this, I checked my script.conf on the slave server but it seems good. Because files can be very big and my server in few hours is getting full. So actually snapcleaner is not necessary on the master ? what about the slave? Thanks a lot, Sunny -- View this message in context: http://www.nabble.com/delete-snapshot---tp21998333p21998333.html Sent from the Solr - User mailing list archive at Nabble.com. -- View this message in context: http://www.nabble.com/delete-snapshot---tp21998333p22041332.html Sent from the Solr - User mailing list archive at Nabble.com. -- --Noble Paul -- View this message in context: http://www.nabble.com/delete-snapshot---tp21998333p22048398.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Release of solr 1.4 autosuggest
On Feb 16, 2009, at 6:13 PM, David Smiley @MITRE.org wrote: Sorry for budding in on this thread but what value is added by TermComponent when you can use faceting for auto-suggest? Yeah, you can do auto-suggest w/ faceting, no doubt. In fact the TermComponent could just as well be called Term Faceting or something like that. I mostly wrote the TermComp for exposing Lucene's underlying TermEnum and thought the auto-suggest would be a bonus. And with faceting, you can limit the suggestion by existing words before the word the user is typing by using it for q. Not sure I follow, but the whole point of auto-suggest is to limit by existing words, right? The TermComp uses Lucene's internal TermEnum to return results without any of the other stuff related to faceting. And, of course, you would only ask for terms beginning with the word that is being typed. I haven't tested if it is faster or not, but I do know there is a fair amount less code involved, so it _might_ be. It would be good to do some perf. comparisons. -Grant -- Grant Ingersoll http://www.lucidimagination.com/ Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) using Solr/Lucene: http://www.lucidimagination.com/search
Re: Input XML duplicate fields uniqueness
Shalin Shekhar Mangar wrote: On Mon, Feb 16, 2009 at 11:47 PM, Adi_Jinx rohit_wa...@yahoo.com wrote: How about creating a Solr document for each account and adding the recid and updt attributes from the record tag? -- Regards, Shalin Shekhar Mangar. However then I do need to allow duplicate for my uniqueue key that is rec.id. My purpose is to track account changes and somebody should be able to query it. XML posted here only has entity_updated, I have added and deleted also. IN that case I may have to post 4-5 docs with same rec.id. Is there ay other way out? -- View this message in context: http://www.nabble.com/Input-XML-duplicate-fields-uniqueness-tp22042765p22049210.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Distributed search
Hi, That should work, yes, though it may not be a wise thing to do performance-wise, if the number of CPU cores that solr server has is lower than the number of Solr cores. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch From: revathy arun revas...@gmail.com To: solr-user@lucene.apache.org Sent: Monday, February 16, 2009 8:18:36 PM Subject: Distributed search Hi, Can we use multicore to have several indexes per webapp and use distributed search to merge the indexes? for exampe if we have 3 cores -core0 ,core1 and core2 for 3 different languages and to search across all the 3 indexes use the shard parameter as shard=localhost:8080/solr/core0,localhost:8080/solr/core1,localhost:8080/solr/core2 Regards Sujatha
Re: indexing Chienese langage
Hi, While some of the characters in simplified and traditional Chinese do differ, the Chinese tokenizer doesn't care - it simply creates ngram tokens. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch From: revathy arun revas...@gmail.com To: solr-user@lucene.apache.org Sent: Monday, February 16, 2009 4:30:47 PM Subject: indexing Chienese langage Hi, When I index chinese content using chinese tokenizer and analyzer in solr 1.3 ,some of the chinese text files are getting indexed but others are not. Since chinese has got many different language subtypes as in standard chinese,simplified chinese etc which of these does the chinese tokenizer support and is there any method to find the type of chiense language from the file? Rgds
Re: Multilanguage
Hi, The best option would be to identify the language after parsing the PDF and then index it using an appropriate analyzer defined in schema.xml. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch From: revathy arun revas...@gmail.com To: solr-user@lucene.apache.org Sent: Monday, February 16, 2009 1:42:07 PM Subject: Multilanguage Hi, I have a scenario where ,i need to convert pdf content to text and then index the same at run time .I do not know as to what language the pdf would be ,in this case which is the best soln i have with respect the content field type in the schema where the text content would be indexed to? That is can i use the default tokenizer for all languages and since i would not know the language and hence would not be able to stem the tokens,how would this impact search?Is there any other solution for the same? Rgds
Re: Outofmemory error for large files
Siddharth, At the end of your email you said: One option I see is to break the file in chunks, but with this, I won't be able to search with multiple words if they are distributed in different documents. Unless I'm missing something unusual about your application, I don't think the above is technically correct. Have you tried doing this and have you then tried your searches? Everything should still work, even if you index one document at a time. Otis-- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch From: Gargate, Siddharth sgarg...@ptc.com To: solr-user@lucene.apache.org Sent: Monday, February 16, 2009 2:00:58 PM Subject: Outofmemory error for large files I am trying to index around 150 MB text file with 1024 MB max heap. But I get Outofmemory error in the SolrJ code. Exception in thread main java.lang.OutOfMemoryError: Java heap space at java.util.Arrays.copyOf(Arrays.java:2882) at java.lang.AbstractStringBuilder.expandCapacity(AbstractStringBuilder.jav a:100) at java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:572) at java.lang.StringBuffer.append(StringBuffer.java:320) at java.io.StringWriter.write(StringWriter.java:60) at org.apache.solr.common.util.XML.escape(XML.java:206) at org.apache.solr.common.util.XML.escapeCharData(XML.java:79) at org.apache.solr.common.util.XML.writeXML(XML.java:149) at org.apache.solr.client.solrj.util.ClientUtils.writeXML(ClientUtils.java: 115) at org.apache.solr.client.solrj.request.UpdateRequest.writeXML(UpdateReques t.java:200) at org.apache.solr.client.solrj.request.UpdateRequest.getXML(UpdateRequest. java:178) at org.apache.solr.client.solrj.request.UpdateRequest.getContentStreams(Upd ateRequest.java:173) at org.apache.solr.client.solrj.embedded.EmbeddedSolrServer.request(Embedde dSolrServer.java:136) at org.apache.solr.client.solrj.request.UpdateRequest.process(UpdateRequest .java:243) at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:63) I modified the UpdateRequest class to initialize the StringWriter object in UpdateRequest.getXML with initial size, and cleared the SolrInputDocument that is having the reference of the file text. Then I am getting OOM as below: Exception in thread main java.lang.OutOfMemoryError: Java heap space at java.util.Arrays.copyOf(Arrays.java:2786) at java.lang.StringCoding.safeTrim(StringCoding.java:64) at java.lang.StringCoding.access$300(StringCoding.java:34) at java.lang.StringCoding$StringEncoder.encode(StringCoding.java:251) at java.lang.StringCoding.encode(StringCoding.java:272) at java.lang.String.getBytes(String.java:947) at org.apache.solr.common.util.ContentStreamBase$StringStream.getStream(Con tentStreamBase.java:142) at org.apache.solr.common.util.ContentStreamBase$StringStream.getReader(Con tentStreamBase.java:154) at org.apache.solr.handler.XMLLoader.load(XMLLoader.java:61) at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(Conte ntStreamHandlerBase.java:54) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerB ase.java:131) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1333) at org.apache.solr.client.solrj.embedded.EmbeddedSolrServer.request(Embedde dSolrServer.java:139) at org.apache.solr.client.solrj.request.UpdateRequest.process(UpdateRequest .java:249) at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:63) After I increase the heap size upto 1250 MB, I get OOM as Exception in thread main java.lang.OutOfMemoryError: Java heap space at java.util.Arrays.copyOfRange(Arrays.java:3209) at java.lang.String.init(String.java:216) at java.lang.StringBuffer.toString(StringBuffer.java:585) at com.ctc.wstx.util.TextBuffer.contentsAsString(TextBuffer.java:403) at com.ctc.wstx.sr.BasicStreamReader.getText(BasicStreamReader.java:821) at org.apache.solr.handler.XMLLoader.readDoc(XMLLoader.java:276) at org.apache.solr.handler.XMLLoader.processUpdate(XMLLoader.java:139) at org.apache.solr.handler.XMLLoader.load(XMLLoader.java:69) at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(Conte ntStreamHandlerBase.java:54) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerB ase.java:131) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1333) at org.apache.solr.client.solrj.embedded.EmbeddedSolrServer.request(Embedde dSolrServer.java:139) at org.apache.solr.client.solrj.request.UpdateRequest.process(UpdateRequest .java:249) at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:63) So looks like I won't be able to get out of these OOMs. Is there any way to avoid these OOMs? One option I see is to break the file in chunks, but with this, I won't be able to search with multiple words if they are distributed in different documents. Also, can somebody tell me the
Re: Word Locations Search Components
Hi, Wouldn't this be as easy as: - split email into paragraphs - for each paragraph compute signature (MD5 or something fuzzier, like in SOLR-799) - for each signature look for other emails with this signature - when you find an email with an identical signature, you know you've found the banner I'd do this in a pre-processing phase. You may have to add special logic for and other email-quoting characters. Perhaps you can make use of assumption that banners always come at the end of emails. Perhaps you can make use of situations where the banner appears multiple times in a single email (the one with lots of back-and-forth replies, for example). This is similar to MoreLikeThis on paragraph level. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch From: Johnny X jonathanwel...@gmail.com To: solr-user@lucene.apache.org Sent: Monday, February 16, 2009 11:05:40 PM Subject: Re: Word Locations Search Components Basically I'm working on the Enron dataset, and I've already de-duplicated the collection and applied a spam filter. All the e-mails after this have been parsed to XML and each field (so To, From, Date etc) has been separated, along with one large field for the remaining e-mail content (called Content). So yes, to answer your question. Bearing in mind though this still represents around 240, 000ish files to compute. I have no idea about Solr analyzers/search components, but my theory was that I'd need an analyzer to remove 'banner-like' content from being indexed and a search component to identify 'corporate-like' information in the content of the e-mails. What is a business logical solution and how will that work? Thanks. zayhen wrote: I would go for a business logic solution and not a Solr customization in this case, as you need to filter information that you actually would like to see in diferent fields on your index. Did you already tried to split the email in several fields like subject, from, to, content, signature, etc etc etc ? 2009/2/16 Johnny X jonathanwel...@gmail.com Hi there, I was told before that I'd need to create a custom search component to do what I want to do, but I'm thinking it might actually be a custom analyzer. Basically, I'm indexing e-mail in XML in Solr and searching the 'content' field which is parsed as 'text'. I want to ignore certain elements of the e-mail (i.e. corporate banners), but also identify the actual content of those e-mails including corporate information. To identify the banners I need something a little more developed than a stop word list. I need to evaluate the frequency of certain words around words like 'privileged' and 'corporate' within a word window of about 100ish words to determine whether they're banners and then remove them from being indexed. I need to do the opposite during the same time to identify, in a similar manner, which e-mails include corporate information in their actual content. I suppose if I'm doing this I don't want what's processed to be indexed as what's returned in a search, because then presumably it won't be the full e-mail, so do I need to store some kind of copy field that keeps the full e-mail and is fully indexed to be returned instead? Can what I'm suggesting be done and can anyone direct me to a guide? On another note, is there an easy way to destroy an index...any custom code? Thanks for any help! -- View this message in context: http://www.nabble.com/Word-Locations---Search-Components-tp22031139p22031139.html Sent from the Solr - User mailing list archive at Nabble.com. -- Alexander Ramos Jardim - RPG da Ilha -- View this message in context: http://www.nabble.com/Word-Locations---Search-Components-tp22031139p22038912.html Sent from the Solr - User mailing list archive at Nabble.com.
RE: Outofmemory error for large files
Otis, I haven't tried it yet but what I meant is : If we divide the content in multiple parts, then words will be splitted in two different SOLR documents. If the main document contains 'Hello World' then these two words might get indexed in two different documents. Searching for 'Hello world' won't give me the required search result unless I use OR in the query. Thanks, Siddharth -Original Message- From: Otis Gospodnetic [mailto:otis_gospodne...@yahoo.com] Sent: Tuesday, February 17, 2009 9:58 AM To: solr-user@lucene.apache.org Subject: Re: Outofmemory error for large files Siddharth, At the end of your email you said: One option I see is to break the file in chunks, but with this, I won't be able to search with multiple words if they are distributed in different documents. Unless I'm missing something unusual about your application, I don't think the above is technically correct. Have you tried doing this and have you then tried your searches? Everything should still work, even if you index one document at a time. Otis-- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch From: Gargate, Siddharth sgarg...@ptc.com To: solr-user@lucene.apache.org Sent: Monday, February 16, 2009 2:00:58 PM Subject: Outofmemory error for large files I am trying to index around 150 MB text file with 1024 MB max heap. But I get Outofmemory error in the SolrJ code. Exception in thread main java.lang.OutOfMemoryError: Java heap space at java.util.Arrays.copyOf(Arrays.java:2882) at java.lang.AbstractStringBuilder.expandCapacity(AbstractStringBuilder.jav a:100) at java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:572) at java.lang.StringBuffer.append(StringBuffer.java:320) at java.io.StringWriter.write(StringWriter.java:60) at org.apache.solr.common.util.XML.escape(XML.java:206) at org.apache.solr.common.util.XML.escapeCharData(XML.java:79) at org.apache.solr.common.util.XML.writeXML(XML.java:149) at org.apache.solr.client.solrj.util.ClientUtils.writeXML(ClientUtils.java: 115) at org.apache.solr.client.solrj.request.UpdateRequest.writeXML(UpdateReques t.java:200) at org.apache.solr.client.solrj.request.UpdateRequest.getXML(UpdateRequest. java:178) at org.apache.solr.client.solrj.request.UpdateRequest.getContentStreams(Upd ateRequest.java:173) at org.apache.solr.client.solrj.embedded.EmbeddedSolrServer.request(Embedde dSolrServer.java:136) at org.apache.solr.client.solrj.request.UpdateRequest.process(UpdateRequest .java:243) at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:63) I modified the UpdateRequest class to initialize the StringWriter object in UpdateRequest.getXML with initial size, and cleared the SolrInputDocument that is having the reference of the file text. Then I am getting OOM as below: Exception in thread main java.lang.OutOfMemoryError: Java heap space at java.util.Arrays.copyOf(Arrays.java:2786) at java.lang.StringCoding.safeTrim(StringCoding.java:64) at java.lang.StringCoding.access$300(StringCoding.java:34) at java.lang.StringCoding$StringEncoder.encode(StringCoding.java:251) at java.lang.StringCoding.encode(StringCoding.java:272) at java.lang.String.getBytes(String.java:947) at org.apache.solr.common.util.ContentStreamBase$StringStream.getStream(Con tentStreamBase.java:142) at org.apache.solr.common.util.ContentStreamBase$StringStream.getReader(Con tentStreamBase.java:154) at org.apache.solr.handler.XMLLoader.load(XMLLoader.java:61) at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(Conte ntStreamHandlerBase.java:54) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerB ase.java:131) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1333) at org.apache.solr.client.solrj.embedded.EmbeddedSolrServer.request(Embedde dSolrServer.java:139) at org.apache.solr.client.solrj.request.UpdateRequest.process(UpdateRequest .java:249) at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:63) After I increase the heap size upto 1250 MB, I get OOM as Exception in thread main java.lang.OutOfMemoryError: Java heap space at java.util.Arrays.copyOfRange(Arrays.java:3209) at java.lang.String.init(String.java:216) at java.lang.StringBuffer.toString(StringBuffer.java:585) at com.ctc.wstx.util.TextBuffer.contentsAsString(TextBuffer.java:403) at com.ctc.wstx.sr.BasicStreamReader.getText(BasicStreamReader.java:821) at org.apache.solr.handler.XMLLoader.readDoc(XMLLoader.java:276) at org.apache.solr.handler.XMLLoader.processUpdate(XMLLoader.java:139) at org.apache.solr.handler.XMLLoader.load(XMLLoader.java:69) at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(Conte ntStreamHandlerBase.java:54) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerB ase.java:131) at
Re: Outofmemory error for large files
Siddharth, But does your 150MB file represent a single Document? That doesn't sound right. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch From: Gargate, Siddharth sgarg...@ptc.com To: solr-user@lucene.apache.org Sent: Tuesday, February 17, 2009 12:39:53 PM Subject: RE: Outofmemory error for large files Otis, I haven't tried it yet but what I meant is : If we divide the content in multiple parts, then words will be splitted in two different SOLR documents. If the main document contains 'Hello World' then these two words might get indexed in two different documents. Searching for 'Hello world' won't give me the required search result unless I use OR in the query. Thanks, Siddharth -Original Message- From: Otis Gospodnetic [mailto:otis_gospodne...@yahoo.com] Sent: Tuesday, February 17, 2009 9:58 AM To: solr-user@lucene.apache.org Subject: Re: Outofmemory error for large files Siddharth, At the end of your email you said: One option I see is to break the file in chunks, but with this, I won't be able to search with multiple words if they are distributed in different documents. Unless I'm missing something unusual about your application, I don't think the above is technically correct. Have you tried doing this and have you then tried your searches? Everything should still work, even if you index one document at a time. Otis-- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch From: Gargate, Siddharth sgarg...@ptc.com To: solr-user@lucene.apache.org Sent: Monday, February 16, 2009 2:00:58 PM Subject: Outofmemory error for large files I am trying to index around 150 MB text file with 1024 MB max heap. But I get Outofmemory error in the SolrJ code. Exception in thread main java.lang.OutOfMemoryError: Java heap space at java.util.Arrays.copyOf(Arrays.java:2882) at java.lang.AbstractStringBuilder.expandCapacity(AbstractStringBuilder.jav a:100) at java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:572) at java.lang.StringBuffer.append(StringBuffer.java:320) at java.io.StringWriter.write(StringWriter.java:60) at org.apache.solr.common.util.XML.escape(XML.java:206) at org.apache.solr.common.util.XML.escapeCharData(XML.java:79) at org.apache.solr.common.util.XML.writeXML(XML.java:149) at org.apache.solr.client.solrj.util.ClientUtils.writeXML(ClientUtils.java: 115) at org.apache.solr.client.solrj.request.UpdateRequest.writeXML(UpdateReques t.java:200) at org.apache.solr.client.solrj.request.UpdateRequest.getXML(UpdateRequest. java:178) at org.apache.solr.client.solrj.request.UpdateRequest.getContentStreams(Upd ateRequest.java:173) at org.apache.solr.client.solrj.embedded.EmbeddedSolrServer.request(Embedde dSolrServer.java:136) at org.apache.solr.client.solrj.request.UpdateRequest.process(UpdateRequest .java:243) at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:63) I modified the UpdateRequest class to initialize the StringWriter object in UpdateRequest.getXML with initial size, and cleared the SolrInputDocument that is having the reference of the file text. Then I am getting OOM as below: Exception in thread main java.lang.OutOfMemoryError: Java heap space at java.util.Arrays.copyOf(Arrays.java:2786) at java.lang.StringCoding.safeTrim(StringCoding.java:64) at java.lang.StringCoding.access$300(StringCoding.java:34) at java.lang.StringCoding$StringEncoder.encode(StringCoding.java:251) at java.lang.StringCoding.encode(StringCoding.java:272) at java.lang.String.getBytes(String.java:947) at org.apache.solr.common.util.ContentStreamBase$StringStream.getStream(Con tentStreamBase.java:142) at org.apache.solr.common.util.ContentStreamBase$StringStream.getReader(Con tentStreamBase.java:154) at org.apache.solr.handler.XMLLoader.load(XMLLoader.java:61) at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(Conte ntStreamHandlerBase.java:54) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerB ase.java:131) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1333) at org.apache.solr.client.solrj.embedded.EmbeddedSolrServer.request(Embedde dSolrServer.java:139) at org.apache.solr.client.solrj.request.UpdateRequest.process(UpdateRequest .java:249) at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:63) After I increase the heap size upto 1250 MB, I get OOM as Exception in thread main java.lang.OutOfMemoryError: Java heap space at java.util.Arrays.copyOfRange(Arrays.java:3209) at java.lang.String.init(String.java:216) at java.lang.StringBuffer.toString(StringBuffer.java:585) at com.ctc.wstx.util.TextBuffer.contentsAsString(TextBuffer.java:403) at com.ctc.wstx.sr.BasicStreamReader.getText(BasicStreamReader.java:821) at
How to fetch all matching records :urgent
Hello, I am using getResults method of queryResponse class, on a keyword that has more than hundred of matching records. Bit this method returns me only 10 results. And then throw an array index out of bound exception. how can I fetch all the results? Its really important and urgent for me , kindly reply Neha Bhardwaj| Software Engineer | Persistent Systems Limited. mailto:akshat_maheshw...@persistent.co.in neha_bhard...@persistent.co.in | Cell: +91 9272383082 | Tel: +91 (20) 302 35257 DISCLAIMER == This e-mail may contain privileged and confidential information which is the property of Persistent Systems Ltd. It is intended only for the use of the individual or entity to which it is addressed. If you are not the intended recipient, you are not authorized to read, retain, copy, print, distribute or use this message. If you have received this communication in error, please notify the sender and delete all copies of this message. Persistent Systems Ltd. does not accept any liability for virus infected mails.
Re: How to fetch all matching records :urgent
Increment the start value by 10 and make another request. wunder On 2/16/09 9:13 PM, Neha Bhardwaj neha_bhard...@persistent.co.in wrote: Hello, I am using getResults method of queryResponse class, on a keyword that has more than hundred of matching records. Bit this method returns me only 10 results. And then throw an array index out of bound exception. how can I fetch all the results? Its really important and urgent for me , kindly reply Neha Bhardwaj| Software Engineer | Persistent Systems Limited. mailto:akshat_maheshw...@persistent.co.in neha_bhard...@persistent.co.in | Cell: +91 9272383082 | Tel: +91 (20) 302 35257 DISCLAIMER == This e-mail may contain privileged and confidential information which is the property of Persistent Systems Ltd. It is intended only for the use of the individual or entity to which it is addressed. If you are not the intended recipient, you are not authorized to read, retain, copy, print, distribute or use this message. If you have received this communication in error, please notify the sender and delete all copies of this message. Persistent Systems Ltd. does not accept any liability for virus infected mails.
Re: delete snapshot??
The hardlinks will prevent the unused files from getting cleaned up. So the diskspace is consumed for unused index files also. You may need to delete unused snapshots from time to time --Noble On Tue, Feb 17, 2009 at 5:24 AM, sunnyfr johanna...@gmail.com wrote: Hi Noble, I maybe don't get something Ok if it's hard link but how come i've not space left on device error and 30G shown on the data folder ?? sorry I'm quite new 6.0G/data/solr/book/data/snapshot.20090216214502 35M /data/solr/book/data/snapshot.20090216195003 12M /data/solr/book/data/snapshot.20090216195502 12K /data/solr/book/data/spellchecker2 36M /data/solr/book/data/snapshot.20090216185502 37M /data/solr/book/data/snapshot.20090216203502 6.0M/data/solr/book/data/index 12K /data/solr/book/data/snapshot.20090216204002 5.8G/data/solr/book/data/snapshot.20090216172020 12K /data/solr/book/data/spellcheckerFile 28K /data/solr/book/data/snapshot.20090216200503 40K /data/solr/book/data/snapshot.20090216194002 24K /data/solr/book/data/snapshot.2009021622 32K /data/solr/book/data/snapshot.20090216184502 20K /data/solr/book/data/snapshot.20090216191004 1.1M/data/solr/book/data/snapshot.20090216213502 1.1M/data/solr/book/data/snapshot.20090216201502 1.1M/data/solr/book/data/snapshot.20090216213005 24K /data/solr/book/data/snapshot.20090216191502 1.1M/data/solr/book/data/snapshot.20090216212503 107M/data/solr/book/data/snapshot.20090216212002 14M /data/solr/book/data/snapshot.20090216190502 32K /data/solr/book/data/snapshot.20090216201002 2.3M/data/solr/book/data/snapshot.20090216204502 28K /data/solr/book/data/snapshot.20090216184002 5.8G/data/solr/book/data/snapshot.20090216181425 44K /data/solr/book/data/snapshot.20090216190001 20K /data/solr/book/data/snapshot.20090216183401 1.1M/data/solr/book/data/snapshot.20090216203002 44K /data/solr/book/data/snapshot.20090216194502 36K /data/solr/book/data/snapshot.20090216185004 12K /data/solr/book/data/snapshot.20090216182720 12K /data/solr/book/data/snapshot.20090216214001 5.8G/data/solr/book/data/snapshot.20090216175106 1.1M/data/solr/book/data/snapshot.20090216202003 5.8G/data/solr/book/data/snapshot.20090216173224 12K /data/solr/book/data/spellchecker1 1.1M/data/solr/book/data/snapshot.20090216202502 30G /data/solr/book/data thanks a lot, Noble Paul നോബിള് नोब्ळ् wrote: they are just hardlinks. they do not consume space on disk On Mon, Feb 16, 2009 at 10:34 PM, sunnyfr johanna...@gmail.com wrote: Hi, Ok but can I use it more often then every day like every three hours, because snapshot are quite big. Thanks a lot, Bill Au wrote: The --delete option of the rsync command deletes extraneous files from the destination directory. It does not delete Solr snapshots. To do that you can use the snapcleaner on the master and/or slave. Bill On Fri, Feb 13, 2009 at 10:15 AM, sunnyfr johanna...@gmail.com wrote: root 26834 16.2 0.0 19412 824 ?S16:05 0:08 rsync -Wa --delete rsync://##.##.##.##:18180/solr/snapshot.20090213160051/ /data/solr/books/data/snapshot.20090213160051-wip Hi obviously it can't delete them because the adress is bad it shouldnt be : rsync://##.##.##.##:18180/solr/snapshot.20090213160051/ but: rsync://##.##.##.##:18180/solr/books/snapshot.20090213160051/ Where should I change this, I checked my script.conf on the slave server but it seems good. Because files can be very big and my server in few hours is getting full. So actually snapcleaner is not necessary on the master ? what about the slave? Thanks a lot, Sunny -- View this message in context: http://www.nabble.com/delete-snapshot---tp21998333p21998333.html Sent from the Solr - User mailing list archive at Nabble.com. -- View this message in context: http://www.nabble.com/delete-snapshot---tp21998333p22041332.html Sent from the Solr - User mailing list archive at Nabble.com. -- --Noble Paul -- View this message in context: http://www.nabble.com/delete-snapshot---tp21998333p22048398.html Sent from the Solr - User mailing list archive at Nabble.com. -- --Noble Paul
Re: Outofmemory error for large files
On Tue, Feb 17, 2009 at 10:26 AM, Otis Gospodnetic otis_gospodne...@yahoo.com wrote: Siddharth, But does your 150MB file represent a single Document? That doesn't sound right. Otis, Solrj writes the whole XML in memory before writing it to server. That may be one reason behind Sidhharth's OOME. See https://issues.apache.org/jira/browse/SOLR-973 -- Regards, Shalin Shekhar Mangar.
Re: Outofmemory error for large files
Right. But I was trying to point out that a single 150MB Document is not in fact what the o.p. wants to do. For example, if your 150MB represents, say, a whole book, should that really be a single document? Or should individual chapters be separate documents, for example? Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch From: Shalin Shekhar Mangar shalinman...@gmail.com To: solr-user@lucene.apache.org Sent: Tuesday, February 17, 2009 2:48:08 PM Subject: Re: Outofmemory error for large files On Tue, Feb 17, 2009 at 10:26 AM, Otis Gospodnetic otis_gospodne...@yahoo.com wrote: Siddharth, But does your 150MB file represent a single Document? That doesn't sound right. Otis, Solrj writes the whole XML in memory before writing it to server. That may be one reason behind Sidhharth's OOME. See https://issues.apache.org/jira/browse/SOLR-973 -- Regards, Shalin Shekhar Mangar.
Re: Need help with DictionaryCompoundWordTokenFilterFactory
Ralf, Not sure if you got this working or not, but perhaps a simple solution is changing the default boolean operator from OR to AND. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch From: Kraus, Ralf | pixelhouse GmbH r...@pixelhouse.de To: solr-user@lucene.apache.org Sent: Friday, February 6, 2009 6:23:51 PM Subject: Need help with DictionaryCompoundWordTokenFilterFactory Hi, Now I ran into another problem by using the solr.DictionaryCompoundWordTokenFilterFactory :-( If I search for the german word Spargelcremesuppe which contains Spargel, Creme and Suppe SOLR will find way to many result. Its because SOLR finds EVERY entry with either one of the three words in it :-( Here is my schema.xml fieldType name=text_text class=solr.TextField positionIncrementGap=100 analyzer tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.DictionaryCompoundWordTokenFilterFactory dictionary=dictionary.txt minWordSize=5 minSubwordSize=2 maxSubwordSize=15 onlyLongestMatch=true / filter class=solr.SynonymFilterFactory synonyms=synonyms.txt ignoreCase=true expand=true/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.RemoveDuplicatesTokenFilterFactory/ filter class=solr.SnowballPorterFilterFactory language=German / /analyzer /fieldType Any help ? Greets, Ralf Kraus