Solr UpdateJSON - extra fields
If JSON being posted to ''http://localhost:8983/solr/update/json' URL has extra fields that are not defined in the index schema definition, will those be silently ignored or an error thrown. -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-UpdateJSON-extra-fields-tp3366066p3366066.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Update ingest rate drops suddenly
Thanks Otis, we will look into these issues again, slightly deeper. Network problems are not likely, but DB, I do not know, this is huge select ... we will try to scan db, without indexing, just to see if it can sustain... But gut feeling says, nope, this is not the one. IO saturation would surprise me, but you never know. Might be very well that SSD is somehow having problems with this sustained throughput. 8 Core... no, this was single update thread. we left default index settings (do not tweak if it works :) ramBufferSizeMB32/ramBufferSizeMB 32MB sounds like a lot of our documents (100b average on disk size). Assuming ram efficiency of 50% (?), we lend at 100k buffered documents. Yes, this is kind of smallish as every ~3 seconds we fill-up ramBuffer. (our Analyzers surprised me with 30k+ records per second). 256 will do the job, ~24 seconds should be plenty of idle time for IO-OS-JVM to sort out MMAP issues, if any (windows was newer MMAP performance champion when using it from java, but once you dance around it, it works ok)... Max jvm heap on this test was 768m, memory never went above 500m, Using -XX:-UseParallelGC ... this is definitely not a gc problem. cheers, eks On Sun, Sep 25, 2011 at 6:20 AM, Otis Gospodnetic otis_gospodne...@yahoo.com wrote: eks, This is clear as day - you're using Winblows! Kidding. I'd: * watch IO with something like vmstat 2 and see if the rate drops correlate to increased disk IO or IO wait time * monitor the DB from which you were pulling the data - maybe the DB or the server that runs it had issues * monitor the network over which you pull data from DB If none of the above reveals the problem I'd still: * grab all data you need to index and copy it locally * index everything locally Out of curiosity, how big is your ramBufferSizeMB and your -Xmx? And on that 8-core box you have ~8 indexing threads going? Otis Sematext is Hiring -- http://sematext.com/about/jobs.html From: eks dev eks...@yahoo.co.uk To: solr-user solr-user@lucene.apache.org Sent: Saturday, September 24, 2011 3:18 PM Subject: Update ingest rate drops suddenly just looking for hints where to look for... We were testing single threaded ingest rate on solr, trunk version on atypical collection (a lot of small documents), and we noticed something we are not able to explain. Setup: We use defaults for index settings, windows 64 bit, jdk 7 U2. on SSD, machine with enough memory and 8 cores. Schema has 5 stored fields, 4 of them indexed no positions no norms. Average net document size (optimized index size / number of documents) is around 100 bytes. On a test with 40 Mio document: - we had update ingest rate on first 4,4Mio documents @ incredible 34k records / second... - then it dropped, suddenly to 20k records per second and this rate remained stable (variance 1k) until... - we hit 13Mio, where ingest rate dropped again really hard, from one instant in time to another to 10k records per second. it stayed there until we reached the end @40Mio (slightly reducing, to ca 9k, but this is not long enough to see trend). Nothing unusual happening with jvm memory ( tooth-saw 200- 450M fully regular). CPU in turn was following the ingest rate trend, inicating that we were waiting on something. No searches , no commits, nothing. autoCommit was turned off. Updates were streaming directly from the database. - I did not expect something like this, knowing lucene merges in background. Also, having such sudden drops in ingest rate is indicative that we are not leaking something. (drop would have been much more gradual). It is some caches, but why two really significant drops? 33k/sec to 20k and than to 10k... We would love to keep it @34 k/second :) I am not really acquainted with the new MergePolicy and flushing settings, but I suspect this is something there we could tweak. Could it be windows is somehow, hmm, quirky with solr default directory on win64/jvm (I think it is MMAP by default)... We did not saturate IO with such a small documents I guess, It is a just couple of Gig over 1-2 hours. All in all, it works good, but is having such hard update ingest rate drops normal? Thanks, eks.
Re: matching reponse and request
Hi Otis, this is absolutely brilliant! I did not think it were possible. It opens up a new possibility. If I insert device ID's in this manner (as in a unique identifier of the device sending the request) , might it be possible to control (at least block or permit) the permissions of the user? It seems like something of the sort is possible but I only come up with this: http://search-lucene.com/m/Yuib11zCeYN No redirect to where the permissions can be set (in schema) and how the requests are identified to come from a particular user/device.. Thanks for your help. Kind regards, Roland Otis Gospodnetic wrote: Hi Roland, Check this: response lst name=responseHeader int name=status0/int int name=QTime0/int lst name=params str name=indenton/str str name=start0/str str name=qsolr/str str name=foo1/str=== from foo=1 str name=version2.2/str str name=rows10/str /lst I added foo=1 to the request to Solr and got the above back. Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/ From: Roland Tollenaar rwatollen...@gmail.com To: solr-user@lucene.apache.org Sent: Saturday, September 24, 2011 4:07 AM Subject: matching reponse and request Hi, sorry for this question but I am hoping it has a quick solution. I am sending multiple get request queries to solr but solr is not returning the responses in the sequence I send the requests. The shortest responses arrive back first I am wondering whether I can add a tag to the request which will be given back to me in the response so that when the response comes I can connect it to re original request and handle it in the appropriate manner. If this is possible, how? Help appreciated! Regards, Roland.
How to apply filters to stored data
Is it possible to apply filters to stored data like we can apply filter when indexing. For example I use KeepWordFilter on a field during indexing. But I don't want filtered data to be even stored ie I want the content in index and store for this field to be same. Also when retrieving data(querying solr) I find that the content retrieved is the stored data. Is it possible to get the data that is indexed as against the stored one? -- View this message in context: http://lucene.472066.n3.nabble.com/How-to-apply-filters-to-stored-data-tp3366230p3366230.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: matching reponse and request
Hi, actually your are right in the sense that this should be sorted out a layer level lower. I.e. server-client connection level. Done that as well. Thanks for the response. Regards, Roland rkuris wrote: I don't think you can do this. If you are sending multiple GET requests, you are doing it across different HTTP connections. The web service has no way of knowing these are related. One solution would be to pass a spare, unused parameter to your request, like sequenceId=NNN and get the response to echo that back. Then at least you can tell which one is coming back and fix the order up in your program. -- View this message in context: http://lucene.472066.n3.nabble.com/matching-reponse-and-request-tp3363976p3364816.html Sent from the Solr - User mailing list archive at Nabble.com.
Multiple servers support
Hi, I am new to Solr, and I am studying it currently. We are planning to implement Solr in our production setup. We have 15 servers where we are getting the data. The data is huge, like we are supposed to keep 150 Tera bytes of data (in terms of documents it will be around 2592000 documents per server), across all servers (combined). We have the necessary storage capacity. Can anyone let me know whether Solr will be a good solution for our text search needs ? We are required to provide text searches or certain limited number of fields. 1- Does Solr support such architecture, i.e. multiple servers ? what specific area in Solr do i need to explore (shards, cores etc, ???) 2- Any idea whether we will really benefit from Solr implementation for text searches, vs let us say Oracle Text Search ? Currently our Oracle Text search is giving a very bad performance and we are looking to some how improve our text search performance any high level pointers or help will be greatly appreciated. thanks in advance guys -- Regards, Raja
escaping HTML tags within XML file
Hello, Was wondering if it is necessary to escape HTML tags within an XML file for indexing? If so, seems like a large XML files with tons of HTML tags could get really messy (using CDATA). Has this been your experience? Do you escape the HTML tags? If so, what technique do you use? Or do you leave the HTML tags in place without escaping them? Thanks!
Re: Solr UpdateJSON - extra fields
This is really easy to try, and you get the same kind of error you get with undefined fields in XML. Best Erick On Sat, Sep 24, 2011 at 11:29 PM, msingla msin...@hotmail.com wrote: If JSON being posted to ''http://localhost:8983/solr/update/json' URL has extra fields that are not defined in the index schema definition, will those be silently ignored or an error thrown. -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-UpdateJSON-extra-fields-tp3366066p3366066.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: How to apply filters to stored data
No and no. Hmmm, that's a bit terse. The split between stored and indexed happens quite early in the update process, there's no way I know of to use the tokenized stream as the input to your stored data. And there's no out-of-the-box way to get the indexed tokens back For anything except very small fields, this would be quite costly. What problem are you trying to solve? Perhaps this is an XY problem. See: http://people.apache.org/~hossman/#xyproblem Best Erick On Sun, Sep 25, 2011 at 1:54 AM, drogon jithin1...@gmail.com wrote: Is it possible to apply filters to stored data like we can apply filter when indexing. For example I use KeepWordFilter on a field during indexing. But I don't want filtered data to be even stored ie I want the content in index and store for this field to be same. Also when retrieving data(querying solr) I find that the content retrieved is the stored data. Is it possible to get the data that is indexed as against the stored one? -- View this message in context: http://lucene.472066.n3.nabble.com/How-to-apply-filters-to-stored-data-tp3366230p3366230.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: term vector parser in solr.NET
TermVectorComponent support is a pending issue: http://code.google.com/p/solrnet/issues/detail?id=68 Please use the SolrNet mailing list for specific questions about it: http://groups.google.com/group/solrnet Cheers, Mauricio On Mon, Sep 19, 2011 at 7:33 AM, jame vaalet jamevaa...@gmail.com wrote: hi, i was wondering if there is any method to get back the term vector list from solr through solr.NET? from the source code for SOLR.NET i couldn't notice any term vector parser in SOLR.NET . -- -JAME
Re: Multiple servers support
Well, this is not a neutral forum G... A common use-case for Solr is exactly to replace database searches because, as you say, search performance in a database is often slow and limited. RDBMSs do very complex stuff very well, but they are not designed for text searching. Scaling is accomplished by either replication or sharding. Replication is used when the entire index fits on a single machine and you can get reasonable responses. I've seen 40-50M docs fit quite comfortably on one machine. But 150TB *probably* indicates that this isn't reasonable in your case. If you can't fit the entire index on one machine, then you shard, which splits up the single logical index into multiple slices and Solr automatically will query all the shards and assemble the parts into a single response. But you absolutely cannot guess the hardware requirements ahead of time. It's like answering How big is a Java program? There are too many variables. But Solr is free, right? So you absolutely have to get a copy and put your 2.5M docs on it and test (Solrmeter or jMeter are good options). If you get adequate throughput, add another 1M docs to the machine. Keep on until your QPS rate drops and you'll have a good idea how many documents you can put on a single machine. There's really no other way to answer that question Best Erick On Sun, Sep 25, 2011 at 5:55 AM, Raja Ghulam Rasool the.r...@gmail.com wrote: Hi, I am new to Solr, and I am studying it currently. We are planning to implement Solr in our production setup. We have 15 servers where we are getting the data. The data is huge, like we are supposed to keep 150 Tera bytes of data (in terms of documents it will be around 2592000 documents per server), across all servers (combined). We have the necessary storage capacity. Can anyone let me know whether Solr will be a good solution for our text search needs ? We are required to provide text searches or certain limited number of fields. 1- Does Solr support such architecture, i.e. multiple servers ? what specific area in Solr do i need to explore (shards, cores etc, ???) 2- Any idea whether we will really benefit from Solr implementation for text searches, vs let us say Oracle Text Search ? Currently our Oracle Text search is giving a very bad performance and we are looking to some how improve our text search performance any high level pointers or help will be greatly appreciated. thanks in advance guys -- Regards, Raja
Re: Production Issue: SolrJ client throwing this error even though field type is not defined in schema
If I had to give a gentle nudge, I would ask you to validate your schema XML file. You can do so by looking for any w3c XML validator website and just copy pasting the text there to find out where its malformed. Sent from my iPhone On Sep 24, 2011, at 2:01 PM, Erick Erickson erickerick...@gmail.com wrote: You might want to review: http://wiki.apache.org/solr/UsingMailingLists There's really not much to go on here. Best Erick On Wed, Sep 21, 2011 at 12:13 PM, roz dev rozde...@gmail.com wrote: Hi All We are getting this error in our Production Solr Setup. Message: Element type t_sort must be followed by either attribute specifications, or /. Solr version is 1.4.1 Stack trace indicates that solr is returning malformed document. Caused by: org.apache.solr.client.solrj.SolrServerException: Error executing query at org.apache.solr.client.solrj.request.QueryRequest.process(QueryRequest.java:95) at org.apache.solr.client.solrj.SolrServer.query(SolrServer.java:118) at com.gap.gid.search.impl.SearchServiceImpl.executeQuery(SearchServiceImpl.java:232) ... 15 more Caused by: org.apache.solr.common.SolrException: parsing error at org.apache.solr.client.solrj.impl.XMLResponseParser.processResponse(XMLResponseParser.java:140) at org.apache.solr.client.solrj.impl.XMLResponseParser.processResponse(XMLResponseParser.java:101) at org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:481) at org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:244) at org.apache.solr.client.solrj.request.QueryRequest.process(QueryRequest.java:89) ... 17 more Caused by: javax.xml.stream.XMLStreamException: ParseError at [row,col]:[3,136974] Message: Element type t_sort must be followed by either attribute specifications, or /. at com.sun.org.apache.xerces.internal.impl.XMLStreamReaderImpl.next(XMLStreamReaderImpl.java:594) at org.apache.solr.client.solrj.impl.XMLResponseParser.readArray(XMLResponseParser.java:282) at org.apache.solr.client.solrj.impl.XMLResponseParser.readDocument(XMLResponseParser.java:410) at org.apache.solr.client.solrj.impl.XMLResponseParser.readDocuments(XMLResponseParser.java:360) at org.apache.solr.client.solrj.impl.XMLResponseParser.readNamedList(XMLResponseParser.java:241) at org.apache.solr.client.solrj.impl.XMLResponseParser.processResponse(XMLResponseParser.java:125) ... 21 more
Re: How to apply filters to stored data
Hi Erick, The problem I am trying to solve is to filter invalid entities. Users might mispell or enter a new entity name. This new/invalid entities need to pass through a KeepWordFilter so that it won't pollute our autocomplete result. I was looking into Luke. And it does seem to solve my use case, but is Luke something I can use in a production setup? Also when does copyField happens? Is the data being copied a result of application of all filters or unmodified one? -- View this message in context: http://lucene.472066.n3.nabble.com/How-to-apply-filters-to-stored-data-tp3366230p3366987.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: How to apply filters to stored data
See below: On Sun, Sep 25, 2011 at 9:53 AM, Jithin jithin1...@gmail.com wrote: Hi Erick, The problem I am trying to solve is to filter invalid entities. Users might mispell or enter a new entity name. This new/invalid entities need to pass through a KeepWordFilter so that it won't pollute our autocomplete result. Right. But if you have a KeepWordFilter, that implies that you have a list of known good words. Couldn't you use that file as your base for the autosuggest component? I was looking into Luke. And it does seem to solve my use case, but is Luke something I can use in a production setup? You'll find the performance unacceptably slow if you tried to do something similar in production. The nature of an inverted index makes reconstructing a document from the various terms costly. Also when does copyField happens? Is the data being copied a result of application of all filters or unmodified one? copyField happens to the raw input, not the result of your analysis chain. And you can't chain copyField directives, i.e. copyField source=field1 dest=field2 / copyField source=field2 dest=field3 / would not put the contents of field1 into field3 Best Erick -- View this message in context: http://lucene.472066.n3.nabble.com/How-to-apply-filters-to-stored-data-tp3366230p3366987.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: How to apply filters to stored data
Erick Erickson wrote: See below: On Sun, Sep 25, 2011 at 9:53 AM, Jithin lt;jithin1...@gmail.comgt; wrote: Hi Erick, The problem I am trying to solve is to filter invalid entities. Users might mispell or enter a new entity name. This new/invalid entities need to pass through a KeepWordFilter so that it won't pollute our autocomplete result. Right. But if you have a KeepWordFilter, that implies that you have a list of known good words. Couldn't you use that file as your base for the autosuggest component? I think that is possible. But is there any other mechanism within solr/lucene to preprocess stored data. -- View this message in context: http://lucene.472066.n3.nabble.com/How-to-apply-filters-to-stored-data-tp3366230p3367158.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: escaping HTML tags within XML file
Assuming that the XML has the HTML as values inside fully formed tags like so: nodeHTML/HTML/node then I think that using the HTML field type in schema.xml for indexing/storing will allow you to do meaningful searches on the content of the HTML without getting confused by the HTML syntax itself. If you have absolutely no need for the entire stored HTML when presenting results to the user then stripping out the syntax at index time makes sense. This will adversely affect highlighting of that document field as well so just know your requirements. If you don't want to present anything at all then don't store, just index and use the right field type (HTML) such that search results find the right document. Just because a field is helpful in finding the doc, doesn't mean folks always want to present it or store it. With Data Import Handler a HTML stripping transformer is present so that it is removed before the indexer gets it's hands on things. I can't be sure if that is how you get your data into Solr. - Pulkit Sent from my iPhone On Sep 25, 2011, at 8:00 AM, okayndc bodymo...@gmail.com wrote: Hello, Was wondering if it is necessary to escape HTML tags within an XML file for indexing? If so, seems like a large XML files with tons of HTML tags could get really messy (using CDATA). Has this been your experience? Do you escape the HTML tags? If so, what technique do you use? Or do you leave the HTML tags in place without escaping them? Thanks!
Seek your wisdom for implementing 12 million docs..
Hi List, We are pretty new to Solr Lucene and have just starting indexing few 10K documents using Solr. Before we attempt anything bigger we want to see what should be the best approach.. Documents: We have close to ~12 million XML docs, of varying sizes average size 20 KB. These documents have 150 fields, which should be searchable indexed. Of which over 80% fixed length string fields and few strings are multivalued ones (e.g. title, headline, id, submitter, reviewers, suggested-titles etc), there other 15% who are date specific (added-on, reviewed-on etc). Rest are multivalued text documents, (E,g, description, summary, comments, notes etc). Some of the documents do have large number of these text fields (so we are leaning against storing these in index). Approximately ~6000 such documents are updated 400-800 new ones are added each day Queries: A typical query would mainly be on string fields ~ 60% of queries e.g. a simple one would be find document ids of documents whose author is XYZ submitted between [X-Z] whose status is reviewed or pending review title has this string etc... the results of which are exacting nature (found 300 docs). Rest of searches would include the text fields, where they search quoted snippets or phrases... Almost all queries have multiple operators. Also each one would want to grab as many result rows as possible (we are limiting this to 2000). The output shall contain only 1-5 fields. (No highlighting etc needed) Available hardware: Some of existing hardware we could find consists of existing ~300GB SAN each on 4 Boxes with ~96Gig each. We do couple of older HP DL380s (mainly want to use for offline indexing). All of this is on 10G Ethernet. Questions: Our priority is to provide results fast, and the new or updated documents should be indexed within 2 hour. Users are also known to use complex queries for data mining. Seeing all this any recommendations for indexing data, fields? How do we scale, what architecture should we follow here? Slave/master servers? Any possible issues we may hit? Thanks
Re: How to apply filters to stored data
Not that I know of... On Sun, Sep 25, 2011 at 11:15 AM, Jithin jithin1...@gmail.com wrote: Erick Erickson wrote: See below: On Sun, Sep 25, 2011 at 9:53 AM, Jithin lt;jithin1...@gmail.comgt; wrote: Hi Erick, The problem I am trying to solve is to filter invalid entities. Users might mispell or enter a new entity name. This new/invalid entities need to pass through a KeepWordFilter so that it won't pollute our autocomplete result. Right. But if you have a KeepWordFilter, that implies that you have a list of known good words. Couldn't you use that file as your base for the autosuggest component? I think that is possible. But is there any other mechanism within solr/lucene to preprocess stored data. -- View this message in context: http://lucene.472066.n3.nabble.com/How-to-apply-filters-to-stored-data-tp3366230p3367158.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Seek your wisdom for implementing 12 million docs..
Round N + 1 of it depends G. This isn't a very big index as Solr indexes go, my first guess would be that you can easily fit this on the machines you're talking about. But, as always, how you implement things may prove me wrong. Really, about the only thing you can do is try it. Be aware that the size of the index is a tricky concept. For instance, if you store your data (stored=true), the files in your index directory will NOT reflect the total memory requirements since verbatim copies of your fields are held in the *.fdt files and really don't affect searching speed. Here's what I claim: 1 you can index these 12M document in a reasonable time. I index on my Mac book Pro 1.9M documents (Wikipedia dump) in just a few minutes ( 10 as I remember). So you can just try things. 2 use a Master/Slave architecture. You can control how fast the updates are available by the polling interval on the slave and how fast you commit. 2 hours is easy. 10 minutes is a reasonable goal here. 3 Consider edismax-style handlers. The point here is that they allow you to tune relevance much more finely than a bag of words approach in which you index many fields into a single text field. 4 You only really need to store the fields you intend to display as part of your search results. Assuming you're going to your system-of-record for the full document, your stored data may be very small 5 Be aware that the first few queries will often be much slower than later queries, as there are certain caches that need to be filled up. See the various warming parameters on the caches and the firstSearcher and newSearcher entries in the config files. 6 Create a mix of queries and use something like jMeter or SolrMeter to determine where your target hardware falls down. You have to take some care to create a reasonable query set, not just the same query over and over or you'll just get cached results. Fire enough queries at the searcher that it starts to perform poorly and tweak from there. 7 Really, really get familiar with two things, a the admin/analysis page for understanding the analysis process. b adding debugQuery=on to your queries when you don't understand what's happening. In particular, that will show you the parsed queries, you can defer digging into the scoring explanations for later. 8 string types aren't what you want very often. They're really suitable for things like IDs, serial numbers, etc. But they are NOT tokenized. So if your input is some stuff and you search for stuff, you won't get a match. This often confuses people. For tokenized processing, you'll probably want one of the text variants. String types are even case sensitive... But all in all, I don't see what you've described as particularly difficult, although you'll doubtlessly run into things you don't expect. Hope that helps Erick On Sun, Sep 25, 2011 at 1:00 PM, Ikhsvaku S ikhsv...@gmail.com wrote: Hi List, We are pretty new to Solr Lucene and have just starting indexing few 10K documents using Solr. Before we attempt anything bigger we want to see what should be the best approach.. Documents: We have close to ~12 million XML docs, of varying sizes average size 20 KB. These documents have 150 fields, which should be searchable indexed. Of which over 80% fixed length string fields and few strings are multivalued ones (e.g. title, headline, id, submitter, reviewers, suggested-titles etc), there other 15% who are date specific (added-on, reviewed-on etc). Rest are multivalued text documents, (E,g, description, summary, comments, notes etc). Some of the documents do have large number of these text fields (so we are leaning against storing these in index). Approximately ~6000 such documents are updated 400-800 new ones are added each day Queries: A typical query would mainly be on string fields ~ 60% of queries e.g. a simple one would be find document ids of documents whose author is XYZ submitted between [X-Z] whose status is reviewed or pending review title has this string etc... the results of which are exacting nature (found 300 docs). Rest of searches would include the text fields, where they search quoted snippets or phrases... Almost all queries have multiple operators. Also each one would want to grab as many result rows as possible (we are limiting this to 2000). The output shall contain only 1-5 fields. (No highlighting etc needed) Available hardware: Some of existing hardware we could find consists of existing ~300GB SAN each on 4 Boxes with ~96Gig each. We do couple of older HP DL380s (mainly want to use for offline indexing). All of this is on 10G Ethernet. Questions: Our priority is to provide results fast, and the new or updated documents should be indexed within 2 hour. Users are also known to use complex queries for data mining.
Re: escaping HTML tags within XML file
Here is a representation of the XML file... root commenter commentpText here/pimg src=image.gif /pMore text here/p/comment /commenter /root I want to keep the HTML tags because it keeps the formatting (paragraph tags, etc) intact for the output. Seems like you're saying that the HTML can be kept intact with the use of a HTML field type without having to escape the HTML tags? On Sun, Sep 25, 2011 at 2:52 PM, pulkitsing...@gmail.com wrote: Assuming that the XML has the HTML as values inside fully formed tags like so: nodeHTML/HTML/node then I think that using the HTML field type in schema.xml for indexing/storing will allow you to do meaningful searches on the content of the HTML without getting confused by the HTML syntax itself. If you have absolutely no need for the entire stored HTML when presenting results to the user then stripping out the syntax at index time makes sense. This will adversely affect highlighting of that document field as well so just know your requirements. If you don't want to present anything at all then don't store, just index and use the right field type (HTML) such that search results find the right document. Just because a field is helpful in finding the doc, doesn't mean folks always want to present it or store it. With Data Import Handler a HTML stripping transformer is present so that it is removed before the indexer gets it's hands on things. I can't be sure if that is how you get your data into Solr. - Pulkit Sent from my iPhone On Sep 25, 2011, at 8:00 AM, okayndc bodymo...@gmail.com wrote: Hello, Was wondering if it is necessary to escape HTML tags within an XML file for indexing? If so, seems like a large XML files with tons of HTML tags could get really messy (using CDATA). Has this been your experience? Do you escape the HTML tags? If so, what technique do you use? Or do you leave the HTML tags in place without escaping them? Thanks!
Re: How to apply filters to stored data
Well ... DIH can. And update processors can. And of course client-side indexers. But yeah... elbow grease required. Erik On Sep 25, 2011, at 16:32, Erick Erickson erickerick...@gmail.com wrote: Not that I know of... On Sun, Sep 25, 2011 at 11:15 AM, Jithin jithin1...@gmail.com wrote: Erick Erickson wrote: See below: On Sun, Sep 25, 2011 at 9:53 AM, Jithin lt;jithin1...@gmail.comgt; wrote: Hi Erick, The problem I am trying to solve is to filter invalid entities. Users might mispell or enter a new entity name. This new/invalid entities need to pass through a KeepWordFilter so that it won't pollute our autocomplete result. Right. But if you have a KeepWordFilter, that implies that you have a list of known good words. Couldn't you use that file as your base for the autosuggest component? I think that is possible. But is there any other mechanism within solr/lucene to preprocess stored data. -- View this message in context: http://lucene.472066.n3.nabble.com/How-to-apply-filters-to-stored-data-tp3366230p3367158.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: escaping HTML tags within XML file
Yes - you can index HTML text only while keeping the tags in place in the stored field using HTMLCharFilter (or possibly XMLCharFilter). But you will find that embedding HTML inside XML can be problematic since HTML tags don't have to follow the well-formed constraints that XML requires. For example, old-style paragraph tags in HTML were often not closed, just p with no /p. If you have stuff like that, you won't be able to embed in XML without quoting the character. You never said why you are embedding HTML in XML though. -Mike On 9/25/2011 5:06 PM, okayndc wrote: Here is a representation of the XML file... root commenter commentpText here/pimg src=image.gif /pMore text here/p/comment /commenter /root I want to keep the HTML tags because it keeps the formatting (paragraph tags, etc) intact for the output. Seems like you're saying that the HTML can be kept intact with the use of a HTML field type without having to escape the HTML tags?
Re: escaping HTML tags within XML file
Yes sir! Sent from my iPhone On Sep 25, 2011, at 4:06 PM, okayndc bodymo...@gmail.com wrote: Here is a representation of the XML file... root commenter commentpText here/pimg src=image.gif /pMore text here/p/comment /commenter /root I want to keep the HTML tags because it keeps the formatting (paragraph tags, etc) intact for the output. Seems like you're saying that the HTML can be kept intact with the use of a HTML field type without having to escape the HTML tags? On Sun, Sep 25, 2011 at 2:52 PM, pulkitsing...@gmail.com wrote: Assuming that the XML has the HTML as values inside fully formed tags like so: nodeHTML/HTML/node then I think that using the HTML field type in schema.xml for indexing/storing will allow you to do meaningful searches on the content of the HTML without getting confused by the HTML syntax itself. If you have absolutely no need for the entire stored HTML when presenting results to the user then stripping out the syntax at index time makes sense. This will adversely affect highlighting of that document field as well so just know your requirements. If you don't want to present anything at all then don't store, just index and use the right field type (HTML) such that search results find the right document. Just because a field is helpful in finding the doc, doesn't mean folks always want to present it or store it. With Data Import Handler a HTML stripping transformer is present so that it is removed before the indexer gets it's hands on things. I can't be sure if that is how you get your data into Solr. - Pulkit Sent from my iPhone On Sep 25, 2011, at 8:00 AM, okayndc bodymo...@gmail.com wrote: Hello, Was wondering if it is necessary to escape HTML tags within an XML file for indexing? If so, seems like a large XML files with tons of HTML tags could get really messy (using CDATA). Has this been your experience? Do you escape the HTML tags? If so, what technique do you use? Or do you leave the HTML tags in place without escaping them? Thanks!
Re: Sending pdf files to slor for indexing
Hi there you can use DIH with Tika