RE: full-text indexing XML files
CDATA didn’t work either.It still complained about the input doc not being in correct format. -Original Message- From: Lance Norskog [mailto:goks...@gmail.com] Sent: Thursday, December 10, 2009 7:43 PM To: solr-user@lucene.apache.org Subject: Re: full-text indexing XML files Or CDATA (much easier to work with). On Wed, Dec 9, 2009 at 10:37 PM, Shalin Shekhar Mangar shalinman...@gmail.com wrote: On Thu, Dec 10, 2009 at 5:13 AM, Feroze Daud fero...@zillow.com wrote: Hi! I am trying to full text index an XML file. For various reasons, I cannot use Tika or other technology to parse the XML file. The requirement is to full-text index the XML file, including Tags and everything. So, I created a input index spec like this: add doc field name=id1001/field field name=nameNASA Advanced Research Labs/field field name=address1010 Main Street, Chattanooga, FL 32212/field field name=contentlistingid1001/id name NASA Advanced Research Labs / name address1010 main street, chattanooga, FL 32212/address/listing/field /doc /add You need to XML encode the value of the content field. -- Regards, Shalin Shekhar Mangar. -- Lance Norskog goks...@gmail.com
RE: full-text indexing XML files
Yeah, xml tags as well. Essentially we want to full-text index the file, without the need for stemming the tokens. Will the SOLR analyzer be able to tokenize the document correctly if it does not have any whitespaces (besides those required by XML syntax)? -Original Message- From: Walter Underwood [mailto:wun...@wunderwood.org] Sent: Thursday, December 10, 2009 8:00 PM To: solr-user@lucene.apache.org Subject: Re: full-text indexing XML files What kind of searches do you want to do? Do you want to do searches that match the XML tags? wunder On Dec 10, 2009, at 7:43 PM, Lance Norskog wrote: Or CDATA (much easier to work with). On Wed, Dec 9, 2009 at 10:37 PM, Shalin Shekhar Mangar shalinman...@gmail.com wrote: On Thu, Dec 10, 2009 at 5:13 AM, Feroze Daud fero...@zillow.com wrote: Hi! I am trying to full text index an XML file. For various reasons, I cannot use Tika or other technology to parse the XML file. The requirement is to full-text index the XML file, including Tags and everything. So, I created a input index spec like this: add doc field name=id1001/field field name=nameNASA Advanced Research Labs/field field name=address1010 Main Street, Chattanooga, FL 32212/field field name=contentlistingid1001/id name NASA Advanced Research Labs / name address1010 main street, chattanooga, FL 32212/address/listing/field /doc /add You need to XML encode the value of the content field. -- Regards, Shalin Shekhar Mangar. -- Lance Norskog goks...@gmail.com
Re: full-text indexing XML files
If you really want to do XML-senstive search, it could be a lot of work in Solr. Lucene is a flat data model, so hierarchy requires a lot of mapping to the schema or fancy, slow queries. There are engines that are designed for XML indexing and search, using XQuery, so consider whether that might be less work overall. XML engines are less mature than Lucene and Solr, so there is a big performance and scalability gap between the best free engines (eXist) and the best commercial engines (Mark Logic, where I work). wunder Walter Underwood Lead Engineer, Mark Logic On Dec 11, 2009, at 9:42 AM, Feroze Daud wrote: Yeah, xml tags as well. Essentially we want to full-text index the file, without the need for stemming the tokens. Will the SOLR analyzer be able to tokenize the document correctly if it does not have any whitespaces (besides those required by XML syntax)? -Original Message- From: Walter Underwood [mailto:wun...@wunderwood.org] Sent: Thursday, December 10, 2009 8:00 PM To: solr-user@lucene.apache.org Subject: Re: full-text indexing XML files What kind of searches do you want to do? Do you want to do searches that match the XML tags? wunder On Dec 10, 2009, at 7:43 PM, Lance Norskog wrote: Or CDATA (much easier to work with). On Wed, Dec 9, 2009 at 10:37 PM, Shalin Shekhar Mangar shalinman...@gmail.com wrote: On Thu, Dec 10, 2009 at 5:13 AM, Feroze Daud fero...@zillow.com wrote: Hi! I am trying to full text index an XML file. For various reasons, I cannot use Tika or other technology to parse the XML file. The requirement is to full-text index the XML file, including Tags and everything. So, I created a input index spec like this: add doc field name=id1001/field field name=nameNASA Advanced Research Labs/field field name=address1010 Main Street, Chattanooga, FL 32212/field field name=contentlistingid1001/id name NASA Advanced Research Labs / name address1010 main street, chattanooga, FL 32212/address/listing/field /doc /add You need to XML encode the value of the content field. -- Regards, Shalin Shekhar Mangar. -- Lance Norskog goks...@gmail.com
Re: full-text indexing XML files
Please post a small sample file that has this problem with CDATA. On Fri, Dec 11, 2009 at 9:41 AM, Feroze Daud fero...@zillow.com wrote: CDATA didn’t work either.It still complained about the input doc not being in correct format. -Original Message- From: Lance Norskog [mailto:goks...@gmail.com] Sent: Thursday, December 10, 2009 7:43 PM To: solr-user@lucene.apache.org Subject: Re: full-text indexing XML files Or CDATA (much easier to work with). On Wed, Dec 9, 2009 at 10:37 PM, Shalin Shekhar Mangar shalinman...@gmail.com wrote: On Thu, Dec 10, 2009 at 5:13 AM, Feroze Daud fero...@zillow.com wrote: Hi! I am trying to full text index an XML file. For various reasons, I cannot use Tika or other technology to parse the XML file. The requirement is to full-text index the XML file, including Tags and everything. So, I created a input index spec like this: add doc field name=id1001/field field name=nameNASA Advanced Research Labs/field field name=address1010 Main Street, Chattanooga, FL 32212/field field name=contentlistingid1001/id name NASA Advanced Research Labs / name address1010 main street, chattanooga, FL 32212/address/listing/field /doc /add You need to XML encode the value of the content field. -- Regards, Shalin Shekhar Mangar. -- Lance Norskog goks...@gmail.com -- Lance Norskog goks...@gmail.com
Re: full-text indexing XML files
Or CDATA (much easier to work with). On Wed, Dec 9, 2009 at 10:37 PM, Shalin Shekhar Mangar shalinman...@gmail.com wrote: On Thu, Dec 10, 2009 at 5:13 AM, Feroze Daud fero...@zillow.com wrote: Hi! I am trying to full text index an XML file. For various reasons, I cannot use Tika or other technology to parse the XML file. The requirement is to full-text index the XML file, including Tags and everything. So, I created a input index spec like this: add doc field name=id1001/field field name=nameNASA Advanced Research Labs/field field name=address1010 Main Street, Chattanooga, FL 32212/field field name=contentlistingid1001/id name NASA Advanced Research Labs / name address1010 main street, chattanooga, FL 32212/address/listing/field /doc /add You need to XML encode the value of the content field. -- Regards, Shalin Shekhar Mangar. -- Lance Norskog goks...@gmail.com
Re: full-text indexing XML files
What kind of searches do you want to do? Do you want to do searches that match the XML tags? wunder On Dec 10, 2009, at 7:43 PM, Lance Norskog wrote: Or CDATA (much easier to work with). On Wed, Dec 9, 2009 at 10:37 PM, Shalin Shekhar Mangar shalinman...@gmail.com wrote: On Thu, Dec 10, 2009 at 5:13 AM, Feroze Daud fero...@zillow.com wrote: Hi! I am trying to full text index an XML file. For various reasons, I cannot use Tika or other technology to parse the XML file. The requirement is to full-text index the XML file, including Tags and everything. So, I created a input index spec like this: add doc field name=id1001/field field name=nameNASA Advanced Research Labs/field field name=address1010 Main Street, Chattanooga, FL 32212/field field name=contentlistingid1001/id name NASA Advanced Research Labs / name address1010 main street, chattanooga, FL 32212/address/listing/field /doc /add You need to XML encode the value of the content field. -- Regards, Shalin Shekhar Mangar. -- Lance Norskog goks...@gmail.com
Re: full-text indexing XML files
On Thu, Dec 10, 2009 at 5:13 AM, Feroze Daud fero...@zillow.com wrote: Hi! I am trying to full text index an XML file. For various reasons, I cannot use Tika or other technology to parse the XML file. The requirement is to full-text index the XML file, including Tags and everything. So, I created a input index spec like this: add doc field name=id1001/field field name=nameNASA Advanced Research Labs/field field name=address1010 Main Street, Chattanooga, FL 32212/field field name=contentlistingid1001/id name NASA Advanced Research Labs / name address1010 main street, chattanooga, FL 32212/address/listing/field /doc /add You need to XML encode the value of the content field. -- Regards, Shalin Shekhar Mangar.