Re: Problems with DIH XPath flatten

2009-10-11 Thread Shalin Shekhar Mangar
On Wed, Oct 7, 2009 at 6:54 PM, Adam Foltzer  wrote:

> Here's a sample:
>
> 
>  
> 
> 
> ]>
> 
>  
>In Mac OS X, how do I enable or disable the firewall?
>
> Mac OS
> Xallvisible includes
> an easy-to-use 
> access="allowed">firewallallvisible
> that
> can prevent potentially harmful incoming connections from other
> computers. To turn it on or off:
>
>
> Mac OS X 10.6 (Snow Leopard)
>
> From the Apple menu, select System Preferences...†.
> When the System Preferences window appears, from the
> View menu, select Security.
>
> 
> Click the Firewall tab.
>
> ...
>
> 
> 
>
>  macos
>  macintosh
>  apple
>  macosx
>
> ...
>
>
>  
>  
>aozg
>scmac
>
> ...
>
>  
> 
>
> The /document/kbml/kbq works fine, but as you can see, it has no
> children. The actual content of the document is within the body
> element, though, which requires some flattening.
>
>
Adam, I'm not able to reproduce your problem. I wrote a test case using your
xml and configuration and it passes. Diff below:

Index:
contrib/dataimporthandler/src/test/java/org/apache/solr/handler/dataimport/TestXPathEntityProcessor.java
===
---
contrib/dataimporthandler/src/test/java/org/apache/solr/handler/dataimport/TestXPathEntityProcessor.java
(revision 824015)
+++
contrib/dataimporthandler/src/test/java/org/apache/solr/handler/dataimport/TestXPathEntityProcessor.java
(working copy)
@@ -109,6 +109,85 @@
   }

   @Test
+  @SuppressWarnings("unchecked")
+  public void testFlatten() throws Exception {
+String xml = "\n" +
+"\n" +
+"\n" +
+"\n" +
+"]>\n" +
+"\n" +
+" \n"
+
+"   In Mac OS X, how do I enable or disable the
firewall?\n" +
+"   \n" +
+"Mac OS\n" +
+"Xallvisible
includes\n" +
+"an easy-to-use firewallallvisible\n"
+
+"that\n" +
+"can prevent potentially harmful incoming connections from
other\n" +
+"computers. To turn it on or off:\n" +
+"\n" +
+"\n" +
+"Mac OS X 10.6 (Snow Leopard)\n" +
+"\n" +
+"From the Apple menu, select System
Preferences...†.\n" +
+"When the System Preferences window appears, from
the\n" +
+"View menu, select Security.\n" +
+"\n" +
+"\n" +
+"Click the Firewall tab.\n" +
+"\n" +
+"...\n" +
+"\n" +
+"\n" +
+"\n" +
+"   \n" +
+" macos\n" +
+" macintosh\n" +
+" apple\n" +
+" macosx\n" +
+"\n" +
+"...\n" +
+"\n" +
+"   \n" +
+" \n" +
+" \n" +
+"   aozg\n" +
+"   scmac\n" +
+"\n" +
+"...\n" +
+"\n" +
+" \n" +
+"";
+Map entityAttrs = createMap("name", "kbxml", "url", "testdata.xml",
+XPathEntityProcessor.FOR_EACH, "/document", "transformer",
"HTMLStripTransformer");
+List fields = new ArrayList();
+fields.add(createMap("column", "content", "xpath",
"/document/kbml/body" ,"flatten","true", "stripHTML", "true"));
+fields.add(createMap("column", "title", "xpath",
"/document/kbml/kbq"));
+Context c = AbstractDataImportHandlerTest.getContext(null,
+new VariableResolverImpl(), getDataSource(xml),
Context.FULL_DUMP, fields, entityAttrs);
+XPathEntityProcessor xPathEntityProcessor = new XPathEntityProcessor();
+xPathEntityProcessor.init(c);
+Map result = null;
+while (true) {
+  Map row = xPathEntityProcessor.nextRow();
+  if (row == null)
+break;
+  result = row;
+}
+System.out.println("result.get(\"content\") = " +
result.get("content"));
+Assert.assertNotNull(result.get("content"));
+Assert.assertTrue(result.get("content").toString().trim().length() >
0);
+HTMLStripTransformer t = new HTMLStripTransformer();
+t.transformRow(result, c);
+System.out.println("result.get(\"content\") = " +
result.get("content"));
+Assert.assertNotNull(result.get("content"));
+Assert.assertTrue(result.get("content").toString().trim().length() >
0);
+  }
+
+  @Test
   public void withFieldsAndXpathStream() throws Exception {
 Map entityAttrs = createMap("name", "e", "url", "cd.xml",
 XPathEntityProcessor.FOR_EACH, "/catalog/cd", "stream", "true",
"batchSize","1");


-- 
Regards,
Shalin Shekhar Mangar.
Index: contrib/dataimporthandler/src/test/java/org/apache/solr/handler/dataimport/TestXPathEntityProcessor.java
===
--- contrib/dataimporthandler/src/test/java/org/apache/solr/handler/dataimport/TestXPathEntityProcessor.java	(revision 824015)
+++ contrib/dataimporthandler/src/test/java/org/apache/solr/

Re: Problems with DIH XPath flatten

2009-10-07 Thread Adam Foltzer
Here's a sample:





]>

  
In Mac OS X, how do I enable or disable the firewall?

Mac OS
Xallvisible includes
an easy-to-use firewallallvisible
that
can prevent potentially harmful incoming connections from other
computers. To turn it on or off:


Mac OS X 10.6 (Snow Leopard)

From the Apple menu, select System Preferences...†.
When the System Preferences window appears, from the
View menu, select Security.


Click the Firewall tab.

...




  macos
  macintosh
  apple
  macosx

...


  
  
aozg
scmac

...

  


The /document/kbml/kbq works fine, but as you can see, it has no
children. The actual content of the document is within the body
element, though, which requires some flattening.

Thanks for your time,
Adam

2009/10/6 Noble Paul നോബിള്‍  नोब्ळ् :
> send a small sample xml snippet you are trying to index and it may help
>
> On Tue, Oct 6, 2009 at 9:29 PM, Adam Foltzer  wrote:
>> Hi all,
>>
>> I'm trying to set up DataImportHandler to index some XML documents available
>> over web services. The XML includes both content and metadata, so for the
>> indexable content, I'm trying to just index everything under the content
>> tag:
>>
>> >        url="resturl" processor="XPathEntityProcessor"
>>        forEach="/document" transformer="HTMLStripTransformer"
>> flatten="true">
>> > flatten="true" stripHTML="true" />
>> 
>> 
>>
>> The result of this is that the title field gets populated and indexed (there
>> are no child nodes of /document/kbml/kbq), but content does not get indexed
>> at all. Since /document/kbml/body has many children, I expected that
>> flatten="true" would store all of the body text in the field. Instead, it
>> stores nothing at all. I've tried this with many combinations of
>> transformers and flatten options, and the result is the same each time.
>>
>> Here are the relevant field declarations from the schema (the type="text" is
>> just the one from the example's schema.xml). I have tried combinations here
>> as well of stored= and multiValued=, with the same result each time.
>>
>> > multiValued="true" />
>> > multiValued="true" />
>>
>> If it would help troubleshooting, I could send along some sample XML. I
>> don't want to spam the list with an attachment unless it's necessary, though
>> :)
>>
>> Thanks in advance for your help,
>>
>> Adam Foltzer
>>
>
>
>
> --
> -
> Noble Paul | Principal Engineer| AOL | http://aol.com
>


Re: Problems with DIH XPath flatten

2009-10-06 Thread Noble Paul നോബിള്‍ नोब्ळ्
send a small sample xml snippet you are trying to index and it may help

On Tue, Oct 6, 2009 at 9:29 PM, Adam Foltzer  wrote:
> Hi all,
>
> I'm trying to set up DataImportHandler to index some XML documents available
> over web services. The XML includes both content and metadata, so for the
> indexable content, I'm trying to just index everything under the content
> tag:
>
>         url="resturl" processor="XPathEntityProcessor"
>        forEach="/document" transformer="HTMLStripTransformer"
> flatten="true">
>  flatten="true" stripHTML="true" />
> 
> 
>
> The result of this is that the title field gets populated and indexed (there
> are no child nodes of /document/kbml/kbq), but content does not get indexed
> at all. Since /document/kbml/body has many children, I expected that
> flatten="true" would store all of the body text in the field. Instead, it
> stores nothing at all. I've tried this with many combinations of
> transformers and flatten options, and the result is the same each time.
>
> Here are the relevant field declarations from the schema (the type="text" is
> just the one from the example's schema.xml). I have tried combinations here
> as well of stored= and multiValued=, with the same result each time.
>
>  multiValued="true" />
>  multiValued="true" />
>
> If it would help troubleshooting, I could send along some sample XML. I
> don't want to spam the list with an attachment unless it's necessary, though
> :)
>
> Thanks in advance for your help,
>
> Adam Foltzer
>



-- 
-
Noble Paul | Principal Engineer| AOL | http://aol.com


Re: Problems with DIH XPath flatten

2009-10-06 Thread Adam Foltzer
Hi Shalin,

Good question; sorry I forgot it in the initial post. I have tried with both
a nightly build from earlier this month (Oct 2 I believe) as well as a build
from the trunk as of yesterday afternoon.

Adam

On Tue, Oct 6, 2009 at 5:04 PM, Shalin Shekhar Mangar <
shalinman...@gmail.com> wrote:

> On Tue, Oct 6, 2009 at 9:29 PM, Adam Foltzer  wrote:
>
> > Hi all,
> >
> > I'm trying to set up DataImportHandler to index some XML documents
> > available
> > over web services. The XML includes both content and metadata, so for the
> > indexable content, I'm trying to just index everything under the content
> > tag:
> >
> >  >url="resturl" processor="XPathEntityProcessor"
> >forEach="/document" transformer="HTMLStripTransformer"
> > flatten="true">
> >  > flatten="true" stripHTML="true" />
> > 
> > 
> >
> > The result of this is that the title field gets populated and indexed
> > (there
> > are no child nodes of /document/kbml/kbq), but content does not get
> indexed
> > at all. Since /document/kbml/body has many children, I expected that
> > flatten="true" would store all of the body text in the field. Instead, it
> > stores nothing at all. I've tried this with many combinations of
> > transformers and flatten options, and the result is the same each time.
> >
> >
> Which Solr version are you using? The flatten attribute was introduced
> after
> 1.3 released.
>
> --
> Regards,
> Shalin Shekhar Mangar.
>


Re: Problems with DIH XPath flatten

2009-10-06 Thread Shalin Shekhar Mangar
On Tue, Oct 6, 2009 at 9:29 PM, Adam Foltzer  wrote:

> Hi all,
>
> I'm trying to set up DataImportHandler to index some XML documents
> available
> over web services. The XML includes both content and metadata, so for the
> indexable content, I'm trying to just index everything under the content
> tag:
>
> url="resturl" processor="XPathEntityProcessor"
>forEach="/document" transformer="HTMLStripTransformer"
> flatten="true">
>  flatten="true" stripHTML="true" />
> 
> 
>
> The result of this is that the title field gets populated and indexed
> (there
> are no child nodes of /document/kbml/kbq), but content does not get indexed
> at all. Since /document/kbml/body has many children, I expected that
> flatten="true" would store all of the body text in the field. Instead, it
> stores nothing at all. I've tried this with many combinations of
> transformers and flatten options, and the result is the same each time.
>
>
Which Solr version are you using? The flatten attribute was introduced after
1.3 released.

-- 
Regards,
Shalin Shekhar Mangar.


Problems with DIH XPath flatten

2009-10-06 Thread Adam Foltzer
Hi all,

I'm trying to set up DataImportHandler to index some XML documents available
over web services. The XML includes both content and metadata, so for the
indexable content, I'm trying to just index everything under the content
tag:






The result of this is that the title field gets populated and indexed (there
are no child nodes of /document/kbml/kbq), but content does not get indexed
at all. Since /document/kbml/body has many children, I expected that
flatten="true" would store all of the body text in the field. Instead, it
stores nothing at all. I've tried this with many combinations of
transformers and flatten options, and the result is the same each time.

Here are the relevant field declarations from the schema (the type="text" is
just the one from the example's schema.xml). I have tried combinations here
as well of stored= and multiValued=, with the same result each time.




If it would help troubleshooting, I could send along some sample XML. I
don't want to spam the list with an attachment unless it's necessary, though
:)

Thanks in advance for your help,

Adam Foltzer