[jira] [Commented] (NUTCH-1129) Any23 Nutch plugin

2017-08-31 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16149041#comment-16149041
 ] 

ASF GitHub Bot commented on NUTCH-1129:
---

lewismc commented on issue #205: WIP: NUTCH-1129 microdata for Nutch 1.x
URL: https://github.com/apache/nutch/pull/205#issuecomment-326309710
 
 
   OK this is an issue. The solution is to address 
https://issues.apache.org/jira/browse/ANY23-264
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Any23 Nutch plugin
> --
>
> Key: NUTCH-1129
> URL: https://issues.apache.org/jira/browse/NUTCH-1129
> Project: Nutch
>  Issue Type: New Feature
>  Components: parser
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Minor
> Fix For: 2.5
>
> Attachments: NUTCH-1129.patch
>
>
> This plugin should build on the Any23 library to provide us with a plugin 
> which extracts RDF data from HTTP and file resources. Although as of writing 
> Any23 not part of the ASF, the project is working towards integration into 
> the Apache Incubator. Once the project proves its value, this would be an 
> excellent addition to the Nutch 1.X codebase. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (NUTCH-1129) Any23 Nutch plugin

2017-08-31 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16149030#comment-16149030
 ] 

ASF GitHub Bot commented on NUTCH-1129:
---

simoncpu commented on issue #205: WIP: NUTCH-1129 microdata for Nutch 1.x
URL: https://github.com/apache/nutch/pull/205#issuecomment-326307268
 
 
   @lewismc It still didn't work, so I just grabbed the jar file at: 
http://svn.apache.org/repos/asf/any23/repo-ext/org/apache/commons/commons-csv/1.0-SNAPSHOT-rev1148315/.
 :)
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Any23 Nutch plugin
> --
>
> Key: NUTCH-1129
> URL: https://issues.apache.org/jira/browse/NUTCH-1129
> Project: Nutch
>  Issue Type: New Feature
>  Components: parser
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Minor
> Fix For: 2.5
>
> Attachments: NUTCH-1129.patch
>
>
> This plugin should build on the Any23 library to provide us with a plugin 
> which extracts RDF data from HTTP and file resources. Although as of writing 
> Any23 not part of the ASF, the project is working towards integration into 
> the Apache Incubator. Once the project proves its value, this would be an 
> excellent addition to the Nutch 1.X codebase. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (NUTCH-2411) Index-metadata to support indexing multiple values for a field

2017-08-31 Thread Markus Jelsma (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2411?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16148906#comment-16148906
 ] 

Markus Jelsma commented on NUTCH-2411:
--

Added param to explicitly list the fields that are supposed to be split by the 
separator.

> Index-metadata to support indexing multiple values for a field 
> ---
>
> Key: NUTCH-2411
> URL: https://issues.apache.org/jira/browse/NUTCH-2411
> Project: Nutch
>  Issue Type: Improvement
>Affects Versions: 1.13
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
>Priority: Minor
> Fix For: 1.14
>
> Attachments: NUTCH-2411-1.13.patch, NUTCH-2411-1.13.patch, 
> NUTCH-2411.patch
>
>
> {code}
> 
>   index.metadata.separator
>   
>   
>Separator to use if you want to index multiple values for a given field. 
> Leave empty to
>treat each value as a single value.
>   
> 
> 
>   index.metadata.multivalued.fields
>   
>   
> Comma separated list of fields that are multi valued.
>   
> 
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (NUTCH-2411) Index-metadata to support indexing multiple values for a field

2017-08-31 Thread Markus Jelsma (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2411?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-2411:
-
Description: 
{code}

  index.metadata.separator
  
  
   Separator to use if you want to index multiple values for a given field. 
Leave empty to
   treat each value as a single value.
  



  index.metadata.multivalued.fields
  
  
Comma separated list of fields that are multi valued.
  

{code}

  was:
{code}

  index.metadata.separator
  
  
   Separator to use if you want to index multiple values for a given field. 
Leave empty to
   treat each value as a single value.
  

{code}


> Index-metadata to support indexing multiple values for a field 
> ---
>
> Key: NUTCH-2411
> URL: https://issues.apache.org/jira/browse/NUTCH-2411
> Project: Nutch
>  Issue Type: Improvement
>Affects Versions: 1.13
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
>Priority: Minor
> Fix For: 1.14
>
> Attachments: NUTCH-2411-1.13.patch, NUTCH-2411-1.13.patch, 
> NUTCH-2411.patch
>
>
> {code}
> 
>   index.metadata.separator
>   
>   
>Separator to use if you want to index multiple values for a given field. 
> Leave empty to
>treat each value as a single value.
>   
> 
> 
>   index.metadata.multivalued.fields
>   
>   
> Comma separated list of fields that are multi valued.
>   
> 
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (NUTCH-2411) Index-metadata to support indexing multiple values for a field

2017-08-31 Thread Markus Jelsma (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2411?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-2411:
-
Attachment: NUTCH-2411-1.13.patch

> Index-metadata to support indexing multiple values for a field 
> ---
>
> Key: NUTCH-2411
> URL: https://issues.apache.org/jira/browse/NUTCH-2411
> Project: Nutch
>  Issue Type: Improvement
>Affects Versions: 1.13
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
>Priority: Minor
> Fix For: 1.14
>
> Attachments: NUTCH-2411-1.13.patch, NUTCH-2411-1.13.patch, 
> NUTCH-2411.patch
>
>
> {code}
> 
>   index.metadata.separator
>   
>   
>Separator to use if you want to index multiple values for a given field. 
> Leave empty to
>treat each value as a single value.
>   
> 
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (NUTCH-2415) Create a JEXL based IndexingFilter

2017-08-31 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16148849#comment-16148849
 ] 

ASF GitHub Bot commented on NUTCH-2415:
---

jorgelbg commented on a change in pull request #219: NUTCH-2415 : Create a JEXL 
based IndexingFilter
URL: https://github.com/apache/nutch/pull/219#discussion_r136309345
 
 

 ##
 File path: 
src/plugin/index-jexl-filter/src/test/org/apache/nutch/indexer/filter/TestJexlIndexingFilter.java
 ##
 @@ -0,0 +1,80 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.nutch.indexer.filter;
+
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.io.Text;
+import org.apache.nutch.crawl.CrawlDatum;
+import org.apache.nutch.crawl.Inlinks;
+import org.apache.nutch.indexer.NutchDocument;
+import org.apache.nutch.metadata.Metadata;
+import org.apache.nutch.parse.Outlink;
+import org.apache.nutch.parse.ParseData;
+import org.apache.nutch.parse.ParseImpl;
+import org.apache.nutch.parse.ParseStatus;
+import org.apache.nutch.util.NutchConfiguration;
+import org.junit.Assert;
+import org.junit.Test;
+
+public class TestJexlIndexingFilter {
+
+   @Test
+   public void testBasicIndexingFilter() throws Exception {
+   Configuration conf = NutchConfiguration.create();
+   conf.set("index.jexl.filter", "doc.lang[0]=='en'");
+
+   JexlIndexingFilter filter = new JexlIndexingFilter();
+   filter.setConf(conf);
+   Assert.assertNotNull(filter);
+
+   NutchDocument doc = new NutchDocument();
+
+   String title = "The Foo Page";
+   Outlink[] outlinks = new Outlink[] { new 
Outlink("http://foo.com/;, "Foo") };
+   Metadata metaData = new Metadata();
+   metaData.add("Language", "en/us");
+   ParseData parseData = new ParseData(ParseStatus.STATUS_SUCCESS, 
title, outlinks, metaData);
+   ParseImpl parse = new ParseImpl("this is a sample foo bar page. 
hope you enjoy it.", parseData);
+
+   CrawlDatum crawlDatum = new CrawlDatum();
+   crawlDatum.setFetchTime(100L);
+
+   Inlinks inlinks = new Inlinks();
+
+   doc.add("lang", "en");
+
+   try {
+   NutchDocument result = filter.filter(doc, parse, new 
Text("http://nutch.apache.org/index.html;), crawlDatum, inlinks);
+   Assert.assertNotNull(result);
+   Assert.assertEquals(doc, result);
+   } catch (Exception e) {
+   e.printStackTrace();
+   Assert.fail(e.getMessage());
+   }
+   
+   doc.removeField("lang");
+   doc.add("lang", "ru");
+   
+   try {
+   NutchDocument result = filter.filter(doc, parse, new 
Text("http://nutch.apache.org/index.html;), crawlDatum, inlinks);
+   Assert.assertNull(result);
+   } catch (Exception e) {
+   e.printStackTrace();
+   Assert.fail(e.getMessage());
 
 Review comment:
   @pipldev Yes for testing the exception you could use something like
   
   ```java
   @Test(expected = IndexOutOfBoundsException.class) 
   public void empty() { 
new ArrayList().get(0); 
   }
   ```
   The idea is that if you isolate this into a separated test you don't need 
the try ... catch because for that particular test you'll know if it's supposed 
to throw an exception or not (hence my suggestion of having separated tests). 
So for test `A` you know it's supposed to work as expected and you don't need 
to handle the exception because you know that when you're setting the data for 
the filter. And on test `B` you know that it's supposed to throw an exception 
because you made sure of that when configuring the filter and your assertions 
will reflect that. 
   
   We try to follow the _boy scout rule_ when some code is touched/added we try 
to leave it in a better form than how we found it :). With this approach, we 
try to keep the stability of 

[jira] [Updated] (NUTCH-2411) Index-metadata to support indexing multiple values for a field

2017-08-31 Thread Markus Jelsma (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2411?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-2411:
-
Attachment: NUTCH-2411-1.13.patch

Patch for 1.13

> Index-metadata to support indexing multiple values for a field 
> ---
>
> Key: NUTCH-2411
> URL: https://issues.apache.org/jira/browse/NUTCH-2411
> Project: Nutch
>  Issue Type: Improvement
>Affects Versions: 1.13
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
>Priority: Minor
> Fix For: 1.14
>
> Attachments: NUTCH-2411-1.13.patch, NUTCH-2411.patch
>
>
> {code}
> 
>   index.metadata.separator
>   
>   
>Separator to use if you want to index multiple values for a given field. 
> Leave empty to
>treat each value as a single value.
>   
> 
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (NUTCH-2415) Create a JEXL based IndexingFilter

2017-08-31 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16148758#comment-16148758
 ] 

ASF GitHub Bot commented on NUTCH-2415:
---

pipldev commented on a change in pull request #219: NUTCH-2415 : Create a JEXL 
based IndexingFilter
URL: https://github.com/apache/nutch/pull/219#discussion_r136294069
 
 

 ##
 File path: 
src/plugin/index-jexl-filter/src/test/org/apache/nutch/indexer/filter/TestJexlIndexingFilter.java
 ##
 @@ -0,0 +1,80 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.nutch.indexer.filter;
+
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.io.Text;
+import org.apache.nutch.crawl.CrawlDatum;
+import org.apache.nutch.crawl.Inlinks;
+import org.apache.nutch.indexer.NutchDocument;
+import org.apache.nutch.metadata.Metadata;
+import org.apache.nutch.parse.Outlink;
+import org.apache.nutch.parse.ParseData;
+import org.apache.nutch.parse.ParseImpl;
+import org.apache.nutch.parse.ParseStatus;
+import org.apache.nutch.util.NutchConfiguration;
+import org.junit.Assert;
+import org.junit.Test;
+
+public class TestJexlIndexingFilter {
+
+   @Test
+   public void testBasicIndexingFilter() throws Exception {
+   Configuration conf = NutchConfiguration.create();
+   conf.set("index.jexl.filter", "doc.lang[0]=='en'");
+
+   JexlIndexingFilter filter = new JexlIndexingFilter();
+   filter.setConf(conf);
+   Assert.assertNotNull(filter);
+
+   NutchDocument doc = new NutchDocument();
+
+   String title = "The Foo Page";
+   Outlink[] outlinks = new Outlink[] { new 
Outlink("http://foo.com/;, "Foo") };
+   Metadata metaData = new Metadata();
+   metaData.add("Language", "en/us");
+   ParseData parseData = new ParseData(ParseStatus.STATUS_SUCCESS, 
title, outlinks, metaData);
+   ParseImpl parse = new ParseImpl("this is a sample foo bar page. 
hope you enjoy it.", parseData);
+
+   CrawlDatum crawlDatum = new CrawlDatum();
+   crawlDatum.setFetchTime(100L);
+
+   Inlinks inlinks = new Inlinks();
+
+   doc.add("lang", "en");
+
+   try {
+   NutchDocument result = filter.filter(doc, parse, new 
Text("http://nutch.apache.org/index.html;), crawlDatum, inlinks);
+   Assert.assertNotNull(result);
+   Assert.assertEquals(doc, result);
+   } catch (Exception e) {
+   e.printStackTrace();
+   Assert.fail(e.getMessage());
+   }
+   
+   doc.removeField("lang");
+   doc.add("lang", "ru");
+   
+   try {
+   NutchDocument result = filter.filter(doc, parse, new 
Text("http://nutch.apache.org/index.html;), crawlDatum, inlinks);
+   Assert.assertNull(result);
+   } catch (Exception e) {
+   e.printStackTrace();
+   Assert.fail(e.getMessage());
 
 Review comment:
   I'm not sure I understand your comment. If I understand correctly, I should 
just remove the try...catch, and let the Exception be thrown, since that will 
fail the test with the exception.
   For the new (third) test, where we want to create an exception, I assume we 
should use ExpectedException.
   Agreed?
   (BTW, all these problems with the test were also in the test you pointed me 
to as a template!)
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Create a JEXL based IndexingFilter
> --
>
> Key: NUTCH-2415
> URL: https://issues.apache.org/jira/browse/NUTCH-2415
> Project: Nutch
>  Issue Type: New Feature
>

[jira] [Commented] (NUTCH-2415) Create a JEXL based IndexingFilter

2017-08-31 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16148743#comment-16148743
 ] 

ASF GitHub Bot commented on NUTCH-2415:
---

jorgelbg commented on issue #219: NUTCH-2415 : Create a JEXL based 
IndexingFilter
URL: https://github.com/apache/nutch/pull/219#issuecomment-326245187
 
 
   @pipldev Added a couple of comments about the test, but it's looking really 
good! Thanks for the contribution!
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Create a JEXL based IndexingFilter
> --
>
> Key: NUTCH-2415
> URL: https://issues.apache.org/jira/browse/NUTCH-2415
> Project: Nutch
>  Issue Type: New Feature
>  Components: plugin
>Affects Versions: 1.13
>Reporter: Yossi Tamari
>Assignee: Jorge Luis Betancourt Gonzalez
>Priority: Minor
>
> Following on NUTCH-2414 and NUTCH-2412, the requirement was raised for a 
> IndexingFilter plugin which will decide whether to index a document based on 
> a JEXL expression.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (NUTCH-2415) Create a JEXL based IndexingFilter

2017-08-31 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16148741#comment-16148741
 ] 

ASF GitHub Bot commented on NUTCH-2415:
---

jorgelbg commented on a change in pull request #219: NUTCH-2415 : Create a JEXL 
based IndexingFilter
URL: https://github.com/apache/nutch/pull/219#discussion_r136290454
 
 

 ##
 File path: 
src/plugin/index-jexl-filter/src/test/org/apache/nutch/indexer/filter/TestJexlIndexingFilter.java
 ##
 @@ -0,0 +1,80 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.nutch.indexer.filter;
+
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.io.Text;
+import org.apache.nutch.crawl.CrawlDatum;
+import org.apache.nutch.crawl.Inlinks;
+import org.apache.nutch.indexer.NutchDocument;
+import org.apache.nutch.metadata.Metadata;
+import org.apache.nutch.parse.Outlink;
+import org.apache.nutch.parse.ParseData;
+import org.apache.nutch.parse.ParseImpl;
+import org.apache.nutch.parse.ParseStatus;
+import org.apache.nutch.util.NutchConfiguration;
+import org.junit.Assert;
+import org.junit.Test;
+
+public class TestJexlIndexingFilter {
+
+   @Test
+   public void testBasicIndexingFilter() throws Exception {
+   Configuration conf = NutchConfiguration.create();
+   conf.set("index.jexl.filter", "doc.lang[0]=='en'");
+
+   JexlIndexingFilter filter = new JexlIndexingFilter();
+   filter.setConf(conf);
+   Assert.assertNotNull(filter);
+
+   NutchDocument doc = new NutchDocument();
+
+   String title = "The Foo Page";
+   Outlink[] outlinks = new Outlink[] { new 
Outlink("http://foo.com/;, "Foo") };
+   Metadata metaData = new Metadata();
+   metaData.add("Language", "en/us");
+   ParseData parseData = new ParseData(ParseStatus.STATUS_SUCCESS, 
title, outlinks, metaData);
+   ParseImpl parse = new ParseImpl("this is a sample foo bar page. 
hope you enjoy it.", parseData);
+
+   CrawlDatum crawlDatum = new CrawlDatum();
+   crawlDatum.setFetchTime(100L);
+
+   Inlinks inlinks = new Inlinks();
+
+   doc.add("lang", "en");
+
+   try {
+   NutchDocument result = filter.filter(doc, parse, new 
Text("http://nutch.apache.org/index.html;), crawlDatum, inlinks);
+   Assert.assertNotNull(result);
+   Assert.assertEquals(doc, result);
+   } catch (Exception e) {
+   e.printStackTrace();
+   Assert.fail(e.getMessage());
+   }
+   
+   doc.removeField("lang");
+   doc.add("lang", "ru");
+   
+   try {
+   NutchDocument result = filter.filter(doc, parse, new 
Text("http://nutch.apache.org/index.html;), crawlDatum, inlinks);
+   Assert.assertNull(result);
+   } catch (Exception e) {
+   e.printStackTrace();
+   Assert.fail(e.getMessage());
 
 Review comment:
   When this check is moved to a different test method you should use the 
exception assertion from JUnit, instead of the `try ... catch` block.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Create a JEXL based IndexingFilter
> --
>
> Key: NUTCH-2415
> URL: https://issues.apache.org/jira/browse/NUTCH-2415
> Project: Nutch
>  Issue Type: New Feature
>  Components: plugin
>Affects Versions: 1.13
>Reporter: Yossi Tamari
>Assignee: Jorge Luis Betancourt Gonzalez
>Priority: Minor
>
> Following on NUTCH-2414 and NUTCH-2412, the requirement was raised for a 
> IndexingFilter 

[jira] [Commented] (NUTCH-2415) Create a JEXL based IndexingFilter

2017-08-31 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16148734#comment-16148734
 ] 

ASF GitHub Bot commented on NUTCH-2415:
---

jorgelbg commented on a change in pull request #219: NUTCH-2415 : Create a JEXL 
based IndexingFilter
URL: https://github.com/apache/nutch/pull/219#discussion_r136290007
 
 

 ##
 File path: 
src/plugin/index-jexl-filter/src/test/org/apache/nutch/indexer/filter/TestJexlIndexingFilter.java
 ##
 @@ -0,0 +1,80 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.nutch.indexer.filter;
+
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.io.Text;
+import org.apache.nutch.crawl.CrawlDatum;
+import org.apache.nutch.crawl.Inlinks;
+import org.apache.nutch.indexer.NutchDocument;
+import org.apache.nutch.metadata.Metadata;
+import org.apache.nutch.parse.Outlink;
+import org.apache.nutch.parse.ParseData;
+import org.apache.nutch.parse.ParseImpl;
+import org.apache.nutch.parse.ParseStatus;
+import org.apache.nutch.util.NutchConfiguration;
+import org.junit.Assert;
+import org.junit.Test;
+
+public class TestJexlIndexingFilter {
+
+   @Test
+   public void testBasicIndexingFilter() throws Exception {
 
 Review comment:
   Would be best to have a meaningful test name, something like 
`testAllowMatchingDocument` but in case that the test fails, will provide 
information about what the test is checking. 
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Create a JEXL based IndexingFilter
> --
>
> Key: NUTCH-2415
> URL: https://issues.apache.org/jira/browse/NUTCH-2415
> Project: Nutch
>  Issue Type: New Feature
>  Components: plugin
>Affects Versions: 1.13
>Reporter: Yossi Tamari
>Assignee: Jorge Luis Betancourt Gonzalez
>Priority: Minor
>
> Following on NUTCH-2414 and NUTCH-2412, the requirement was raised for a 
> IndexingFilter plugin which will decide whether to index a document based on 
> a JEXL expression.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (NUTCH-2415) Create a JEXL based IndexingFilter

2017-08-31 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16148736#comment-16148736
 ] 

ASF GitHub Bot commented on NUTCH-2415:
---

jorgelbg commented on a change in pull request #219: NUTCH-2415 : Create a JEXL 
based IndexingFilter
URL: https://github.com/apache/nutch/pull/219#discussion_r136290035
 
 

 ##
 File path: 
src/plugin/index-jexl-filter/src/test/org/apache/nutch/indexer/filter/TestJexlIndexingFilter.java
 ##
 @@ -0,0 +1,80 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.nutch.indexer.filter;
+
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.io.Text;
+import org.apache.nutch.crawl.CrawlDatum;
+import org.apache.nutch.crawl.Inlinks;
+import org.apache.nutch.indexer.NutchDocument;
+import org.apache.nutch.metadata.Metadata;
+import org.apache.nutch.parse.Outlink;
+import org.apache.nutch.parse.ParseData;
+import org.apache.nutch.parse.ParseImpl;
+import org.apache.nutch.parse.ParseStatus;
+import org.apache.nutch.util.NutchConfiguration;
+import org.junit.Assert;
+import org.junit.Test;
+
+public class TestJexlIndexingFilter {
+
+   @Test
+   public void testBasicIndexingFilter() throws Exception {
+   Configuration conf = NutchConfiguration.create();
+   conf.set("index.jexl.filter", "doc.lang[0]=='en'");
+
+   JexlIndexingFilter filter = new JexlIndexingFilter();
+   filter.setConf(conf);
+   Assert.assertNotNull(filter);
+
+   NutchDocument doc = new NutchDocument();
+
+   String title = "The Foo Page";
+   Outlink[] outlinks = new Outlink[] { new 
Outlink("http://foo.com/;, "Foo") };
+   Metadata metaData = new Metadata();
+   metaData.add("Language", "en/us");
+   ParseData parseData = new ParseData(ParseStatus.STATUS_SUCCESS, 
title, outlinks, metaData);
+   ParseImpl parse = new ParseImpl("this is a sample foo bar page. 
hope you enjoy it.", parseData);
+
+   CrawlDatum crawlDatum = new CrawlDatum();
+   crawlDatum.setFetchTime(100L);
+
+   Inlinks inlinks = new Inlinks();
+
+   doc.add("lang", "en");
+
+   try {
+   NutchDocument result = filter.filter(doc, parse, new 
Text("http://nutch.apache.org/index.html;), crawlDatum, inlinks);
+   Assert.assertNotNull(result);
+   Assert.assertEquals(doc, result);
+   } catch (Exception e) {
+   e.printStackTrace();
+   Assert.fail(e.getMessage());
+   }
+   
+   doc.removeField("lang");
 
 Review comment:
   This should be a separated test case something like 
`testBlockNotMatchingDocuments`. In tests is ok if we have a bit of repetition, 
but the idea is that each test contains one specific scenario (test case). 
   
   It would be also great if you could add a test for when the expression is 
missing/invalid expression is entered, just checking that the proper exception 
get's triggered could do the trick.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Create a JEXL based IndexingFilter
> --
>
> Key: NUTCH-2415
> URL: https://issues.apache.org/jira/browse/NUTCH-2415
> Project: Nutch
>  Issue Type: New Feature
>  Components: plugin
>Affects Versions: 1.13
>Reporter: Yossi Tamari
>Assignee: Jorge Luis Betancourt Gonzalez
>Priority: Minor
>
> Following on NUTCH-2414 and NUTCH-2412, the requirement was raised for a 
> IndexingFilter plugin which will decide whether to index a document based on 
> a JEXL expression.



--
This message was sent by Atlassian JIRA

[jira] [Assigned] (NUTCH-2415) Create a JEXL based IndexingFilter

2017-08-31 Thread Jorge Luis Betancourt Gonzalez (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2415?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jorge Luis Betancourt Gonzalez reassigned NUTCH-2415:
-

Assignee: Jorge Luis Betancourt Gonzalez

> Create a JEXL based IndexingFilter
> --
>
> Key: NUTCH-2415
> URL: https://issues.apache.org/jira/browse/NUTCH-2415
> Project: Nutch
>  Issue Type: New Feature
>  Components: plugin
>Affects Versions: 1.13
>Reporter: Yossi Tamari
>Assignee: Jorge Luis Betancourt Gonzalez
>Priority: Minor
>
> Following on NUTCH-2414 and NUTCH-2412, the requirement was raised for a 
> IndexingFilter plugin which will decide whether to index a document based on 
> a JEXL expression.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)