[GitHub] nutch pull request: NUTCH-2038

2015-06-30 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/nutch/pull/42


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] nutch pull request: NUTCH-2038

2015-06-29 Thread asitang
GitHub user asitang opened a pull request:

https://github.com/apache/nutch/pull/42

NUTCH-2038

minor changes and suggestions by Sebastian.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/asitang/nutch NUTCH-2038

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/nutch/pull/42.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #42


commit b0ce4a157dbd0bfd8ea368f3fa230a90c7117ae2
Author: Asitang Mishra 
Date:   2015-06-17T16:11:42Z

patch 1.0 for NUTCH-2038

commit e243cc5e626106a4cd8dfca8d9c2ec93e9648560
Author: Asitang Mishra 
Date:   2015-06-17T16:14:37Z

patch 1.0 for NUTCH-2038

commit 711f44d8d4af51538ff1764145ac743445b6f43b
Author: Asitang Mishra 
Date:   2015-06-17T16:35:28Z

patch 1.0 for NUTCH-2038

commit e0e924e15c247d3fa3dd92f387fe53ba7effd78a
Author: Asitang Mishra 
Date:   2015-06-18T15:09:30Z

final commir for pattch 1.0

commit cca768bc1c790a976594136433485fe899465cb8
Author: Asitang Mishra 
Date:   2015-06-19T20:13:34Z

Patch 2.0 for NUTCH-2038

commit 0e80bf471b7d40965cf3bdad908252f5ce577d85
Author: Asitang Mishra 
Date:   2015-06-24T15:45:50Z

commit for 3.0 patch of NUTCH-2038

commit 63efcfecd2eda339c3c55a6236cb88c7a08698bc
Author: Asitang Mishra 
Date:   2015-06-24T15:46:46Z

commit for 3.0 patch of NUTCH-2038

commit 3a7bf466c76e8cffef96063101a39a77c328d657
Author: Asitang Mishra 
Date:   2015-06-24T15:55:22Z

commit for 3.1 patch of NUTCH-2038

commit ae89456e9f4078111653273fe0ac52c26c568c36
Author: Asitang Mishra 
Date:   2015-06-24T15:58:12Z

commit for 3.2 patch of NUTCH-2038

commit ae639ec40263fafbd6c0273c619d425ee482f7f0
Author: Asitang Mishra 
Date:   2015-06-24T17:31:09Z

commit for 3.3 patch of NUTCH-2038

commit 5ba14790c1367deeb54d4d61f87be3d602cecedf
Author: Asitang Mishra 
Date:   2015-06-25T22:59:45Z

patch 4.0 for NUTCH-2038

commit 4b5597a5fac0d3d94a38aace9b8a386d956da4e3
Author: Asitang Mishra 
Date:   2015-06-25T23:00:40Z

patch 4.0 for NUTCH-2038

commit 9ebcae33284d325f86bdbcfa18ef2c9a5744e67d
Author: Asitang Mishra 
Date:   2015-06-25T23:05:20Z

patch 4.1 for NUTCH-2038

commit 830f05bfe77abf79b2877c2a9c388fa24b3df526
Author: Asitang Mishra 
Date:   2015-06-25T23:07:44Z

patch 4.1 for NUTCH-2038

commit 5e907b1109c8e623bfcdb25b4b467dd53fbec9f3
Author: Asitang Mishra 
Date:   2015-06-28T23:51:58Z

Patch 5.0 for NUTCH-2038

commit b984cdfac2d30ef38b1aebbc0330ba7eee1e12bf
Author: Asitang Mishra 
Date:   2015-06-28T23:53:22Z

Patch 5.0 for NUTCH-2038

commit ecbd4c27ae71b8c04e011c6b7106cc1fb324e04a
Author: Asitang Mishra 
Date:   2015-06-28T23:53:52Z

Patch 5.0 for NUTCH-2038

commit aba64fc941ed7616153d19410dbe9b9a0f8ef387
Author: Asitang Mishra 
Date:   2015-06-29T00:03:43Z

Patch 5.0 for NUTCH-2038

commit 71be15df81222adc6b58b6308e1dac7db23b6386
Author: Asitang Mishra 
Date:   2015-06-29T04:21:38Z

Patch 5.1 for NUTCH-2038

commit a9465c06d59e7ed2bd13d07c128bcea574fc9d6c
Author: Asitang Mishra 
Date:   2015-06-29T14:27:02Z

Patch 5.2 for NUTCH-2038

commit 8f45e634c942df66ea9c1ee775bb216d35fabb87
Author: Asitang Mishra 
Date:   2015-06-29T19:23:39Z

patch 6.0 for NUTCH-2038

commit 97278c5a09f5d4391473185d2268c7b26f151120
Author: Asitang Mishra 
Date:   2015-06-29T19:24:34Z

patch 6.0 for NUTCH-2038

commit 9b876bc8cbad902b094d696e3df751d9f163e4b3
Author: Asitang Mishra 
Date:   2015-06-29T19:25:06Z

patch 6.0 for NUTCH-2038

commit 866486e6be337e8a1e0e5209642649a1834278d3
Author: Asitang Mishra 
Date:   2015-06-29T19:35:24Z

patch 6.1 for NUTCH-2038

commit dd159175822a476cc5889da71a19272cf733e011
Author: Asitang Mishra 
Date:   2015-06-29T20:39:55Z

patch 6.2 for NUTCH-2038




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] nutch pull request: NUTCH-2038

2015-06-29 Thread asitang
Github user asitang closed the pull request at:

https://github.com/apache/nutch/pull/41


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] nutch pull request: NUTCH-2038

2015-06-29 Thread asitang
GitHub user asitang opened a pull request:

https://github.com/apache/nutch/pull/41

NUTCH-2038

--added specific IOException messages
--added files: 
conf/naivebayes-train.txt.template
conf/naivebayes-wordlist.txt.template

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/asitang/nutch NUTCH-2038

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/nutch/pull/41.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #41


commit b0ce4a157dbd0bfd8ea368f3fa230a90c7117ae2
Author: Asitang Mishra 
Date:   2015-06-17T16:11:42Z

patch 1.0 for NUTCH-2038

commit e243cc5e626106a4cd8dfca8d9c2ec93e9648560
Author: Asitang Mishra 
Date:   2015-06-17T16:14:37Z

patch 1.0 for NUTCH-2038

commit 711f44d8d4af51538ff1764145ac743445b6f43b
Author: Asitang Mishra 
Date:   2015-06-17T16:35:28Z

patch 1.0 for NUTCH-2038

commit e0e924e15c247d3fa3dd92f387fe53ba7effd78a
Author: Asitang Mishra 
Date:   2015-06-18T15:09:30Z

final commir for pattch 1.0

commit cca768bc1c790a976594136433485fe899465cb8
Author: Asitang Mishra 
Date:   2015-06-19T20:13:34Z

Patch 2.0 for NUTCH-2038

commit 0e80bf471b7d40965cf3bdad908252f5ce577d85
Author: Asitang Mishra 
Date:   2015-06-24T15:45:50Z

commit for 3.0 patch of NUTCH-2038

commit 63efcfecd2eda339c3c55a6236cb88c7a08698bc
Author: Asitang Mishra 
Date:   2015-06-24T15:46:46Z

commit for 3.0 patch of NUTCH-2038

commit 3a7bf466c76e8cffef96063101a39a77c328d657
Author: Asitang Mishra 
Date:   2015-06-24T15:55:22Z

commit for 3.1 patch of NUTCH-2038

commit ae89456e9f4078111653273fe0ac52c26c568c36
Author: Asitang Mishra 
Date:   2015-06-24T15:58:12Z

commit for 3.2 patch of NUTCH-2038

commit ae639ec40263fafbd6c0273c619d425ee482f7f0
Author: Asitang Mishra 
Date:   2015-06-24T17:31:09Z

commit for 3.3 patch of NUTCH-2038

commit 5ba14790c1367deeb54d4d61f87be3d602cecedf
Author: Asitang Mishra 
Date:   2015-06-25T22:59:45Z

patch 4.0 for NUTCH-2038

commit 4b5597a5fac0d3d94a38aace9b8a386d956da4e3
Author: Asitang Mishra 
Date:   2015-06-25T23:00:40Z

patch 4.0 for NUTCH-2038

commit 9ebcae33284d325f86bdbcfa18ef2c9a5744e67d
Author: Asitang Mishra 
Date:   2015-06-25T23:05:20Z

patch 4.1 for NUTCH-2038

commit 830f05bfe77abf79b2877c2a9c388fa24b3df526
Author: Asitang Mishra 
Date:   2015-06-25T23:07:44Z

patch 4.1 for NUTCH-2038

commit 5e907b1109c8e623bfcdb25b4b467dd53fbec9f3
Author: Asitang Mishra 
Date:   2015-06-28T23:51:58Z

Patch 5.0 for NUTCH-2038

commit b984cdfac2d30ef38b1aebbc0330ba7eee1e12bf
Author: Asitang Mishra 
Date:   2015-06-28T23:53:22Z

Patch 5.0 for NUTCH-2038

commit ecbd4c27ae71b8c04e011c6b7106cc1fb324e04a
Author: Asitang Mishra 
Date:   2015-06-28T23:53:52Z

Patch 5.0 for NUTCH-2038

commit aba64fc941ed7616153d19410dbe9b9a0f8ef387
Author: Asitang Mishra 
Date:   2015-06-29T00:03:43Z

Patch 5.0 for NUTCH-2038

commit 71be15df81222adc6b58b6308e1dac7db23b6386
Author: Asitang Mishra 
Date:   2015-06-29T04:21:38Z

Patch 5.1 for NUTCH-2038

commit a9465c06d59e7ed2bd13d07c128bcea574fc9d6c
Author: Asitang Mishra 
Date:   2015-06-29T14:27:02Z

Patch 5.2 for NUTCH-2038

commit 8f45e634c942df66ea9c1ee775bb216d35fabb87
Author: Asitang Mishra 
Date:   2015-06-29T19:23:39Z

patch 6.0 for NUTCH-2038

commit 97278c5a09f5d4391473185d2268c7b26f151120
Author: Asitang Mishra 
Date:   2015-06-29T19:24:34Z

patch 6.0 for NUTCH-2038

commit 9b876bc8cbad902b094d696e3df751d9f163e4b3
Author: Asitang Mishra 
Date:   2015-06-29T19:25:06Z

patch 6.0 for NUTCH-2038

commit 866486e6be337e8a1e0e5209642649a1834278d3
Author: Asitang Mishra 
Date:   2015-06-29T19:35:24Z

patch 6.1 for NUTCH-2038




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] nutch pull request: NUTCH-2038

2015-06-29 Thread asitang
Github user asitang closed the pull request at:

https://github.com/apache/nutch/pull/40


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] nutch pull request: NUTCH-2038

2015-06-29 Thread asitang
GitHub user asitang opened a pull request:

https://github.com/apache/nutch/pull/40

NUTCH-2038

added all the jars in plugin.xml

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/asitang/nutch NUTCH-2038

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/nutch/pull/40.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #40


commit b0ce4a157dbd0bfd8ea368f3fa230a90c7117ae2
Author: Asitang Mishra 
Date:   2015-06-17T16:11:42Z

patch 1.0 for NUTCH-2038

commit e243cc5e626106a4cd8dfca8d9c2ec93e9648560
Author: Asitang Mishra 
Date:   2015-06-17T16:14:37Z

patch 1.0 for NUTCH-2038

commit 711f44d8d4af51538ff1764145ac743445b6f43b
Author: Asitang Mishra 
Date:   2015-06-17T16:35:28Z

patch 1.0 for NUTCH-2038

commit e0e924e15c247d3fa3dd92f387fe53ba7effd78a
Author: Asitang Mishra 
Date:   2015-06-18T15:09:30Z

final commir for pattch 1.0

commit cca768bc1c790a976594136433485fe899465cb8
Author: Asitang Mishra 
Date:   2015-06-19T20:13:34Z

Patch 2.0 for NUTCH-2038

commit 0e80bf471b7d40965cf3bdad908252f5ce577d85
Author: Asitang Mishra 
Date:   2015-06-24T15:45:50Z

commit for 3.0 patch of NUTCH-2038

commit 63efcfecd2eda339c3c55a6236cb88c7a08698bc
Author: Asitang Mishra 
Date:   2015-06-24T15:46:46Z

commit for 3.0 patch of NUTCH-2038

commit 3a7bf466c76e8cffef96063101a39a77c328d657
Author: Asitang Mishra 
Date:   2015-06-24T15:55:22Z

commit for 3.1 patch of NUTCH-2038

commit ae89456e9f4078111653273fe0ac52c26c568c36
Author: Asitang Mishra 
Date:   2015-06-24T15:58:12Z

commit for 3.2 patch of NUTCH-2038

commit ae639ec40263fafbd6c0273c619d425ee482f7f0
Author: Asitang Mishra 
Date:   2015-06-24T17:31:09Z

commit for 3.3 patch of NUTCH-2038

commit 5ba14790c1367deeb54d4d61f87be3d602cecedf
Author: Asitang Mishra 
Date:   2015-06-25T22:59:45Z

patch 4.0 for NUTCH-2038

commit 4b5597a5fac0d3d94a38aace9b8a386d956da4e3
Author: Asitang Mishra 
Date:   2015-06-25T23:00:40Z

patch 4.0 for NUTCH-2038

commit 9ebcae33284d325f86bdbcfa18ef2c9a5744e67d
Author: Asitang Mishra 
Date:   2015-06-25T23:05:20Z

patch 4.1 for NUTCH-2038

commit 830f05bfe77abf79b2877c2a9c388fa24b3df526
Author: Asitang Mishra 
Date:   2015-06-25T23:07:44Z

patch 4.1 for NUTCH-2038

commit 5e907b1109c8e623bfcdb25b4b467dd53fbec9f3
Author: Asitang Mishra 
Date:   2015-06-28T23:51:58Z

Patch 5.0 for NUTCH-2038

commit b984cdfac2d30ef38b1aebbc0330ba7eee1e12bf
Author: Asitang Mishra 
Date:   2015-06-28T23:53:22Z

Patch 5.0 for NUTCH-2038

commit ecbd4c27ae71b8c04e011c6b7106cc1fb324e04a
Author: Asitang Mishra 
Date:   2015-06-28T23:53:52Z

Patch 5.0 for NUTCH-2038

commit aba64fc941ed7616153d19410dbe9b9a0f8ef387
Author: Asitang Mishra 
Date:   2015-06-29T00:03:43Z

Patch 5.0 for NUTCH-2038

commit 71be15df81222adc6b58b6308e1dac7db23b6386
Author: Asitang Mishra 
Date:   2015-06-29T04:21:38Z

Patch 5.1 for NUTCH-2038

commit a9465c06d59e7ed2bd13d07c128bcea574fc9d6c
Author: Asitang Mishra 
Date:   2015-06-29T14:27:02Z

Patch 5.2 for NUTCH-2038




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] nutch pull request: NUTCH-2038

2015-06-28 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/nutch/pull/39


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] nutch pull request: NUTCH-2038

2015-06-28 Thread asitang
GitHub user asitang opened a pull request:

https://github.com/apache/nutch/pull/39

NUTCH-2038

Removed the TODO comments

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/asitang/nutch NUTCH-2038

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/nutch/pull/39.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #39


commit b0ce4a157dbd0bfd8ea368f3fa230a90c7117ae2
Author: Asitang Mishra 
Date:   2015-06-17T16:11:42Z

patch 1.0 for NUTCH-2038

commit e243cc5e626106a4cd8dfca8d9c2ec93e9648560
Author: Asitang Mishra 
Date:   2015-06-17T16:14:37Z

patch 1.0 for NUTCH-2038

commit 711f44d8d4af51538ff1764145ac743445b6f43b
Author: Asitang Mishra 
Date:   2015-06-17T16:35:28Z

patch 1.0 for NUTCH-2038

commit e0e924e15c247d3fa3dd92f387fe53ba7effd78a
Author: Asitang Mishra 
Date:   2015-06-18T15:09:30Z

final commir for pattch 1.0

commit cca768bc1c790a976594136433485fe899465cb8
Author: Asitang Mishra 
Date:   2015-06-19T20:13:34Z

Patch 2.0 for NUTCH-2038

commit 0e80bf471b7d40965cf3bdad908252f5ce577d85
Author: Asitang Mishra 
Date:   2015-06-24T15:45:50Z

commit for 3.0 patch of NUTCH-2038

commit 63efcfecd2eda339c3c55a6236cb88c7a08698bc
Author: Asitang Mishra 
Date:   2015-06-24T15:46:46Z

commit for 3.0 patch of NUTCH-2038

commit 3a7bf466c76e8cffef96063101a39a77c328d657
Author: Asitang Mishra 
Date:   2015-06-24T15:55:22Z

commit for 3.1 patch of NUTCH-2038

commit ae89456e9f4078111653273fe0ac52c26c568c36
Author: Asitang Mishra 
Date:   2015-06-24T15:58:12Z

commit for 3.2 patch of NUTCH-2038

commit ae639ec40263fafbd6c0273c619d425ee482f7f0
Author: Asitang Mishra 
Date:   2015-06-24T17:31:09Z

commit for 3.3 patch of NUTCH-2038

commit 5ba14790c1367deeb54d4d61f87be3d602cecedf
Author: Asitang Mishra 
Date:   2015-06-25T22:59:45Z

patch 4.0 for NUTCH-2038

commit 4b5597a5fac0d3d94a38aace9b8a386d956da4e3
Author: Asitang Mishra 
Date:   2015-06-25T23:00:40Z

patch 4.0 for NUTCH-2038

commit 9ebcae33284d325f86bdbcfa18ef2c9a5744e67d
Author: Asitang Mishra 
Date:   2015-06-25T23:05:20Z

patch 4.1 for NUTCH-2038

commit 830f05bfe77abf79b2877c2a9c388fa24b3df526
Author: Asitang Mishra 
Date:   2015-06-25T23:07:44Z

patch 4.1 for NUTCH-2038

commit 5e907b1109c8e623bfcdb25b4b467dd53fbec9f3
Author: Asitang Mishra 
Date:   2015-06-28T23:51:58Z

Patch 5.0 for NUTCH-2038

commit b984cdfac2d30ef38b1aebbc0330ba7eee1e12bf
Author: Asitang Mishra 
Date:   2015-06-28T23:53:22Z

Patch 5.0 for NUTCH-2038

commit ecbd4c27ae71b8c04e011c6b7106cc1fb324e04a
Author: Asitang Mishra 
Date:   2015-06-28T23:53:52Z

Patch 5.0 for NUTCH-2038

commit aba64fc941ed7616153d19410dbe9b9a0f8ef387
Author: Asitang Mishra 
Date:   2015-06-29T00:03:43Z

Patch 5.0 for NUTCH-2038

commit 71be15df81222adc6b58b6308e1dac7db23b6386
Author: Asitang Mishra 
Date:   2015-06-29T04:21:38Z

Patch 5.1 for NUTCH-2038




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] nutch pull request: NUTCH-2038

2015-06-28 Thread asitang
Github user asitang closed the pull request at:

https://github.com/apache/nutch/pull/38


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] nutch pull request: NUTCH-2038

2015-06-28 Thread asitang
Github user asitang commented on a diff in the pull request:

https://github.com/apache/nutch/pull/38#discussion_r33433136
  
--- Diff: 
src/plugin/parsefilter-naivebayes/src/java/org/apache/nutch/parsefilter/naivebayes/NaiveBayesParseFilter.java
 ---
@@ -0,0 +1,204 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.nutch.parsefilter.naivebayes;
+
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+import org.w3c.dom.DocumentFragment;
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.fs.FileSystem;
+import org.apache.hadoop.fs.Path;
+import org.apache.hadoop.util.StringUtils;
+import org.apache.nutch.parse.HTMLMetaTags;
+import org.apache.nutch.parse.HtmlParseFilter;
+import org.apache.nutch.parse.Outlink;
+import org.apache.nutch.parse.Parse;
+import org.apache.nutch.parse.ParseData;
+import org.apache.nutch.parse.ParseResult;
+import org.apache.nutch.parse.ParseStatus;
+import org.apache.nutch.parse.ParseText;
+import org.apache.nutch.protocol.Content;
+
+import java.io.Reader;
+import java.io.BufferedReader;
+import java.io.IOException;
+import java.util.ArrayList;
+
+/**
+ * Html Parse filter that classifies the outlinks from the parseresult as
+ * relevant or irrelevant based on the parseText's relevancy (using a 
training
+ * file where you can give positive and negative example texts see the
+ * description of parsefilter.naivebayes.trainfile) and if found 
irrelevent it
+ * gives the link a second chance if it contains any of the words from the 
list
+ * given in parsefilter.naivebayes.wordlist. CAUTION: Set the 
parser.timeout to
+ * -1 or a bigger value than 30, when using this classifier.
+ */
+public class NaiveBayesParseFilter implements HtmlParseFilter {
+
+  private static final Logger LOG = LoggerFactory
+  .getLogger(NaiveBayesParseFilter.class);
+
+  public static final String TRAINFILE_MODELFILTER = 
"parsefilter.naivebayes.trainfile";
+  public static final String DICTFILE_MODELFILTER = 
"parsefilter.naivebayes.wordlist";
+
+  private Configuration conf;
+  private String inputFilePath;
+  private String dictionaryFile;
+  private ArrayList wordlist = new ArrayList();
+
+  public boolean filterParse(String text) {
+
+try {
+  return classify(text);
+} catch (IOException e) {
+  // TODO Auto-generated catch block
+  LOG.error("Error occured while classifying:: " + text + " ::"
+  + StringUtils.stringifyException(e));
+}
+
+return false;
+  }
+
+  public boolean filterUrl(String url) {
+
+return containsWord(url, wordlist);
+
+  }
+
+  public boolean classify(String text) throws IOException {
+
+// if classified as relevent "1" then return true
+if (NaiveBayesClassifier.classify(text).equals("1"))
+  return true;
+return false;
+  }
+
+  public void train() throws Exception {
+// check if the model file exists, if it does then don't train
+if (!FileSystem.get(conf).exists(new Path("model"))) {
+  LOG.info("Training the Naive Bayes Model");
+  NaiveBayesClassifier.createModel(inputFilePath);
+} else {
+  LOG.info("Model file already exists. Skipping training.");
+}
+  }
+
+  public boolean containsWord(String url, ArrayList wordlist) {
+for (String word : wordlist) {
+  if (url.contains(word)) {
+return true;
+  }
+}
+
+return false;
+  }
+
+  public void setConf(Configuration conf) {
+this.conf = conf;
+inputFilePath = conf.get(TRAINFILE_MODELFILTER);
+dictionaryFile = conf.get(DICTFILE_MODELFILTER);
+if (inputFilePath == null || inputFilePath.trim().length() == 0
+|| dictionaryFile == null || dictionaryFile.trim().length() == 0) {
+  String message = "ParseFilter: NaiveBayes

[GitHub] nutch pull request: NUTCH-2038

2015-06-28 Thread asitang
Github user asitang commented on a diff in the pull request:

https://github.com/apache/nutch/pull/38#discussion_r33433090
  
--- Diff: 
src/plugin/parsefilter-naivebayes/src/java/org/apache/nutch/parsefilter/naivebayes/NaiveBayesParseFilter.java
 ---
@@ -0,0 +1,204 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.nutch.parsefilter.naivebayes;
+
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+import org.w3c.dom.DocumentFragment;
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.fs.FileSystem;
+import org.apache.hadoop.fs.Path;
+import org.apache.hadoop.util.StringUtils;
+import org.apache.nutch.parse.HTMLMetaTags;
+import org.apache.nutch.parse.HtmlParseFilter;
+import org.apache.nutch.parse.Outlink;
+import org.apache.nutch.parse.Parse;
+import org.apache.nutch.parse.ParseData;
+import org.apache.nutch.parse.ParseResult;
+import org.apache.nutch.parse.ParseStatus;
+import org.apache.nutch.parse.ParseText;
+import org.apache.nutch.protocol.Content;
+
+import java.io.Reader;
+import java.io.BufferedReader;
+import java.io.IOException;
+import java.util.ArrayList;
+
+/**
+ * Html Parse filter that classifies the outlinks from the parseresult as
+ * relevant or irrelevant based on the parseText's relevancy (using a 
training
+ * file where you can give positive and negative example texts see the
+ * description of parsefilter.naivebayes.trainfile) and if found 
irrelevent it
+ * gives the link a second chance if it contains any of the words from the 
list
+ * given in parsefilter.naivebayes.wordlist. CAUTION: Set the 
parser.timeout to
+ * -1 or a bigger value than 30, when using this classifier.
+ */
+public class NaiveBayesParseFilter implements HtmlParseFilter {
+
+  private static final Logger LOG = LoggerFactory
+  .getLogger(NaiveBayesParseFilter.class);
+
+  public static final String TRAINFILE_MODELFILTER = 
"parsefilter.naivebayes.trainfile";
+  public static final String DICTFILE_MODELFILTER = 
"parsefilter.naivebayes.wordlist";
+
+  private Configuration conf;
+  private String inputFilePath;
+  private String dictionaryFile;
+  private ArrayList wordlist = new ArrayList();
+
+  public boolean filterParse(String text) {
+
+try {
+  return classify(text);
+} catch (IOException e) {
+  // TODO Auto-generated catch block
+  LOG.error("Error occured while classifying:: " + text + " ::"
+  + StringUtils.stringifyException(e));
+}
+
+return false;
+  }
+
+  public boolean filterUrl(String url) {
+
+return containsWord(url, wordlist);
+
+  }
+
+  public boolean classify(String text) throws IOException {
+
+// if classified as relevent "1" then return true
+if (NaiveBayesClassifier.classify(text).equals("1"))
+  return true;
+return false;
+  }
+
+  public void train() throws Exception {
+// check if the model file exists, if it does then don't train
+if (!FileSystem.get(conf).exists(new Path("model"))) {
+  LOG.info("Training the Naive Bayes Model");
+  NaiveBayesClassifier.createModel(inputFilePath);
+} else {
+  LOG.info("Model file already exists. Skipping training.");
+}
+  }
+
+  public boolean containsWord(String url, ArrayList wordlist) {
+for (String word : wordlist) {
+  if (url.contains(word)) {
+return true;
+  }
+}
+
+return false;
+  }
+
+  public void setConf(Configuration conf) {
+this.conf = conf;
+inputFilePath = conf.get(TRAINFILE_MODELFILTER);
+dictionaryFile = conf.get(DICTFILE_MODELFILTER);
+if (inputFilePath == null || inputFilePath.trim().length() == 0
+|| dictionaryFile == null || dictionaryFile.trim().length() == 0) {
+  String message = "ParseFilter: NaiveBayes

[GitHub] nutch pull request: NUTCH-2038

2015-06-28 Thread chrismattmann
Github user chrismattmann commented on a diff in the pull request:

https://github.com/apache/nutch/pull/38#discussion_r33432911
  
--- Diff: 
src/plugin/parsefilter-naivebayes/src/java/org/apache/nutch/parsefilter/naivebayes/NaiveBayesParseFilter.java
 ---
@@ -0,0 +1,204 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.nutch.parsefilter.naivebayes;
+
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+import org.w3c.dom.DocumentFragment;
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.fs.FileSystem;
+import org.apache.hadoop.fs.Path;
+import org.apache.hadoop.util.StringUtils;
+import org.apache.nutch.parse.HTMLMetaTags;
+import org.apache.nutch.parse.HtmlParseFilter;
+import org.apache.nutch.parse.Outlink;
+import org.apache.nutch.parse.Parse;
+import org.apache.nutch.parse.ParseData;
+import org.apache.nutch.parse.ParseResult;
+import org.apache.nutch.parse.ParseStatus;
+import org.apache.nutch.parse.ParseText;
+import org.apache.nutch.protocol.Content;
+
+import java.io.Reader;
+import java.io.BufferedReader;
+import java.io.IOException;
+import java.util.ArrayList;
+
+/**
+ * Html Parse filter that classifies the outlinks from the parseresult as
+ * relevant or irrelevant based on the parseText's relevancy (using a 
training
+ * file where you can give positive and negative example texts see the
+ * description of parsefilter.naivebayes.trainfile) and if found 
irrelevent it
+ * gives the link a second chance if it contains any of the words from the 
list
+ * given in parsefilter.naivebayes.wordlist. CAUTION: Set the 
parser.timeout to
+ * -1 or a bigger value than 30, when using this classifier.
+ */
+public class NaiveBayesParseFilter implements HtmlParseFilter {
+
+  private static final Logger LOG = LoggerFactory
+  .getLogger(NaiveBayesParseFilter.class);
+
+  public static final String TRAINFILE_MODELFILTER = 
"parsefilter.naivebayes.trainfile";
+  public static final String DICTFILE_MODELFILTER = 
"parsefilter.naivebayes.wordlist";
+
+  private Configuration conf;
+  private String inputFilePath;
+  private String dictionaryFile;
+  private ArrayList wordlist = new ArrayList();
+
+  public boolean filterParse(String text) {
+
+try {
+  return classify(text);
+} catch (IOException e) {
+  // TODO Auto-generated catch block
+  LOG.error("Error occured while classifying:: " + text + " ::"
+  + StringUtils.stringifyException(e));
+}
+
+return false;
+  }
+
+  public boolean filterUrl(String url) {
+
+return containsWord(url, wordlist);
+
+  }
+
+  public boolean classify(String text) throws IOException {
+
+// if classified as relevent "1" then return true
+if (NaiveBayesClassifier.classify(text).equals("1"))
+  return true;
+return false;
+  }
+
+  public void train() throws Exception {
+// check if the model file exists, if it does then don't train
+if (!FileSystem.get(conf).exists(new Path("model"))) {
+  LOG.info("Training the Naive Bayes Model");
+  NaiveBayesClassifier.createModel(inputFilePath);
+} else {
+  LOG.info("Model file already exists. Skipping training.");
+}
+  }
+
+  public boolean containsWord(String url, ArrayList wordlist) {
+for (String word : wordlist) {
+  if (url.contains(word)) {
+return true;
+  }
+}
+
+return false;
+  }
+
+  public void setConf(Configuration conf) {
+this.conf = conf;
+inputFilePath = conf.get(TRAINFILE_MODELFILTER);
+dictionaryFile = conf.get(DICTFILE_MODELFILTER);
+if (inputFilePath == null || inputFilePath.trim().length() == 0
+|| dictionaryFile == null || dictionaryFile.trim().length() == 0) {
+  String message = "ParseFilter: Naiv

[GitHub] nutch pull request: NUTCH-2038

2015-06-28 Thread chrismattmann
Github user chrismattmann commented on a diff in the pull request:

https://github.com/apache/nutch/pull/38#discussion_r33432889
  
--- Diff: 
src/plugin/parsefilter-naivebayes/src/java/org/apache/nutch/parsefilter/naivebayes/NaiveBayesParseFilter.java
 ---
@@ -0,0 +1,204 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.nutch.parsefilter.naivebayes;
+
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+import org.w3c.dom.DocumentFragment;
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.fs.FileSystem;
+import org.apache.hadoop.fs.Path;
+import org.apache.hadoop.util.StringUtils;
+import org.apache.nutch.parse.HTMLMetaTags;
+import org.apache.nutch.parse.HtmlParseFilter;
+import org.apache.nutch.parse.Outlink;
+import org.apache.nutch.parse.Parse;
+import org.apache.nutch.parse.ParseData;
+import org.apache.nutch.parse.ParseResult;
+import org.apache.nutch.parse.ParseStatus;
+import org.apache.nutch.parse.ParseText;
+import org.apache.nutch.protocol.Content;
+
+import java.io.Reader;
+import java.io.BufferedReader;
+import java.io.IOException;
+import java.util.ArrayList;
+
+/**
+ * Html Parse filter that classifies the outlinks from the parseresult as
+ * relevant or irrelevant based on the parseText's relevancy (using a 
training
+ * file where you can give positive and negative example texts see the
+ * description of parsefilter.naivebayes.trainfile) and if found 
irrelevent it
+ * gives the link a second chance if it contains any of the words from the 
list
+ * given in parsefilter.naivebayes.wordlist. CAUTION: Set the 
parser.timeout to
+ * -1 or a bigger value than 30, when using this classifier.
+ */
+public class NaiveBayesParseFilter implements HtmlParseFilter {
+
+  private static final Logger LOG = LoggerFactory
+  .getLogger(NaiveBayesParseFilter.class);
+
+  public static final String TRAINFILE_MODELFILTER = 
"parsefilter.naivebayes.trainfile";
+  public static final String DICTFILE_MODELFILTER = 
"parsefilter.naivebayes.wordlist";
+
+  private Configuration conf;
+  private String inputFilePath;
+  private String dictionaryFile;
+  private ArrayList wordlist = new ArrayList();
+
+  public boolean filterParse(String text) {
+
+try {
+  return classify(text);
+} catch (IOException e) {
+  // TODO Auto-generated catch block
--- End diff --

I asked for this to be removed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] nutch pull request: NUTCH-2038

2015-06-24 Thread asitang
GitHub user asitang opened a pull request:

https://github.com/apache/nutch/pull/36

NUTCH-2038

Made aesthetic changes suggested by Chris Mattmann. Removed dependencies 
from the main ivy.xml and added it to plugin's ivy.xml. 

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/asitang/nutch NUTCH-2038

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/nutch/pull/36.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #36


commit b0ce4a157dbd0bfd8ea368f3fa230a90c7117ae2
Author: Asitang Mishra 
Date:   2015-06-17T16:11:42Z

patch 1.0 for NUTCH-2038

commit e243cc5e626106a4cd8dfca8d9c2ec93e9648560
Author: Asitang Mishra 
Date:   2015-06-17T16:14:37Z

patch 1.0 for NUTCH-2038

commit 711f44d8d4af51538ff1764145ac743445b6f43b
Author: Asitang Mishra 
Date:   2015-06-17T16:35:28Z

patch 1.0 for NUTCH-2038

commit e0e924e15c247d3fa3dd92f387fe53ba7effd78a
Author: Asitang Mishra 
Date:   2015-06-18T15:09:30Z

final commir for pattch 1.0

commit cca768bc1c790a976594136433485fe899465cb8
Author: Asitang Mishra 
Date:   2015-06-19T20:13:34Z

Patch 2.0 for NUTCH-2038

commit 0e80bf471b7d40965cf3bdad908252f5ce577d85
Author: Asitang Mishra 
Date:   2015-06-24T15:45:50Z

commit for 3.0 patch of NUTCH-2038

commit 63efcfecd2eda339c3c55a6236cb88c7a08698bc
Author: Asitang Mishra 
Date:   2015-06-24T15:46:46Z

commit for 3.0 patch of NUTCH-2038

commit 3a7bf466c76e8cffef96063101a39a77c328d657
Author: Asitang Mishra 
Date:   2015-06-24T15:55:22Z

commit for 3.1 patch of NUTCH-2038

commit ae89456e9f4078111653273fe0ac52c26c568c36
Author: Asitang Mishra 
Date:   2015-06-24T15:58:12Z

commit for 3.2 patch of NUTCH-2038

commit ae639ec40263fafbd6c0273c619d425ee482f7f0
Author: Asitang Mishra 
Date:   2015-06-24T17:31:09Z

commit for 3.3 patch of NUTCH-2038




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] nutch pull request: NUTCH-2038

2015-06-24 Thread asitang
Github user asitang closed the pull request at:

https://github.com/apache/nutch/pull/35


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] nutch pull request: NUTCH-2038

2015-06-24 Thread chrismattmann
Github user chrismattmann commented on a diff in the pull request:

https://github.com/apache/nutch/pull/35#discussion_r33165638
  
--- Diff: 
src/plugin/htmlparsefilter-naivebayes/src/java/org/apache/nutch/htmlparsefilter/naivebayes/NaiveBayesHTMLParseFilter.java
 ---
@@ -0,0 +1,214 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.nutch.htmlparsefilter.naivebayes;
+
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+import org.w3c.dom.DocumentFragment;
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.fs.FileSystem;
+import org.apache.hadoop.fs.Path;
+import org.apache.nutch.parse.HTMLMetaTags;
+import org.apache.nutch.parse.HtmlParseFilter;
+import org.apache.nutch.parse.Outlink;
+import org.apache.nutch.parse.Parse;
+import org.apache.nutch.parse.ParseData;
+import org.apache.nutch.parse.ParseResult;
+import org.apache.nutch.parse.ParseStatus;
+import org.apache.nutch.parse.ParseText;
+import org.apache.nutch.protocol.Content;
+
+import java.io.Reader;
+import java.io.BufferedReader;
+import java.io.IOException;
+import java.util.ArrayList;
+
+/**
+ * Html Parse filter that classifies the outlinks from the parseresult as
+ * relevant or irrelevant based on the parseText's relevancy (using a 
training
+ * file where you can give positive and negative example texts see the
+ * description of htmlparsefilter.naivebayes.trainfile) and if found 
irrelevent
+ * it gives the link a second chance if it contains any of the words from 
the
+ * list given in htmlparsefilter.naivebayes.wordlist. CAUTION: Set the
+ * parser.timeout to -1 or a bigger value than 30, when using this 
classifier.
+ */
+public class NaiveBayesHTMLParseFilter implements HtmlParseFilter {
+
+  private static final Logger LOG = LoggerFactory
+  .getLogger(NaiveBayesHTMLParseFilter.class);
+
+  public static final String TRAINFILE_MODELFILTER = 
"htmlparsefilter.naivebayes.trainfile";
+  public static final String DICTFILE_MODELFILTER = 
"htmlparsefilter.naivebayes.wordlist";
+
+  private Configuration conf;
+  private String inputFilePath;
+  private String dictionaryFile;
+  private ArrayList wordlist = new ArrayList();
+
+  public NaiveBayesHTMLParseFilter() {
+
+  }
+
+  public boolean filterParse(String text) {
+
+try {
+  return classify(text);
+} catch (IOException e) {
+  // TODO Auto-generated catch block
+  LOG.error("Error occured while classifying:: " + text);
+
+}
+
+return false;
+  }
+
+  public boolean filterUrl(String url) {
+
+return containsWord(url, wordlist);
+
+  }
+
+  public boolean classify(String text) throws IOException {
+
+// if classified as relevent "1" then return true
+if (NaiveBayesClassifier.classify(text).equals("1"))
+  return true;
+return false;
+  }
+
+  public void train() throws Exception {
+// check if the model file exists, if it does then don't train
+if (!FileSystem.get(conf).exists(new Path("model"))) {
+  LOG.info("Training the Naive Bayes Model");
+  NaiveBayesClassifier.createModel(inputFilePath);
+} else {
+  LOG.info("Model already exists. Skipping training.");
+}
+  }
+
+  public boolean containsWord(String url, ArrayList wordlist) {
+for (String word : wordlist) {
+  if (url.contains(word)) {
+return true;
+  }
+}
+
+return false;
+  }
+
+  public void setConf(Configuration conf) {
+this.conf = conf;
+inputFilePath = conf.get(TRAINFILE_MODELFILTER);
+dictionaryFile = conf.get(DICTFILE_MODELFILTER);
+if (inputFilePath == null || inputFilePath.trim().length() == 0
+|| dictionaryFile == null || dictionaryFile.trim().length() == 0) {
+  String message = "Model URLFilter: t

[GitHub] nutch pull request: NUTCH-2038

2015-06-24 Thread chrismattmann
Github user chrismattmann commented on a diff in the pull request:

https://github.com/apache/nutch/pull/35#discussion_r33165623
  
--- Diff: 
src/plugin/htmlparsefilter-naivebayes/src/java/org/apache/nutch/htmlparsefilter/naivebayes/NaiveBayesHTMLParseFilter.java
 ---
@@ -0,0 +1,214 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.nutch.htmlparsefilter.naivebayes;
+
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+import org.w3c.dom.DocumentFragment;
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.fs.FileSystem;
+import org.apache.hadoop.fs.Path;
+import org.apache.nutch.parse.HTMLMetaTags;
+import org.apache.nutch.parse.HtmlParseFilter;
+import org.apache.nutch.parse.Outlink;
+import org.apache.nutch.parse.Parse;
+import org.apache.nutch.parse.ParseData;
+import org.apache.nutch.parse.ParseResult;
+import org.apache.nutch.parse.ParseStatus;
+import org.apache.nutch.parse.ParseText;
+import org.apache.nutch.protocol.Content;
+
+import java.io.Reader;
+import java.io.BufferedReader;
+import java.io.IOException;
+import java.util.ArrayList;
+
+/**
+ * Html Parse filter that classifies the outlinks from the parseresult as
+ * relevant or irrelevant based on the parseText's relevancy (using a 
training
+ * file where you can give positive and negative example texts see the
+ * description of htmlparsefilter.naivebayes.trainfile) and if found 
irrelevent
+ * it gives the link a second chance if it contains any of the words from 
the
+ * list given in htmlparsefilter.naivebayes.wordlist. CAUTION: Set the
+ * parser.timeout to -1 or a bigger value than 30, when using this 
classifier.
+ */
+public class NaiveBayesHTMLParseFilter implements HtmlParseFilter {
+
+  private static final Logger LOG = LoggerFactory
+  .getLogger(NaiveBayesHTMLParseFilter.class);
+
+  public static final String TRAINFILE_MODELFILTER = 
"htmlparsefilter.naivebayes.trainfile";
+  public static final String DICTFILE_MODELFILTER = 
"htmlparsefilter.naivebayes.wordlist";
+
+  private Configuration conf;
+  private String inputFilePath;
+  private String dictionaryFile;
+  private ArrayList wordlist = new ArrayList();
+
+  public NaiveBayesHTMLParseFilter() {
+
+  }
+
+  public boolean filterParse(String text) {
+
+try {
+  return classify(text);
+} catch (IOException e) {
+  // TODO Auto-generated catch block
+  LOG.error("Error occured while classifying:: " + text);
+
+}
+
+return false;
+  }
+
+  public boolean filterUrl(String url) {
+
+return containsWord(url, wordlist);
+
+  }
+
+  public boolean classify(String text) throws IOException {
+
+// if classified as relevent "1" then return true
+if (NaiveBayesClassifier.classify(text).equals("1"))
+  return true;
+return false;
+  }
+
+  public void train() throws Exception {
+// check if the model file exists, if it does then don't train
+if (!FileSystem.get(conf).exists(new Path("model"))) {
+  LOG.info("Training the Naive Bayes Model");
+  NaiveBayesClassifier.createModel(inputFilePath);
+} else {
+  LOG.info("Model already exists. Skipping training.");
+}
+  }
+
+  public boolean containsWord(String url, ArrayList wordlist) {
+for (String word : wordlist) {
+  if (url.contains(word)) {
+return true;
+  }
+}
+
+return false;
+  }
+
+  public void setConf(Configuration conf) {
+this.conf = conf;
+inputFilePath = conf.get(TRAINFILE_MODELFILTER);
+dictionaryFile = conf.get(DICTFILE_MODELFILTER);
+if (inputFilePath == null || inputFilePath.trim().length() == 0
+|| dictionaryFile == null || dictionaryFile.trim().length() == 0) {
+  String message = "Model URLFilter: t

[GitHub] nutch pull request: NUTCH-2038

2015-06-24 Thread chrismattmann
Github user chrismattmann commented on a diff in the pull request:

https://github.com/apache/nutch/pull/35#discussion_r33165581
  
--- Diff: 
src/plugin/htmlparsefilter-naivebayes/src/java/org/apache/nutch/htmlparsefilter/naivebayes/NaiveBayesHTMLParseFilter.java
 ---
@@ -0,0 +1,214 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.nutch.htmlparsefilter.naivebayes;
+
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+import org.w3c.dom.DocumentFragment;
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.fs.FileSystem;
+import org.apache.hadoop.fs.Path;
+import org.apache.nutch.parse.HTMLMetaTags;
+import org.apache.nutch.parse.HtmlParseFilter;
+import org.apache.nutch.parse.Outlink;
+import org.apache.nutch.parse.Parse;
+import org.apache.nutch.parse.ParseData;
+import org.apache.nutch.parse.ParseResult;
+import org.apache.nutch.parse.ParseStatus;
+import org.apache.nutch.parse.ParseText;
+import org.apache.nutch.protocol.Content;
+
+import java.io.Reader;
+import java.io.BufferedReader;
+import java.io.IOException;
+import java.util.ArrayList;
+
+/**
+ * Html Parse filter that classifies the outlinks from the parseresult as
+ * relevant or irrelevant based on the parseText's relevancy (using a 
training
+ * file where you can give positive and negative example texts see the
+ * description of htmlparsefilter.naivebayes.trainfile) and if found 
irrelevent
+ * it gives the link a second chance if it contains any of the words from 
the
+ * list given in htmlparsefilter.naivebayes.wordlist. CAUTION: Set the
+ * parser.timeout to -1 or a bigger value than 30, when using this 
classifier.
+ */
+public class NaiveBayesHTMLParseFilter implements HtmlParseFilter {
+
+  private static final Logger LOG = LoggerFactory
+  .getLogger(NaiveBayesHTMLParseFilter.class);
+
+  public static final String TRAINFILE_MODELFILTER = 
"htmlparsefilter.naivebayes.trainfile";
+  public static final String DICTFILE_MODELFILTER = 
"htmlparsefilter.naivebayes.wordlist";
+
+  private Configuration conf;
+  private String inputFilePath;
+  private String dictionaryFile;
+  private ArrayList wordlist = new ArrayList();
+
+  public NaiveBayesHTMLParseFilter() {
+
+  }
+
+  public boolean filterParse(String text) {
+
+try {
+  return classify(text);
+} catch (IOException e) {
+  // TODO Auto-generated catch block
+  LOG.error("Error occured while classifying:: " + text);
+
+}
+
+return false;
+  }
+
+  public boolean filterUrl(String url) {
+
+return containsWord(url, wordlist);
+
+  }
+
+  public boolean classify(String text) throws IOException {
+
+// if classified as relevent "1" then return true
+if (NaiveBayesClassifier.classify(text).equals("1"))
+  return true;
+return false;
+  }
+
+  public void train() throws Exception {
+// check if the model file exists, if it does then don't train
+if (!FileSystem.get(conf).exists(new Path("model"))) {
+  LOG.info("Training the Naive Bayes Model");
+  NaiveBayesClassifier.createModel(inputFilePath);
+} else {
+  LOG.info("Model already exists. Skipping training.");
+}
+  }
+
+  public boolean containsWord(String url, ArrayList wordlist) {
+for (String word : wordlist) {
+  if (url.contains(word)) {
+return true;
+  }
+}
+
+return false;
+  }
+
+  public void setConf(Configuration conf) {
+this.conf = conf;
+inputFilePath = conf.get(TRAINFILE_MODELFILTER);
+dictionaryFile = conf.get(DICTFILE_MODELFILTER);
+if (inputFilePath == null || inputFilePath.trim().length() == 0
+|| dictionaryFile == null || dictionaryFile.trim().length() == 0) {
+  String message = "Model URLFilter: t

[GitHub] nutch pull request: NUTCH-2038

2015-06-24 Thread chrismattmann
Github user chrismattmann commented on a diff in the pull request:

https://github.com/apache/nutch/pull/35#discussion_r33165500
  
--- Diff: 
src/plugin/htmlparsefilter-naivebayes/src/java/org/apache/nutch/htmlparsefilter/naivebayes/NaiveBayesHTMLParseFilter.java
 ---
@@ -0,0 +1,214 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.nutch.htmlparsefilter.naivebayes;
+
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+import org.w3c.dom.DocumentFragment;
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.fs.FileSystem;
+import org.apache.hadoop.fs.Path;
+import org.apache.nutch.parse.HTMLMetaTags;
+import org.apache.nutch.parse.HtmlParseFilter;
+import org.apache.nutch.parse.Outlink;
+import org.apache.nutch.parse.Parse;
+import org.apache.nutch.parse.ParseData;
+import org.apache.nutch.parse.ParseResult;
+import org.apache.nutch.parse.ParseStatus;
+import org.apache.nutch.parse.ParseText;
+import org.apache.nutch.protocol.Content;
+
+import java.io.Reader;
+import java.io.BufferedReader;
+import java.io.IOException;
+import java.util.ArrayList;
+
+/**
+ * Html Parse filter that classifies the outlinks from the parseresult as
+ * relevant or irrelevant based on the parseText's relevancy (using a 
training
+ * file where you can give positive and negative example texts see the
+ * description of htmlparsefilter.naivebayes.trainfile) and if found 
irrelevent
+ * it gives the link a second chance if it contains any of the words from 
the
+ * list given in htmlparsefilter.naivebayes.wordlist. CAUTION: Set the
+ * parser.timeout to -1 or a bigger value than 30, when using this 
classifier.
+ */
+public class NaiveBayesHTMLParseFilter implements HtmlParseFilter {
+
+  private static final Logger LOG = LoggerFactory
+  .getLogger(NaiveBayesHTMLParseFilter.class);
+
+  public static final String TRAINFILE_MODELFILTER = 
"htmlparsefilter.naivebayes.trainfile";
+  public static final String DICTFILE_MODELFILTER = 
"htmlparsefilter.naivebayes.wordlist";
+
+  private Configuration conf;
+  private String inputFilePath;
+  private String dictionaryFile;
+  private ArrayList wordlist = new ArrayList();
+
+  public NaiveBayesHTMLParseFilter() {
+
+  }
+
+  public boolean filterParse(String text) {
+
+try {
+  return classify(text);
+} catch (IOException e) {
+  // TODO Auto-generated catch block
--- End diff --

remove


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] nutch pull request: NUTCH-2038

2015-06-24 Thread chrismattmann
Github user chrismattmann commented on a diff in the pull request:

https://github.com/apache/nutch/pull/35#discussion_r33165528
  
--- Diff: 
src/plugin/htmlparsefilter-naivebayes/src/java/org/apache/nutch/htmlparsefilter/naivebayes/NaiveBayesHTMLParseFilter.java
 ---
@@ -0,0 +1,214 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.nutch.htmlparsefilter.naivebayes;
+
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+import org.w3c.dom.DocumentFragment;
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.fs.FileSystem;
+import org.apache.hadoop.fs.Path;
+import org.apache.nutch.parse.HTMLMetaTags;
+import org.apache.nutch.parse.HtmlParseFilter;
+import org.apache.nutch.parse.Outlink;
+import org.apache.nutch.parse.Parse;
+import org.apache.nutch.parse.ParseData;
+import org.apache.nutch.parse.ParseResult;
+import org.apache.nutch.parse.ParseStatus;
+import org.apache.nutch.parse.ParseText;
+import org.apache.nutch.protocol.Content;
+
+import java.io.Reader;
+import java.io.BufferedReader;
+import java.io.IOException;
+import java.util.ArrayList;
+
+/**
+ * Html Parse filter that classifies the outlinks from the parseresult as
+ * relevant or irrelevant based on the parseText's relevancy (using a 
training
+ * file where you can give positive and negative example texts see the
+ * description of htmlparsefilter.naivebayes.trainfile) and if found 
irrelevent
+ * it gives the link a second chance if it contains any of the words from 
the
+ * list given in htmlparsefilter.naivebayes.wordlist. CAUTION: Set the
+ * parser.timeout to -1 or a bigger value than 30, when using this 
classifier.
+ */
+public class NaiveBayesHTMLParseFilter implements HtmlParseFilter {
+
+  private static final Logger LOG = LoggerFactory
+  .getLogger(NaiveBayesHTMLParseFilter.class);
+
+  public static final String TRAINFILE_MODELFILTER = 
"htmlparsefilter.naivebayes.trainfile";
+  public static final String DICTFILE_MODELFILTER = 
"htmlparsefilter.naivebayes.wordlist";
+
+  private Configuration conf;
+  private String inputFilePath;
+  private String dictionaryFile;
+  private ArrayList wordlist = new ArrayList();
+
+  public NaiveBayesHTMLParseFilter() {
+
+  }
+
+  public boolean filterParse(String text) {
+
+try {
+  return classify(text);
+} catch (IOException e) {
+  // TODO Auto-generated catch block
+  LOG.error("Error occured while classifying:: " + text);
--- End diff --

maybe print the e's stack trace too?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] nutch pull request: NUTCH-2038

2015-06-24 Thread chrismattmann
Github user chrismattmann commented on a diff in the pull request:

https://github.com/apache/nutch/pull/35#discussion_r33165388
  
--- Diff: ivy/ivy.xml ---
@@ -78,7 +78,11 @@
 
  
 
-
+
+
--- End diff --

these dependencies should go into the htmlparsefilter-naivebayes/ivy.xml, 
not the main one. I mentioned this last time.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] nutch pull request: NUTCH-2038

2015-06-24 Thread chrismattmann
Github user chrismattmann commented on a diff in the pull request:

https://github.com/apache/nutch/pull/35#discussion_r33165405
  
--- Diff: ivy/ivy.xml ---
@@ -100,6 +104,8 @@



+   
--- End diff --

also should go into the plugins ivy.xml


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] nutch pull request: NUTCH-2038

2015-06-24 Thread chrismattmann
Github user chrismattmann commented on a diff in the pull request:

https://github.com/apache/nutch/pull/35#discussion_r33165338
  
--- Diff: ivy/ivy.xml ---
@@ -78,7 +78,11 @@
 
  
 
-
+
--- End diff --

extraneous.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] nutch pull request: NUTCH-2038

2015-06-24 Thread chrismattmann
Github user chrismattmann commented on a diff in the pull request:

https://github.com/apache/nutch/pull/35#discussion_r33165265
  
--- Diff: conf/nutch-default.xml ---
@@ -1208,6 +1208,28 @@
 
 
 
+  htmlparsefilter.naivebayes.trainfile
+  
+  Set the name of the file to be used for Naive Bayes 
training. The format will be: 
+Each line contains two tab seperted parts
+There are two columns/parts:
+1. "1" or "0", "1" for relevent and "0" for irrelevent document.
+3. Text (text that will be used for training)
+
+Each row will be considered a new "document" for the classifier.
+CAUTION: Set the parser.timeout to -1 or a bigger value than 30, when 
using this classifier.
+
+  
+
+
+
+  htmlparsefilter.naivebayes.wordlist
+  
+  Put the name of the file you want to be used as a list of 
important words to be matched in the url for the model filter. The format 
should be one word per line.
--- End diff --

can you insert some line breaks at like 80 chars so it doesn't run off the 
screen on this? Thanks @asitang 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] nutch pull request: NUTCH-2038

2015-06-24 Thread chrismattmann
Github user chrismattmann commented on a diff in the pull request:

https://github.com/apache/nutch/pull/35#discussion_r33165299
  
--- Diff: conf/nutch-default.xml ---
@@ -1258,6 +1280,7 @@
 
 
 
+
--- End diff --

extraneous not needed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] nutch pull request: NUTCH-2038

2015-06-24 Thread asitang
Github user asitang closed the pull request at:

https://github.com/apache/nutch/pull/34


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] nutch pull request: NUTCH-2038

2015-06-24 Thread asitang
GitHub user asitang opened a pull request:

https://github.com/apache/nutch/pull/35

NUTCH-2038



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/asitang/nutch NUTCH-2038

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/nutch/pull/35.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #35


commit b0ce4a157dbd0bfd8ea368f3fa230a90c7117ae2
Author: Asitang Mishra 
Date:   2015-06-17T16:11:42Z

patch 1.0 for NUTCH-2038

commit e243cc5e626106a4cd8dfca8d9c2ec93e9648560
Author: Asitang Mishra 
Date:   2015-06-17T16:14:37Z

patch 1.0 for NUTCH-2038

commit 711f44d8d4af51538ff1764145ac743445b6f43b
Author: Asitang Mishra 
Date:   2015-06-17T16:35:28Z

patch 1.0 for NUTCH-2038

commit e0e924e15c247d3fa3dd92f387fe53ba7effd78a
Author: Asitang Mishra 
Date:   2015-06-18T15:09:30Z

final commir for pattch 1.0

commit cca768bc1c790a976594136433485fe899465cb8
Author: Asitang Mishra 
Date:   2015-06-19T20:13:34Z

Patch 2.0 for NUTCH-2038

commit 0e80bf471b7d40965cf3bdad908252f5ce577d85
Author: Asitang Mishra 
Date:   2015-06-24T15:45:50Z

commit for 3.0 patch of NUTCH-2038

commit 63efcfecd2eda339c3c55a6236cb88c7a08698bc
Author: Asitang Mishra 
Date:   2015-06-24T15:46:46Z

commit for 3.0 patch of NUTCH-2038

commit 3a7bf466c76e8cffef96063101a39a77c328d657
Author: Asitang Mishra 
Date:   2015-06-24T15:55:22Z

commit for 3.1 patch of NUTCH-2038

commit ae89456e9f4078111653273fe0ac52c26c568c36
Author: Asitang Mishra 
Date:   2015-06-24T15:58:12Z

commit for 3.2 patch of NUTCH-2038




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] nutch pull request: NUTCH-2038

2015-06-19 Thread chrismattmann
Github user chrismattmann commented on a diff in the pull request:

https://github.com/apache/nutch/pull/34#discussion_r32870349
  
--- Diff: src/java/org/apache/nutch/parse/ParseSegment.java ---
@@ -140,6 +177,37 @@ public void map(WritableComparable key, Content 
content,
   LOG.warn("Error passing score: " + url + ": " + e.getMessage());
 }
   }
+  
+  if (filterflag) {
+
+if (!filter.filterParse(parse.getText())) { // kick in the second 
tier
+// if parent page found
+// irrelevent
+  LOG.info("ModelURLFilter: Page found irrelevent:: " + url);
--- End diff --

needs to be insulated.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] nutch pull request: NUTCH-2038

2015-06-19 Thread chrismattmann
Github user chrismattmann commented on a diff in the pull request:

https://github.com/apache/nutch/pull/34#discussion_r32870336
  
--- Diff: src/java/org/apache/nutch/parse/ParseSegment.java ---
@@ -69,6 +77,35 @@ public void configure(JobConf job) {
 setConf(job);
 this.scfilters = new ScoringFilters(job);
 skipTruncated = job.getBoolean(SKIP_TRUNCATED, true);
+
+filterflag = job.getBoolean(PARSER_MODELFILTER, true);
+if (filterflag) {
+  String[] args = new String[2];
+  args[0] = getConf().get(TRAINFILE_MODELFILTER);
+  args[1] = getConf().get(DICTFILE_MODELFILTER);
+
+  if (args[0] == null || args[0].trim().length() == 0 || args[1] == 
null
+  || args[1].trim().length() == 0) {
+String message = "Model URLFilter: trainfile or wordlist not set 
in the urlfilter.model.trainfile or urlfilter.model.wordlist";
+if (LOG.isErrorEnabled()) {
+  filterflag = false;
+  LOG.error(message);
+}
+throw new IllegalArgumentException(message);
+  } else {
+try {
+  filters = new URLFilters(job);
+  filter = (ModelURLFilterAbstract) filters
--- End diff --

This ties us into using a specific filter, the ModelURLFilter, in the core 
Nutch classes. Why can't the URL filter simply be insulated to the plugin - 
this shouldn't have to touch the Nutch core?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] nutch pull request: NUTCH-2038

2015-06-19 Thread chrismattmann
Github user chrismattmann commented on a diff in the pull request:

https://github.com/apache/nutch/pull/34#discussion_r32869857
  
--- Diff: ivy/ivy.xml ---
@@ -78,7 +78,11 @@
 
  
 
-
+
+
--- End diff --

can these just be in the plugin's ivy.xml?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] nutch pull request: NUTCH-2038

2015-06-19 Thread chrismattmann
Github user chrismattmann commented on a diff in the pull request:

https://github.com/apache/nutch/pull/34#discussion_r32869870
  
--- Diff: src/java/org/apache/nutch/net/URLFilters.java ---
@@ -41,4 +42,28 @@ public String filter(String urlString) throws 
URLFilterException {
 }
 return urlString;
   }
+
+  /**
+   * Get a filter with the full classname if only it is activated through 
the
+   * nutch-site.xml
+   */
+  public URLFilter getFilter(String classname) {
--- End diff --

this is orthogonal to this patch, no?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] nutch pull request: NUTCH-2038

2015-06-19 Thread chrismattmann
Github user chrismattmann commented on a diff in the pull request:

https://github.com/apache/nutch/pull/34#discussion_r32869372
  
--- Diff: conf/nutch-default.xml ---
@@ -1259,6 +1259,34 @@
 
 
 
+  urlfilter.model.trainfile
+  
+  Set the name of the file to be used for Naive Bayes 
training. The format will be: 
+Each line contains two tab seperted parts
+There are two columns/parts:
+1. "1" or "0", "1" for relevent and "0" for irrelevent document.
+3. Text (text that will be used for training)
+
+Each row will be considered a new "document" for the classifier.
+
+  
+
+
+
+  urlfilter.model.wordlist
+  
+  Put the name of the file you want to be used as a list of 
"hot words" to be matched in the url for the model filter. The format should be 
one word per line.
+  
+
+
+
+  urlfilter.model.filter
+  false
+  A boolean. Set it to true if using the model filter.
--- End diff --

What does it mean to use the model filter (or not). What implications are 
there for (or for not) using it?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] nutch pull request: NUTCH-2038

2015-06-19 Thread chrismattmann
Github user chrismattmann commented on a diff in the pull request:

https://github.com/apache/nutch/pull/34#discussion_r32869350
  
--- Diff: conf/nutch-default.xml ---
@@ -1259,6 +1259,34 @@
 
 
 
+  urlfilter.model.trainfile
+  
+  Set the name of the file to be used for Naive Bayes 
training. The format will be: 
+Each line contains two tab seperted parts
+There are two columns/parts:
+1. "1" or "0", "1" for relevent and "0" for irrelevent document.
+3. Text (text that will be used for training)
+
+Each row will be considered a new "document" for the classifier.
+
+  
+
+
+
+  urlfilter.model.wordlist
+  
+  Put the name of the file you want to be used as a list of 
"hot words" to be matched in the url for the model filter. The format should be 
one word per line.
--- End diff --

formatting, run on line. Also please define "hot".


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] nutch pull request: NUTCH-2038

2015-06-19 Thread chrismattmann
Github user chrismattmann commented on a diff in the pull request:

https://github.com/apache/nutch/pull/34#discussion_r32869303
  
--- Diff: conf/nutch-default.xml ---
@@ -1259,6 +1259,34 @@
 
 
 
+  urlfilter.model.trainfile
+  
+  Set the name of the file to be used for Naive Bayes 
training. The format will be: 
+Each line contains two tab seperted parts
--- End diff --

spell-check: separated, not seperted.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] nutch pull request: NUTCH-2038

2015-06-19 Thread asitang
GitHub user asitang opened a pull request:

https://github.com/apache/nutch/pull/34

NUTCH-2038



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/asitang/nutch NUTCH-2038

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/nutch/pull/34.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #34


commit b0ce4a157dbd0bfd8ea368f3fa230a90c7117ae2
Author: Asitang Mishra 
Date:   2015-06-17T16:11:42Z

patch 1.0 for NUTCH-2038

commit e243cc5e626106a4cd8dfca8d9c2ec93e9648560
Author: Asitang Mishra 
Date:   2015-06-17T16:14:37Z

patch 1.0 for NUTCH-2038

commit 711f44d8d4af51538ff1764145ac743445b6f43b
Author: Asitang Mishra 
Date:   2015-06-17T16:35:28Z

patch 1.0 for NUTCH-2038

commit e0e924e15c247d3fa3dd92f387fe53ba7effd78a
Author: Asitang Mishra 
Date:   2015-06-18T15:09:30Z

final commir for pattch 1.0

commit cca768bc1c790a976594136433485fe899465cb8
Author: Asitang Mishra 
Date:   2015-06-19T20:13:34Z

Patch 2.0 for NUTCH-2038




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] nutch pull request: NUTCH-2038

2015-06-19 Thread asitang
Github user asitang closed the pull request at:

https://github.com/apache/nutch/pull/32


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] nutch pull request: NUTCH-2038

2015-06-18 Thread chrismattmann
Github user chrismattmann commented on a diff in the pull request:

https://github.com/apache/nutch/pull/32#discussion_r32798921
  
--- Diff: 
src/plugin/urlfilter-model/src/java/org/apache/nutch/urlfilter/model/NBClassifier.java
 ---
@@ -0,0 +1,234 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.nutch.urlfilter.model;
+
+import java.io.BufferedReader;
+import java.io.FileReader;
+import java.io.IOException;
+import java.io.StringReader;
+import java.util.HashMap;
+import java.util.Map;
+
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.fs.FileSystem;
+import org.apache.hadoop.fs.Path;
+import org.apache.hadoop.io.IntWritable;
+import org.apache.hadoop.io.LongWritable;
+import org.apache.hadoop.io.SequenceFile;
+import org.apache.hadoop.io.SequenceFile.Writer;
+import org.apache.hadoop.io.Text;
+import org.apache.lucene.analysis.Analyzer;
+import org.apache.lucene.analysis.TokenStream;
+import org.apache.lucene.analysis.standard.StandardAnalyzer;
+import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;
+import org.apache.lucene.util.Version;
+import org.apache.mahout.classifier.naivebayes.BayesUtils;
+import org.apache.mahout.classifier.naivebayes.NaiveBayesModel;
+import 
org.apache.mahout.classifier.naivebayes.StandardNaiveBayesClassifier;
+import org.apache.mahout.classifier.naivebayes.training.TrainNaiveBayesJob;
+import org.apache.mahout.common.Pair;
+import org.apache.mahout.common.iterator.sequencefile.SequenceFileIterable;
+import org.apache.mahout.math.RandomAccessSparseVector;
+import org.apache.mahout.math.Vector;
+import org.apache.mahout.math.Vector.Element;
+import org.apache.mahout.vectorizer.SparseVectorsFromSequenceFiles;
+import org.apache.mahout.vectorizer.TFIDF;
+
+import com.google.common.collect.ConcurrentHashMultiset;
+import com.google.common.collect.Multiset;
+
+public class NBClassifier {
+
+   public static Map readDictionnary(Configuration conf,
+   Path dictionnaryPath) {
+   Map dictionnary = new HashMap();
+   for (Pair pair : new 
SequenceFileIterable(
+   dictionnaryPath, true, conf)) {
+   dictionnary.put(pair.getFirst().toString(), 
pair.getSecond().get());
+   }
+   return dictionnary;
+   }
+
+   public static Map readDocumentFrequency(Configuration 
conf,
+   Path documentFrequencyPath) {
+   Map documentFrequency = new HashMap();
+   for (Pair pair : new 
SequenceFileIterable(
+   documentFrequencyPath, true, conf)) {
+   documentFrequency
+   .put(pair.getFirst().get(), 
pair.getSecond().get());
+   }
+   return documentFrequency;
+   }
+
+   public static void createModel(String inputTrainFilePath) throws 
Exception {
+
+   String[] args1 = new String[4];
+
+   args1[0] = "-i";
+   args1[1] = "outseq";
+   args1[2] = "-o";
+   args1[3] = "vectors";
+
+   String[] args2 = new String[9];
+
+   args2[0] = "-i";
+   args2[1] = "vectors/tfidf-vectors";
+   args2[2] = "-el";
+   args2[3] = "-li";
+   args2[4] = "labelindex";
+   args2[5] = "-o";
+   args2[6] = "model";
+   args2[7] = "-ow";
+   args2[8] = "-c";
+
+   convertToSeq(inputTrainFilePath, "outseq");
+
+   SparseVectorsFromSequenceFiles.main(args1);
+
+   TrainNaiveBayesJob.main(args2);
+   }
+
+   public static String classify(String text) throws IOException {
+   return classify(text, "model", "labelindex",
+   "vectors/dictionary.file-0", 
"vectors/df-count/part-r-0");
+   

[GitHub] nutch pull request: NUTCH-2038

2015-06-18 Thread chrismattmann
Github user chrismattmann commented on a diff in the pull request:

https://github.com/apache/nutch/pull/32#discussion_r32798910
  
--- Diff: 
src/plugin/urlfilter-model/src/java/org/apache/nutch/urlfilter/model/NBClassifier.java
 ---
@@ -0,0 +1,234 @@
+/**
--- End diff --

+1


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] nutch pull request: NUTCH-2038

2015-06-18 Thread chrismattmann
Github user chrismattmann commented on a diff in the pull request:

https://github.com/apache/nutch/pull/32#discussion_r32798896
  
--- Diff: src/java/org/apache/nutch/parse/ParseSegment.java ---
@@ -56,6 +57,14 @@
   private ParseUtil parseUtil;
 
   private boolean skipTruncated;
+  
+  public static final String PARSER_MODELFILTER="parser.modelfilter";
--- End diff --

yeah should be insulated to the plugin.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] nutch pull request: NUTCH-2038

2015-06-18 Thread chrismattmann
Github user chrismattmann commented on a diff in the pull request:

https://github.com/apache/nutch/pull/32#discussion_r32798873
  
--- Diff: ivy/ivy.xml ---
@@ -78,7 +78,11 @@
 
  
 
-
+
+
--- End diff --

why not just put this in the plugin's ivy.xml, we can do that Asitang, and 
Lewis, right? @lewismc @asitang 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] nutch pull request: NUTCH-2038

2015-06-18 Thread lewismc
Github user lewismc commented on a diff in the pull request:

https://github.com/apache/nutch/pull/32#discussion_r32741673
  
--- Diff: ivy/ivy.xml ---
@@ -78,7 +78,11 @@
 
  
 
-
+
+
--- End diff --

Hi Asitang,
I get Your point however I am also trying OO help you get your path into
the codebase.  Nutch is a crawler... Adding machine learning  and indexing
components such as Mahout (what if someone does not wish to use Mahout) and
Lucene (what if someone wishes to use ES) back into the core codebase
dependency tree is, on this occasion,, not the right way to go.
If you can ease send a new pull request we can take a look. Excellent work
thank you :)

On Thursday, June 18, 2015, asitang  wrote:

> In ivy/ivy.xml
> :
>
> > @@ -78,7 +78,11 @@
> >  
> >  
> >  
> > -
> > +
> > +
>
> Was trying to pave the way for a machine learning library into nutch, so
> that anyone can use that in future
>
> —
> Reply to this email directly or view it on GitHub
> .
>


-- 
*Lewis*



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] nutch pull request: NUTCH-2038

2015-06-18 Thread asitang
Github user asitang commented on a diff in the pull request:

https://github.com/apache/nutch/pull/32#discussion_r32741196
  
--- Diff: ivy/ivy.xml ---
@@ -78,7 +78,11 @@
 
  
 
-
+
+
--- End diff --

Was trying to pave the way for a machine learning library into nutch, so 
that anyone can use that in future


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] nutch pull request: NUTCH-2038

2015-06-18 Thread asitang
Github user asitang commented on a diff in the pull request:

https://github.com/apache/nutch/pull/32#discussion_r32742390
  
--- Diff: conf/nutch-default.xml ---
@@ -1136,6 +1136,28 @@
 
 
 
+  parser.modelfilter.trainfile
+  tweets-train.tsv
+  
--- End diff --

I will update the code for all the formatting and info changes. I had 
already mentioned it in jira. This is an "Uncouth" patch just to show the flow 
and the proof of concept. Thanks for the formatting pointers though :)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] nutch pull request: Nutch 2038

2015-06-18 Thread asitang
Github user asitang closed the pull request at:

https://github.com/apache/nutch/pull/32


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] nutch pull request: Nutch 2038

2015-06-18 Thread asitang
GitHub user asitang opened a pull request:

https://github.com/apache/nutch/pull/32

Nutch 2038



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/asitang/nutch NUTCH-2038

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/nutch/pull/32.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #32


commit b0ce4a157dbd0bfd8ea368f3fa230a90c7117ae2
Author: Asitang Mishra 
Date:   2015-06-17T16:11:42Z

patch 1.0 for NUTCH-2038

commit e243cc5e626106a4cd8dfca8d9c2ec93e9648560
Author: Asitang Mishra 
Date:   2015-06-17T16:14:37Z

patch 1.0 for NUTCH-2038

commit 711f44d8d4af51538ff1764145ac743445b6f43b
Author: Asitang Mishra 
Date:   2015-06-17T16:35:28Z

patch 1.0 for NUTCH-2038




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] nutch pull request: Nutch 2038

2015-06-18 Thread asitang
Github user asitang closed the pull request at:

https://github.com/apache/nutch/pull/31


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] nutch pull request: Nutch 2038

2015-06-18 Thread lewismc
Github user lewismc commented on a diff in the pull request:

https://github.com/apache/nutch/pull/32#discussion_r32702482
  
--- Diff: src/java/org/apache/nutch/parse/ModelURLFilterAbstract.java ---
@@ -0,0 +1,12 @@
+package org.apache.nutch.parse;
--- End diff --

We need license headers


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] nutch pull request: Nutch 2038

2015-06-18 Thread lewismc
Github user lewismc commented on a diff in the pull request:

https://github.com/apache/nutch/pull/32#discussion_r32702463
  
--- Diff: ivy/ivy.xml ---
@@ -78,7 +78,11 @@
 
  
 
-
+
+
--- End diff --

The Mahout and Lucene dependencies cannot be included in main ivy.xml if 
this is to be a plugin.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] nutch pull request: Nutch 2038

2015-06-18 Thread lewismc
Github user lewismc commented on a diff in the pull request:

https://github.com/apache/nutch/pull/32#discussion_r32702499
  
--- Diff: conf/nutch-default.xml ---
@@ -1136,6 +1136,28 @@
 
 
 
+  parser.modelfilter.trainfile
+  tweets-train.tsv
+  
--- End diff --

These all need descriptions. If they do not have descriptions then no one 
has a clue what they do.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] nutch pull request: Nutch 2038

2015-06-18 Thread lewismc
Github user lewismc commented on a diff in the pull request:

https://github.com/apache/nutch/pull/32#discussion_r32702537
  
--- Diff: src/java/org/apache/nutch/net/URLFilters.java ---
@@ -41,4 +41,24 @@ public String filter(String urlString) throws 
URLFilterException {
 }
 return urlString;
   }
+/**Get a filter with the full classname if only it is activated through 
the nutchsite.xml*/
--- End diff --

This is messy Javadoc. You should make efforts to reference the 
configuration file properly e.g. nutch-site.xml as well as include the method 
parameter description(s).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] nutch pull request: Nutch 2038

2015-06-18 Thread asitang
GitHub user asitang reopened a pull request:

https://github.com/apache/nutch/pull/32

Nutch 2038



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/asitang/nutch NUTCH-2038

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/nutch/pull/32.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #32


commit b0ce4a157dbd0bfd8ea368f3fa230a90c7117ae2
Author: Asitang Mishra 
Date:   2015-06-17T16:11:42Z

patch 1.0 for NUTCH-2038

commit e243cc5e626106a4cd8dfca8d9c2ec93e9648560
Author: Asitang Mishra 
Date:   2015-06-17T16:14:37Z

patch 1.0 for NUTCH-2038

commit 711f44d8d4af51538ff1764145ac743445b6f43b
Author: Asitang Mishra 
Date:   2015-06-17T16:35:28Z

patch 1.0 for NUTCH-2038




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] nutch pull request: Nutch 2038

2015-06-18 Thread lewismc
Github user lewismc commented on a diff in the pull request:

https://github.com/apache/nutch/pull/32#discussion_r32702634
  
--- Diff: src/java/org/apache/nutch/parse/ParseSegment.java ---
@@ -56,6 +57,14 @@
   private ParseUtil parseUtil;
 
   private boolean skipTruncated;
+  
+  public static final String PARSER_MODELFILTER="parser.modelfilter";
--- End diff --

I would argue that the key's are unsuitable. As this is meant to be an 
URLFilter, the keys should refelect that... not parser related but instead 
URLFilter specific


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] nutch pull request: Nutch 2038

2015-06-18 Thread lewismc
Github user lewismc commented on a diff in the pull request:

https://github.com/apache/nutch/pull/32#discussion_r32702839
  
--- Diff: 
src/plugin/urlfilter-model/src/java/org/apache/nutch/urlfilter/model/NBClassifier.java
 ---
@@ -0,0 +1,234 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.nutch.urlfilter.model;
+
+import java.io.BufferedReader;
+import java.io.FileReader;
+import java.io.IOException;
+import java.io.StringReader;
+import java.util.HashMap;
+import java.util.Map;
+
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.fs.FileSystem;
+import org.apache.hadoop.fs.Path;
+import org.apache.hadoop.io.IntWritable;
+import org.apache.hadoop.io.LongWritable;
+import org.apache.hadoop.io.SequenceFile;
+import org.apache.hadoop.io.SequenceFile.Writer;
+import org.apache.hadoop.io.Text;
+import org.apache.lucene.analysis.Analyzer;
+import org.apache.lucene.analysis.TokenStream;
+import org.apache.lucene.analysis.standard.StandardAnalyzer;
+import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;
+import org.apache.lucene.util.Version;
+import org.apache.mahout.classifier.naivebayes.BayesUtils;
+import org.apache.mahout.classifier.naivebayes.NaiveBayesModel;
+import 
org.apache.mahout.classifier.naivebayes.StandardNaiveBayesClassifier;
+import org.apache.mahout.classifier.naivebayes.training.TrainNaiveBayesJob;
+import org.apache.mahout.common.Pair;
+import org.apache.mahout.common.iterator.sequencefile.SequenceFileIterable;
+import org.apache.mahout.math.RandomAccessSparseVector;
+import org.apache.mahout.math.Vector;
+import org.apache.mahout.math.Vector.Element;
+import org.apache.mahout.vectorizer.SparseVectorsFromSequenceFiles;
+import org.apache.mahout.vectorizer.TFIDF;
+
+import com.google.common.collect.ConcurrentHashMultiset;
+import com.google.common.collect.Multiset;
+
+public class NBClassifier {
+
+   public static Map readDictionnary(Configuration conf,
+   Path dictionnaryPath) {
+   Map dictionnary = new HashMap();
+   for (Pair pair : new 
SequenceFileIterable(
--- End diff --

Formatting in this file is all over the place, in Nutch we use 2 space 
indents.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] nutch pull request: Nutch 2038

2015-06-18 Thread lewismc
Github user lewismc commented on a diff in the pull request:

https://github.com/apache/nutch/pull/32#discussion_r32702649
  
--- Diff: src/java/org/apache/nutch/parse/ParseSegment.java ---
@@ -140,6 +161,29 @@ public void map(WritableComparable key, Content 
content,
   LOG.warn("Error passing score: " + url + ": " + e.getMessage());
 }
   }
+  
+if(filterflag){
+  
+ 
--- End diff --

Code formatting is all over the place here. We have 2 space indents.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] nutch pull request: Nutch 2038

2015-06-18 Thread lewismc
Github user lewismc commented on a diff in the pull request:

https://github.com/apache/nutch/pull/32#discussion_r32702733
  
--- Diff: 
src/plugin/urlfilter-model/src/java/org/apache/nutch/urlfilter/model/ModelURLFilter.java
 ---
@@ -0,0 +1,158 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.nutch.urlfilter.model;
+
+
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+import org.apache.hadoop.conf.Configuration;
+import org.apache.nutch.parse.ModelURLFilterAbstract;
+
+
+import java.io.Reader;
+
+import java.io.BufferedReader;
+
+import java.io.IOException;
+
+import java.util.ArrayList;
+
+/**
+ * Filters URLs based on a file of URL prefixes. The file is named by (1)
+ * property "urlfilter.prefix.file" in ./conf/nutch-default.xml, and (2)
+ * attribute "file" in plugin.xml of this plugin Attribute "file" has 
higher
+ * precedence if defined.
+ * 
+ * 
+ * The format of this file is one URL prefix per line.
+ * 
+ */
+public class ModelURLFilter extends ModelURLFilterAbstract {
+
--- End diff --

Your formatting is all over the place. We have 2 space indents in the Nutch 
codebase.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] nutch pull request: Nutch 2038

2015-06-18 Thread lewismc
Github user lewismc commented on a diff in the pull request:

https://github.com/apache/nutch/pull/32#discussion_r32702851
  
--- Diff: 
src/plugin/urlfilter-model/src/java/org/apache/nutch/urlfilter/model/NBClassifier.java
 ---
@@ -0,0 +1,234 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.nutch.urlfilter.model;
+
+import java.io.BufferedReader;
+import java.io.FileReader;
+import java.io.IOException;
+import java.io.StringReader;
+import java.util.HashMap;
+import java.util.Map;
+
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.fs.FileSystem;
+import org.apache.hadoop.fs.Path;
+import org.apache.hadoop.io.IntWritable;
+import org.apache.hadoop.io.LongWritable;
+import org.apache.hadoop.io.SequenceFile;
+import org.apache.hadoop.io.SequenceFile.Writer;
+import org.apache.hadoop.io.Text;
+import org.apache.lucene.analysis.Analyzer;
+import org.apache.lucene.analysis.TokenStream;
+import org.apache.lucene.analysis.standard.StandardAnalyzer;
+import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;
+import org.apache.lucene.util.Version;
+import org.apache.mahout.classifier.naivebayes.BayesUtils;
+import org.apache.mahout.classifier.naivebayes.NaiveBayesModel;
+import 
org.apache.mahout.classifier.naivebayes.StandardNaiveBayesClassifier;
+import org.apache.mahout.classifier.naivebayes.training.TrainNaiveBayesJob;
+import org.apache.mahout.common.Pair;
+import org.apache.mahout.common.iterator.sequencefile.SequenceFileIterable;
+import org.apache.mahout.math.RandomAccessSparseVector;
+import org.apache.mahout.math.Vector;
+import org.apache.mahout.math.Vector.Element;
+import org.apache.mahout.vectorizer.SparseVectorsFromSequenceFiles;
+import org.apache.mahout.vectorizer.TFIDF;
+
+import com.google.common.collect.ConcurrentHashMultiset;
+import com.google.common.collect.Multiset;
+
+public class NBClassifier {
+
+   public static Map readDictionnary(Configuration conf,
+   Path dictionnaryPath) {
+   Map dictionnary = new HashMap();
+   for (Pair pair : new 
SequenceFileIterable(
+   dictionnaryPath, true, conf)) {
+   dictionnary.put(pair.getFirst().toString(), 
pair.getSecond().get());
+   }
+   return dictionnary;
+   }
+
+   public static Map readDocumentFrequency(Configuration 
conf,
+   Path documentFrequencyPath) {
+   Map documentFrequency = new HashMap();
+   for (Pair pair : new 
SequenceFileIterable(
+   documentFrequencyPath, true, conf)) {
+   documentFrequency
+   .put(pair.getFirst().get(), 
pair.getSecond().get());
+   }
+   return documentFrequency;
+   }
+
+   public static void createModel(String inputTrainFilePath) throws 
Exception {
+
+   String[] args1 = new String[4];
+
+   args1[0] = "-i";
+   args1[1] = "outseq";
+   args1[2] = "-o";
+   args1[3] = "vectors";
+
+   String[] args2 = new String[9];
+
+   args2[0] = "-i";
+   args2[1] = "vectors/tfidf-vectors";
+   args2[2] = "-el";
+   args2[3] = "-li";
+   args2[4] = "labelindex";
+   args2[5] = "-o";
+   args2[6] = "model";
+   args2[7] = "-ow";
+   args2[8] = "-c";
+
+   convertToSeq(inputTrainFilePath, "outseq");
+
+   SparseVectorsFromSequenceFiles.main(args1);
+
+   TrainNaiveBayesJob.main(args2);
+   }
+
+   public static String classify(String text) throws IOException {
+   return classify(text, "model", "labelindex",
+   "vectors/dictionary.file-0", 
"vectors/df-count/part-r-0");
+   }