[GitHub] nutch pull request: fix for NUTCH-2039 contributed by Sujen Shah
Github user asfgit closed the pull request at: https://github.com/apache/nutch/pull/30 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] nutch pull request: fix for NUTCH-2039 contributed by Sujen Shah
Github user sujen1412 commented on a diff in the pull request: https://github.com/apache/nutch/pull/30#discussion_r32698822 --- Diff: src/plugin/scoring-similarity/src/java/org/apache/nutch/scoring/similarity/Cosine/CosineSimilarityModel.java --- @@ -0,0 +1,154 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.nutch.scoring.similarity.Cosine; --- End diff -- Thank you for pointing it out. Corrected it. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] nutch pull request: fix for NUTCH-2039 contributed by Sujen Shah
Github user chrismattmann commented on a diff in the pull request: https://github.com/apache/nutch/pull/30#discussion_r32689261 --- Diff: src/plugin/scoring-similarity/src/java/org/apache/nutch/scoring/similarity/Cosine/CosineSimilarityModel.java --- @@ -0,0 +1,154 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.nutch.scoring.similarity.Cosine; --- End diff -- no need for upper case Cosine here. In fact it will look weird. Please lowercase package name. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] nutch pull request: fix for NUTCH-2039 contributed by Sujen Shah
Github user chrismattmann commented on a diff in the pull request: https://github.com/apache/nutch/pull/30#discussion_r32413625 --- Diff: src/plugin/scoring-similarity/src/java/org/apache/nutch/scoring/similarity/SimilarityScoringFilter.java --- @@ -0,0 +1,150 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.nutch.scoring.similarity; + +import java.io.File; +import java.io.IOException; +import java.util.Collection; +import java.util.List; +import java.util.Map.Entry; + +import org.apache.commons.io.FileUtils; +import org.apache.hadoop.conf.Configuration; +import org.apache.hadoop.io.Text; +import org.apache.nutch.crawl.CrawlDatum; +import org.apache.nutch.crawl.Inlinks; +import org.apache.nutch.indexer.NutchDocument; +import org.apache.nutch.metadata.Nutch; +import org.apache.nutch.parse.Parse; +import org.apache.nutch.parse.ParseData; +import org.apache.nutch.protocol.Content; +import org.apache.nutch.scoring.ScoringFilter; +import org.apache.nutch.scoring.ScoringFilterException; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +public class SimilarityScoringFilter implements ScoringFilter { + + private Configuration conf; + private String goldStandardDocPath; + private final static Logger LOG = LoggerFactory + .getLogger(SimilarityScoringFilter.class); + + @Override + public Configuration getConf() { +return conf; + } + + @Override + public void setConf(Configuration conf) { +this.conf = conf; +goldStandardDocPath = conf.get("similarity.model.path"); +LOG.info("Getting the goldstanrd path {}",goldStandardDocPath); + } + + @Override + public void injectedScore(Text url, CrawlDatum datum) + throws ScoringFilterException { +// TODO Auto-generated method stub + + } + + @Override + public void initialScore(Text url, CrawlDatum datum) --- End diff -- I think in these cases, you should simply call Tika on the URL. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] nutch pull request: fix for NUTCH-2039 contributed by Sujen Shah
Github user chrismattmann commented on a diff in the pull request: https://github.com/apache/nutch/pull/30#discussion_r32413617 --- Diff: src/plugin/scoring-similarity/src/java/org/apache/nutch/scoring/similarity/SimilarityScoringFilter.java --- @@ -0,0 +1,150 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.nutch.scoring.similarity; + +import java.io.File; +import java.io.IOException; +import java.util.Collection; +import java.util.List; +import java.util.Map.Entry; + +import org.apache.commons.io.FileUtils; +import org.apache.hadoop.conf.Configuration; +import org.apache.hadoop.io.Text; +import org.apache.nutch.crawl.CrawlDatum; +import org.apache.nutch.crawl.Inlinks; +import org.apache.nutch.indexer.NutchDocument; +import org.apache.nutch.metadata.Nutch; +import org.apache.nutch.parse.Parse; +import org.apache.nutch.parse.ParseData; +import org.apache.nutch.protocol.Content; +import org.apache.nutch.scoring.ScoringFilter; +import org.apache.nutch.scoring.ScoringFilterException; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +public class SimilarityScoringFilter implements ScoringFilter { + + private Configuration conf; + private String goldStandardDocPath; + private final static Logger LOG = LoggerFactory + .getLogger(SimilarityScoringFilter.class); + + @Override + public Configuration getConf() { +return conf; + } + + @Override + public void setConf(Configuration conf) { +this.conf = conf; +goldStandardDocPath = conf.get("similarity.model.path"); +LOG.info("Getting the goldstanrd path {}",goldStandardDocPath); + } + + @Override + public void injectedScore(Text url, CrawlDatum datum) + throws ScoringFilterException { --- End diff -- I think in these cases, you should simply call Tika on the URL. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] nutch pull request: fix for NUTCH-2039 contributed by Sujen Shah
GitHub user sujen1412 opened a pull request: https://github.com/apache/nutch/pull/30 fix for NUTCH-2039 contributed by Sujen Shah You can merge this pull request into a Git repository by running: $ git pull https://github.com/sujen1412/nutch NUTCH-2039 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/nutch/pull/30.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #30 commit 18737d63494ebe99ba62115d6b3232cf52e0092f Author: Sujen Shah Date: 2015-06-05T18:25:39Z Added support for REST services in IndexingJob commit 67678ac67d481f3d6d746bc716d443b132433972 Author: Sujen Shah Date: 2015-06-05T18:26:05Z Added IndexingJob in JObFactory commit 59d2e1f51ce2a86f21c023c0d00f13c18df076e8 Author: Sujen Shah Date: 2015-06-09T22:30:27Z Merge remote-tracking branch 'upstream/trunk' into trunk commit 7717816ba2189dbac12ac0217b5bb837c153bebe Author: Sujen Shah Date: 2015-06-11T16:22:46Z Cosine similarity model scoring plugin commit 38aa53fbdacd5c9bdaf4ea812ed1f5f287ecc0e7 Author: Sujen Shah Date: 2015-06-11T16:23:31Z Added scoring-similarity plugin in build files commit 2b712c0d07b2d98fed4b3fb91542a78c7973d29b Author: Sujen Shah Date: 2015-06-14T23:48:03Z Overriding method calculate similarity commit 81ed178312eb1789f06d7a1e739aca4b45542382 Author: Sujen Shah Date: 2015-06-14T23:48:29Z Added support to remove stop words commit 5bbd0331e412bd07ebf8e01a76e402b6b087106d Author: Sujen Shah Date: 2015-06-14T23:49:38Z Averaging out similarity scores commit 07b000cfc19058de9dc9e1804911b85f9bf4a296 Author: Sujen Shah Date: 2015-06-14T23:52:01Z Added Apache license info commit 671c54750f5a78bfb7275fae078310a9c804260c Author: Sujen Shah Date: 2015-06-15T05:45:05Z Deleted interface files commit d00a64c14bbf3682952020337defacd13950434e Author: Sujen Shah Date: 2015-06-15T05:45:48Z Correct stopword.txt path commit 5043e584e339fec4a2a04a092fd84a7493f5c953 Author: Sujen Shah Date: 2015-06-15T05:56:39Z Removed debugging statements --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---