[GitHub] nutch pull request: fix for NUTCH-2039 contributed by Sujen Shah

2015-06-18 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/nutch/pull/30


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] nutch pull request: fix for NUTCH-2039 contributed by Sujen Shah

2015-06-17 Thread sujen1412
Github user sujen1412 commented on a diff in the pull request:

https://github.com/apache/nutch/pull/30#discussion_r32698822
  
--- Diff: 
src/plugin/scoring-similarity/src/java/org/apache/nutch/scoring/similarity/Cosine/CosineSimilarityModel.java
 ---
@@ -0,0 +1,154 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.nutch.scoring.similarity.Cosine;
--- End diff --

Thank you for pointing it out. Corrected it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] nutch pull request: fix for NUTCH-2039 contributed by Sujen Shah

2015-06-17 Thread chrismattmann
Github user chrismattmann commented on a diff in the pull request:

https://github.com/apache/nutch/pull/30#discussion_r32689261
  
--- Diff: 
src/plugin/scoring-similarity/src/java/org/apache/nutch/scoring/similarity/Cosine/CosineSimilarityModel.java
 ---
@@ -0,0 +1,154 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.nutch.scoring.similarity.Cosine;
--- End diff --

no need for upper case Cosine here. In fact it will look weird. Please 
lowercase package name.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] nutch pull request: fix for NUTCH-2039 contributed by Sujen Shah

2015-06-15 Thread chrismattmann
Github user chrismattmann commented on a diff in the pull request:

https://github.com/apache/nutch/pull/30#discussion_r32413625
  
--- Diff: 
src/plugin/scoring-similarity/src/java/org/apache/nutch/scoring/similarity/SimilarityScoringFilter.java
 ---
@@ -0,0 +1,150 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.nutch.scoring.similarity;
+
+import java.io.File;
+import java.io.IOException;
+import java.util.Collection;
+import java.util.List;
+import java.util.Map.Entry;
+
+import org.apache.commons.io.FileUtils;
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.io.Text;
+import org.apache.nutch.crawl.CrawlDatum;
+import org.apache.nutch.crawl.Inlinks;
+import org.apache.nutch.indexer.NutchDocument;
+import org.apache.nutch.metadata.Nutch;
+import org.apache.nutch.parse.Parse;
+import org.apache.nutch.parse.ParseData;
+import org.apache.nutch.protocol.Content;
+import org.apache.nutch.scoring.ScoringFilter;
+import org.apache.nutch.scoring.ScoringFilterException;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+public class SimilarityScoringFilter implements ScoringFilter {
+
+  private Configuration conf;
+  private String goldStandardDocPath;
+  private final static Logger LOG = LoggerFactory
+  .getLogger(SimilarityScoringFilter.class);
+  
+  @Override
+  public Configuration getConf() {
+return conf;
+  }
+
+  @Override
+  public void setConf(Configuration conf) {
+this.conf = conf;
+goldStandardDocPath = conf.get("similarity.model.path");
+LOG.info("Getting the goldstanrd path {}",goldStandardDocPath);
+  }
+
+  @Override
+  public void injectedScore(Text url, CrawlDatum datum)
+  throws ScoringFilterException {
+// TODO Auto-generated method stub
+
+  }
+
+  @Override
+  public void initialScore(Text url, CrawlDatum datum)
--- End diff --

I think in these cases, you should simply call Tika on the URL.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] nutch pull request: fix for NUTCH-2039 contributed by Sujen Shah

2015-06-15 Thread chrismattmann
Github user chrismattmann commented on a diff in the pull request:

https://github.com/apache/nutch/pull/30#discussion_r32413617
  
--- Diff: 
src/plugin/scoring-similarity/src/java/org/apache/nutch/scoring/similarity/SimilarityScoringFilter.java
 ---
@@ -0,0 +1,150 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.nutch.scoring.similarity;
+
+import java.io.File;
+import java.io.IOException;
+import java.util.Collection;
+import java.util.List;
+import java.util.Map.Entry;
+
+import org.apache.commons.io.FileUtils;
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.io.Text;
+import org.apache.nutch.crawl.CrawlDatum;
+import org.apache.nutch.crawl.Inlinks;
+import org.apache.nutch.indexer.NutchDocument;
+import org.apache.nutch.metadata.Nutch;
+import org.apache.nutch.parse.Parse;
+import org.apache.nutch.parse.ParseData;
+import org.apache.nutch.protocol.Content;
+import org.apache.nutch.scoring.ScoringFilter;
+import org.apache.nutch.scoring.ScoringFilterException;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+public class SimilarityScoringFilter implements ScoringFilter {
+
+  private Configuration conf;
+  private String goldStandardDocPath;
+  private final static Logger LOG = LoggerFactory
+  .getLogger(SimilarityScoringFilter.class);
+  
+  @Override
+  public Configuration getConf() {
+return conf;
+  }
+
+  @Override
+  public void setConf(Configuration conf) {
+this.conf = conf;
+goldStandardDocPath = conf.get("similarity.model.path");
+LOG.info("Getting the goldstanrd path {}",goldStandardDocPath);
+  }
+
+  @Override
+  public void injectedScore(Text url, CrawlDatum datum)
+  throws ScoringFilterException {
--- End diff --

I think in these cases, you should simply call Tika on the URL.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] nutch pull request: fix for NUTCH-2039 contributed by Sujen Shah

2015-06-15 Thread sujen1412
GitHub user sujen1412 opened a pull request:

https://github.com/apache/nutch/pull/30

fix for NUTCH-2039 contributed by Sujen Shah



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/sujen1412/nutch NUTCH-2039

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/nutch/pull/30.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #30


commit 18737d63494ebe99ba62115d6b3232cf52e0092f
Author: Sujen Shah 
Date:   2015-06-05T18:25:39Z

Added support for REST services in IndexingJob

commit 67678ac67d481f3d6d746bc716d443b132433972
Author: Sujen Shah 
Date:   2015-06-05T18:26:05Z

Added IndexingJob in JObFactory

commit 59d2e1f51ce2a86f21c023c0d00f13c18df076e8
Author: Sujen Shah 
Date:   2015-06-09T22:30:27Z

Merge remote-tracking branch 'upstream/trunk' into trunk

commit 7717816ba2189dbac12ac0217b5bb837c153bebe
Author: Sujen Shah 
Date:   2015-06-11T16:22:46Z

Cosine similarity model scoring plugin

commit 38aa53fbdacd5c9bdaf4ea812ed1f5f287ecc0e7
Author: Sujen Shah 
Date:   2015-06-11T16:23:31Z

Added scoring-similarity plugin in build files

commit 2b712c0d07b2d98fed4b3fb91542a78c7973d29b
Author: Sujen Shah 
Date:   2015-06-14T23:48:03Z

Overriding method calculate similarity

commit 81ed178312eb1789f06d7a1e739aca4b45542382
Author: Sujen Shah 
Date:   2015-06-14T23:48:29Z

Added support to remove stop words

commit 5bbd0331e412bd07ebf8e01a76e402b6b087106d
Author: Sujen Shah 
Date:   2015-06-14T23:49:38Z

Averaging out similarity scores

commit 07b000cfc19058de9dc9e1804911b85f9bf4a296
Author: Sujen Shah 
Date:   2015-06-14T23:52:01Z

Added Apache license info

commit 671c54750f5a78bfb7275fae078310a9c804260c
Author: Sujen Shah 
Date:   2015-06-15T05:45:05Z

Deleted interface files

commit d00a64c14bbf3682952020337defacd13950434e
Author: Sujen Shah 
Date:   2015-06-15T05:45:48Z

Correct stopword.txt path

commit 5043e584e339fec4a2a04a092fd84a7493f5c953
Author: Sujen Shah 
Date:   2015-06-15T05:56:39Z

Removed debugging statements




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---