EBernhardson has submitted this change and it was merged.

Change subject: Add a filter to html results getter
......................................................................


Add a filter to html results getter

The selectors for google ended up pulling in the answer box, which while
interesting isn't limited by the provided site: filter and has duplicate
results to the main search listing.

Change-Id: I47cf354d5fa5e9eb97c4bc90af592900039302a4
---
M src/RelevanceScoring/Import/HtmlResultGetter.php
M src/RelevanceScoring/RelevanceScoringProvider.php
2 files changed, 6 insertions(+), 1 deletion(-)

Approvals:
  EBernhardson: Verified; Looks good to me, approved



diff --git a/src/RelevanceScoring/Import/HtmlResultGetter.php 
b/src/RelevanceScoring/Import/HtmlResultGetter.php
index 635451b..21071b0 100644
--- a/src/RelevanceScoring/Import/HtmlResultGetter.php
+++ b/src/RelevanceScoring/Import/HtmlResultGetter.php
@@ -89,7 +89,11 @@
 
         $domain = strtolower($this->getWikiDomain($wiki));
         $results = [];
-        foreach ($doc[$this->selectors['results']] as $result) {
+        $resultElements = $doc[$this->selectors['results']];
+        if (isset($this->selectors['results_filter'])) {
+            $resultElements = 
$resultElements->filter($this->selectors['results_filter']);
+        }
+        foreach ($resultElements as $result) {
             $pq = \pq($result);
             $url = $pq[$this->selectors['url']]->attr('href');
             $urlDomain = strtolower(parse_url($url, PHP_URL_HOST));
diff --git a/src/RelevanceScoring/RelevanceScoringProvider.php 
b/src/RelevanceScoring/RelevanceScoringProvider.php
index 195b9ee..61b8d90 100644
--- a/src/RelevanceScoring/RelevanceScoringProvider.php
+++ b/src/RelevanceScoring/RelevanceScoringProvider.php
@@ -104,6 +104,7 @@
                 [
                     'is_valid' => '#ires',
                     'results' => '#ires .g',
+                    'results_filter' => '.g-blk',
                     'url' => 'h3 a',
                     'snippet' => '.st',
                 ],

-- 
To view, visit https://gerrit.wikimedia.org/r/286007
To unsubscribe, visit https://gerrit.wikimedia.org/r/settings

Gerrit-MessageType: merged
Gerrit-Change-Id: I47cf354d5fa5e9eb97c4bc90af592900039302a4
Gerrit-PatchSet: 1
Gerrit-Project: wikimedia/discovery/discernatron
Gerrit-Branch: master
Gerrit-Owner: EBernhardson <ebernhard...@wikimedia.org>
Gerrit-Reviewer: EBernhardson <ebernhard...@wikimedia.org>

_______________________________________________
MediaWiki-commits mailing list
MediaWiki-commits@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/mediawiki-commits

Reply via email to