EBernhardson has submitted this change and it was merged. Change subject: Add a filter to html results getter ......................................................................
Add a filter to html results getter The selectors for google ended up pulling in the answer box, which while interesting isn't limited by the provided site: filter and has duplicate results to the main search listing. Change-Id: I47cf354d5fa5e9eb97c4bc90af592900039302a4 --- M src/RelevanceScoring/Import/HtmlResultGetter.php M src/RelevanceScoring/RelevanceScoringProvider.php 2 files changed, 6 insertions(+), 1 deletion(-) Approvals: EBernhardson: Verified; Looks good to me, approved diff --git a/src/RelevanceScoring/Import/HtmlResultGetter.php b/src/RelevanceScoring/Import/HtmlResultGetter.php index 635451b..21071b0 100644 --- a/src/RelevanceScoring/Import/HtmlResultGetter.php +++ b/src/RelevanceScoring/Import/HtmlResultGetter.php @@ -89,7 +89,11 @@ $domain = strtolower($this->getWikiDomain($wiki)); $results = []; - foreach ($doc[$this->selectors['results']] as $result) { + $resultElements = $doc[$this->selectors['results']]; + if (isset($this->selectors['results_filter'])) { + $resultElements = $resultElements->filter($this->selectors['results_filter']); + } + foreach ($resultElements as $result) { $pq = \pq($result); $url = $pq[$this->selectors['url']]->attr('href'); $urlDomain = strtolower(parse_url($url, PHP_URL_HOST)); diff --git a/src/RelevanceScoring/RelevanceScoringProvider.php b/src/RelevanceScoring/RelevanceScoringProvider.php index 195b9ee..61b8d90 100644 --- a/src/RelevanceScoring/RelevanceScoringProvider.php +++ b/src/RelevanceScoring/RelevanceScoringProvider.php @@ -104,6 +104,7 @@ [ 'is_valid' => '#ires', 'results' => '#ires .g', + 'results_filter' => '.g-blk', 'url' => 'h3 a', 'snippet' => '.st', ], -- To view, visit https://gerrit.wikimedia.org/r/286007 To unsubscribe, visit https://gerrit.wikimedia.org/r/settings Gerrit-MessageType: merged Gerrit-Change-Id: I47cf354d5fa5e9eb97c4bc90af592900039302a4 Gerrit-PatchSet: 1 Gerrit-Project: wikimedia/discovery/discernatron Gerrit-Branch: master Gerrit-Owner: EBernhardson <ebernhard...@wikimedia.org> Gerrit-Reviewer: EBernhardson <ebernhard...@wikimedia.org> _______________________________________________ MediaWiki-commits mailing list MediaWiki-commits@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/mediawiki-commits