[openwayback-dev] Re: CDXFormatIndex.getPrefixIterator(prefix) returns entries that do not match the prefix

andrew.jackson Mon, 13 Feb 2017 08:42:30 -0800

This admittedly surprising behaviour appears to be intended: 
https://github.com/iipc/openwayback/blob/6475121cef79240b5e18a5f2c224ff9ba933b43d/wayback-core/src/main/java/org/archive/wayback/util/flatfile/FlatFile.java#L39-L41


e.g. here's an example of the match being tested in the client rather than 
in the iterator: 
https://github.com/iipc/openwayback/blob/6475121cef79240b5e18a5f2c224ff9ba933b43d/wayback-core/src/main/java/org/archive/wayback/resourceindex/adapters/CaptureToUrlSearchResultIterator.java#L79

I'm a bit wary of changing FlatFile.java as this is used in a number of 
places that may depend on this behaviour. However, it would seem reasonable 
to modify the CDXIndex or CDXFormatIndex child classes to behave in a less 
surprising fashion.


Best,
Andy


On Monday, 13 February 2017 15:43:26 UTC, David Portabella wrote:
>
> The API containts these two functions: getPrefixIterator and 
> getUrlIterator.
>
> With getPrefixIterator, I would expect to return all entries that match a 
> given prefix. With getUrlIterator, I would expect to get all entries that 
> match a url. The problem is that getPrefixIterator returns the first 
> entry that match a given prefix, and all the following entries until the 
> end of the archive (which do not match the prefix). The same for 
> getUrlIterator: it returns the first entry that match the full url, and 
> all the following entries until the end of the archive.
>
> I think that the example is pretty clear on this. I asked for all the 
> entries that match the prefix: rmspumptools.com/innovation.php 
> <http://www.google.com/url?q=http%3A%2F%2Fwww.rmspumptools.com%2Finnovation.php&sa=D&sntz=1&usg=AFQjCNFGeNgeK5QXxpQU-29U8UP9j0ecbA>
> and I get that entry http://www.rmspumptools.com/innovation.php 
> <http://www.google.com/url?q=http%3A%2F%2Fwww.rmspumptools.com%2Finnovation.php&sa=D&sntz=1&usg=AFQjCNFGeNgeK5QXxpQU-29U8UP9j0ecbA>
>  
> (this is correct), but also https://www.sjm.com/en 
> <http://www.sjm.com/en/legal-notices-patents/patents/cardiac-rhythm-management-patents>,
>  
> and http://www.slaperoo.com/ and ... until the end of the archive (which 
> do not match the prefix).
>
>
> On Monday, February 13, 2017 at 11:40:57 AM UTC+1, andrew.jackson wrote:
>>
>> I believe the idea is to allow you to use the same API to perform other 
>> queries, such as listing all URIs that start with a given prefix (e.g. all 
>> URIs for a host).
>>
>> If you only want the list that matches a specific URI, you should stop 
>> pulling results from the iterator when the URI is no longer the one you 
>> want.
>>
>> HTH,
>> Andy Jackson
>>
>> On Wednesday, 8 February 2017 17:20:16 UTC, David Portabella wrote:
>>>
>>> Using CDXFormatIndex.getPrefixIterator(prefix),
>>>
>>> I would expect to get only the entries that match this prefix.
>>>
>>> Instead, it finds the first entry matching this prefix, and then it returns 
>>> all entries from that point until the end of the archive.
>>>
>>> so, it returns entries that do not match the prefix.
>>>
>>>
>>> Why?
>>>
>>> How to *only* get the entries that match the prefix?
>>>
>>>
>>>
>>> Example (written in Scala)
>>>
>>>
>>> package application
>>>
>>> import org.archive.wayback.core.CaptureSearchResult
>>> import org.archive.wayback.resourceindex.cdx.CDXFormatIndex
>>> import org.archive.wayback.util.url.AggressiveUrlCanonicalizer
>>>
>>> import scala.collection.JavaConverters._
>>>
>>> object ResponseWarcReaderExample {
>>>  def main(args: Array[String]) {
>>>  val index = new CDXFormatIndex()
>>>  index.setPath("/dataset/files.warc.cdx")
>>>
>>>  val key = canonicalize("http://www.rmspumptools.com/innovation.php 
>>> <http://www.google.com/url?q=http%3A%2F%2Fwww.rmspumptools.com%2Finnovation.php&sa=D&sntz=1&usg=AFQjCNFGeNgeK5QXxpQU-29U8UP9j0ecbA>
>>> ")
>>>  val it = index.getPrefixIterator(key).asScala.foreach { (r: 
>>> CaptureSearchResult) =>
>>>  println(s"${r.getFile}:${r.getOffset}: ${r.getOriginalUrl}")
>>>  }
>>>  }
>>>
>>>  val canonicalizer = new AggressiveUrlCanonicalizer()
>>>  def canonicalize(url: String): String =
>>>  canonicalizer.urlStringToKey(url)
>>> }
>>>
>>>
>>> The output is as follows:
>>> files.warc.gz:1053529: http://www.rmspumptools.com/innovation.php 
>>> <http://www.google.com/url?q=http%3A%2F%2Fwww.rmspumptools.com%2Finnovation.php&sa=D&sntz=1&usg=AFQjCNFGeNgeK5QXxpQU-29U8UP9j0ecbA>
>>> files.warc.gz:1181319: https://
>>> www.sjm.com/en/legal-notices-patents/patents/cardiac-rhythm-management-patents
>>> files.warc.gz:11538: http://www.slaperoo.com/
>>> files.warc.gz:1268086: https://www.smarttech.com/patents
>>> files.warc.gz:826021: http://speckip.com/
>>> ...
>>>
>>>
>>>
>>>
>>>
>>> Note that CDXFormatIndex/CDXIndex.getPrefixIterator calls 
>>> FlatFile.getRecordIterator(final String prefix),
>>>
>>> which returns an input stream starting with the offset the first entry 
>>> matching the prefix; and then it reads everything until the end of the 
>>> archive.
>>>
>>>   RandomAccessFile raf = new RandomAccessFile(file,"r");
>>>
>>>   findKeyOffset(raf, prefix);
>>>
>>>
>>>
>>>
>>>

-- 
You received this message because you are subscribed to the Google Groups 
"openwayback-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/d/optout.

[openwayback-dev] Re: CDXFormatIndex.getPrefixIterator(prefix) returns entries that do not match the prefix

Reply via email to