I see.
This is my current workaround:
def filterByPrefix(prefix: String): Iterator[CaptureSearchResult] = {
val key = canonicalize(url)
getIndex.getPrefixIterator(key).asScala.takeWhile(_.getUrlKey.startsWith(key))
}
def filterByUrl(url: String): Iterator[CaptureSearchResult] = {
val key = canonicalize(url)
getIndex.getUrlIterator(key).asScala.takeWhile(key == _.getUrlKey)
}
val canonicalizer = new AggressiveUrlCanonicalizer()
def canonicalize(url: String): String =
canonicalizer.urlStringToKey(url)
What is also "surprising", is that calling getPrefixIterator a second time
will throw an Exception.
You need to close and open a new CDXFormatIndex.
Cheers,
David
On Monday, February 13, 2017 at 5:41:03 PM UTC+1, andrew.jackson wrote:
>
> This admittedly surprising behaviour appears to be intended:
> https://github.com/iipc/openwayback/blob/6475121cef79240b5e18a5f2c224ff9ba933b43d/wayback-core/src/main/java/org/archive/wayback/util/flatfile/FlatFile.java#L39-L41
>
> <https://www.google.com/url?q=https%3A%2F%2Fgithub.com%2Fiipc%2Fopenwayback%2Fblob%2F6475121cef79240b5e18a5f2c224ff9ba933b43d%2Fwayback-core%2Fsrc%2Fmain%2Fjava%2Forg%2Farchive%2Fwayback%2Futil%2Fflatfile%2FFlatFile.java%23L39-L41&sa=D&sntz=1&usg=AFQjCNH6tYZPtzFicoFivUkq_aCv09x-Gg>
>
> e.g. here's an example of the match being tested in the client rather than
> in the iterator:
> https://github.com/iipc/openwayback/blob/6475121cef79240b5e18a5f2c224ff9ba933b43d/wayback-core/src/main/java/org/archive/wayback/resourceindex/adapters/CaptureToUrlSearchResultIterator.java#L79
>
> <https://www.google.com/url?q=https%3A%2F%2Fgithub.com%2Fiipc%2Fopenwayback%2Fblob%2F6475121cef79240b5e18a5f2c224ff9ba933b43d%2Fwayback-core%2Fsrc%2Fmain%2Fjava%2Forg%2Farchive%2Fwayback%2Fresourceindex%2Fadapters%2FCaptureToUrlSearchResultIterator.java%23L79&sa=D&sntz=1&usg=AFQjCNH5pxDgSSY3FkX80OkUBzlOFQwRuw>
>
> I'm a bit wary of changing FlatFile.java as this is used in a number of
> places that may depend on this behaviour. However, it would seem reasonable
> to modify the CDXIndex or CDXFormatIndex child classes to behave in a less
> surprising fashion.
>
>
> Best,
> Andy
>
>
> On Monday, 13 February 2017 15:43:26 UTC, David Portabella wrote:
>>
>> The API containts these two functions: getPrefixIterator and
>> getUrlIterator.
>>
>> With getPrefixIterator, I would expect to return all entries that match
>> a given prefix. With getUrlIterator, I would expect to get all entries
>> that match a url. The problem is that getPrefixIterator returns the
>> first entry that match a given prefix, and all the following entries until
>> the end of the archive (which do not match the prefix). The same for
>> getUrlIterator: it returns the first entry that match the full url, and
>> all the following entries until the end of the archive.
>>
>> I think that the example is pretty clear on this. I asked for all the
>> entries that match the prefix: rmspumptools.com/innovation.php
>> <http://www.google.com/url?q=http%3A%2F%2Fwww.rmspumptools.com%2Finnovation.php&sa=D&sntz=1&usg=AFQjCNFGeNgeK5QXxpQU-29U8UP9j0ecbA>
>> and I get that entry http://www.rmspumptools.com/innovation.php
>> <http://www.google.com/url?q=http%3A%2F%2Fwww.rmspumptools.com%2Finnovation.php&sa=D&sntz=1&usg=AFQjCNFGeNgeK5QXxpQU-29U8UP9j0ecbA>
>>
>> (this is correct), but also https://www.sjm.com/en
>> <http://www.sjm.com/en/legal-notices-patents/patents/cardiac-rhythm-management-patents>,
>>
>> and http://www.slaperoo.com/ and ... until the end of the archive (which
>> do not match the prefix).
>>
>>
>> On Monday, February 13, 2017 at 11:40:57 AM UTC+1, andrew.jackson wrote:
>>>
>>> I believe the idea is to allow you to use the same API to perform other
>>> queries, such as listing all URIs that start with a given prefix (e.g. all
>>> URIs for a host).
>>>
>>> If you only want the list that matches a specific URI, you should stop
>>> pulling results from the iterator when the URI is no longer the one you
>>> want.
>>>
>>> HTH,
>>> Andy Jackson
>>>
>>> On Wednesday, 8 February 2017 17:20:16 UTC, David Portabella wrote:
>>>>
>>>> Using CDXFormatIndex.getPrefixIterator(prefix),
>>>>
>>>> I would expect to get only the entries that match this prefix.
>>>>
>>>> Instead, it finds the first entry matching this prefix, and then it
>>>> returns all entries from that point until the end of the archive.
>>>>
>>>> so, it returns entries that do not match the prefix.
>>>>
>>>>
>>>> Why?
>>>>
>>>> How to *only* get the entries that match the prefix?
>>>>
>>>>
>>>>
>>>> Example (written in Scala)
>>>>
>>>>
>>>> package application
>>>>
>>>> import org.archive.wayback.core.CaptureSearchResult
>>>> import org.archive.wayback.resourceindex.cdx.CDXFormatIndex
>>>> import org.archive.wayback.util.url.AggressiveUrlCanonicalizer
>>>>
>>>> import scala.collection.JavaConverters._
>>>>
>>>> object ResponseWarcReaderExample {
>>>> def main(args: Array[String]) {
>>>> val index = new CDXFormatIndex()
>>>> index.setPath("/dataset/files.warc.cdx")
>>>>
>>>> val key = canonicalize("http://www.rmspumptools.com/innovation.php
>>>> <http://www.google.com/url?q=http%3A%2F%2Fwww.rmspumptools.com%2Finnovation.php&sa=D&sntz=1&usg=AFQjCNFGeNgeK5QXxpQU-29U8UP9j0ecbA>
>>>> ")
>>>> val it = index.getPrefixIterator(key).asScala.foreach { (r:
>>>> CaptureSearchResult) =>
>>>> println(s"${r.getFile}:${r.getOffset}: ${r.getOriginalUrl}")
>>>> }
>>>> }
>>>>
>>>> val canonicalizer = new AggressiveUrlCanonicalizer()
>>>> def canonicalize(url: String): String =
>>>> canonicalizer.urlStringToKey(url)
>>>> }
>>>>
>>>>
>>>> The output is as follows:
>>>> files.warc.gz:1053529: http://www.rmspumptools.com/innovation.php
>>>> <http://www.google.com/url?q=http%3A%2F%2Fwww.rmspumptools.com%2Finnovation.php&sa=D&sntz=1&usg=AFQjCNFGeNgeK5QXxpQU-29U8UP9j0ecbA>
>>>> files.warc.gz:1181319: https://
>>>> www.sjm.com/en/legal-notices-patents/patents/cardiac-rhythm-management-patents
>>>> files.warc.gz:11538: http://www.slaperoo.com/
>>>> files.warc.gz:1268086: https://www.smarttech.com/patents
>>>> files.warc.gz:826021: http://speckip.com/
>>>> ...
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> Note that CDXFormatIndex/CDXIndex.getPrefixIterator calls
>>>> FlatFile.getRecordIterator(final String prefix),
>>>>
>>>> which returns an input stream starting with the offset the first entry
>>>> matching the prefix; and then it reads everything until the end of the
>>>> archive.
>>>>
>>>> RandomAccessFile raf = new RandomAccessFile(file,"r");
>>>>
>>>> findKeyOffset(raf, prefix);
>>>>
>>>>
>>>>
>>>>
>>>>
--
You received this message because you are subscribed to the Google Groups
"openwayback-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
For more options, visit https://groups.google.com/d/optout.