Hi Hiran,
(looping the conversation back to user@nutch)
> I believe the 2 GB limit could be circumvented by using InputStream to
> serve bytes rather than byte[]. Changing the interface would make
> existing plugins incompatible, that is true.
But the underlying InputStream would need to implement the Writeable
interface, same as the Content class does. Writable is used to efficiently
serialize objects
- to send them from mappers to reducers. Nutch is a distributed
web crawler able to be deployed on a cluster of machines.
- to store the objects on the Hadoop file system (or any compatible FS).
This allows to process it again, at another time or by another workflow.
Also: the InputStream should be reliable, so it needs to hold the data.
A parser should not wait until the data arrives from a website far away.
> How about introducing an additional interface with improved method
> signatures? If the plugin implements that, those methods would be
> preferred. It would keep backwards compatibility while allowing the
> framework to grow the handled data size.
Because Content supports versioning, it would be indeed possible to
upgrade the Content class to hold more than 2 GiB content. It is also
used in the Parser interface and all other interfaces.
And it could provide a method getContentAsStream() or similar which
would return the entire content. The old method `byte[] getContent()`
could still exist but return only the first 2 GiB (2^32 - 8 bytes to be
precise).
Another question is how the additional data is hold: could be just
a list of byte arrays. Of course, then the next limit is the Java
heap space, but it's configurable.
> That seems to work for HTTP. Maybe for some more (am investigating imap
> right now, and it seems to support similar metadata). For filesystems
> this seems less usable, but then they can return always
> "application/octet-stream". Plugin developers just need to know.
Also not every web server sends the Content-Type header, nor is there
a guarantee that it's correct. That's why also a content-base MIME
detection is used with the original Content-Type as additional hint.
Best,
Sebastian
On 10/5/24 23:04, Hiran Chaudhuri wrote:
On 05.10.24 12:03, Sebastian Nagel wrote:
Hi Hiran,
few comments to your questions and Lewis' exhaustive answer...
Nutch was started in 2002 as a distributed web crawler and search engine
project. Many design decisions need to be viewed from that historic
perspective.
For example, the 2 GiB limit of the content: in 2002 nobody would have
thought
that this is or even might become an issue. Changing it now in a
backward-compatible way wouldn't be trivial.
I believe the 2 GB limit could be circumvented by using InputStream to
serve bytes rather than byte[]. Changing the interface would make
existing plugins incompatible, that is true.
How about introducing an additional interface with improved method
signatures? If the plugin implements that, those methods would be
preferred. It would keep backwards compatibility while allowing the
framework to grow the handled data size.
Why would a protocol plugin know about the content type?
Because the HTTP protocol defines the Content-Type HTTP header.
Basically,
HTTP headers are multi-valued metadata which helps or makes it
possible to
process requests and responses. The Nutch Content class and the
protocol implementation are definitely influenced by the layout of the
HTTP protocol itself.
That seems to work for HTTP. Maybe for some more (am investigating imap
right now, and it seems to support similar metadata). For filesystems
this seems less usable, but then they can return always
"application/octet-stream". Plugin developers just need to know.