Stefan Groschupf wrote:
Code can be found here:
http://cvs.sourceforge.net/viewcvs.py/tiniplug/nutch-extractors/src/
net/nutch/extractor/FlashExtractor.java?rev=1.3&view=markup
libs can be found here:
http://cvs.sourceforge.net/viewcvs.py/tiniplug/nutch-extractors/libs/
Please note that java
Andrzej Bialecki wrote:
Philipp Suter wrote:
I would have some spare cycles starting end of july until end of
august.. but I would need some short explanation where and how to
integrate the flash text extractor. furthermore is there any
document, whatsoever explaining the nutch deign
Andrzej Bialecki wrote:
Have you ever thought about integrating a javascript interpreter into
nutch? this could be another big step thowards a wider range of
crawlable websites. If you need any help on this I would be very much
interested to support anybody (timewise) implementing such a
Vacuum Joe wrote:
Have you evaluated flash either? is it possible to
parse it?
Yes, definitely:
http://swift-tools.net/Flash/
Obviously, it's a non-trivial amount of work to take
the basic ideas from that and port it into Java.
However, we're only interested in grabbing text, so
it's
Andrzej Bialecki wrote:
Philipp Suter wrote:
does anybody know how to crawl frames? Or how to extend nutch to be
able to crawl frames? We are using the api.
The development version (available from SVN) should handle frames just
fine, i.e. it should follow the src=... attributed in frames
does anybody know how to crawl frames? Or how to extend nutch to be able
to crawl frames? We are using the api.
cheers
ph