Hi,
There is an apache project for that : nutch (https://nutch.apache.org/)
You can do a plugin for gora (https://gora.apache.org/) that save data
into neo4j.
Cheers
Le 20/11/2014 01:46, Michael Hunger a écrit :
Probably not so good because you want to run the crawler
multi-threaded across a lot of network connections and this would
affect Neo4j's performance (also in terms of GC).
Probably easier to use a message queue to send crawled pages to a
neo4j extension and then let the extension run the graph algorithms
you want to use to integrate the crawling results best into your graph.
HTH Michael
On Wed, Nov 19, 2014 at 6:45 PM, Pedro Montoto García
<[email protected] <mailto:[email protected]>> wrote:
Considering the situation of implementing a domain-specific web
crawler I've come across a number of technologies, but I had an
idea to implement it as a server extension in neo4j.
The idea would be to use the graph database to implement the
concepts of "already explored pages" and "frontier" as server-side
algorithms and use them to feed the crawling algorithm but, as you
see, you can go an step further and implement the crawling in the
server side too. Could this be a bad idea? If so, why?
--
You received this message because you are subscribed to the Google
Groups "Neo4j" group.
To unsubscribe from this group and stop receiving emails from it,
send an email to [email protected]
<mailto:[email protected]>.
For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are subscribed to the Google
Groups "Neo4j" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to [email protected]
<mailto:[email protected]>.
For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are subscribed to the Google Groups
"Neo4j" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
For more options, visit https://groups.google.com/d/optout.