Re: [Neo4j] WebCrawler-Data in Neo4j

Mattias Persson Thu, 21 Apr 2011 02:13:36 -0700

Hi Marc,

2011/4/19 Marc Seeger <m...@marc-seeger.de>:
> Hey,
> I'm currently thinking about how my current data (in mysql + solr)
> would fit into Neo4j.
>
> In one of my "documents", there are the 3 types of data I have:
>
> 1. Properties that have high cardinality: e.g. the domain name
> ("www.example.org", unique), the subdomain name ("www."), the
> host-name ("example")
> 2. A bunch of numbers (the website latency (1244ms), the amount of
> incoming links (e.g. 2321))
> 3. A number of 'tags' that have a relatively low cardinality (<100).
> Things like the webserver ("apache"), the country ("germany")
>
> As for the model, I think it would be something like this:
> - Every domain gets a node
> - #1 would be modeled as a property on the domain node
> - #2 would probably be put into a lucene index so I can sort on it later on
> - #3 could be modeled using relations. E.g. a node that has two
> properties: type:webserver and name:apache. All of the "domain"-nodes
> can have a relation called "runs on the webserver"
>
> Does this make sense?
> I am used to Document DBs, relational DBs and Column Stores, but Graph
> DBs are still pretty new to me and I don't think I got the model 100%
> :)
>
> Using this model, would I be able to filter subsets of the data such
> as "All Domains that run on apache and are in Germany and have more
> than 200 incoming links sorted by the amount of links"?


Even every subdomain and tag could be a node:

("www") <--SUBDOMAIN_OF-- ("example.org") --RUNS_ON--> ("apache")
                                                        \
                                                         ---RUNS_IN-->
("germany")

You could then start from the apache or germany node:

  Node apache = ...
  Node germany = ...
  for ( Relationship runsIn : germany.getRelationships( RUNS_IN, INCOMING ) ) {
      Node domain = runsIn.getStartNode();
      if ( apache.equals( domain.getSingleRelationship( RUNS_ON, OUTGOING ) ) {
          int incomingLinks = (Integer) domain.getProperty( "links" );
          if ( incomingLinks < 200 )
              // This is a hit, store in a list
      }
  }
  // sort the result list

Or the other way around (start from number of links, via a sorted
lucene lookup). Sorry for the quite verbose lucene query code:

  Node apache = ...
  Node germany = ...

  Query rangeQuery = NumericRangeQuery.newIntRange( "links", 0, 200,
true, false );
  QueryContext query = new QueryContext( rangeQuery ).sort(
      new Sort( new SortField( "links", SortField.LONG ) ) );

  for ( Node domain : domainIndex.query( query  ) ) {
      if ( apache.equals( domain.getSingleRelationship( RUNS_ON, OUTGOING ) ) &&
              germany.equals( domain.getSingleRelationship( RUNS_IN,
OUTGOING ) ) )
          // This is a hit
  }


If performance becomes a problem then I'd guess you'll have to index
more fields (links, webserver, country) into the same index so that
compound queries can be asked.

> I played a bit arround with the neography gem in Ruby and I could do stuff 
> like:
>
> germany_nginx = germany_nodel.shortest_path_to(websrv_nginx).depth(2).nodes
>
> But I couldn't figure out how to "expand" this "query"
>
> Looking forward to the feedback!
> Marc
>
>
>
> --
> Pessimists, we're told, look at a glass containing 50% air and 50%
> water and see it as half empty. Optimists, in contrast, see it as half
> full. Engineers, of course, understand the glass is twice as big as it
> needs to be. (Bob Lewis)
> _______________________________________________
> Neo4j mailing list
> User@lists.neo4j.org
> https://lists.neo4j.org/mailman/listinfo/user
>



-- 
Mattias Persson, [matt...@neotechnology.com]
Hacker, Neo Technology
www.neotechnology.com
_______________________________________________
Neo4j mailing list
User@lists.neo4j.org
https://lists.neo4j.org/mailman/listinfo/user

Re: [Neo4j] WebCrawler-Data in Neo4j

Reply via email to