Re: [Neo4j] WebCrawler-Data in Neo4j

2011-04-21 Thread Mattias Persson
Hi Marc,

2011/4/19 Marc Seeger :
> Hey,
> I'm currently thinking about how my current data (in mysql + solr)
> would fit into Neo4j.
>
> In one of my "documents", there are the 3 types of data I have:
>
> 1. Properties that have high cardinality: e.g. the domain name
> ("www.example.org", unique), the subdomain name ("www."), the
> host-name ("example")
> 2. A bunch of numbers (the website latency (1244ms), the amount of
> incoming links (e.g. 2321))
> 3. A number of 'tags' that have a relatively low cardinality (<100).
> Things like the webserver ("apache"), the country ("germany")
>
> As for the model, I think it would be something like this:
> - Every domain gets a node
> - #1 would be modeled as a property on the domain node
> - #2 would probably be put into a lucene index so I can sort on it later on
> - #3 could be modeled using relations. E.g. a node that has two
> properties: type:webserver and name:apache. All of the "domain"-nodes
> can have a relation called "runs on the webserver"
>
> Does this make sense?
> I am used to Document DBs, relational DBs and Column Stores, but Graph
> DBs are still pretty new to me and I don't think I got the model 100%
> :)
>
> Using this model, would I be able to filter subsets of the data such
> as "All Domains that run on apache and are in Germany and have more
> than 200 incoming links sorted by the amount of links"?

Even every subdomain and tag could be a node:

("www") <--SUBDOMAIN_OF-- ("example.org") --RUNS_ON--> ("apache")
\
 ---RUNS_IN-->
("germany")

You could then start from the apache or germany node:

  Node apache = ...
  Node germany = ...
  for ( Relationship runsIn : germany.getRelationships( RUNS_IN, INCOMING ) ) {
  Node domain = runsIn.getStartNode();
  if ( apache.equals( domain.getSingleRelationship( RUNS_ON, OUTGOING ) ) {
  int incomingLinks = (Integer) domain.getProperty( "links" );
  if ( incomingLinks < 200 )
  // This is a hit, store in a list
  }
  }
  // sort the result list

Or the other way around (start from number of links, via a sorted
lucene lookup). Sorry for the quite verbose lucene query code:

  Node apache = ...
  Node germany = ...

  Query rangeQuery = NumericRangeQuery.newIntRange( "links", 0, 200,
true, false );
  QueryContext query = new QueryContext( rangeQuery ).sort(
  new Sort( new SortField( "links", SortField.LONG ) ) );

  for ( Node domain : domainIndex.query( query  ) ) {
  if ( apache.equals( domain.getSingleRelationship( RUNS_ON, OUTGOING ) ) &&
  germany.equals( domain.getSingleRelationship( RUNS_IN,
OUTGOING ) ) )
  // This is a hit
  }


If performance becomes a problem then I'd guess you'll have to index
more fields (links, webserver, country) into the same index so that
compound queries can be asked.

> I played a bit arround with the neography gem in Ruby and I could do stuff 
> like:
>
> germany_nginx = germany_nodel.shortest_path_to(websrv_nginx).depth(2).nodes
>
> But I couldn't figure out how to "expand" this "query"
>
> Looking forward to the feedback!
> Marc
>
>
>
> --
> Pessimists, we're told, look at a glass containing 50% air and 50%
> water and see it as half empty. Optimists, in contrast, see it as half
> full. Engineers, of course, understand the glass is twice as big as it
> needs to be. (Bob Lewis)
> ___
> Neo4j mailing list
> User@lists.neo4j.org
> https://lists.neo4j.org/mailman/listinfo/user
>



-- 
Mattias Persson, [matt...@neotechnology.com]
Hacker, Neo Technology
www.neotechnology.com
___
Neo4j mailing list
User@lists.neo4j.org
https://lists.neo4j.org/mailman/listinfo/user


Re: [Neo4j] Error building Neo4j

2011-04-21 Thread Anders Nawroth
We're successfully building it with Maven 2.

/anders

On 04/21/2011 04:15 AM, Kevin Moore wrote:
> I've tried 1.3 tag, master, etc.
>
> Always the same error.
>
> Maven 3.0.2
>
> Should I be using a different version?
>
> [INFO] Unpacking /Users/kevin/source/github/neo4j/graph-algo/target/classes
> to
>/Users/kevin/source/github/neo4j/neo4j/target/sources
> with includes null and excludes:null
> org.codehaus.plexus.archiver.ArchiverException: The source must not be a
> directory.
> at
> org.codehaus.plexus.archiver.AbstractUnArchiver.validate(AbstractUnArchiver.java:174)
> at
> org.codehaus.plexus.archiver.AbstractUnArchiver.extract(AbstractUnArchiver.java:107)
> at
> org.apache.maven.plugin.dependency.AbstractDependencyMojo.unpack(AbstractDependencyMojo.java:260)
> at
> org.apache.maven.plugin.dependency.UnpackDependenciesMojo.execute(UnpackDependenciesMojo.java:90)
> at
> org.apache.maven.plugin.DefaultBuildPluginManager.executeMojo(DefaultBuildPluginManager.java:107)
> at
> org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:209)
> at
> org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:153)
> at
> org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:145)
> at
> org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject(LifecycleModuleBuilder.java:84)
> at
> org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject(LifecycleModuleBuilder.java:59)
> at
> org.apache.maven.lifecycle.internal.LifecycleStarter.singleThreadedBuild(LifecycleStarter.java:183)
> at
> org.apache.maven.lifecycle.internal.LifecycleStarter.execute(LifecycleStarter.java:161)
> at org.apache.maven.DefaultMaven.doExecute(DefaultMaven.java:319)
> at org.apache.maven.DefaultMaven.execute(DefaultMaven.java:156)
> at org.apache.maven.cli.MavenCli.execute(MavenCli.java:534)
> at org.apache.maven.cli.MavenCli.doMain(MavenCli.java:196)
> at org.apache.maven.cli.MavenCli.main(MavenCli.java:141)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> at java.lang.reflect.Method.invoke(Method.java:597)
> at
> org.codehaus.plexus.classworlds.launcher.Launcher.launchEnhanced(Launcher.java:290)
> at
> org.codehaus.plexus.classworlds.launcher.Launcher.launch(Launcher.java:230)
> at
> org.codehaus.plexus.classworlds.launcher.Launcher.mainWithExitCode(Launcher.java:409)
> at org.codehaus.plexus.classworlds.launcher.Launcher.main(Launcher.java:352)
> [INFO]
> 
> [INFO] Reactor Summary:
> [INFO]
> [INFO] Neo4j - Graph Database Kernel . SUCCESS
> [3:53.536s]
> [INFO] Neo4j - JMX support ... SUCCESS [1.291s]
> [INFO] Neo4j - Usage Data Collection . SUCCESS [13.238s]
> [INFO] Neo4j - Lucene Index .. SUCCESS [5.020s]
> [INFO] Neo4j - Graph Algorithms .. SUCCESS [0.204s]
> [INFO] Neo4j . FAILURE
> [1:16.071s]
> [INFO] Neo4j Community ... SKIPPED
> [INFO] Neo4j - Generic shell . SKIPPED
> [INFO] Neo4j Examples  SKIPPED
> [INFO] Neo4j Server API .. SKIPPED
> [INFO] Neo4j Server .. SKIPPED
> [INFO] Neo4j Server Examples . SKIPPED
> [INFO] Neo4j Community Build . SKIPPED
> [INFO]
> 
> [INFO] BUILD FAILURE
> [INFO]
> 
> [INFO] Total time: 6:57.812s
> [INFO] Finished at: Wed Apr 20 18:58:58 PDT 2011
> [INFO] Final Memory: 17M/81M
> [INFO]
> 
> [ERROR] Failed to execute goal
> org.apache.maven.plugins:maven-dependency-plugin:2.1:unpack-dependencies
> (get-sources) on project neo4j: Error unpacking file:
> /Users/kevin/source/github/neo4j/graph-algo/target/classes to:
> /Users/kevin/source/github/neo4j/neo4j/target/sources
> [ERROR] org.codehaus.plexus.archiver.ArchiverException: The source must not
> be a directory.
> [ERROR] ->  [Help 1]
> [ERROR]
> [ERROR] To see the full stack trace of the errors, re-run Maven with the -e
> switch.
> [ERROR] Re-run Maven using the -X switch to enable full debug logging.
> [ERROR]
> [ERROR] For more information about the errors and possible solutions, please
> read the following articles:
> [ERROR] [Help 1]
> http://cwiki.apache.org/confluence/display/MAVEN/MojoExecutionException
> [ERROR]
> [ERROR] After correcting the problems, you can resume the build with the
> command
> [ERROR]

Re: [Neo4j] Error building Neo4j

2011-04-21 Thread Jim Webber
Hi Kevin,

I can replicate your problem. The way I worked around this was to use Maven 
2.2.1 rather than Maven 3.0.x. Then I get a green build for community edition.

I'll poke the devteam and see what Maven versions they're running on.

Jim


___
Neo4j mailing list
User@lists.neo4j.org
https://lists.neo4j.org/mailman/listinfo/user


Re: [Neo4j] REST results pagination

2011-04-21 Thread Jacob Hansson
On Wed, Apr 20, 2011 at 7:42 PM, Craig Taverner  wrote:

> To respond to your arguments it would be worth noting a comment by Michael
> DeHaan later on in this thread. He asked for 'something more or less
> resembling a database cursor (see MongoDB's API).' The trick is to achieve
> this without having to store a lot of state on the server, so it is robust
> against server restarts or crashes.
>
> If we compare to the SQL situation, there are two numbers passed by the
> client, the page size and the offset. The state can be re-created by the
> database server entirely from this information. How this is implemented in
> a
> relational database I do not know, but whether the database is relational
> or
> a graph, certain behaviors would be expected, like robustness against
> database content changes between the requests, and coping with very long
> gaps between requests. In my opinion the database cursor could be achieved
> by both of the following approaches:
>
>   - Starting the traversal from the beginning, and only returning results
>   after passing the cursor offset position
>

I assume this:

Traverser x = Traversal.description().traverse( someNode );
x.nodes();
x.nodes(); // Not necessarily in the same order as previous call.

If that assumption is false or there is some workaround, then I agree that
this is a valid approach, and a good efficient alternative when sorting is
not relevant. Glancing at the code in TraverserImpl though, it really looks
like the call to .nodes  will re-run the traversal, and I thought that would
mean the two calls can yield results in different order?

  - Keeping a live traverser in the server, and continuing it from the
>   previous position
>
> Personally I think the second approach is simply a performance optimization
> of the first. So robustness is achieved by having both, with the second one
> working when possible (no server restarts, timeout not expiring, etc.), and
> falling back to the first in other cases. This achieves performance and
> robustness. What we do not need to do with either case is keep an entire
> result set in memory between client requests.
>

I understand, and completely agree. My problem with the approach is that I
think its harder than it looks at first glance.


>
> Now when you add sorting into the picture, then you need to generate the
> complete result-set in memory, sort, paginate and return only the requested
> page. If the entire process has to be repeated for every page requested,
> this could perform very badly for large result sets. I must believe that
> relational databases do not do this (but I do not know how they paginate
> sorted results, unless the sort order is maintained in an index).
>

This is what makes me push for the sorted approach - relational databases
are doing this. I don't know how they do it, but they are, and we should be
at least as good.


>
> To avoid keeping everything in memory, or repeatedly reloading everything
> to
> memory on every page request, we need sorted results to be produced on the
> stream. This can be done by keeping the sort order in an index. This is
> very
> hard to do in a generic way, which is why I thought it best done in a
> domain
> specific way.
>

I agree the issue of what should be indexed to optimize sorting is a
domain-specific problem, but I think that is how relational databases treat
it as well. If you want sorting to be fast, you have to tell them to index
the field you will be sorting on. The only difference contra having the user
put the sorting index in the graph is that relational databases will handle
the indexing for you, saving you a *ton* of work, and I think we should too.

There are cases where you need to add this sort of meta data to your domain
model, where the sorting logic is too complex, and you see that in
relational dbs as well, where people create lookup tables for various
things. There are for sure valid uses for that too, but the generic approach
I believe covers the *vast* majority of the common use cases.


> Finally, I think we are really looking at two, different but valid use
> cases. The need for generic sorting combined with pagination, and the need
> for pagination on very large result sets. The former use case can work with
> re-traversing and sorting on each client request, is fully generic, but
> will
> perform badly on large result sets. The latter can perform adequately on
> large result sets, as long as you do not need to sort (and use the database
> cursor approach to avoid loading the result set into memory).
>

I agree, this is important. I'd like to change "the need for pagination on
very large result sets" to "the ability to return very large result sets
over the wire". That opens up the debate to solutions like http streaming,
which do not have the problems that come with keeping state on the server
between calls.


>
> On Wed, Apr 20, 2011 at 2:01 PM, Jacob Hansson 
> wrote:
>
> > On Wed, Apr 20, 2011 at 11:25 AM, Craig Ta

Re: [Neo4j] Question from Webinar - traversing a path with nodes of different types

2011-04-21 Thread Rick Bullotta
Sounds like a simulation/operations research application.  The graph database 
will be suitable for modeling the entities and their characteristics (transfer 
times = properties on relationships, setup and services times = properties on 
nodes, queue sizes, etc) but I think you'll need a layer on top of the 
traversal framework for managing the overall simulation logic.



- Reply message -
From: "Vipul Gupta" 
Date: Thu, Apr 21, 2011 2:16 am
Subject: [Neo4j] Question from Webinar - traversing a path with nodes of 
different types
To: "David Montag" 
Cc: "UserList" , "michael.hun...@neotechnology.com" 


Hi David,

Inputs are 1 and 6 + Graph is acyclic.

domain.Client@1 -> domain.Router@2 -> domain.Router@3 -> domain.Router@5 ->
domain.Server@6
  -> domain.Router@7 -> domain.Router@8 ->

I want a way to start from 1,

process the 2 path till it reaches 5 (say in a thread)
process the 7 path till it reaches 5 (in another thread)

then process 5 and eventually 6.
the above step of processing intermediate path and waiting on the blocking
point can happen over and over again in a more complex graph (that is there
could be a number of loops in between even) and the traversal stops only we
reach 6

I hope this makes it a bit clear. I was working out something for this, but
it is turning out to be too complex a solution for this sort of traversal of
a graph, so I am hoping if you can suggest something.

Best Regards,
Vipul


On Thu, Apr 21, 2011 at 11:36 AM, David Montag <
david.mon...@neotechnology.com> wrote:

> Hi Vipul,
>
> Zooming out a little bit, what are the inputs to your algorithm, and what
> do you want it to do?
>
> For example, given 1 and 6, do you want to find any points in the chain
> between them that are join points of two (or more) subchains (5 in this
> case)?
>
> David
>
>
> On Wed, Apr 20, 2011 at 10:56 PM, Vipul Gupta wrote:
>
>> my mistake - I meant "5" depends on both 3 and 8 and acts as a blocking
>> point till 3 and 8 finishes
>>
>>
>> On Thu, Apr 21, 2011 at 11:19 AM, Vipul Gupta wrote:
>>
>>> David/Michael,
>>>
>>> Let me modify the example a bit.
>>> What if my graph structure is like this
>>>
>>> domain.Client@1 -> domain.Router@2 -> domain.Router@3 -> domain.Router@5-> 
>>> domain.Server@6
>>>   -> domain.Router@7 -> domain.Router@8 ->
>>>
>>>
>>> Imagine a manufacturing line.
>>> 6 depends on both 3 and 8 and acts as a blocking point till 3 and 8
>>> finishes.
>>>
>>> Is there a way to get a cleaner traversal for such kind of relationship. I
>>> want to get a complete intermediate traversal from Client to Server.
>>>
>>> Thank a lot for helping out on this.
>>>
>>> Best Regards,
>>> Vipul
>>>
>>>
>>>
>>>
>>> On Thu, Apr 21, 2011 at 12:09 AM, David Montag <
>>> david.mon...@neotechnology.com> wrote:
>>>
 Hi Vipul,

 Thanks for listening!

 It's a very good question, and the short answer is: yes! I'm cc'ing our
 mailing list so that everyone can take part in the answer.

 Here's the long answer, illustrated by an example:

 Let's assume you're modeling a network. You'll have some domain classes
 that are all networked entities with peers:

 @NodeEntity
 public class NetworkEntity {
 @RelatedTo(type = "PEER", direction = Direction.BOTH, elementClass =
 NetworkEntity.class)
 private Set peers;

 public void addPeer(NetworkEntity peer) {
 peers.add(peer);
 }
 }

 public class Server extends NetworkEntity {}
 public class Router extends NetworkEntity {}
 public class Client extends NetworkEntity {}

 Then we can build a small network:

 Client c = new Client().persist();
 Router r1 = new Router().persist();
 Router r21 = new Router().persist();
 Router r22 = new Router().persist();
 Router r3 = new Router().persist();
 Server s = new Server().persist();

 c.addPeer(r1);
 r1.addPeer(r21);
 r1.addPeer(r22);
 r21.addPeer(r3);
 r22.addPeer(r3);
 r3.addPeer(s);

 c.persist();

 Note that after linking the entities, I only call persist() on the
 client. You can read more about this in the reference documentation, but
 essentially it will cascade in the direction of the relationships created,
 and will in this case cascade all the way to the server entity.

 You can now query this:

 Iterable> paths =
 c.findAllPathsByTraversal(Traversal.description());

 The above code will get you an EntityPath per node visited during the
 traversal from c. The example does however not use a very interesting
 traversal description, but you can still print the results:

 for (EntityPath path : paths) {
 StringBuilder sb = new StringBuilder();
 Iterator iter =
 path.nodeEntities().iterator();
 while (iter.hasNext()) {
 sb.append(iter.next());
 i

Re: [Neo4j] REST results pagination

2011-04-21 Thread Craig Taverner
>
> I assume this:
>Traverser x = Traversal.description().traverse( someNode );
>x.nodes();
>x.nodes(); // Not necessarily in the same order as previous call.
>
> If that assumption is false or there is some workaround, then I agree that
> this is a valid approach, and a good efficient alternative when sorting is
> not relevant. Glancing at the code in TraverserImpl though, it really looks
> like the call to .nodes  will re-run the traversal, and I thought that
> would
> mean the two calls can yield results in different order?
>

OK. My assumptions were different. I assume that while the order is not
easily predictable, it is reproducable as long as the underlying graph has
not changed. If the graph changes, then the order can change also. But I
think this is true of a relational database also, is it not?

So, obviously pagination is expected (by me at least) to give page X as it
is at the time of the request for page X, not at the time of the request for
page 1.

But my assumptions could be incorrect too...

I understand, and completely agree. My problem with the approach is that I
> think its harder than it looks at first glance.
>

I guess I cannot argue that point. My original email said I did not know if
this idea had been solved yet. Since some of the key people involved in this
have not chipped into this discussion, either we are reasonably correct in
our ideas, or so wrong that they don't know where to begin correcting us ;-)

This is what makes me push for the sorted approach - relational databases
> are doing this. I don't know how they do it, but they are, and we should be
> at least as good.
>

Absolutely. We should be as good. Relational database manage to serve a page
deep down the list quite fast. I must believe if they had to complete the
traversal, sort the results and extract the page on every single page
request, they could not be so fast. I think my ideas for the traversal are
'supposed' to be performance enhancements, and that is why I like them ;-)

I agree the issue of what should be indexed to optimize sorting is a
> domain-specific problem, but I think that is how relational databases treat
> it as well. If you want sorting to be fast, you have to tell them to index
> the field you will be sorting on. The only difference contra having the
> user
> put the sorting index in the graph is that relational databases will handle
> the indexing for you, saving you a *ton* of work, and I think we should
> too.
>

Yes. I was discussing automatic indexing with Mattias recently. I think (and
hope I am right), that once we move to automatic indexes, then it will be
possible to put external indexes (a'la lucene) and graph indexes (like the
ones I favour) behind the same API. In this case perhaps the database will
more easily be able to make the right optimized decisions, and use the index
for providing sorted results fast and with low memory footprint where
possible, based on the existance or non-existance of the necessary indices.
Then all the developer needs to do to make things really fast is put in the
right index. For some data, that would be lucene and for others it would be
a graph index. If we get to this point, I think we will have closed a key
usability gap with relational databases.

There are cases where you need to add this sort of meta data to your domain
> model, where the sorting logic is too complex, and you see that in
> relational dbs as well, where people create lookup tables for various
> things. There are for sure valid uses for that too, but the generic
> approach
> I believe covers the *vast* majority of the common use cases.
>

Perhaps. But I'm not sure the two extremes are as lop-sided as you think. I
think large data users are very interested in Neo4j.

I agree, this is important. I'd like to change "the need for pagination on
> very large result sets" to "the ability to return very large result sets
> over the wire". That opens up the debate to solutions like http streaming,
> which do not have the problems that come with keeping state on the server
> between calls.
>

I think there are two separate, but related, problems to solve. One is the
transfer of large result-sets over the wire for people that need that. The
other is efficiently providing the small page of results from a large
dataset. Most of our discussion has so far focused on the latter.

For the former, I did a bit of experimenting last year and was able to
compact my JSON by several times by moving all meta-data into a header
section. This works very well for data that has a repeating structure, for
example a large number of records with similar schema. I know schema is a
nasty word in the nosql world, but it is certainly common for data to have a
repeating pattern, especially when dealing with very large numbers. Then you
find that something like CSV is actually an efficient format, since the bulk
of the text is only the data. We did this in JSON by simply specifying a
meta-data element (wit

Re: [Neo4j] REST results pagination

2011-04-21 Thread Rick Bullotta
Fwiw, I think paging is an outdated "crutch", for a few reasons:

1) bandwidth and browser processing/parsing are largely non issues, although 
they used to be

2) human users rarely have the patience (and usability sucks) to go beyond 2-4 
pages of information.  It is far better to allow incrementally refined filters 
and searches to get to a workable subset of data.

3) machine users could care less about paging

4) when doing visualization of a large dataset, you generally want the whole 
dataset, not a page of it, so that's another "non use case"

Discuss and debate please!

Rick



- Reply message -
From: "Craig Taverner" 
Date: Thu, Apr 21, 2011 8:52 am
Subject: [Neo4j] REST results pagination
To: "Neo4j user discussions" 

>
> I assume this:
>Traverser x = Traversal.description().traverse( someNode );
>x.nodes();
>x.nodes(); // Not necessarily in the same order as previous call.
>
> If that assumption is false or there is some workaround, then I agree that
> this is a valid approach, and a good efficient alternative when sorting is
> not relevant. Glancing at the code in TraverserImpl though, it really looks
> like the call to .nodes  will re-run the traversal, and I thought that
> would
> mean the two calls can yield results in different order?
>

OK. My assumptions were different. I assume that while the order is not
easily predictable, it is reproducable as long as the underlying graph has
not changed. If the graph changes, then the order can change also. But I
think this is true of a relational database also, is it not?

So, obviously pagination is expected (by me at least) to give page X as it
is at the time of the request for page X, not at the time of the request for
page 1.

But my assumptions could be incorrect too...

I understand, and completely agree. My problem with the approach is that I
> think its harder than it looks at first glance.
>

I guess I cannot argue that point. My original email said I did not know if
this idea had been solved yet. Since some of the key people involved in this
have not chipped into this discussion, either we are reasonably correct in
our ideas, or so wrong that they don't know where to begin correcting us ;-)

This is what makes me push for the sorted approach - relational databases
> are doing this. I don't know how they do it, but they are, and we should be
> at least as good.
>

Absolutely. We should be as good. Relational database manage to serve a page
deep down the list quite fast. I must believe if they had to complete the
traversal, sort the results and extract the page on every single page
request, they could not be so fast. I think my ideas for the traversal are
'supposed' to be performance enhancements, and that is why I like them ;-)

I agree the issue of what should be indexed to optimize sorting is a
> domain-specific problem, but I think that is how relational databases treat
> it as well. If you want sorting to be fast, you have to tell them to index
> the field you will be sorting on. The only difference contra having the
> user
> put the sorting index in the graph is that relational databases will handle
> the indexing for you, saving you a *ton* of work, and I think we should
> too.
>

Yes. I was discussing automatic indexing with Mattias recently. I think (and
hope I am right), that once we move to automatic indexes, then it will be
possible to put external indexes (a'la lucene) and graph indexes (like the
ones I favour) behind the same API. In this case perhaps the database will
more easily be able to make the right optimized decisions, and use the index
for providing sorted results fast and with low memory footprint where
possible, based on the existance or non-existance of the necessary indices.
Then all the developer needs to do to make things really fast is put in the
right index. For some data, that would be lucene and for others it would be
a graph index. If we get to this point, I think we will have closed a key
usability gap with relational databases.

There are cases where you need to add this sort of meta data to your domain
> model, where the sorting logic is too complex, and you see that in
> relational dbs as well, where people create lookup tables for various
> things. There are for sure valid uses for that too, but the generic
> approach
> I believe covers the *vast* majority of the common use cases.
>

Perhaps. But I'm not sure the two extremes are as lop-sided as you think. I
think large data users are very interested in Neo4j.

I agree, this is important. I'd like to change "the need for pagination on
> very large result sets" to "the ability to return very large result sets
> over the wire". That opens up the debate to solutions like http streaming,
> which do not have the problems that come with keeping state on the server
> between calls.
>

I think there are two separate, but related, problems to solve. One is the
transfer of large result-sets over the wire for people that need that. The
other 

Re: [Neo4j] REST results pagination

2011-04-21 Thread Georg Summer
Legacy application that just uses a new data source. It can be quite hard to
get users away from their trusty old-chap-UI.  In the case of Pagination,
Legacy might only mean some years but still legacy :-).

@1-2) In the wake of mobile applications and mobile sites a pagination
system might be more relevant than bulk loading everything and displaying
it. defining smart filters might be problematic in such a use case as well.

Parallelism of an application could also be a interesting aspect. Each
worker retrieves the different pages of the graph and the user does not have
to care at all about separating the graph after downloading it. This would
only be interesting though if the graph relations are not important.

Georg

On 21 April 2011 14:59, Rick Bullotta  wrote:

> Fwiw, I think paging is an outdated "crutch", for a few reasons:
>
> 1) bandwidth and browser processing/parsing are largely non issues,
> although they used to be
>
> 2) human users rarely have the patience (and usability sucks) to go beyond
> 2-4 pages of information.  It is far better to allow incrementally refined
> filters and searches to get to a workable subset of data.
>
> 3) machine users could care less about paging
>
> 4) when doing visualization of a large dataset, you generally want the
> whole dataset, not a page of it, so that's another "non use case"
>
> Discuss and debate please!
>
> Rick
>
>
>
> - Reply message -
> From: "Craig Taverner" 
> Date: Thu, Apr 21, 2011 8:52 am
> Subject: [Neo4j] REST results pagination
> To: "Neo4j user discussions" 
>
> >
> > I assume this:
> >Traverser x = Traversal.description().traverse( someNode );
> >x.nodes();
> >x.nodes(); // Not necessarily in the same order as previous call.
> >
> > If that assumption is false or there is some workaround, then I agree
> that
> > this is a valid approach, and a good efficient alternative when sorting
> is
> > not relevant. Glancing at the code in TraverserImpl though, it really
> looks
> > like the call to .nodes  will re-run the traversal, and I thought that
> > would
> > mean the two calls can yield results in different order?
> >
>
> OK. My assumptions were different. I assume that while the order is not
> easily predictable, it is reproducable as long as the underlying graph has
> not changed. If the graph changes, then the order can change also. But I
> think this is true of a relational database also, is it not?
>
> So, obviously pagination is expected (by me at least) to give page X as it
> is at the time of the request for page X, not at the time of the request
> for
> page 1.
>
> But my assumptions could be incorrect too...
>
> I understand, and completely agree. My problem with the approach is that I
> > think its harder than it looks at first glance.
> >
>
> I guess I cannot argue that point. My original email said I did not know if
> this idea had been solved yet. Since some of the key people involved in
> this
> have not chipped into this discussion, either we are reasonably correct in
> our ideas, or so wrong that they don't know where to begin correcting us
> ;-)
>
> This is what makes me push for the sorted approach - relational databases
> > are doing this. I don't know how they do it, but they are, and we should
> be
> > at least as good.
> >
>
> Absolutely. We should be as good. Relational database manage to serve a
> page
> deep down the list quite fast. I must believe if they had to complete the
> traversal, sort the results and extract the page on every single page
> request, they could not be so fast. I think my ideas for the traversal are
> 'supposed' to be performance enhancements, and that is why I like them ;-)
>
> I agree the issue of what should be indexed to optimize sorting is a
> > domain-specific problem, but I think that is how relational databases
> treat
> > it as well. If you want sorting to be fast, you have to tell them to
> index
> > the field you will be sorting on. The only difference contra having the
> > user
> > put the sorting index in the graph is that relational databases will
> handle
> > the indexing for you, saving you a *ton* of work, and I think we should
> > too.
> >
>
> Yes. I was discussing automatic indexing with Mattias recently. I think
> (and
> hope I am right), that once we move to automatic indexes, then it will be
> possible to put external indexes (a'la lucene) and graph indexes (like the
> ones I favour) behind the same API. In this case perhaps the database will
> more easily be able to make the right optimized decisions, and use the
> index
> for providing sorted results fast and with low memory footprint where
> possible, based on the existance or non-existance of the necessary indices.
> Then all the developer needs to do to make things really fast is put in the
> right index. For some data, that would be lucene and for others it would be
> a graph index. If we get to this point, I think we will have closed a key
> usability gap with relational databases.
>
> There a

Re: [Neo4j] REST results pagination

2011-04-21 Thread Michael DeHaan
>
> 3) machine users could care less about paging

My thoughts are that parsing very large documents can perform poorly
and requires the entire document be slurped into (available) RAM.
This puts a cap on the size of a usable resultset and slows
processing, or at least makes you pay an up-front cost, and decreases
potential for parallelism in other parts of your app?.
___
Neo4j mailing list
User@lists.neo4j.org
https://lists.neo4j.org/mailman/listinfo/user


Re: [Neo4j] REST results pagination

2011-04-21 Thread Rick Bullotta
That can be dealt with via more "streamable" content structures.

- Reply message -
From: "Michael DeHaan" 
Date: Thu, Apr 21, 2011 9:44 am
Subject: [Neo4j] REST results pagination
To: "Neo4j user discussions" 

>
> 3) machine users could care less about paging

My thoughts are that parsing very large documents can perform poorly
and requires the entire document be slurped into (available) RAM.
This puts a cap on the size of a usable resultset and slows
processing, or at least makes you pay an up-front cost, and decreases
potential for parallelism in other parts of your app?.
___
Neo4j mailing list
User@lists.neo4j.org
https://lists.neo4j.org/mailman/listinfo/user
___
Neo4j mailing list
User@lists.neo4j.org
https://lists.neo4j.org/mailman/listinfo/user


Re: [Neo4j] REST results pagination

2011-04-21 Thread Jacob Hansson
On Thu, Apr 21, 2011 at 2:52 PM, Craig Taverner  wrote:

> >
> > I assume this:
> >Traverser x = Traversal.description().traverse( someNode );
> >x.nodes();
> >x.nodes(); // Not necessarily in the same order as previous call.
> >
> > If that assumption is false or there is some workaround, then I agree
> that
> > this is a valid approach, and a good efficient alternative when sorting
> is
> > not relevant. Glancing at the code in TraverserImpl though, it really
> looks
> > like the call to .nodes  will re-run the traversal, and I thought that
> > would
> > mean the two calls can yield results in different order?
> >
>
> OK. My assumptions were different. I assume that while the order is not
> easily predictable, it is reproducable as long as the underlying graph has
> not changed. If the graph changes, then the order can change also. But I
> think this is true of a relational database also, is it not?
>
> So, obviously pagination is expected (by me at least) to give page X as it
> is at the time of the request for page X, not at the time of the request
> for
> page 1.
>
> But my assumptions could be incorrect too...
>

I think you are probably right about that, and if you don't provide a sort
order, then I think a SQL database will exert the same sort of unknown
behaviour, like you say.

Leaving out single-user single-threaded applications then, it must be
assumed that the database will be accessed by other parties while we page
through our result. If the cache-first traversal optimization gets
implemented, it might even be enough to read the results of the traversal
for the sorting order the next time around to be different. Point being,
there is a reasonable assumption that parts of the traversal result will
never get returned due to the shifting sort order.

I can only think of a few use cases where loosing some of the expected
result is ok, for instance if you want to "peek" at the result.


>
> I understand, and completely agree. My problem with the approach is that I
> > think its harder than it looks at first glance.
> >
>
> I guess I cannot argue that point. My original email said I did not know if
> this idea had been solved yet. Since some of the key people involved in
> this
> have not chipped into this discussion, either we are reasonably correct in
> our ideas, or so wrong that they don't know where to begin correcting us
> ;-)
>

I'm waiting for one of those SlapOnTheFingersExceptions' that Tobias has
been handing out :)


>
> This is what makes me push for the sorted approach - relational databases
> > are doing this. I don't know how they do it, but they are, and we should
> be
> > at least as good.
> >
>
> Absolutely. We should be as good. Relational database manage to serve a
> page
> deep down the list quite fast. I must believe if they had to complete the
> traversal, sort the results and extract the page on every single page
> request, they could not be so fast. I think my ideas for the traversal are
> 'supposed' to be performance enhancements, and that is why I like them ;-)
>

I think they are performance enhancements, huge ones. But I still think
there are hard problems involved in putting them into practice.


>
I agree the issue of what should be indexed to optimize sorting is a
> > domain-specific problem, but I think that is how relational databases
> treat
> > it as well. If you want sorting to be fast, you have to tell them to
> index
> > the field you will be sorting on. The only difference contra having the
> > user
> > put the sorting index in the graph is that relational databases will
> handle
> > the indexing for you, saving you a *ton* of work, and I think we should
> > too.
> >
>
> Yes. I was discussing automatic indexing with Mattias recently. I think
> (and
> hope I am right), that once we move to automatic indexes, then it will be
> possible to put external indexes (a'la lucene) and graph indexes (like the
> ones I favour) behind the same API. In this case perhaps the database will
> more easily be able to make the right optimized decisions, and use the
> index
> for providing sorted results fast and with low memory footprint where
> possible, based on the existance or non-existance of the necessary indices.
> Then all the developer needs to do to make things really fast is put in the
> right index. For some data, that would be lucene and for others it would be
> a graph index. If we get to this point, I think we will have closed a key
> usability gap with relational databases.
>

Couldn't agree more :)


>
> There are cases where you need to add this sort of meta data to your domain
> > model, where the sorting logic is too complex, and you see that in
> > relational dbs as well, where people create lookup tables for various
> > things. There are for sure valid uses for that too, but the generic
> > approach
> > I believe covers the *vast* majority of the common use cases.
> >
>
> Perhaps. But I'm not sure the two extremes are as lop-sided as you think.

Re: [Neo4j] REST results pagination

2011-04-21 Thread Jacob Hansson
On Thu, Apr 21, 2011 at 2:59 PM, Rick Bullotta
wrote:

> Fwiw, I think paging is an outdated "crutch", for a few reasons:
>
> 1) bandwidth and browser processing/parsing are largely non issues,
> although they used to be
>

I disagree. They have improved significantly, for sure, but that is no
reason to download massive amounts of data that will never be used.


>
> 2) human users rarely have the patience (and usability sucks) to go beyond
> 2-4 pages of information.  It is far better to allow incrementally refined
> filters and searches to get to a workable subset of data.
>

I agree with the suckiness of paging and the awesomeness of filtering - but
what do you do when the users filter returns 40 million results? You somehow
have to tell the user that "damn, that filter, it returned forty freaking
million results, you need to refine your search buddy".

The way the user expects that to happen is through presenting a paged,
infinite scrolled or similar interface, where she can see how many results
where returned and act on that feedback.


> 3) machine users could care less about paging
>
>
Agreed, streaming is a much better way for machines to talk about data that
doesn't fit in memory.


> 4) when doing visualization of a large dataset, you generally want the
> whole dataset, not a page of it, so that's another "non use case"
>

Not necessarily true. You need all the data that you want to visualize, but
that is not necessarily all the data the user has asked for. You can be
clever about the visualization to keep it uncluttered, and "paging"-like
behaviours may be a way to do that.


>
> Discuss and debate please!
>
> Rick
>
>
>
> - Reply message -
> From: "Craig Taverner" 
> Date: Thu, Apr 21, 2011 8:52 am
> Subject: [Neo4j] REST results pagination
> To: "Neo4j user discussions" 
>
> >
> > I assume this:
> >Traverser x = Traversal.description().traverse( someNode );
> >x.nodes();
> >x.nodes(); // Not necessarily in the same order as previous call.
> >
> > If that assumption is false or there is some workaround, then I agree
> that
> > this is a valid approach, and a good efficient alternative when sorting
> is
> > not relevant. Glancing at the code in TraverserImpl though, it really
> looks
> > like the call to .nodes  will re-run the traversal, and I thought that
> > would
> > mean the two calls can yield results in different order?
> >
>
> OK. My assumptions were different. I assume that while the order is not
> easily predictable, it is reproducable as long as the underlying graph has
> not changed. If the graph changes, then the order can change also. But I
> think this is true of a relational database also, is it not?
>
> So, obviously pagination is expected (by me at least) to give page X as it
> is at the time of the request for page X, not at the time of the request
> for
> page 1.
>
> But my assumptions could be incorrect too...
>
> I understand, and completely agree. My problem with the approach is that I
> > think its harder than it looks at first glance.
> >
>
> I guess I cannot argue that point. My original email said I did not know if
> this idea had been solved yet. Since some of the key people involved in
> this
> have not chipped into this discussion, either we are reasonably correct in
> our ideas, or so wrong that they don't know where to begin correcting us
> ;-)
>
> This is what makes me push for the sorted approach - relational databases
> > are doing this. I don't know how they do it, but they are, and we should
> be
> > at least as good.
> >
>
> Absolutely. We should be as good. Relational database manage to serve a
> page
> deep down the list quite fast. I must believe if they had to complete the
> traversal, sort the results and extract the page on every single page
> request, they could not be so fast. I think my ideas for the traversal are
> 'supposed' to be performance enhancements, and that is why I like them ;-)
>
> I agree the issue of what should be indexed to optimize sorting is a
> > domain-specific problem, but I think that is how relational databases
> treat
> > it as well. If you want sorting to be fast, you have to tell them to
> index
> > the field you will be sorting on. The only difference contra having the
> > user
> > put the sorting index in the graph is that relational databases will
> handle
> > the indexing for you, saving you a *ton* of work, and I think we should
> > too.
> >
>
> Yes. I was discussing automatic indexing with Mattias recently. I think
> (and
> hope I am right), that once we move to automatic indexes, then it will be
> possible to put external indexes (a'la lucene) and graph indexes (like the
> ones I favour) behind the same API. In this case perhaps the database will
> more easily be able to make the right optimized decisions, and use the
> index
> for providing sorted results fast and with low memory footprint where
> possible, based on the existance or non-existance of the necessary indices.
> Then all the developer 

Re: [Neo4j] REST results pagination

2011-04-21 Thread Rick Bullotta
Fwiw, we use an "idiot resistant" (no such thing as "idiot proof") approach 
that clamps the number of returned items on the server side by default. We 
allow the user to explicitly request to do something foolish and ask for more 
data, but it requires a conscious effort.


- Reply message -
From: "Jacob Hansson" 
Date: Thu, Apr 21, 2011 10:06 am
Subject: [Neo4j] REST results pagination
To: "Neo4j user discussions" 

On Thu, Apr 21, 2011 at 2:59 PM, Rick Bullotta
wrote:

> Fwiw, I think paging is an outdated "crutch", for a few reasons:
>
> 1) bandwidth and browser processing/parsing are largely non issues,
> although they used to be
>

I disagree. They have improved significantly, for sure, but that is no
reason to download massive amounts of data that will never be used.


>
> 2) human users rarely have the patience (and usability sucks) to go beyond
> 2-4 pages of information.  It is far better to allow incrementally refined
> filters and searches to get to a workable subset of data.
>

I agree with the suckiness of paging and the awesomeness of filtering - but
what do you do when the users filter returns 40 million results? You somehow
have to tell the user that "damn, that filter, it returned forty freaking
million results, you need to refine your search buddy".

The way the user expects that to happen is through presenting a paged,
infinite scrolled or similar interface, where she can see how many results
where returned and act on that feedback.


> 3) machine users could care less about paging
>
>
Agreed, streaming is a much better way for machines to talk about data that
doesn't fit in memory.


> 4) when doing visualization of a large dataset, you generally want the
> whole dataset, not a page of it, so that's another "non use case"
>

Not necessarily true. You need all the data that you want to visualize, but
that is not necessarily all the data the user has asked for. You can be
clever about the visualization to keep it uncluttered, and "paging"-like
behaviours may be a way to do that.


>
> Discuss and debate please!
>
> Rick
>
>
>
> - Reply message -
> From: "Craig Taverner" 
> Date: Thu, Apr 21, 2011 8:52 am
> Subject: [Neo4j] REST results pagination
> To: "Neo4j user discussions" 
>
> >
> > I assume this:
> >Traverser x = Traversal.description().traverse( someNode );
> >x.nodes();
> >x.nodes(); // Not necessarily in the same order as previous call.
> >
> > If that assumption is false or there is some workaround, then I agree
> that
> > this is a valid approach, and a good efficient alternative when sorting
> is
> > not relevant. Glancing at the code in TraverserImpl though, it really
> looks
> > like the call to .nodes  will re-run the traversal, and I thought that
> > would
> > mean the two calls can yield results in different order?
> >
>
> OK. My assumptions were different. I assume that while the order is not
> easily predictable, it is reproducable as long as the underlying graph has
> not changed. If the graph changes, then the order can change also. But I
> think this is true of a relational database also, is it not?
>
> So, obviously pagination is expected (by me at least) to give page X as it
> is at the time of the request for page X, not at the time of the request
> for
> page 1.
>
> But my assumptions could be incorrect too...
>
> I understand, and completely agree. My problem with the approach is that I
> > think its harder than it looks at first glance.
> >
>
> I guess I cannot argue that point. My original email said I did not know if
> this idea had been solved yet. Since some of the key people involved in
> this
> have not chipped into this discussion, either we are reasonably correct in
> our ideas, or so wrong that they don't know where to begin correcting us
> ;-)
>
> This is what makes me push for the sorted approach - relational databases
> > are doing this. I don't know how they do it, but they are, and we should
> be
> > at least as good.
> >
>
> Absolutely. We should be as good. Relational database manage to serve a
> page
> deep down the list quite fast. I must believe if they had to complete the
> traversal, sort the results and extract the page on every single page
> request, they could not be so fast. I think my ideas for the traversal are
> 'supposed' to be performance enhancements, and that is why I like them ;-)
>
> I agree the issue of what should be indexed to optimize sorting is a
> > domain-specific problem, but I think that is how relational databases
> treat
> > it as well. If you want sorting to be fast, you have to tell them to
> index
> > the field you will be sorting on. The only difference contra having the
> > user
> > put the sorting index in the graph is that relational databases will
> handle
> > the indexing for you, saving you a *ton* of work, and I think we should
> > too.
> >
>
> Yes. I was discussing automatic indexing with Mattias recently. I think
> (and
> hope I am right), that once we move to automatic index

Re: [Neo4j] REST results pagination

2011-04-21 Thread Rick Bullotta
Good dialog, btw!

- Reply message -
From: "Jacob Hansson" 
Date: Thu, Apr 21, 2011 10:06 am
Subject: [Neo4j] REST results pagination
To: "Neo4j user discussions" 

On Thu, Apr 21, 2011 at 2:59 PM, Rick Bullotta
wrote:

> Fwiw, I think paging is an outdated "crutch", for a few reasons:
>
> 1) bandwidth and browser processing/parsing are largely non issues,
> although they used to be
>

I disagree. They have improved significantly, for sure, but that is no
reason to download massive amounts of data that will never be used.


>
> 2) human users rarely have the patience (and usability sucks) to go beyond
> 2-4 pages of information.  It is far better to allow incrementally refined
> filters and searches to get to a workable subset of data.
>

I agree with the suckiness of paging and the awesomeness of filtering - but
what do you do when the users filter returns 40 million results? You somehow
have to tell the user that "damn, that filter, it returned forty freaking
million results, you need to refine your search buddy".

The way the user expects that to happen is through presenting a paged,
infinite scrolled or similar interface, where she can see how many results
where returned and act on that feedback.


> 3) machine users could care less about paging
>
>
Agreed, streaming is a much better way for machines to talk about data that
doesn't fit in memory.


> 4) when doing visualization of a large dataset, you generally want the
> whole dataset, not a page of it, so that's another "non use case"
>

Not necessarily true. You need all the data that you want to visualize, but
that is not necessarily all the data the user has asked for. You can be
clever about the visualization to keep it uncluttered, and "paging"-like
behaviours may be a way to do that.


>
> Discuss and debate please!
>
> Rick
>
>
>
> - Reply message -
> From: "Craig Taverner" 
> Date: Thu, Apr 21, 2011 8:52 am
> Subject: [Neo4j] REST results pagination
> To: "Neo4j user discussions" 
>
> >
> > I assume this:
> >Traverser x = Traversal.description().traverse( someNode );
> >x.nodes();
> >x.nodes(); // Not necessarily in the same order as previous call.
> >
> > If that assumption is false or there is some workaround, then I agree
> that
> > this is a valid approach, and a good efficient alternative when sorting
> is
> > not relevant. Glancing at the code in TraverserImpl though, it really
> looks
> > like the call to .nodes  will re-run the traversal, and I thought that
> > would
> > mean the two calls can yield results in different order?
> >
>
> OK. My assumptions were different. I assume that while the order is not
> easily predictable, it is reproducable as long as the underlying graph has
> not changed. If the graph changes, then the order can change also. But I
> think this is true of a relational database also, is it not?
>
> So, obviously pagination is expected (by me at least) to give page X as it
> is at the time of the request for page X, not at the time of the request
> for
> page 1.
>
> But my assumptions could be incorrect too...
>
> I understand, and completely agree. My problem with the approach is that I
> > think its harder than it looks at first glance.
> >
>
> I guess I cannot argue that point. My original email said I did not know if
> this idea had been solved yet. Since some of the key people involved in
> this
> have not chipped into this discussion, either we are reasonably correct in
> our ideas, or so wrong that they don't know where to begin correcting us
> ;-)
>
> This is what makes me push for the sorted approach - relational databases
> > are doing this. I don't know how they do it, but they are, and we should
> be
> > at least as good.
> >
>
> Absolutely. We should be as good. Relational database manage to serve a
> page
> deep down the list quite fast. I must believe if they had to complete the
> traversal, sort the results and extract the page on every single page
> request, they could not be so fast. I think my ideas for the traversal are
> 'supposed' to be performance enhancements, and that is why I like them ;-)
>
> I agree the issue of what should be indexed to optimize sorting is a
> > domain-specific problem, but I think that is how relational databases
> treat
> > it as well. If you want sorting to be fast, you have to tell them to
> index
> > the field you will be sorting on. The only difference contra having the
> > user
> > put the sorting index in the graph is that relational databases will
> handle
> > the indexing for you, saving you a *ton* of work, and I think we should
> > too.
> >
>
> Yes. I was discussing automatic indexing with Mattias recently. I think
> (and
> hope I am right), that once we move to automatic indexes, then it will be
> possible to put external indexes (a'la lucene) and graph indexes (like the
> ones I favour) behind the same API. In this case perhaps the database will
> more easily be able to make the right optimized decisions, and use the
> index
> 

Re: [Neo4j] REST results pagination

2011-04-21 Thread Jim Webber
This is indeed a good dialogue. The pagination versus streaming was something 
I'd previously had in my mind as orthogonal issues, but I like the direction 
this is going. Let's break it down to fundamentals:

As a remote client, I want to be just as rich and performant as a local client. 
Unfortunately,  Deutsch, Amdahl and Einstein are against me on that, and I 
don't think I am tough enough to defeat those guys.

So what are my choices? I know I have to be more "granular" to try to alleviate 
some of the network penalty so doing operations bulkily sounds great. 

Now what I need to decide is whether I control the rate at which those bulk 
operations occur or whether the server does. If I want to control those 
operations, then paging seems sensible. Otherwise a streamed (chunked) encoding 
scheme would make sense if I'm happy for the server to throw results back at me 
at its own pace. Or indeed you can mix both so that pages are streamed.

In either case if I get bored of those results, I'll stop paging or I'll 
terminate the connection.

So what does this mean for implementation on the server? I guess this is 
important since it affects the likelihood of the Neo Tech team implementing it.

If the server supports pagination, it means we need a paging controller in 
memory per paginated result set being created. If we assume that we'll only go 
forward in pages, that's effectively just a wrapper around the traversal that's 
been uploaded. The overhead should be modest, and apart from the paging 
controller and the traverser, it doesn't need much state. We would need to add 
some logic to the representation code to support "next" links, but that seems a 
modest task.

If the server streams, we will need to decouple the representation generation 
from the existing representation logic since that builds an in-memory 
representation which is then flushed. Instead we'll need a streaming 
representation implementation which seems to be a reasonable amount of 
engineering. We'll also need a new streaming binding to the REST server in 
JAX-RS land.

I'm still a bit concerned about how "rude" it is for a client to just drop a 
streaming connection. I've asked Mark Nottingham for his authoritative opinion 
on that. But still, this does seem popular and feasible.

Jim





___
Neo4j mailing list
User@lists.neo4j.org
https://lists.neo4j.org/mailman/listinfo/user


Re: [Neo4j] REST results pagination

2011-04-21 Thread Craig Taverner
>
> I can only think of a few use cases where loosing some of the expected
> result is ok, for instance if you want to "peek" at the result.
>

IMHO, paging is, by definition, a "peek". Since the client controls when the
next page will be requested, it is not possible, or reasonable, to enforce
that the complete set of pages (if every requested) will represent a
consistent result set. This is not supported by relational databases either.
The result set, and meaning of a page, can change between requests. So it
can, and does happen, data some of the expected result is lost.

This is completely different to the streaming result, which I see Jim
commented on, and so I might just reply to his mail too :-)

I'm waiting for one of those SlapOnTheFingersExceptions' that Tobias has
> been handing out :)
>

My fingers are, as yet, unscathed. The slap can come at any moment! :-)

This sounds really cool, would be a great thing to look into!
>

Should you want examples, I have a wiki page on this topic at
http://redmine.amanzi.org/wiki/geoptima/Geoptima_Event_Log


___
Neo4j mailing list
User@lists.neo4j.org
https://lists.neo4j.org/mailman/listinfo/user


Re: [Neo4j] REST results pagination

2011-04-21 Thread Craig Taverner
I think Jim makes a great point about the differences between paging and
streaming, being client or server controlled. I think there is a related
point to be made, and that is that paging does not, and cannot, guarantee a
consistent total result set. Since the database can change between pages
requests, they can be inconsistent. It is possible for the same record to
appear in two pages, or for a record to be missed. This is certainly how
relational databases work in this regard.

But in the streaming case, we expect a complete and consistent result set.
Unless, of course, the client cuts off the stream. The use case is very
different, while paging is about getting a peek at the data, and rarely
about paging all the way to the end, streaming is about getting the entire
result, but streamed for efficiency.

On Thu, Apr 21, 2011 at 5:00 PM, Jim Webber  wrote:

> This is indeed a good dialogue. The pagination versus streaming was
> something I'd previously had in my mind as orthogonal issues, but I like the
> direction this is going. Let's break it down to fundamentals:
>
> As a remote client, I want to be just as rich and performant as a local
> client. Unfortunately,  Deutsch, Amdahl and Einstein are against me on that,
> and I don't think I am tough enough to defeat those guys.
>
> So what are my choices? I know I have to be more "granular" to try to
> alleviate some of the network penalty so doing operations bulkily sounds
> great.
>
> Now what I need to decide is whether I control the rate at which those bulk
> operations occur or whether the server does. If I want to control those
> operations, then paging seems sensible. Otherwise a streamed (chunked)
> encoding scheme would make sense if I'm happy for the server to throw
> results back at me at its own pace. Or indeed you can mix both so that pages
> are streamed.
>
> In either case if I get bored of those results, I'll stop paging or I'll
> terminate the connection.
>
> So what does this mean for implementation on the server? I guess this is
> important since it affects the likelihood of the Neo Tech team implementing
> it.
>
> If the server supports pagination, it means we need a paging controller in
> memory per paginated result set being created. If we assume that we'll only
> go forward in pages, that's effectively just a wrapper around the traversal
> that's been uploaded. The overhead should be modest, and apart from the
> paging controller and the traverser, it doesn't need much state. We would
> need to add some logic to the representation code to support "next" links,
> but that seems a modest task.
>
> If the server streams, we will need to decouple the representation
> generation from the existing representation logic since that builds an
> in-memory representation which is then flushed. Instead we'll need a
> streaming representation implementation which seems to be a reasonable
> amount of engineering. We'll also need a new streaming binding to the REST
> server in JAX-RS land.
>
> I'm still a bit concerned about how "rude" it is for a client to just drop
> a streaming connection. I've asked Mark Nottingham for his authoritative
> opinion on that. But still, this does seem popular and feasible.
>
> Jim
>
>
>
>
>
> ___
> Neo4j mailing list
> User@lists.neo4j.org
> https://lists.neo4j.org/mailman/listinfo/user
>
___
Neo4j mailing list
User@lists.neo4j.org
https://lists.neo4j.org/mailman/listinfo/user


[Neo4j] New blog post on non-graph stores for graph-y things

2011-04-21 Thread Jim Webber
Hi guys,

A while ago we were discussing using non-graph native backend for graph 
operations. I've finally gotten around to writing up my thoughts on the thread 
here:

http://jim.webber.name/2011/04/21/e2f48ace-7dba-4709-8600-f29da3491cb4.aspx

As always, I'd value your thoughts and feedback.

Jim
___
Neo4j mailing list
User@lists.neo4j.org
https://lists.neo4j.org/mailman/listinfo/user


[Neo4j] Basic Node storage/retrieval related question?

2011-04-21 Thread G
I have a pojo with a field "a".

which i initialize like this
Object a  = 10;
I store the POJO containing this field using neo4j..

When I load this POJO, I have a getter method to get the object

Object getA() {
return a;
}

*What should be the class type of a ? *
I am of the opinion it should be java.lang.Integer but it is coming out to
be java.lang.String

I am assuming this is because of node.getProperty(... )
Is there a way I can get Integer object only.


Also what all types can be stored  ?

thanks,
Karan
 .
___
Neo4j mailing list
User@lists.neo4j.org
https://lists.neo4j.org/mailman/listinfo/user


Re: [Neo4j] Question about REST interface concurrency

2011-04-21 Thread Stephen Roos
Hi Peter,

I'd be glad to share the code, I'll commit soon and share with the users list.

I've run some more load/concurrency tests and am seeing some strange results.  
Maybe someone can help explain this to me:

I run a load test where I fire off 100K "create empty node" REST requests to 
Neo as quickly as possible.  With my code updates to allow configuration of the 
Jetty thread pool size, I can effectively reduce or increase the maximum 
concurrent transaction limit on the server.  If I limit the thread pool so that 
there is only 1 thread available for requests, I see (as expected) the 
PeakNumberOfConcurrentTransactions reported by the Neo4j Transactions MBean is 
1.  If I scale the thread pool up so that there are 800 available request 
threads, I can throw enough load at the server to cause 800 concurrent 
transactions.  From what I have read, node creation causes a node-local lock, 
not a global node lock, so there shouldn't be a lock-imposed concurrency 
bottleneck.

The strange thing is, no matter whether I have 1 or 800 concurrent 
transactions, my total node creation throughput is always the same (~1600 
nodes/sec).  Even with 800 concurrent transactions, my server is only using 
~15% CPU and ~25% memory (JVM Xmm/Xmx = 1024m/2048m), so server load wouldn't 
appear to be an issue.  I've followed all the recommendations I could find 
including sysctl limits and JVM settings, but the rate doesn't change.  I have 
also tried running the load test from multiple clients simultaneously (just to 
be sure I'm not running into any limits on the client machine), and indeed as 
soon as I add a second load test client, the throughput on each client gets cut 
in half.  If I'm talking to Neo in a way that is unrestricted by things like 
thread pool size and concurrency limits, I'd expect to be able to scale up my 
load tests and see at least some level of throughput improvement until I start 
to saturate/overload the box.  The fact that increasing concurrency doesn't 
increase throughput makes me think that there's some internal bottleneck or 
synchronization point that's limiting.

Any thoughts?  I'm glad to look through the code and investigate, any ideas you 
have would be a big help.

Thanks, and sorry for the long question!

Stephen


-Original Message-
From: Peter Neubauer [mailto:peter.neuba...@neotechnology.com] 
Sent: Monday, April 18, 2011 12:50 AM
To: Neo4j user discussions
Subject: Re: [Neo4j] Question about REST interface concurrency

Stephen,
did you fork the code? Would be good to merge in the changes or at
least take a look at them!

Cheers,

/peter neubauer

GTalk:      neubauer.peter
Skype       peter.neubauer
Phone       +46 704 106975
LinkedIn   http://www.linkedin.com/in/neubauer
Twitter      http://twitter.com/peterneubauer

http://www.neo4j.org               - Your high performance graph database.
http://startupbootcamp.org/    - Öresund - Innovation happens HERE.
http://www.thoughtmade.com - Scandinavia's coolest Bring-a-Thing party.



On Mon, Apr 18, 2011 at 4:08 AM, Stephen Roos  wrote:
> Hi Jim,
>
> Thanks for the quick reply.  I tried the configuration mentioned here 
> ("rest_max_jetty_threads"):
>
> https://trac.neo4j.org/changeset/6157/laboratory/components/rest
>
> But it doesn't seem to have changed anything.  I took a look through the code 
> and didn't see any configuration settings exposed in Jetty6WebServer.  I 
> added the changes myself and am starting to see some good results (I've 
> exposed settings for min/max threadpool size, # acceptor threads, acceptor 
> queue size, and request buffer size).  Is there anything else that you'd 
> recommend tweaking to improve throughput?
>
> Thanks again for your help!
>
>
>
> -Original Message-
> From: Jim Webber [mailto:j...@neotechnology.com]
> Sent: Friday, April 15, 2011 1:57 AM
> To: Neo4j user discussions
> Subject: Re: [Neo4j] Question about REST interface concurrency
>
> Hi Stephen,
>
> The same Jetty tweaks that worked in previous versions will work with 1.3. We 
> haven't changed any of the Jetty stuff under the covers.
>
> Jim
> ___
> Neo4j mailing list
> User@lists.neo4j.org
> https://lists.neo4j.org/mailman/listinfo/user
>


___
Neo4j mailing list
User@lists.neo4j.org
https://lists.neo4j.org/mailman/listinfo/user


Re: [Neo4j] Basic Node storage/retrieval related question?

2011-04-21 Thread David Montag
Hi Karan,

Are you using Spring Data Graph, or the "native" Neo4j API?

David

On Thu, Apr 21, 2011 at 10:21 AM, G  wrote:

> I have a pojo with a field "a".
>
> which i initialize like this
> Object a  = 10;
> I store the POJO containing this field using neo4j..
>
> When I load this POJO, I have a getter method to get the object
>
> Object getA() {
>return a;
> }
>
> *What should be the class type of a ? *
> I am of the opinion it should be java.lang.Integer but it is coming out to
> be java.lang.String
>
> I am assuming this is because of node.getProperty(... )
> Is there a way I can get Integer object only.
>
>
> Also what all types can be stored  ?
>
> thanks,
> Karan
>  .
> ___
> Neo4j mailing list
> User@lists.neo4j.org
> https://lists.neo4j.org/mailman/listinfo/user
>



-- 
David Montag 
Neo Technology, www.neotechnology.com
Cell: 650.556.4411
Skype: ddmontag
___
Neo4j mailing list
User@lists.neo4j.org
https://lists.neo4j.org/mailman/listinfo/user


Re: [Neo4j] REST results pagination

2011-04-21 Thread Rick Bullotta
Jim, we should schedule a group chat on this topic.



- Reply message -
From: "Jim Webber" 
Date: Thu, Apr 21, 2011 11:01 am
Subject: [Neo4j] REST results pagination
To: "Neo4j user discussions" 

This is indeed a good dialogue. The pagination versus streaming was something 
I'd previously had in my mind as orthogonal issues, but I like the direction 
this is going. Let's break it down to fundamentals:

As a remote client, I want to be just as rich and performant as a local client. 
Unfortunately,  Deutsch, Amdahl and Einstein are against me on that, and I 
don't think I am tough enough to defeat those guys.

So what are my choices? I know I have to be more "granular" to try to alleviate 
some of the network penalty so doing operations bulkily sounds great.

Now what I need to decide is whether I control the rate at which those bulk 
operations occur or whether the server does. If I want to control those 
operations, then paging seems sensible. Otherwise a streamed (chunked) encoding 
scheme would make sense if I'm happy for the server to throw results back at me 
at its own pace. Or indeed you can mix both so that pages are streamed.

In either case if I get bored of those results, I'll stop paging or I'll 
terminate the connection.

So what does this mean for implementation on the server? I guess this is 
important since it affects the likelihood of the Neo Tech team implementing it.

If the server supports pagination, it means we need a paging controller in 
memory per paginated result set being created. If we assume that we'll only go 
forward in pages, that's effectively just a wrapper around the traversal that's 
been uploaded. The overhead should be modest, and apart from the paging 
controller and the traverser, it doesn't need much state. We would need to add 
some logic to the representation code to support "next" links, but that seems a 
modest task.

If the server streams, we will need to decouple the representation generation 
from the existing representation logic since that builds an in-memory 
representation which is then flushed. Instead we'll need a streaming 
representation implementation which seems to be a reasonable amount of 
engineering. We'll also need a new streaming binding to the REST server in 
JAX-RS land.

I'm still a bit concerned about how "rude" it is for a client to just drop a 
streaming connection. I've asked Mark Nottingham for his authoritative opinion 
on that. But still, this does seem popular and feasible.

Jim





___
Neo4j mailing list
User@lists.neo4j.org
https://lists.neo4j.org/mailman/listinfo/user
___
Neo4j mailing list
User@lists.neo4j.org
https://lists.neo4j.org/mailman/listinfo/user


Re: [Neo4j] Question about REST interface concurrency

2011-04-21 Thread Jim Webber
Hi Stephen,

Are you running on Linux (or Windows) by any chance? I wonder whether the 
asymptotical performance you're seeing is because you've gotten to a point 
where you're exercising the IO channel and file system.

Jim
___
Neo4j mailing list
User@lists.neo4j.org
https://lists.neo4j.org/mailman/listinfo/user


Re: [Neo4j] Strange performance difference on different machines

2011-04-21 Thread Bob Hutchison

On 2011-04-20, at 7:30 AM, Tobias Ivarsson wrote:

> Sorry I got a bit distracted when writing this. I should have added that I
> then want you to send the results of running that benchmark to me so that I
> can further analyze what the cause of these slow writes might be.
> 
> Thank you,
> Tobias

That's what I figured you meant. Sorry for the delay, here they are:

On a HP z400, quad Xeon W3550 @ 3.07GHz
ext4 filesystem
-

>> dd if=/dev/urandom of=store bs=1M count=1000
1000+0 records in
1000+0 records out
1048576000 bytes (1.0 GB) copied, 111.175 s, 9.4 MB/s
>> dd if=store of=/dev/null bs=100M
10+0 records in
10+0 records out
1048576000 bytes (1.0 GB) copied, 0.281153 s, 3.7 GB/s
>> dd if=store of=/dev/null bs=100M
10+0 records in
10+0 records out
1048576000 bytes (1.0 GB) copied, 0.244339 s, 4.3 GB/s
>> dd if=store of=/dev/null bs=100M
10+0 records in
10+0 records out
1048576000 bytes (1.0 GB) copied, 0.242583 s, 4.3 GB/s


>> ./run ../store logfile 33 100 500 100
tx_count[100] records[31397] fdatasyncs[100] read[0.9881029 MB] wrote[1.9762058 
MB]
Time was: 5.012
19.952114 tx/s, 6264.365 records/s, 19.952114 fdatasyncs/s, 201.87897 kB/s on 
reads, 403.75793 kB/s on writes

>> ./run ../store logfile 33 1000 5000 10 
tx_count[10] records[30997] fdatasyncs[10] read[0.9755144 MB] wrote[1.9510288 
MB]
Time was: 0.604
16.556292 tx/s, 51319.54 records/s, 16.556292 fdatasyncs/s, 1653.8523 kB/s on 
reads, 3307.7046 kB/s on writes

>> ./run ../store logfile 33 1000 5000 100 
tx_count[100] records[298245] fdatasyncs[100] read[9.386144 MB] wrote[18.772287 
MB]
Time was: 199.116
0.5022198 tx/s, 1497.8455 records/s, 0.5022198 fdatasyncs/s, 48.270412 kB/s on 
reads, 96.540825 kB/s on writes

procs ---memory-- ---swap-- -io -system-- cpu
 r  b   swpd   free   buff  cache   si   sobibo   in   cs us sy id wa
 1  2  0 8541712 336716 367094000 1 7   12   20  4  1 95  0
 0  2  0 8525712 336716 367094800 0   979 1653 3186  4  1 60 35
 1  2  0 8525220 336716 367120400 0  1244 1671 3150  4  1 71 24
 0  2  0 8524724 336716 367133200 0   709 1517 3302  4  1 65 30
 0  2  0 8524476 336716 367146000 0  1033 1680 69342  5  7 59 29
 0  2  0 8539168 336716 367158800 0  1375 1599 3272  3  1 70 25
 1  2  0 8538860 336716 367171600 0  1157 1594 3097  3  1 72 24
 0  1  0 8541340 336716 367184400 0  1151 1512 3182  3  2 70 25
 0  1  0 8524812 336716 367197200 0  1597 1641 3391  4  2 72 22


___
Neo4j mailing list
User@lists.neo4j.org
https://lists.neo4j.org/mailman/listinfo/user


Re: [Neo4j] Strange performance difference on different machines

2011-04-21 Thread Michael Hunger
Bob,

I don't know if you have already answered these questions. 

Which JDK (also version) are you using for that, what are the JVM memory 
settings?

Do you have a profiler handy that you could throw at your benchmark? (E.g. 
yourkit has a 30 day trial, other profilers surely too).

Do you have the source code of your tests at hand? So we could run exactly the 
same code on our own Linux systems for cross checking?

What Linux distribution is it, and 64 or 32 bit? Do you also have a disk 
formatted with ext3 to cross check? (Perhaps just a loopback device).

How much memory does the linux box have available?

Thanks so much.

Michael

Am 21.04.2011 um 21:53 schrieb Bob Hutchison:

> 
> On 2011-04-20, at 7:30 AM, Tobias Ivarsson wrote:
> 
>> Sorry I got a bit distracted when writing this. I should have added that I
>> then want you to send the results of running that benchmark to me so that I
>> can further analyze what the cause of these slow writes might be.
>> 
>> Thank you,
>> Tobias
> 
> That's what I figured you meant. Sorry for the delay, here they are:
> 
> On a HP z400, quad Xeon W3550 @ 3.07GHz
> ext4 filesystem
> -
> 
>>> dd if=/dev/urandom of=store bs=1M count=1000
> 1000+0 records in
> 1000+0 records out
> 1048576000 bytes (1.0 GB) copied, 111.175 s, 9.4 MB/s
>>> dd if=store of=/dev/null bs=100M
> 10+0 records in
> 10+0 records out
> 1048576000 bytes (1.0 GB) copied, 0.281153 s, 3.7 GB/s
>>> dd if=store of=/dev/null bs=100M
> 10+0 records in
> 10+0 records out
> 1048576000 bytes (1.0 GB) copied, 0.244339 s, 4.3 GB/s
>>> dd if=store of=/dev/null bs=100M
> 10+0 records in
> 10+0 records out
> 1048576000 bytes (1.0 GB) copied, 0.242583 s, 4.3 GB/s
> 
> 
>>> ./run ../store logfile 33 100 500 100
> tx_count[100] records[31397] fdatasyncs[100] read[0.9881029 MB] 
> wrote[1.9762058 MB]
> Time was: 5.012
> 19.952114 tx/s, 6264.365 records/s, 19.952114 fdatasyncs/s, 201.87897 kB/s on 
> reads, 403.75793 kB/s on writes
> 
>>> ./run ../store logfile 33 1000 5000 10 
> tx_count[10] records[30997] fdatasyncs[10] read[0.9755144 MB] wrote[1.9510288 
> MB]
> Time was: 0.604
> 16.556292 tx/s, 51319.54 records/s, 16.556292 fdatasyncs/s, 1653.8523 kB/s on 
> reads, 3307.7046 kB/s on writes
> 
>>> ./run ../store logfile 33 1000 5000 100 
> tx_count[100] records[298245] fdatasyncs[100] read[9.386144 MB] 
> wrote[18.772287 MB]
> Time was: 199.116
> 0.5022198 tx/s, 1497.8455 records/s, 0.5022198 fdatasyncs/s, 48.270412 kB/s 
> on reads, 96.540825 kB/s on writes
> 
> procs ---memory-- ---swap-- -io -system-- cpu
> r  b   swpd   free   buff  cache   si   sobibo   in   cs us sy id wa
> 1  2  0 8541712 336716 367094000 1 7   12   20  4  1 95  0
> 0  2  0 8525712 336716 367094800 0   979 1653 3186  4  1 60 35
> 1  2  0 8525220 336716 367120400 0  1244 1671 3150  4  1 71 24
> 0  2  0 8524724 336716 367133200 0   709 1517 3302  4  1 65 30
> 0  2  0 8524476 336716 367146000 0  1033 1680 69342  5  7 59 
> 29
> 0  2  0 8539168 336716 367158800 0  1375 1599 3272  3  1 70 25
> 1  2  0 8538860 336716 367171600 0  1157 1594 3097  3  1 72 24
> 0  1  0 8541340 336716 367184400 0  1151 1512 3182  3  2 70 25
> 0  1  0 8524812 336716 367197200 0  1597 1641 3391  4  2 72 22
> 
> 
> ___
> Neo4j mailing list
> User@lists.neo4j.org
> https://lists.neo4j.org/mailman/listinfo/user

___
Neo4j mailing list
User@lists.neo4j.org
https://lists.neo4j.org/mailman/listinfo/user


Re: [Neo4j] REST results pagination

2011-04-21 Thread Michael Hunger
Really cool discussion so far,

I would also prefer streaming over paging as with that approach we can give 
both ends more of the control they need.

The server doesn't have to keep state over a long time (and also implement 
timeouts and clearing of that state, and keeping that state for lots of clients 
also adds up).
The client can decide how much of the result he's interested in, if it is just 
1 entry or 100k and then just drop the connection.
Streaming calls can also have a request-timeout, so keeping those open for too 
long (with no activity) will close them automatically.
Server doesn't use up lots of memory for streaming, one could even leverage the 
lazyness of traversers (and indexes) for not even executing/fetching results 
that are not going to be sent over the wire.

This should accommodate every kind of client from the mobile phone which only 
lists a few entries, to the big machine that can eat a firehose of result data 
in milliseconds.

For this kind of "look-ahead" support we could (and should) add an possible 
offset, so that a client can request data (whose order _he_ is sure hasn't 
changed) by having the server skipping the first n entries (so they don't have 
to be serialized/put on the wire).

I also think that this streaming API could already address many of the 
pain-points of the current REST API. Perhaps we even want to provide a 
streaming interface in both directions, having the client being able to for 
instance stream the creation of nodes and relationships and their indexing 
without restarting a connection for each operation. Whatever comes in this 
stream could also be processed in one TX (or with TX tokens embedded in the 
stream the client could even control that).

The only question that is posing here for me is if we want to put it on top of 
the existing REST API or rather create a more concise API/formats for that 
(with the later option of the format even degrading to binary for high bandwith 
interaction). I'd prefer the latter.

Cheers

Michael

Am 21.04.2011 um 21:09 schrieb Rick Bullotta:

> Jim, we should schedule a group chat on this topic.
> 
> 
> 
> - Reply message -
> From: "Jim Webber" 
> Date: Thu, Apr 21, 2011 11:01 am
> Subject: [Neo4j] REST results pagination
> To: "Neo4j user discussions" 
> 
> This is indeed a good dialogue. The pagination versus streaming was something 
> I'd previously had in my mind as orthogonal issues, but I like the direction 
> this is going. Let's break it down to fundamentals:
> 
> As a remote client, I want to be just as rich and performant as a local 
> client. Unfortunately,  Deutsch, Amdahl and Einstein are against me on that, 
> and I don't think I am tough enough to defeat those guys.
> 
> So what are my choices? I know I have to be more "granular" to try to 
> alleviate some of the network penalty so doing operations bulkily sounds 
> great.
> 
> Now what I need to decide is whether I control the rate at which those bulk 
> operations occur or whether the server does. If I want to control those 
> operations, then paging seems sensible. Otherwise a streamed (chunked) 
> encoding scheme would make sense if I'm happy for the server to throw results 
> back at me at its own pace. Or indeed you can mix both so that pages are 
> streamed.
> 
> In either case if I get bored of those results, I'll stop paging or I'll 
> terminate the connection.
> 
> So what does this mean for implementation on the server? I guess this is 
> important since it affects the likelihood of the Neo Tech team implementing 
> it.
> 
> If the server supports pagination, it means we need a paging controller in 
> memory per paginated result set being created. If we assume that we'll only 
> go forward in pages, that's effectively just a wrapper around the traversal 
> that's been uploaded. The overhead should be modest, and apart from the 
> paging controller and the traverser, it doesn't need much state. We would 
> need to add some logic to the representation code to support "next" links, 
> but that seems a modest task.
> 
> If the server streams, we will need to decouple the representation generation 
> from the existing representation logic since that builds an in-memory 
> representation which is then flushed. Instead we'll need a streaming 
> representation implementation which seems to be a reasonable amount of 
> engineering. We'll also need a new streaming binding to the REST server in 
> JAX-RS land.
> 
> I'm still a bit concerned about how "rude" it is for a client to just drop a 
> streaming connection. I've asked Mark Nottingham for his authoritative 
> opinion on that. But still, this does seem popular and feasible.
> 
> Jim
> 
> 
> 
> 
> 
> ___
> Neo4j mailing list
> User@lists.neo4j.org
> https://lists.neo4j.org/mailman/listinfo/user
> ___
> Neo4j mailing list
> User@lists.neo4j.org
> https://lists.neo4j.org/mailman/listinfo/user


Re: [Neo4j] REST results pagination

2011-04-21 Thread Rick Otten
Half-baked thoughts from a neo4j newbie hacker type on this topic:

1)  I think it is very important, even with modern infrastructures, for
the client to be able to optionally throttle the result set it generates
with a query as it sees fit, and not just because of client memory and
bandwidth limitations.

With regular old SQL databases if you send a careless large query, you
can chew up significant system resources, for significant amounts of
time while it is being processed.  At a minimum, a rowcount/pagination
option allows you to build something into your client which can
minimize accidental denial of service queries.   I'm not sure if it is
possible to construct a query against a large Neo4j database that
would temporarily cripple it, but it wouldn't surprise me if you
could.


2) Sometimes with regular old SQL databases I'll run a sanity check
"count()" function with the query to just return the size of the expected
result set before I try to pull it back into my data structure.  Many
times "count()" is all I needed anyhow.   Does Neo4j have a result set
size function?  Perhaps a client that really could only handle small
result sets could run a count(), and then filter the search somehow, if
necessary, until the count() was smaller?  (I guess it would depend on the
problem domain...)

   In other words it may be possible, when it is really important, to
implement pagination logic on the client side, if you don't mind
running multiple queries for each set of data you get back.


3)  If the result set was broken into pages, you could organize the pages
in the server with a set of [temporary] graph nodes with relationships to
the results in the database -- one node for each page, and a parent node
for the result set.   If order of the pages is important, you could add
directed relationships between the page nodes.  If the order within the
pages is important you could either apply a sequence numbering to the
page-result relationship, or add directed temporary result set directed
relationships too.

Subsequent page retrievals would be new traversals based on the search
result set graph.  In a sense you would be building a temporary
graph-index I suppose.

And advantage to organizing search result sets this way is that you
could then union and intersect result sets (and do other set
operations) without a huge memory overhead.  (Which means you could
probably store millions of search results at one time, and you could
persist them through restarts.)



4) In some HA architectures you may have multiple database copies behind a
load balancer.  Would the search result pages be stored equally on all of
them?  Would the client require a "sticky" flag, to always go back to the
same specific server instance for more pages?

   Depending on how fast writes get propagated across the cluster
(compared to requests for the next page), if you were creating nodes as
described in (3) would that work?



5) As for sorting:

   In my experience, if I need a result set sorted from a regular SQL
database, I will usually sort it myself.  Most databases I've ever
worked with routinely have performance problems.  You can minimize
finger pointing and the risk of complicating those other performance
problems by just directing the database to get me what I need, I'll do
the rest of it back in the client.

   On the other hand, sometimes it is quicker and easier to let the
database do the work. (Usually when I can only handle the data in small
chunks on the client.)

   What I'm trying to say, is that I think sorting is going to be more
important to clients who want paginated results (ie, using resource
limited clients), than to clients who are grabbing large chunks of data
at a time (and will want to "own" any post-query processing steps
anyhow).


-- 
Rick Otten
rot...@windfish.net
O=='=+


___
Neo4j mailing list
User@lists.neo4j.org
https://lists.neo4j.org/mailman/listinfo/user


Re: [Neo4j] REST results pagination

2011-04-21 Thread Michael Hunger
Rick,

great thoughts.

Good catch, forgot to add the in-graph representation of the results to my 
mail, thanks for adding that part. Temporary (transient) nodes and 
relationships would really rock here, with the advantage that with HA you have 
them distributed to all cluster nodes.
Certainly Craig has to add some interesting things to this, as those resemble 
probably his in graph indexes / R-Trees.

As traversers are lazy a count operation is not so easily possible, you could 
run the traversal and discard the results. But then the client could also just 
pull those results until it reaches its
internal tresholds and then decide to use more filtering or stop the pulling 
and ask the user for more filtering (you can always retrieve n+1 and show the 
user that there are more that "n" results available).

The index result size() method only returns an estimate of the result size 
(which might not contain currently changed index entries).

Please don't forget that a count() query in a RDBMS can be as ridicully 
expensive as the original query (especially if just the column selection was 
replaced with count, and sorting, grouping etc was still left in place together 
with lots of joins). 

Sorting on your own instead of letting the db do that mostly harms the 
performance as it requires you to build up all the data in memory, sort it and 
then use it. Instead of having the db do that more efficiently, stream the data 
and you can use it directly from the stream.

Cheers

Michael

Am 21.04.2011 um 23:04 schrieb Rick Otten:

> Half-baked thoughts from a neo4j newbie hacker type on this topic:
> 
> 1)  I think it is very important, even with modern infrastructures, for
> the client to be able to optionally throttle the result set it generates
> with a query as it sees fit, and not just because of client memory and
> bandwidth limitations.
> 
>With regular old SQL databases if you send a careless large query, you
> can chew up significant system resources, for significant amounts of
> time while it is being processed.  At a minimum, a rowcount/pagination
> option allows you to build something into your client which can
> minimize accidental denial of service queries.   I'm not sure if it is
> possible to construct a query against a large Neo4j database that
> would temporarily cripple it, but it wouldn't surprise me if you
> could.
> 
> 
> 2) Sometimes with regular old SQL databases I'll run a sanity check
> "count()" function with the query to just return the size of the expected
> result set before I try to pull it back into my data structure.  Many
> times "count()" is all I needed anyhow.   Does Neo4j have a result set
> size function?  Perhaps a client that really could only handle small
> result sets could run a count(), and then filter the search somehow, if
> necessary, until the count() was smaller?  (I guess it would depend on the
> problem domain...)
> 
>   In other words it may be possible, when it is really important, to
> implement pagination logic on the client side, if you don't mind
> running multiple queries for each set of data you get back.
> 
> 
> 3)  If the result set was broken into pages, you could organize the pages
> in the server with a set of [temporary] graph nodes with relationships to
> the results in the database -- one node for each page, and a parent node
> for the result set.   If order of the pages is important, you could add
> directed relationships between the page nodes.  If the order within the
> pages is important you could either apply a sequence numbering to the
> page-result relationship, or add directed temporary result set directed
> relationships too.
> 
>Subsequent page retrievals would be new traversals based on the search
> result set graph.  In a sense you would be building a temporary
> graph-index I suppose.
> 
>And advantage to organizing search result sets this way is that you
> could then union and intersect result sets (and do other set
> operations) without a huge memory overhead.  (Which means you could
> probably store millions of search results at one time, and you could
> persist them through restarts.)
> 
> 
> 
> 4) In some HA architectures you may have multiple database copies behind a
> load balancer.  Would the search result pages be stored equally on all of
> them?  Would the client require a "sticky" flag, to always go back to the
> same specific server instance for more pages?
> 
>   Depending on how fast writes get propagated across the cluster
> (compared to requests for the next page), if you were creating nodes as
> described in (3) would that work?
> 
> 
> 
> 5) As for sorting:
> 
>   In my experience, if I need a result set sorted from a regular SQL
> database, I will usually sort it myself.  Most databases I've ever
> worked with routinely have performance problems.  You can minimize
> finger pointing and the risk of complicating those other performance
> problems by just directing the database to get me what I need, I'll 

[Neo4j] Sobre neo

2011-04-21 Thread Jose Angel Inda Herrera
Gracias Javier por tu ayuda, muy utiles esos articulos

___
Neo4j mailing list
User@lists.neo4j.org
https://lists.neo4j.org/mailman/listinfo/user


[Neo4j] about two database

2011-04-21 Thread Jose Angel Inda Herrera
Hello list i have some query,
1-i have 2 database (graph), I need to get information from one database 
to another without having to take the target database instance of 
another database.
2- i need know how to open a database(graph) if it already exists.
thanks beforehand
___
Neo4j mailing list
User@lists.neo4j.org
https://lists.neo4j.org/mailman/listinfo/user


Re: [Neo4j] Question about REST interface concurrency

2011-04-21 Thread Stephen Roos
I'm running on Linux (2.6.18).  Watching network utilization, I never see rates 
higher than ~2.5 MBps on the server.  I've also set net.core.rmem_min/max and 
net.ipv4.tcp_rmem/wmem in sysctl to be quite high based on some recommendations 
I've found.  Is this contrary to your own load tests?  Are you able to hit the 
server with enough load that the system is maxed out?  I was considering adding 
some instrumentation around transactions so that I can see the average internal 
transaction time span during a load test.  If you have any other thoughts on 
what to look for/test, I'd be very appreciative.

Thanks again,
Stephen

-Original Message-
From: Jim Webber [mailto:j...@neotechnology.com] 
Sent: Thursday, April 21, 2011 12:24 PM
To: Neo4j user discussions
Subject: Re: [Neo4j] Question about REST interface concurrency

Hi Stephen,

Are you running on Linux (or Windows) by any chance? I wonder whether the 
asymptotical performance you're seeing is because you've gotten to a point 
where you're exercising the IO channel and file system.

Jim
___
Neo4j mailing list
User@lists.neo4j.org
https://lists.neo4j.org/mailman/listinfo/user


Re: [Neo4j] Basic Node storage/retrieval related question?

2011-04-21 Thread G
Hello David,

The problem is while developing using Spring Data Graph.
Can you help me out with these issues.

Best,
Karan

On Fri, Apr 22, 2011 at 12:19 AM, David Montag <
david.mon...@neotechnology.com> wrote:

> Hi Karan,
>
> Are you using Spring Data Graph, or the "native" Neo4j API?
>
> David
>
> On Thu, Apr 21, 2011 at 10:21 AM, G  wrote:
>
> > I have a pojo with a field "a".
> >
> > which i initialize like this
> > Object a  = 10;
> > I store the POJO containing this field using neo4j..
> >
> > When I load this POJO, I have a getter method to get the object
> >
> > Object getA() {
> >return a;
> > }
> >
> > *What should be the class type of a ? *
> > I am of the opinion it should be java.lang.Integer but it is coming out
> to
> > be java.lang.String
> >
> > I am assuming this is because of node.getProperty(... )
> > Is there a way I can get Integer object only.
> >
> >
> > Also what all types can be stored  ?
> >
> > thanks,
> > Karan
> >  .
> > ___
> > Neo4j mailing list
> > User@lists.neo4j.org
> > https://lists.neo4j.org/mailman/listinfo/user
> >
>
>
>
> --
> David Montag 
> Neo Technology, www.neotechnology.com
> Cell: 650.556.4411
> Skype: ddmontag
> ___
> Neo4j mailing list
> User@lists.neo4j.org
> https://lists.neo4j.org/mailman/listinfo/user
>
___
Neo4j mailing list
User@lists.neo4j.org
https://lists.neo4j.org/mailman/listinfo/user


Re: [Neo4j] Basic Node storage/retrieval related question?

2011-04-21 Thread G
David,

and the issue is that I want to store different types of objects and
retrieve them and call different methods using reflection where this value
acts as parameter.
Unfortunately, I am storing an Integer and getting back an instance of type
String.

Is there something I need to do differently ?
Please let me know asap.

Best,
Karan

On Fri, Apr 22, 2011 at 8:30 AM, G  wrote:

> Hello David,
>
> The problem is while developing using Spring Data Graph.
> Can you help me out with these issues.
>
> Best,
> Karan
>
>
> On Fri, Apr 22, 2011 at 12:19 AM, David Montag <
> david.mon...@neotechnology.com> wrote:
>
>> Hi Karan,
>>
>> Are you using Spring Data Graph, or the "native" Neo4j API?
>>
>> David
>>
>> On Thu, Apr 21, 2011 at 10:21 AM, G  wrote:
>>
>> > I have a pojo with a field "a".
>> >
>> > which i initialize like this
>> > Object a  = 10;
>> > I store the POJO containing this field using neo4j..
>> >
>> > When I load this POJO, I have a getter method to get the object
>> >
>> > Object getA() {
>> >return a;
>> > }
>> >
>> > *What should be the class type of a ? *
>> > I am of the opinion it should be java.lang.Integer but it is coming out
>> to
>> > be java.lang.String
>> >
>> > I am assuming this is because of node.getProperty(... )
>> > Is there a way I can get Integer object only.
>> >
>> >
>> > Also what all types can be stored  ?
>> >
>> > thanks,
>> > Karan
>> >  .
>> > ___
>> > Neo4j mailing list
>> > User@lists.neo4j.org
>> > https://lists.neo4j.org/mailman/listinfo/user
>> >
>>
>>
>>
>> --
>> David Montag 
>> Neo Technology, www.neotechnology.com
>> Cell: 650.556.4411
>> Skype: ddmontag
>> ___
>> Neo4j mailing list
>> User@lists.neo4j.org
>> https://lists.neo4j.org/mailman/listinfo/user
>>
>
>
___
Neo4j mailing list
User@lists.neo4j.org
https://lists.neo4j.org/mailman/listinfo/user


Re: [Neo4j] Question from Webinar - traversing a path with nodes of different types

2011-04-21 Thread David Montag
Hi Vipul,

Out of curiosity, what does "process" in this context mean?

As Rick alludes to, you'd have some component performing the simulation
using the domain objects and possibly a graph traversal.

An example of an algorithm for this would be to walk the graph from 1, and
whenever you find a branch, you split the walk. When you finish walking a
branch (a point where more than one branch joins) you use some kind of
synchronization to join the walks.

Does this make sense?

David

On Wed, Apr 20, 2011 at 11:16 PM, Vipul Gupta wrote:

> Hi David,
>
> Inputs are 1 and 6 + Graph is acyclic.
>
> domain.Client@1 -> domain.Router@2 -> domain.Router@3 -> domain.Router@5-> 
> domain.Server@6
>   -> domain.Router@7 -> domain.Router@8 ->
>
> I want a way to start from 1,
>
> process the 2 path till it reaches 5 (say in a thread)
> process the 7 path till it reaches 5 (in another thread)
>
>  then process 5 and eventually 6.
> the above step of processing intermediate path and waiting on the blocking
> point can happen over and over again in a more complex graph (that is there
> could be a number of loops in between even) and the traversal stops only we
> reach 6
>
> I hope this makes it a bit clear. I was working out something for this, but
> it is turning out to be too complex a solution for this sort of traversal of
> a graph, so I am hoping if you can suggest something.
>
> Best Regards,
> Vipul
>
>
> On Thu, Apr 21, 2011 at 11:36 AM, David Montag <
> david.mon...@neotechnology.com> wrote:
>
>> Hi Vipul,
>>
>> Zooming out a little bit, what are the inputs to your algorithm, and what
>> do you want it to do?
>>
>> For example, given 1 and 6, do you want to find any points in the chain
>> between them that are join points of two (or more) subchains (5 in this
>> case)?
>>
>> David
>>
>>
>> On Wed, Apr 20, 2011 at 10:56 PM, Vipul Gupta wrote:
>>
>>> my mistake - I meant "5" depends on both 3 and 8 and acts as a blocking
>>> point till 3 and 8 finishes
>>>
>>>
>>> On Thu, Apr 21, 2011 at 11:19 AM, Vipul Gupta 
>>> wrote:
>>>
 David/Michael,

 Let me modify the example a bit.
 What if my graph structure is like this

 domain.Client@1 -> domain.Router@2 -> domain.Router@3 ->
 domain.Router@5 -> domain.Server@6
   -> domain.Router@7 -> domain.Router@8 ->


 Imagine a manufacturing line.
 6 depends on both 3 and 8 and acts as a blocking point till 3 and 8
 finishes.

 Is there a way to get a cleaner traversal for such kind of
 relationship. I want to get a complete intermediate traversal from
 Client to Server.

 Thank a lot for helping out on this.

 Best Regards,
 Vipul




 On Thu, Apr 21, 2011 at 12:09 AM, David Montag <
 david.mon...@neotechnology.com> wrote:

> Hi Vipul,
>
> Thanks for listening!
>
> It's a very good question, and the short answer is: yes! I'm cc'ing our
> mailing list so that everyone can take part in the answer.
>
> Here's the long answer, illustrated by an example:
>
> Let's assume you're modeling a network. You'll have some domain classes
> that are all networked entities with peers:
>
> @NodeEntity
> public class NetworkEntity {
> @RelatedTo(type = "PEER", direction = Direction.BOTH, elementClass
> = NetworkEntity.class)
> private Set peers;
>
> public void addPeer(NetworkEntity peer) {
> peers.add(peer);
> }
> }
>
> public class Server extends NetworkEntity {}
> public class Router extends NetworkEntity {}
> public class Client extends NetworkEntity {}
>
> Then we can build a small network:
>
> Client c = new Client().persist();
> Router r1 = new Router().persist();
> Router r21 = new Router().persist();
> Router r22 = new Router().persist();
> Router r3 = new Router().persist();
> Server s = new Server().persist();
>
> c.addPeer(r1);
> r1.addPeer(r21);
> r1.addPeer(r22);
> r21.addPeer(r3);
> r22.addPeer(r3);
> r3.addPeer(s);
>
> c.persist();
>
> Note that after linking the entities, I only call persist() on the
> client. You can read more about this in the reference documentation, but
> essentially it will cascade in the direction of the relationships created,
> and will in this case cascade all the way to the server entity.
>
> You can now query this:
>
> Iterable> paths =
> c.findAllPathsByTraversal(Traversal.description());
>
> The above code will get you an EntityPath per node visited during the
> traversal from c. The example does however not use a very interesting
> traversal description, but you can still print the results:
>
> for (EntityPath path : paths) {
> StringBuilder sb = new StringBuilder();
> Iterator iter =
> path.nodeEntities().ite