Re: [Neo4j] Sampling a Neo4j instance?
Would this work in HA mode too (i.e. HighlyAvailableGraphDatabase)? I can see that the 'getConfig' is there -- but does the cast to NeoStoreXaDataSource work as well? Thanks. Date: Wed, 16 Nov 2011 21:40:32 +0200 From: chris.gio...@neotechnology.com To: user@lists.neo4j.org Subject: Re: [Neo4j] Sampling a Neo4j instance? No, GraphDatabaseService wisely hides those things away. I would suggest using instanceof and casting to EmbeddedGraphDatabase. cheers, CG 2011/11/16 Anders Lindström andli...@hotmail.com: Chris, thanks again for your replies. I realize now that I don't have the 'getConfig' method -- I'm writing a server plugin and I only get the GraphDatabaseService interface passed to my method, not a EmbeddedGraphDatabase. Is there an equivalent way of getting the highest node index through the interface? Thanks. Date: Thu, 10 Nov 2011 12:01:31 +0200 From: chris.gio...@neotechnology.com To: user@lists.neo4j.org Subject: Re: [Neo4j] Sampling a Neo4j instance? Answers inline. 2011/11/9 Anders Lindström andli...@hotmail.com: Thanks to the both of you. I am very grateful that you took your time to put this into code -- how's that for community! I presume this way of getting 'highId' is constant in time? It looks rather messy though -- is it really the most straightforward way to do it? This is the safest way to do it, that takes into consideration crashes and HA cluster membership. Another way to do it is long highId = db.getConfig().getIdGeneratorFactory().get( IdType.NODE ).getHighId(); which can return the same value with the first, if some conditions are met. It is shorter and cast-free but i'd still use the first way. getHighId() is a constant time operation for both ways described - it is just a field access, with an additional long comparison for the first case. I am thinking about how efficient this will be. As I understand it, the sampling misses come from deleted nodes that once was there. But if I remember correctly, Neo4j tries to reuse these unused node indices when new nodes are added. But is an unused node index _guaranteed_ to be used given that there is one, or could inserting another node result in increasing 'highId' even though some indices below it are not used? During the lifetime of a Neo4j instance there is no id reuse for Nodes and Relationships - deleted ids are saved however and will be reused the next time Neo4j starts. This means that if during run A you deleted nodes 3 and 5, the first two nodes returned by createNode() on the next run will have ids 3 and 5 - so highId will not change. Additionally, during run A, after deleting nodes 3 and 5, no new nodes would have the id 3 or 5. A crash (or improper shutdown) of the database will break this however, since the ids-to-recycle will probably not make it to disk. So, in short, it is guaranteed that ids *won't* be reused in the same run but not guaranteed to be reused between runs. My conclusion is that the sampling misses will increase with index usage sparseness and that we will have a high rate of sampling misses when we had many deletes and few insertions recently. Would you agree? Yes, that is true, especially given the cost of the wasted I/O and of handling the exception. However, this cost can go down significantly if you keep a hash set for the ids of nodes you have deleted and check that before asking for the node by id, instead of catching an exception. Persisting that between runs would move you away from encapsulated Neo4j constructs and would also be more efficient. Thanks again. Regards,Anders Date: Wed, 9 Nov 2011 19:30:36 +0200 From: chris.gio...@neotechnology.com To: user@lists.neo4j.org Subject: Re: [Neo4j] Sampling a Neo4j instance? Hi, Backing Jim's algorithm with some code: public static void main( String[] args ) { long SAMPLE_SIZE = 1; EmbeddedGraphDatabase db = new EmbeddedGraphDatabase( path/to/db/ ); // Determine the highest possible id for the node store long highId = ( (NeoStoreXaDataSource) db.getConfig().getTxModule().getXaDataSourceManager().getXaDataSource( Config.DEFAULT_DATA_SOURCE_NAME ) ).getNeoStore().getNodeStore().getHighId(); System.out.println( highId + is the highest id ); long i = 0; long nextId; // Do the sampling Random random = new Random(); while ( i SAMPLE_SIZE ) { nextId = Math.abs( random.nextLong() ) % highId; try { db.getNodeById( nextId ); i++; System.out.println( id + nextId + is there ); } catch ( NotFoundException e ) {
[Neo4j] best way to get all directly related nodes?
Hi everybody, what is the most performant way to get all directly related nodes? I know that there are following possibilites: - node.getRelationships() - node.traverse(StopEvaluator.DEPTH_ONE) - Cypher In the first two cases I get the Relationship and still have to do relationship.getEndNode() which seems to me as (little) overhead. By nature, I want to use the most performant way to realise the task. However, I am always puzzled which way to use. Can some please provide me some numbers or even theoretical expression? Thanks, Didi ___ Neo4j mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user
Re: [Neo4j] best way to get all directly related nodes?
On Fri, Nov 18, 2011 at 10:56 AM, D. Frej dieter_f...@gmx.net wrote: Hi everybody, what is the most performant way to get all directly related nodes? I know that there are following possibilites: - node.getRelationships() - node.traverse(StopEvaluator.DEPTH_ONE) - Cypher In the first two cases I get the Relationship and still have to do relationship.getEndNode() which seems to me as (little) overhead. By nature, I want to use the most performant way to realise the task. However, I am always puzzled which way to use. Can some please provide me some numbers or even theoretical expression? I'm pretty sure that the core API is the most performant one. So, the first option should be the fastest. The traversal is next, and Cypher is the slowest. The way I see it is: The trade off is how much work the database does for you, and how much you have to do yourself. Cypher is an abstraction layer built on top of traversals and core API. It can do more things than the core API, but you pay for this with extra CPU cycles. Andrés ___ Neo4j mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user
[Neo4j] Max flow using gremlin
Dear all, has anyone implemented any of the max flow algorithms using gremlin? Alfredas ___ Neo4j mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user
Re: [Neo4j] Batch Insert : poooor performance
Any one ? -- View this message in context: http://neo4j-community-discussions.438527.n3.nabble.com/Batch-Insert-pr-performance-tp3513211p3518340.html Sent from the Neo4j Community Discussions mailing list archive at Nabble.com. ___ Neo4j mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user
Re: [Neo4j] Batch Insert : poooor performance
Of course providing some more context would be poor too? How are we supposed to know what's the problem? ___ Neo4j mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user
Re: [Neo4j] Batch Insert : poooor performance
Yes, I think you should resend your original post that got stuck... On Nov 18, 2011 12:40 PM, Krzysztof Raczyński racz...@gmail.com wrote: Of course providing some more context would be poor too? How are we supposed to know what's the problem? ___ Neo4j mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user ___ Neo4j mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user
Re: [Neo4j] Batch Insert : poooor performance
Btw, inserting 600k nodes over REST with about 8 properties in batches of 100 takes 20-30minutes for me. It's not awesomely fast, but it's not slow either. What settings are affecting insertion speeds, Peter? ___ Neo4j mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user
Re: [Neo4j] Batch Insert : poooor performance
That seems about normal. The good news is that it is much faster (usually) than an RDBMS on the same hardware. -Original Message- From: user-boun...@lists.neo4j.org [mailto:user-boun...@lists.neo4j.org] On Behalf Of Krzysztof Raczynski Sent: Friday, November 18, 2011 6:47 AM To: Neo4j user discussions Subject: Re: [Neo4j] Batch Insert : pr performance Btw, inserting 600k nodes over REST with about 8 properties in batches of 100 takes 20-30minutes for me. It's not awesomely fast, but it's not slow either. What settings are affecting insertion speeds, Peter? ___ Neo4j mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user ___ Neo4j mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user
Re: [Neo4j] Batch Insert : poooor performance
Please try not to use lucene for lookups during batch-inserts just index your nodes (for later use) but use a custom, in memory cache for the insertion process. customID - nodeId, like MapString,Long. Using lucene for lookups takes up to 1000 times longer during batch - inserts (probably, as the merge threads in the background have to finish up before you can include their results in the query). the luceneBatchInserterIndex.setCacheCapacity() seems not to work as expected, we will investigate that. Cheers Michael Here is the original post: Hi, I am in almost the same case as a previous post concerning Batch Insert poor performance but, I still can figure out how to do it correctly with good performances. Nodes: 30 millions Relationships : 250 millions I am on a MacOSX 10.7.1, 4 cpus, 8Go RAM 1) Insert Nodes : JVM -server -d64 -Xmx4G -XX:+UseParNewGC -XX:+UseNUMA -XX:+UseConcMarkSweepGC from 80 000 down to 50 000 inserts / seconds with properties (customID,url) with LuceneIndexing on customID and url a bit disappointing 2) Insert Relationships JVM -server -d64 -Xmx6G -XX:+UseParNewGC -XX:+UseNUMA -XX:+UseConcMarkSweepGC Index cache capacity 30 000 000 (whole nodes) on customID neostore.nodestore.db.mapped_memory=300M neostore.relationshipstore.db.mapped_memory=1G neostore.propertystore.db.mapped_memory=2.2G neostore.propertystore.db.strings.mapped_memory=100M neostore.propertystore.db.arrays.mapped_memory=10M = insertion rate ~ 50 relationships / seconds and going down ... (many many tests ... but always very poor performances) Do you have any idea, on how to have this work correctly ? I am really stuck here if you want to have a look at my code : no issues ! :) Many many thanks for your help Am 18.11.2011 um 12:47 schrieb Krzysztof Raczyński: Btw, inserting 600k nodes over REST with about 8 properties in batches of 100 takes 20-30minutes for me. It's not awesomely fast, but it's not slow either. What settings are affecting insertion speeds, Peter? ___ Neo4j mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user ___ Neo4j mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user
Re: [Neo4j] Max flow using gremlin
Alfredas, not that I know of. Do you hav ea good implementation idea? Cheers, /peter neubauer GTalk: neubauer.peter Skype peter.neubauer Phone +46 704 106975 LinkedIn http://www.linkedin.com/in/neubauer Twitter http://twitter.com/peterneubauer http://www.neo4j.org - NOSQL for the Enterprise. http://startupbootcamp.org/ - Öresund - Innovation happens HERE. On Fri, Nov 18, 2011 at 11:45 AM, Alfredas Chmieliauskas alfredas...@gmail.com wrote: Dear all, has anyone implemented any of the max flow algorithms using gremlin? Alfredas ___ Neo4j mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user ___ Neo4j mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user
Re: [Neo4j] Max flow using gremlin
Hi, has anyone implemented any of the max flow algorithms using gremlin? Most of the algorithms in my toolbox are flow-based algorithms. What in particular are you trying to do? Marko. http://markorodriguez.com ___ Neo4j mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user
[Neo4j] Invitation to connect on LinkedIn
LinkedIn Luke Wen requested to add you as a connection on LinkedIn: -- Craig, I'd like to add you to my professional network on LinkedIn. Accept invitation from Luke Wen http://www.linkedin.com/e/5gyj7a-gv58z8xr-5p/h9LPQ_TdyUOQHKzIpND15vYO56OQOUsn/blk/I244120619_9/pmpxnSRJrSdvj4R5fnhv9ClRsDgZp6lQs6lzoQ5AomZIpn8_elYVcjoMcz4Qd399bQNcr6tShlBMbPcScP4Nc30TczgLrCBxbOYWrSlI/EML_comm_afe/?hs=falsetok=1FXKd32GA_6R01 View invitation from Luke Wen http://www.linkedin.com/e/5gyj7a-gv58z8xr-5p/h9LPQ_TdyUOQHKzIpND15vYO56OQOUsn/blk/I244120619_9/0VnPANdz0OcjgQcAALqnpPbOYWrSlI/svi/?hs=falsetok=1zg-5zeGc_6R01 -- DID YOU KNOW you can use your LinkedIn profile as your website? Select a vanity URL and then promote this address on your business cards, email signatures, website, etc http://www.linkedin.com/e/5gyj7a-gv58z8xr-5p/ewp/inv-21/?hs=falsetok=2NWsHpj0c_6R01 -- (c) 2011, LinkedIn Corporation ___ Neo4j mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user
Re: [Neo4j] Batch Insert : poooor performance
Olivier, please let us know your progress, and feel free to issue a pull request when you get things working! Cheers, /peter neubauer GTalk: neubauer.peter Skype peter.neubauer Phone +46 704 106975 LinkedIn http://www.linkedin.com/in/neubauer Twitter http://twitter.com/peterneubauer http://www.neo4j.org - NOSQL for the Enterprise. http://startupbootcamp.org/ - Öresund - Innovation happens HERE. On Fri, Nov 18, 2011 at 2:16 PM, ov var...@echo.fr wrote: Thanks for your answer Michael, Indeed when creating a relationship between 2 nodes, I need to retrieve neo4j nodeID (from customID) for both nodes ... I expected the cache to have a real big effect on this mechanism, but alas ... For this small graph, I suppose I can fully work in RAM, but this surely won't do for a much bigger graph Thanks a lot, I'll try with my own cache mechanism Regards Le 18 nov. 2011 à 13:14, Michael Hunger [via Neo4j Community Discussions] a écrit : Please try not to use lucene for lookups during batch-inserts just index your nodes (for later use) but use a custom, in memory cache for the insertion process. customID - nodeId, like MapString,Long. Using lucene for lookups takes up to 1000 times longer during batch - inserts (probably, as the merge threads in the background have to finish up before you can include their results in the query). the luceneBatchInserterIndex.setCacheCapacity() seems not to work as expected, we will investigate that. Cheers Michael Here is the original post: Hi, I am in almost the same case as a previous post concerning Batch Insert poor performance but, I still can figure out how to do it correctly with good performances. Nodes: 30 millions Relationships : 250 millions I am on a MacOSX 10.7.1, 4 cpus, 8Go RAM 1) Insert Nodes : JVM -server -d64 -Xmx4G -XX:+UseParNewGC -XX:+UseNUMA -XX:+UseConcMarkSweepGC from 80 000 down to 50 000 inserts / seconds with properties (customID,url) with LuceneIndexing on customID and url a bit disappointing 2) Insert Relationships JVM -server -d64 -Xmx6G -XX:+UseParNewGC -XX:+UseNUMA -XX:+UseConcMarkSweepGC Index cache capacity 30 000 000 (whole nodes) on customID neostore.nodestore.db.mapped_memory=300M neostore.relationshipstore.db.mapped_memory=1G neostore.propertystore.db.mapped_memory=2.2G neostore.propertystore.db.strings.mapped_memory=100M neostore.propertystore.db.arrays.mapped_memory=10M = insertion rate ~ 50 relationships / seconds and going down ... (many many tests ... but always very poor performances) Do you have any idea, on how to have this work correctly ? I am really stuck here if you want to have a look at my code : no issues ! :) Many many thanks for your help Am 18.11.2011 um 12:47 schrieb Krzysztof Raczyński: Btw, inserting 600k nodes over REST with about 8 properties in batches of 100 takes 20-30minutes for me. It's not awesomely fast, but it's not slow either. What settings are affecting insertion speeds, Peter? ___ Neo4j mailing list [hidden email] https://lists.neo4j.org/mailman/listinfo/user ___ Neo4j mailing list [hidden email] https://lists.neo4j.org/mailman/listinfo/user If you reply to this email, your message will be added to the discussion below: http://neo4j-community-discussions.438527.n3.nabble.com/Batch-Insert-pr-performance-tp3513211p3518444.html To unsubscribe from Batch Insert : pr performance, click here. NAML -- View this message in context: http://neo4j-community-discussions.438527.n3.nabble.com/Batch-Insert-pr-performance-tp3513211p3518559.html Sent from the Neo4j Community Discussions mailing list archive at Nabble.com. ___ Neo4j mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user ___ Neo4j mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user
Re: [Neo4j] Sampling a Neo4j instance?
They have a common abstract class AbstractGraphDatabase. Den 18 november 2011 09:46 skrev Anders Lindström andli...@hotmail.com: Would this work in HA mode too (i.e. HighlyAvailableGraphDatabase)? I can see that the 'getConfig' is there -- but does the cast to NeoStoreXaDataSource work as well? Thanks. Date: Wed, 16 Nov 2011 21:40:32 +0200 From: chris.gio...@neotechnology.com To: user@lists.neo4j.org Subject: Re: [Neo4j] Sampling a Neo4j instance? No, GraphDatabaseService wisely hides those things away. I would suggest using instanceof and casting to EmbeddedGraphDatabase. cheers, CG 2011/11/16 Anders Lindström andli...@hotmail.com: Chris, thanks again for your replies. I realize now that I don't have the 'getConfig' method -- I'm writing a server plugin and I only get the GraphDatabaseService interface passed to my method, not a EmbeddedGraphDatabase. Is there an equivalent way of getting the highest node index through the interface? Thanks. Date: Thu, 10 Nov 2011 12:01:31 +0200 From: chris.gio...@neotechnology.com To: user@lists.neo4j.org Subject: Re: [Neo4j] Sampling a Neo4j instance? Answers inline. 2011/11/9 Anders Lindström andli...@hotmail.com: Thanks to the both of you. I am very grateful that you took your time to put this into code -- how's that for community! I presume this way of getting 'highId' is constant in time? It looks rather messy though -- is it really the most straightforward way to do it? This is the safest way to do it, that takes into consideration crashes and HA cluster membership. Another way to do it is long highId = db.getConfig().getIdGeneratorFactory().get( IdType.NODE ).getHighId(); which can return the same value with the first, if some conditions are met. It is shorter and cast-free but i'd still use the first way. getHighId() is a constant time operation for both ways described - it is just a field access, with an additional long comparison for the first case. I am thinking about how efficient this will be. As I understand it, the sampling misses come from deleted nodes that once was there. But if I remember correctly, Neo4j tries to reuse these unused node indices when new nodes are added. But is an unused node index _guaranteed_ to be used given that there is one, or could inserting another node result in increasing 'highId' even though some indices below it are not used? During the lifetime of a Neo4j instance there is no id reuse for Nodes and Relationships - deleted ids are saved however and will be reused the next time Neo4j starts. This means that if during run A you deleted nodes 3 and 5, the first two nodes returned by createNode() on the next run will have ids 3 and 5 - so highId will not change. Additionally, during run A, after deleting nodes 3 and 5, no new nodes would have the id 3 or 5. A crash (or improper shutdown) of the database will break this however, since the ids-to-recycle will probably not make it to disk. So, in short, it is guaranteed that ids *won't* be reused in the same run but not guaranteed to be reused between runs. My conclusion is that the sampling misses will increase with index usage sparseness and that we will have a high rate of sampling misses when we had many deletes and few insertions recently. Would you agree? Yes, that is true, especially given the cost of the wasted I/O and of handling the exception. However, this cost can go down significantly if you keep a hash set for the ids of nodes you have deleted and check that before asking for the node by id, instead of catching an exception. Persisting that between runs would move you away from encapsulated Neo4j constructs and would also be more efficient. Thanks again. Regards,Anders Date: Wed, 9 Nov 2011 19:30:36 +0200 From: chris.gio...@neotechnology.com To: user@lists.neo4j.org Subject: Re: [Neo4j] Sampling a Neo4j instance? Hi, Backing Jim's algorithm with some code: public static void main( String[] args ) { long SAMPLE_SIZE = 1; EmbeddedGraphDatabase db = new EmbeddedGraphDatabase( path/to/db/ ); // Determine the highest possible id for the node store long highId = ( (NeoStoreXaDataSource) db.getConfig().getTxModule().getXaDataSourceManager().getXaDataSource( Config.DEFAULT_DATA_SOURCE_NAME ) ).getNeoStore().getNodeStore().getHighId(); System.out.println( highId + is the highest id ); long i = 0; long nextId; // Do the sampling Random random = new Random(); while ( i SAMPLE_SIZE ) { nextId = Math.abs( random.nextLong() ) % highId; try {
Re: [Neo4j] About Neo4j Indexing
Hello List, Anyone can help me on this? Thanks and regards, Samuel 在 2011年11月14日 下午1:51,Samuel Feng okos...@gmail.com写道: Dear List, I have two questions about indexing *Question 1* At the time of creation, we can use extra configuration can be specified to control the behavior of the index and which backend to use. e.g, IndexManager index = graphDb.index(); IndexNode movies = index.forNodes( movies-fulltext, MapUtil.stringMap( IndexManager.PROVIDER, lucene, analyzer, org.apache.lucene.analysis.cn.smart.SmartChineseAnalyzer ) ); movies.add( theMatrix, cTitle, 黑客帝国 ); movies.add( theMatrix, date, 2000-01-01 ); When adding node theMatrix to index, all the values will be analyzed/tokenized by SmartChineseAnalyzer. However, for some fields I do not want it to be analyzed/tokenized, Any interfaces to implement this? Can enhance valueContext so that I can pass in something like Field.Index.NOT_ANALYZED when adding a node into index? *Question 2* For Query, IndexHitsNode nodes = movies .query(new BooleanQuery(...)); Node currentNode = null; ListMovie result = new ArrayListMovie(); while (nodes.hasNext()) { currentNode = nodes.next(); Movie m = new Movie(currentNode); if(m.getDate().equals(2001-01-01)){ result.add(m); } } I found that if the indexHits is large, say size() 2, Each m.getDate() will spend some time to load the value from underlying node(especially the first-time query), So the total elapsed time is very long. Any interface that I can read the lucene document behind this node directly? Maybe u can use nodes.currentDoc() to expose it? Thanks and Regards, Samuel ___ Neo4j mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user
Re: [Neo4j] About Neo4j Indexing
Samuel, so, in order to do this right, we would like to associate index specific properties to nodes, in order to do things right. This is planned for Neo4j 1.7 with a much more powerful (auto)indexing framework. Before that, things would be a hack, so I think we will postponing this. However, this is very relevant even to Cypher query optimization. Thanks for bringing this up! If you like please raise an issue on this so you can track it. Cheers, /peter neubauer GTalk: neubauer.peter Skype peter.neubauer Phone +46 704 106975 LinkedIn http://www.linkedin.com/in/neubauer Twitter http://twitter.com/peterneubauer http://www.neo4j.org - NOSQL for the Enterprise. http://startupbootcamp.org/- Öresund - Innovation happens HERE. 2011/11/14 Samuel Feng okos...@gmail.com Dear List, I have two questions about indexing *Question 1* At the time of creation, we can use extra configuration can be specified to control the behavior of the index and which backend to use. e.g, IndexManager index = graphDb.index(); IndexNode movies = index.forNodes( movies-fulltext, MapUtil.stringMap( IndexManager.PROVIDER, lucene, analyzer, org.apache.lucene.analysis.cn.smart.SmartChineseAnalyzer ) ); movies.add( theMatrix, cTitle, 黑客帝国 ); movies.add( theMatrix, date, 2000-01-01 ); When adding node theMatrix to index, all the values will be analyzed/tokenized by SmartChineseAnalyzer. However, for some fields I do not want it to be analyzed/tokenized, Any interfaces to implement this? Can enhance valueContext so that I can pass in something like Field.Index.NOT_ANALYZED when adding a node into index? *Question 2* For Query, IndexHitsNode nodes = movies .query(new BooleanQuery(...)); Node currentNode = null; ListMovie result = new ArrayListMovie(); while (nodes.hasNext()) { currentNode = nodes.next(); Movie m = new Movie(currentNode); if(m.getDate().equals(2001-01-01)){ result.add(m); } } I found that if the indexHits is large, say size() 2, Each m.getDate() will spend some time to load the value from underlying node(especially the first-time query), So the total elapsed time is very long. Any interface that I can read the lucene document behind this node directly? Maybe u can use nodes.currentDoc() to expose it? Thanks and Regards, Samuel ___ Neo4j mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user ___ Neo4j mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user
[Neo4j] Lab day: Cypher queries in embedded python bindings
Hey all, Like we've mentioned before, we have lab-day fridays at Neo4j, and today I hacked some stuff together that landed directly in trunk for the embedded python bindings. As of 1 minute ago, the following operations are now possible with the embedded python API: from neo4j import GraphDatabase db = GraphDatabase(/home/jake/db) # Plain query result = db.query(START n=node(0) RETURN n) # Parameterized query result = db.query(START n=node({id}) RETURN n, id=0) # Pre-parsed query get_node_by_id = db.prepare_query(START n=node({id}) RETURN n) result = db.query(get_node_by_id, id=0) # Read the result for row in result: print row['n'] for value in result['n']: print value node = db.query(get_node_by_id, id=0)['n'].single Lemme know what you think :) This is not available on Pypi yet (will be when the first 1.6 milestone is released) but you can build it super-easily yourself, instructions are in the readme at github: https://github.com/neo4j/python-embedded Cheers, -- Jacob Hansson Phone: +46 (0) 763503395 Twitter: @jakewins -- Jacob Hansson Phone: +46 (0) 763503395 Twitter: @jakewins ___ Neo4j mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user
[Neo4j] Scalability Roadmap
Are these following topics will be treated in future release (and when if you know) ? 1/ Supernode I know there is a big downside in handle of super-nodes, which can be a big issue in a twitter-like website with, for example a user followed by more than 200k users (i have in head, real case) or in a recommendation system which have sophisticated rules. I would like to know if the super-node issue (as we name it) is planned to be investigated in futures releases ? 2/ Sharding and horizontal scalability I guess sharding is a complex problem to handle with graph db but is it planned to address the horizontal scalability goal ? and that, even if it should bring us towards kind of inconsistensy but acceptable situation (for example, there are many cases of synchronization latency website can accept when it have a big load) Thanks -- View this message in context: http://neo4j-community-discussions.438527.n3.nabble.com/Scalability-Roadmap-tp3519034p3519034.html Sent from the Neo4j Community Discussions mailing list archive at Nabble.com. ___ Neo4j mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user
Re: [Neo4j] Scalability Roadmap
Hi Serge, Regarding supernodes I already opened an issue about this some time ago: https://github.com/neo4j/community/issues/19 and as you can read there, at the end of the conversation Peter said: we will hopefully be on it for 1.6 ! I really hope they keep thinking of fixing this for 1.6 release, I'd actually say that this is one of the most urgent points that should be covered right now... Cheers, Pablo Pareja On Fri, Nov 18, 2011 at 5:38 PM, serge s.fedoro...@gmail.com wrote: Are these following topics will be treated in future release (and when if you know) ? 1/ Supernode I know there is a big downside in handle of super-nodes, which can be a big issue in a twitter-like website with, for example a user followed by more than 200k users (i have in head, real case) or in a recommendation system which have sophisticated rules. I would like to know if the super-node issue (as we name it) is planned to be investigated in futures releases ? 2/ Sharding and horizontal scalability I guess sharding is a complex problem to handle with graph db but is it planned to address the horizontal scalability goal ? and that, even if it should bring us towards kind of inconsistensy but acceptable situation (for example, there are many cases of synchronization latency website can accept when it have a big load) Thanks -- View this message in context: http://neo4j-community-discussions.438527.n3.nabble.com/Scalability-Roadmap-tp3519034p3519034.html Sent from the Neo4j Community Discussions mailing list archive at Nabble.com. ___ Neo4j mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user -- Pablo Pareja Tobes My site http://about.me/pablopareja LinkedInhttp://www.linkedin.com/in/pabloparejatobes Twitter http://www.twitter.com/pablopareja Creator of Bio4j -- http://www.bio4j.com http://www.ohnosequences.com ___ Neo4j mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user
Re: [Neo4j] Scalability Roadmap
thanks, it sounds great :) is there a release date for 1.6 ? -- View this message in context: http://neo4j-community-discussions.438527.n3.nabble.com/Scalability-Roadmap-tp3519034p3519137.html Sent from the Neo4j Community Discussions mailing list archive at Nabble.com. ___ Neo4j mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user
[Neo4j] [URGENT] Recommended server configurations
Hi people's, I'm creating a social network with a larg number of expected hits and i need help with the server recommended configurations: 1 - Operating system (Linux or Windows? What specifically?) 2 - Hardware (How much Memory necessary?) You think the use of Neo4j REST API will cause problem? I use it to develop my Asp.Net applications I am open to suggestions!! I thank the help. -- View this message in context: http://neo4j-community-discussions.438527.n3.nabble.com/URGENT-Recommended-server-configurations-tp3519328p3519328.html Sent from the Neo4j Community Discussions mailing list archive at Nabble.com. ___ Neo4j mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user
Re: [Neo4j] Scalability Roadmap
1/ Supernode 2012, around Q2. 2/ Sharding and horizontal scalability 2013, around Q1. These are guesses not promises :-) Jim PS - sharding graphs is NP complete. In theory no general solution exists. ___ Neo4j mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user
Re: [Neo4j] [URGENT] Recommended server configurations
On Fri, Nov 18, 2011 at 7:21 PM, gustavoboby gustavob...@gmail.com wrote: Hi people's, I'm creating a social network with a larg number of expected hits and i need help with the server recommended configurations: 1 - Operating system (Linux or Windows? What specifically?) If you have the choice, Linux is preferable. We fully support both platforms, but generally get higher performance on Linux, and less problems. 2 - Hardware (How much Memory necessary?) This completely depends on how much data you intend to store. Can you provide an estimation of how big your dataset would be? Number of nodes, number of relationships per nodes, and how many properties (on both nodes and relationships), and what types of property values. You think the use of Neo4j REST API will cause problem? I use it to develop my Asp.Net applications It depends on how you use it. Generally, you will get reasonable insert speed if the client you use supports the batch operations part of the REST API, query speed will depend on the query of course. You will get significantly better performance with the embedded database right now, but that is only available in JVM languages and Python. I am open to suggestions!! I thank the help. -- View this message in context: http://neo4j-community-discussions.438527.n3.nabble.com/URGENT-Recommended-server-configurations-tp3519328p3519328.html Sent from the Neo4j Community Discussions mailing list archive at Nabble.com. ___ Neo4j mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user -- Jacob Hansson Phone: +46 (0) 763503395 Twitter: @jakewins ___ Neo4j mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user
Re: [Neo4j] Max flow using gremlin
Hey Marko, I'm modeling the european gas transport/pipeline network. I need to have a good way to calculate maximum flow from source to sink and get the nodes in the path . Alfredas On Fri, Nov 18, 2011 at 2:48 PM, Marko Rodriguez okramma...@gmail.com wrote: Hi, has anyone implemented any of the max flow algorithms using gremlin? Most of the algorithms in my toolbox are flow-based algorithms. What in particular are you trying to do? Marko. http://markorodriguez.com ___ Neo4j mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user ___ Neo4j mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user
Re: [Neo4j] Max flow using gremlin
Hey, Perhaps the simplist way to explore flow is to simply get the paths between source and sink and then calculate some function f over the path to determine its flow. For example: def f = { List path - // some function over the path where every other element is an edge (see traversal below) } source.outE.inV.loop(2){it.object.equals(sink)}.paths.each{ println(it + has a flow of + f(it)) } This assumes you have a determined source and a determined sink and that there are no cycles in your gas pipeline. If there are cycles, then you can tweak the expression to make sure you break out of the loop when appropriate. From this basic idea you can then tweak it to simulate decay over time/step or implement random walks through the gasline if you are interested in sampling or studying local eigenvectors in the pipeline. Hope that provides you a good starting point. Enjoy!, Marko http://markorodriguez.com On Nov 18, 2011, at 1:20 PM, Alfredas Chmieliauskas wrote: Hey Marko, I'm modeling the european gas transport/pipeline network. I need to have a good way to calculate maximum flow from source to sink and get the nodes in the path . Alfredas On Fri, Nov 18, 2011 at 2:48 PM, Marko Rodriguez okramma...@gmail.com wrote: Hi, has anyone implemented any of the max flow algorithms using gremlin? Most of the algorithms in my toolbox are flow-based algorithms. What in particular are you trying to do? Marko. http://markorodriguez.com ___ Neo4j mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user ___ Neo4j mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user ___ Neo4j mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user
Re: [Neo4j] Scalability Roadmap
Jim, Not to nitpick, but that's for an ideal graph partitioning, not graph sharding overall, right? Eg the problem is solvable in many specific domains? - Matt On Nov 18, 2011 1:27 PM, Jim Webber j...@neotechnology.com wrote: 1/ Supernode 2012, around Q2. 2/ Sharding and horizontal scalability 2013, around Q1. These are guesses not promises :-) Jim PS - sharding graphs is NP complete. In theory no general solution exists. ___ Neo4j mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user ___ Neo4j mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user
Re: [Neo4j] Max flow using gremlin
Great! Thanks. Also its missing the !... should be source.outE.inV.loop(2){!it.object.equals(sink)}.paths.each{ A On Fri, Nov 18, 2011 at 9:47 PM, Marko Rodriguez okramma...@gmail.com wrote: Hey, Perhaps the simplist way to explore flow is to simply get the paths between source and sink and then calculate some function f over the path to determine its flow. For example: def f = { List path - // some function over the path where every other element is an edge (see traversal below) } source.outE.inV.loop(2){it.object.equals(sink)}.paths.each{ println(it + has a flow of + f(it)) } This assumes you have a determined source and a determined sink and that there are no cycles in your gas pipeline. If there are cycles, then you can tweak the expression to make sure you break out of the loop when appropriate. From this basic idea you can then tweak it to simulate decay over time/step or implement random walks through the gasline if you are interested in sampling or studying local eigenvectors in the pipeline. Hope that provides you a good starting point. Enjoy!, Marko http://markorodriguez.com On Nov 18, 2011, at 1:20 PM, Alfredas Chmieliauskas wrote: Hey Marko, I'm modeling the european gas transport/pipeline network. I need to have a good way to calculate maximum flow from source to sink and get the nodes in the path . Alfredas On Fri, Nov 18, 2011 at 2:48 PM, Marko Rodriguez okramma...@gmail.com wrote: Hi, has anyone implemented any of the max flow algorithms using gremlin? Most of the algorithms in my toolbox are flow-based algorithms. What in particular are you trying to do? Marko. http://markorodriguez.com ___ Neo4j mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user ___ Neo4j mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user ___ Neo4j mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user ___ Neo4j mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user
Re: [Neo4j] Max flow using gremlin
Great! Thanks. Also its missing the !... should be source.outE.inV.loop(2){!it.object.equals(sink)}.paths.each{ Yes...good catch. Good luck, Marko. http://markorodriguez.com On Fri, Nov 18, 2011 at 9:47 PM, Marko Rodriguez okramma...@gmail.com wrote: Hey, Perhaps the simplist way to explore flow is to simply get the paths between source and sink and then calculate some function f over the path to determine its flow. For example: def f = { List path - // some function over the path where every other element is an edge (see traversal below) } source.outE.inV.loop(2){it.object.equals(sink)}.paths.each{ println(it + has a flow of + f(it)) } This assumes you have a determined source and a determined sink and that there are no cycles in your gas pipeline. If there are cycles, then you can tweak the expression to make sure you break out of the loop when appropriate. From this basic idea you can then tweak it to simulate decay over time/step or implement random walks through the gasline if you are interested in sampling or studying local eigenvectors in the pipeline. Hope that provides you a good starting point. Enjoy!, Marko http://markorodriguez.com On Nov 18, 2011, at 1:20 PM, Alfredas Chmieliauskas wrote: Hey Marko, I'm modeling the european gas transport/pipeline network. I need to have a good way to calculate maximum flow from source to sink and get the nodes in the path . Alfredas On Fri, Nov 18, 2011 at 2:48 PM, Marko Rodriguez okramma...@gmail.com wrote: Hi, has anyone implemented any of the max flow algorithms using gremlin? Most of the algorithms in my toolbox are flow-based algorithms. What in particular are you trying to do? Marko. http://markorodriguez.com ___ Neo4j mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user ___ Neo4j mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user ___ Neo4j mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user ___ Neo4j mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user ___ Neo4j mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user
Re: [Neo4j] [URGENT] Recommended server configurations
You can use Neo Technology's Hardware Sizing Calculator to estimate the CPU, RAM and disk space needed for your set up prediction: http://neotechnology.com/calculator/trial.html If you're not doing a batch insertion, the REST API would be fine I guess, specially if you put the database on a separate machine. On Fri, Nov 18, 2011 at 8:02 PM, Jacob Hansson jacob.hans...@neotechnology.com wrote: On Fri, Nov 18, 2011 at 7:21 PM, gustavoboby gustavob...@gmail.com wrote: Hi people's, I'm creating a social network with a larg number of expected hits and i need help with the server recommended configurations: 1 - Operating system (Linux or Windows? What specifically?) If you have the choice, Linux is preferable. We fully support both platforms, but generally get higher performance on Linux, and less problems. 2 - Hardware (How much Memory necessary?) This completely depends on how much data you intend to store. Can you provide an estimation of how big your dataset would be? Number of nodes, number of relationships per nodes, and how many properties (on both nodes and relationships), and what types of property values. You think the use of Neo4j REST API will cause problem? I use it to develop my Asp.Net applications It depends on how you use it. Generally, you will get reasonable insert speed if the client you use supports the batch operations part of the REST API, query speed will depend on the query of course. You will get significantly better performance with the embedded database right now, but that is only available in JVM languages and Python. I am open to suggestions!! I thank the help. -- View this message in context: http://neo4j-community-discussions.438527.n3.nabble.com/URGENT-Recommended-server-configurations-tp3519328p3519328.html Sent from the Neo4j Community Discussions mailing list archive at Nabble.com. ___ Neo4j mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user -- Jacob Hansson Phone: +46 (0) 763503395 Twitter: @jakewins ___ Neo4j mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user -- José Devezas - http://www.josedevezas.com MSc Informatics and Computing Engineering Social Media and Network Theory Research ___ Neo4j mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user
Re: [Neo4j] Scalability Roadmap
Hey Matt, Not to nitpick, but that's for an ideal graph partitioning, not graph sharding overall, right? Eg the problem is solvable in many specific domains? You're right - it's the general case. I was just making the point that sharding isn't something that's an afternoon's hacking to complete. Jim ___ Neo4j mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user
Re: [Neo4j] Max flow using gremlin
This seems to calculate the max flow (edges have capacity): source.outE.inV.loop(2){!it.object.equals(sink)}.paths.each{flow = it.capacity.min(); maxFlow += flow; it.findAll{it.capacity}.each{it.capacity -= flow}}; I can't believe this is so short! A On Fri, Nov 18, 2011 at 10:51 PM, Marko A. Rodriguez okramma...@gmail.com wrote: Great! Thanks. Also its missing the !... should be source.outE.inV.loop(2){!it.object.equals(sink)}.paths.each{ Yes...good catch. Good luck, Marko. http://markorodriguez.com On Fri, Nov 18, 2011 at 9:47 PM, Marko Rodriguez okramma...@gmail.com wrote: Hey, Perhaps the simplist way to explore flow is to simply get the paths between source and sink and then calculate some function f over the path to determine its flow. For example: def f = { List path - // some function over the path where every other element is an edge (see traversal below) } source.outE.inV.loop(2){it.object.equals(sink)}.paths.each{ println(it + has a flow of + f(it)) } This assumes you have a determined source and a determined sink and that there are no cycles in your gas pipeline. If there are cycles, then you can tweak the expression to make sure you break out of the loop when appropriate. From this basic idea you can then tweak it to simulate decay over time/step or implement random walks through the gasline if you are interested in sampling or studying local eigenvectors in the pipeline. Hope that provides you a good starting point. Enjoy!, Marko http://markorodriguez.com On Nov 18, 2011, at 1:20 PM, Alfredas Chmieliauskas wrote: Hey Marko, I'm modeling the european gas transport/pipeline network. I need to have a good way to calculate maximum flow from source to sink and get the nodes in the path . Alfredas On Fri, Nov 18, 2011 at 2:48 PM, Marko Rodriguez okramma...@gmail.com wrote: Hi, has anyone implemented any of the max flow algorithms using gremlin? Most of the algorithms in my toolbox are flow-based algorithms. What in particular are you trying to do? Marko. http://markorodriguez.com ___ Neo4j mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user ___ Neo4j mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user ___ Neo4j mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user ___ Neo4j mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user ___ Neo4j mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user ___ Neo4j mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user
Re: [Neo4j] Max flow using gremlin
This seems to calculate the max flow (edges have capacity): source.outE.inV.loop(2){!it.object.equals(sink)}.paths.each{flow = it.capacity.min(); maxFlow += flow; it.findAll{it.capacity}.each{it.capacity -= flow}}; I can't believe this is so short! Thats the beauty of Gremlin. Once you get it, you can rip some very complex traversals in just a few characters. NOTES: For speed, make it.capacity - it.getProperty('capacity') Some good notes here: https://github.com/tinkerpop/gremlin/wiki/Gremlin-Groovy-Path-Optimizations Glad we could help you with your problem. Enjoy!, Marko. http://markorodriguez.com ___ Neo4j mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user
Re: [Neo4j] Scalability Roadmap
...but I'm sure the community will come up with a wide range of sharding patterns, code, and best practices! On Nov 18, 2011, at 5:46 PM, Jim Webber j...@neotechnology.com wrote: Hey Matt, Not to nitpick, but that's for an ideal graph partitioning, not graph sharding overall, right? Eg the problem is solvable in many specific domains? You're right - it's the general case. I was just making the point that sharding isn't something that's an afternoon's hacking to complete. Jim ___ Neo4j mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user ___ Neo4j mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user
Re: [Neo4j] Max flow using gremlin
Guys, I could put this into the docs, just for future reference. Great contributions Marko and Alfredas! On Nov 18, 2011 11:58 PM, Marko Rodriguez okramma...@gmail.com wrote: This seems to calculate the max flow (edges have capacity): source.outE.inV.loop(2){!it.object.equals(sink)}.paths.each{flow = it.capacity.min(); maxFlow += flow; it.findAll{it.capacity}.each{it.capacity -= flow}}; I can't believe this is so short! Thats the beauty of Gremlin. Once you get it, you can rip some very complex traversals in just a few characters. NOTES: For speed, make it.capacity - it.getProperty('capacity') Some good notes here: https://github.com/tinkerpop/gremlin/wiki/Gremlin-Groovy-Path-Optimizations Glad we could help you with your problem. Enjoy!, Marko. http://markorodriguez.com ___ Neo4j mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user ___ Neo4j mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user
[Neo4j] REST, Gremlin and transactions (neo4django's type hierarchy)
Guys, I'm trying to get neo4django's type hierarchy behaving in a safe way for multiprocessing. I ducked the REST API proper and am using the Gremlin extension, since I need the type creation operation to be atomic. The hierarchy is a simple single-inheritance system represented in-graph as a tree rooted at the reference node. Each node in the tree represents a type, including it's name (`model_name`) and the module the type was defined in (`app_label`). I came up with the following script g.setMaxBufferSize(0) g.startTransaction() cur_vertex = g.v(0) for (def type_props : types) { candidate = cur_vertex.outE('TYPE').inV.find{ it.map.subMap(type_props.keySet()) == type_props } if (candidate == null) { new_type_node = g.addVertex(type_props) name = type_props['app_label'] + : + type_props['model_name'] new_type_node.name = name g.addEdge(cur_vertex, new_type_node, TYPE) cur_vertex = new_type_node } else { cur_vertex = candidate } } g.stopTransaction(TransactionalGraph.Conclusion.SUCCESS) result = cur_vertex which searches for a type node that fits the type lineage sent in through the JSON-encoded `types` list. The code works fine as a replacement for how I was managing types in-graph. However, if I send this script (again, through REST) using three threads simultaneously, I don't get the expected behavior. Instead of the first request resulting in one new type node, and the other two returning the node created by the first, three nodes are created and returned. Which is irksome. I'm pretty sure this is due to my own ignorance, but I've tried to do my homework. http://wiki.neo4j.org/content/Transactions#Isolation leads me to believe that maybe code like above won't work, because it only writes on condition after a read, but doesn't have a read lock. Could this be the case? and if so, is there a suggested fix in Gremlin? Any help/intuition would be greatly appreciated! -- Matt Luongo Co-Founder, Scholr.ly ___ Neo4j mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user