Re: [Neo4j] Retrieving large-ish groups of nodes

Aru Sahni Mon, 13 Oct 2014 15:33:47 -0700

Hi Nigel,

Thanks for your response. It's relieving to hear that this might be related
to the deb package I used to install Neo4j. I will experiment with that
shortly.


I totally get what you're saying with the data modeling bit. In my case we
have already modeled our domain (i.e., these nodes already have relations),
and are enriching our model with information from an external entity
resolution process that produces clusters of node UIDs. This script:

1.) Requests the nodes by UID
2.) Creates a special :ResolvedEntity node with a combination of the
properties of the nodes
3.) Creates a :COMPONENT_OF relationship with the :Entity nodes

Thanks for the ID(n) tip. I assumed that for the purpose of
pagination/slicing queries it'd be fine. I'll switch to sorting by the UID
instead.

Would you happen to know if the IN statement uses schema indexes in newer
versions of Neo4j/Cypher?

Thanks for your help!
~Aru


On Mon, Oct 13, 2014 at 6:13 PM, Nigel Small <ni...@nigelsmall.com> wrote:

> Hi Aru
>
> Firstly, as a side note, the issue to which you link highlights a problem
> with the deb/rpm packages for some versions of Neo4j. The tarballs are fine
> and py2neo is definitely compatible with all versions from 1.8 upwards.
>
> Now to your main question....
>
> Neo4j is optimised for efficient graph traversal but, in your case, you
> are not really using any of this capability. You are instead attempting to
> fetch a set of unconnected nodes based on a single property so, whichever
> way you spin it, it's not really a "graphy" query. This query is in fact
> much more the kind of one you'd ask an RMDBS and so I think this is more a
> problem with data modelling than with the query language itself.
>
> So, one option would be to rework your data model a little to add some
> (more) relationships. As I don't know much about your data model or the
> criteria for ETL selection, it's hard to say exactly what might be useful
> in your case. But, if you can group or segment your entity nodes then you
> can probably reduce the amount of data scanned by each run of your
> extraction query. Your query might then look something like this:
>
> MATCH (g:EntityGroup)-[:MEMBER]->(n:Entity)
> WHERE n.uid IN ["uid001", "uid002" ... "uid500"]
> AND g.group_name IN ["widgets", "things"]
> RETURN n ORDER BY ID(n) ASC
>
> This kind of approach would then only search a subset of your graph data
> and should speed up the query. Of course, if your extraction criteria are
> arbitrary or very variable then this option may be less viable.
>
> Incidentally, you are ordering by ID(n) in the return clause. I'd
> generally recommend against using entity IDs within any part of your domain
> logic as they are an internal artifact and may not always operate as you'd
> expect.
>
> Nigel
>
>
> On 13 October 2014 22:33, Aru Sahni <arusa...@gmail.com> wrote:
>
>> Hi,
>>
>> Disclaimer: I'm using Neo4j 2.0.3. I know that this is far from the most
>> recent version, but a bug between the latest stable py2neo and 2.1.x
>> builds have me stuck on this release
>> <https://groups.google.com/forum/#!topic/neo4j/-eqzLPxk0DI>.
>>
>> I'm writing an ETL script that needs to retrieve around 500 nodes per
>> request.  My nodes have a `uid` field that is indexed and has a
>> uniqueness constraint.
>>
>> :Entity(uid)
>>
>> To get these nodes, I'm issuing the following query:
>>
>> MATCH (n:Entity)
>> WHERE n.uid IN ["uid001", "uid002" ... "uid500"]
>> RETURN n ORDER BY ID(n) ASC;
>>
>> This takes quite a bit of time. Running this with the profiler indicates
>> that it's hitting the database for every filter.  Attempting to unroll this
>> (i.e. `WHERE n.uid = "uid001" OR n.uid = "uid002"`, etc) hits the
>> database just as heavily. If I try to specify the index with the USING
>> statement, I get the following error:
>>
>> IndexHintException: Cannot use index hint in this context. Index hints
>>> require using a simple equality comparison in WHERE (either directly or as
>>> part of a top-level AND).
>>>
>>
>> What I find ends up working is:
>>
>> MATCH (n:Entity) WHERE n.uid = "uid001" RETURN n
>> UNION ALL
>> MATCH (n:Entity) WHERE n.uid = "uid002" RETURN n
>> ...
>> MATCH (n:Entity) WHERE n.uid = "uid500" RETURN n
>>
>> This is a little too verbose and hacky for my tastes. I was wondering if
>> there's anything I can do to improve the performance and reduce the
>> complexity of this query.
>>
>> Regards,
>> ~Aru
>>
>> --
>> You received this message because you are subscribed to the Google Groups
>> "Neo4j" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to neo4j+unsubscr...@googlegroups.com.
>> For more options, visit https://groups.google.com/d/optout.
>>
>
>  --
> You received this message because you are subscribed to the Google Groups
> "Neo4j" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to neo4j+unsubscr...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"Neo4j" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to neo4j+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: [Neo4j] Retrieving large-ish groups of nodes

Reply via email to