Re: [Neo4j] Compacting files?

Alex Averbuch Wed, 02 Jun 2010 09:57:08 -0700

" - On disk, and lucene is a good idea here. Why not index with lucene,
but without storing the property to the node?"


I like it!

This sounds like a cleaner approach than my current one, and (I'm not sure
about how to do this either) may be no more complex than the way I'm doing
it.
Like you say, we can delete the Lucene index afterwards... or just the
Lucene folder associated with that one property.

I'm writing exams, thesis reports, and thesis opposition reports for the
next month so I won't have time to try it out.
If you give it a try I'd be interesting in hearing how the "Lunece only"
approach work out though.

On Wed, Jun 2, 2010 at 6:42 PM, Craig Taverner <cr...@amanzi.com> wrote:

> Yes. I guess you cannot escape an old-new ID map (or in your case ID-GID).
> I
> think it is possible to maintain that outside the database:
>
>   - In memory, as I suggested, but only valid under some circumstances
>   - On disk, and lucene is a good idea here. Why not index with lucene, but
>   without storing the property to the node?
>
> Since the index method takes the node, the property and the value, I assume
> the property and value might be possible to index without actually being
> real properties and values? I've not tried, but this way the graph is
> cleaner, and we can delete the lucene index afterwards!
>
> On Wed, Jun 2, 2010 at 6:12 PM, Alex Averbuch <alex.averb...@gmail.com
> >wrote:
>
> > Hi Craig,
> > Just a quick note about needing to keep all IDs in memory during an
> > import/export operation. The way I'm doing it at the moment it's not
> > necessary to do so.
> >
> > When exporting:
> > Write IDs to the exported format (this could be JSON, XML, GML, GraphML,
> > etc)
> >
> > When importing:
> > First import all Nodes, this is easy to do in most formats (all that I've
> > tried).
> > While importing Nodes, store & index 1 extra property in every Node, I
> call
> > this "GID" for global ID.
> > Next import all Relationships, using the GID and Lucene to locate start
> > Node
> > & end Node.
> >
> > The biggest graph I've tried with this approach had 2.5million Nodes &
> > 250million Relationships.
> > It took a quite a long time, but much of the slowness was because it was
> > performed on an old laptop with 2GB of RAM, I didn't give the
> BatchInserter
> > a properties file, and I used default JVM parameters.
> >
> > There is at least one obvious downside to this though, and that is that
> you
> > "pollute" the dataset with GID properties.
> >
> > Alex
> >
> > On Wed, Jun 2, 2010 at 5:53 PM, Craig Taverner <cr...@amanzi.com> wrote:
> >
> > > I've thought about this briefly, and somehow it actually seems easier
> (to
> > > me) to consider a compacting (defragmenting) algorithm than a generic
> > > import/export. The problem is that in both cases you have to deal with
> > the
> > > same issue, the node/relationship ID's are changed. For the
> import/export
> > > this means you need another way to store the connectedness, so you
> export
> > > the entire graph into another format that maintains the connectedness
> in
> > > some way (perhaps a whole new set of IDs), and the re-import it again.
> > > Getting a very complex, large and cyclic graph to work like this seems
> > hard
> > > to me because you have to maintain a complete table in memory of the
> > > identity map during the export (which makes the export unscalable).
> > >
> > > But de-fragmenting can be done by changing ID's in batches, breaking
> the
> > > problem down into smaller steps, and never neading to deal with the
> > entire
> > > graph at the same time at any point. For example, take the node table,
> > scan
> > > from the base collecting free ID's. Once you have a decent block, pull
> > that
> > > many nodes down from above in the table. Since you keep the entire set
> in
> > > memory, you maintain the mapping of old-new and can use that to 'fix'
> the
> > > relationship table also. Rinse and repeat :-)
> > >
> > > One option for the entire graph export that might work for most
> datasets
> > > that have predominantly tree structures is to export to a common tree
> > > format, like JSON (or, .... XML). This maintains most of the
> > relationships
> > > without requiring any memory of id mappings. The less common cyclic
> > > connections can be maintained with temporary ID's and a table of such
> > ID's
> > > maintained in memory (assuming it is much smaller than the total
> graph).
> > > This can allow complete export of very large graphs if the temp id
> table
> > > does indeed remain small. Probably true for many datasets.
> > >
> > > On Wed, Jun 2, 2010 at 2:30 PM, Johan Svensson <
> jo...@neotechnology.com
> > > >wrote:
> > >
> > > > Alex,
> > > >
> > > > You are correct about the "holes" in the store file and I would
> > > > suggest you export the data and then re-import it again. Neo4j is not
> > > > optimized for the use case were more data is removed than added over
> > > > time.
> > > >
> > > > It would be possible to write a compacting utility but since this is
> > > > not a very common use case I think it is better to put that time into
> > > > producing a generic export/import dump utility. The plan is to get a
> > > > export/import utility in place as soon as possible so any input on
> how
> > > > that should work, what format to use etc. would be great.
> > > >
> > > > -Johan
> > > >
> > > > On Wed, Jun 2, 2010 at 9:23 AM, Alex Averbuch <
> alex.averb...@gmail.com
> > >
> > > > wrote:
> > > > > Hey,
> > > > > Is there a way to compact the data stores (relationships, nodes,
> > > > properties)
> > > > > in Neo4j?
> > > > > I don't mind if its a manual operation.
> > > > >
> > > > > I have some datasets that have had a lot of relationships removed
> > from
> > > > them
> > > > > but the file is still the same size, so I'm guessing there are a
> lot
> > of
> > > > > holes in this file at the moment.
> > > > >
> > > > > Would this be hurting lookup performance?
> > > > >
> > > > > Cheers,
> > > > > Alex
> > > > _______________________________________________
> > > > Neo4j mailing list
> > > > User@lists.neo4j.org
> > > > https://lists.neo4j.org/mailman/listinfo/user
> > > >
> > > _______________________________________________
> > > Neo4j mailing list
> > > User@lists.neo4j.org
> > > https://lists.neo4j.org/mailman/listinfo/user
> > >
> > _______________________________________________
> > Neo4j mailing list
> > User@lists.neo4j.org
> > https://lists.neo4j.org/mailman/listinfo/user
> >
> _______________________________________________
> Neo4j mailing list
> User@lists.neo4j.org
> https://lists.neo4j.org/mailman/listinfo/user
>
_______________________________________________
Neo4j mailing list
User@lists.neo4j.org
https://lists.neo4j.org/mailman/listinfo/user

Re: [Neo4j] Compacting files?

Reply via email to