Re: [Neo4j] Accessing DB data directory on Elastic MapReduce

Paddy Tue, 04 Jan 2011 02:42:25 -0800

Hi Peter,
I have uploaded what I have so far. Its very early stages. I have added a
bit of
detail in the README, there are a few steps to get it going, hopefully it
works.
https://github.com/paddydub/NeoHadoopTester<%20https://github.com/paddydub/NeoHadoopTester>
I'm testing using a Hadoop job with 50 mappers with the stopsListSmall file
as input.
To precompute all the transfer patterns I would need to use 15000 mappers
using
the stopsListLarge input file. I haven't had a chance to do much testing but
if it takes roughly
7.5 minutes per mapper. To find the shortest path from the first departure
event of the inputted stop,
to any stop with Dijkstra's algorithm would take on average 30 milliseconds.


(0.03sec x 15000 x 15000) = 1875 instance hours.
The cost to run this on Elastic MapReduce is approx: (1875 hrs x $0.015) =
$28.75
Running on 20 instances (my limit at the moment) would take about 93 hours
or
2 hours running on 1000 instances. :)

I have one Reducer/BatchInserter which will iterate through the results of
all the Mappers
and will convert the list of patterns into a DAG and insert them into a new
graph db.
I still have to rework this step though. Then output the new database back
to s3 or
even access an EBS volume directly if possible.
Please let me know any suggestions on how to improve or speed it up.

cheers,
Paddy

On Sun, Jan 2, 2011 at 12:44 AM, Peter Neubauer <
peter.neuba...@neotechnology.com> wrote:

> Very cool!
> Is this part of the GITHub setup? And, are you inserting the
> precomputed patterns into the graph in a later step then?
>
> Cheers,
>
> /peter neubauer
>
> GTalk:      neubauer.peter
> Skype       peter.neubauer
> Phone       +46 704 106975
> LinkedIn   http://www.linkedin.com/in/neubauer
> Twitter      http://twitter.com/peterneubauer
>
> http://www.neo4j.org               - Your high performance graph database.
> http://www.thoughtmade.com - Scandinavia's coolest Bring-a-Thing party.
>
>
>
> On Sat, Jan 1, 2011 at 11:38 PM, Paddy <paddyf...@gmail.com> wrote:
> > Hi,
> > Can I just add that setting another bootstrap action to set how many
> tasks
> > are run per machine helped:
> > --site-key-value mapred.tasktracker.map.tasks.maximum=1
> >  .
> > By default it is set to 2 for small ec2 instances and I was running into
> > the "Unable to lock store [/home/hadoop/neo-db/neostore]" error for some
> > jobs.
> > Just tested running Neo4j graph-algo parallel on 20 machines, sweet :)
> >
> > cheers,
> > Paddy
> >
> > On Sat, Nov 27, 2010 at 4:16 PM, Paddy <paddyf...@gmail.com> wrote:
> >
> >> Hi,
> >> I'm a very much a Hadoop newbie but I think that would be very possible,
> >> maybe even making use of: http://incubator.apache.org/whirr/
> >> to ensure capability with Amazon EC2 & Rackspace Cloud Servers.
> >>
> >> Would creating custom types e.g. NodeWritable, WeightedPathWritable
> which
> >> implement org.apache.hadoop.io.Writable, be a good method to serialize
> and
> >> deserialize Neo4j objects between jobs?
> >> http://developer.yahoo.com/hadoop/tutorial/module5.html#types
> >>
> >> It would also be interesting to chain multiple traversals as Map-Reduce
> >> jobs.
> >>
> >> cheers
> >> Paddy
> >>
> >>
> >> On Fri, Nov 26, 2010 at 9:00 AM, Peter Neubauer <
> >> peter.neuba...@neotechnology.com> wrote:
> >>
> >>> This looks good to me, but then,
> >>> I am no Amazon and Hadoop expert. Do you think it would be possible to
> >>> do a generic integration component that lets you run travesals on
> >>> replicated Neo4j backends as Hadoop map-reduce jobs? I think that
> >>> would be interesting to a number of use cases and a very cool use of
> >>> Neo4j. Also, Alex Averbuch wants to look at using AKKA (Scala) to do
> >>> similar things, so comparing the approaches would be great!
> >>>
> >>> Cheers,
> >>>
> >>> /peter neubauer
> >>>
> >>> GTalk:      neubauer.peter
> >>> Skype       peter.neubauer
> >>> Phone       +46 704 106975
> >>> LinkedIn   http://www.linkedin.com/in/neubauer
> >>> Twitter      http://twitter.com/peterneubauer
> >>>
> >>> http://www.neo4j.org               - Your high performance graph
> >>> database.
> >>> http://www.thoughtmade.com - Scandinavia's coolest Bring-a-Thing
> party.
> >>>
> >>>
> >>>
> >>> On Thu, Nov 25, 2010 at 6:33 AM, Paddy <paddyf...@gmail.com> wrote:
> >>> > Hi Guys,
> >>> >
> >>> > I was testing out accessing a Neo4j DB from a Hadoop job on Elastic
> >>> > Mapreduce,
> >>> > I asked a question on the forum regarding loading a file from s3 to
> each
> >>> ec2
> >>> > at startup: https://forums.aws.amazon.com/thread.jspa?threadID=54919
> >>> >
> >>> > In case anyone faces the same issue, the following bootstrap action
> will
> >>> > download a compressed
> >>> > neo4j database from a s3 bucket to each launched ec2 instance and
> >>> changes
> >>> > the directory
> >>> >  permissions to allow access with:
> >>> > private static GraphDatabaseService graphDb = new
> >>> > EmbeddedGraphDatabase("/home/hadoop/neo-db");
> >>> >
> >>> >
> >>> > #!/bin/bash
> >>> > set -e
> >>> > sudo wget -S -T 10 -t 5  http://<yourbucket>.
> >>> s3.amazonaws.com/neo-db.tar.gz
> >>> > sudo tar -C /home/hadoop -xzf neo-db.tar.gz
> >>> > sudo chmod -R 777 /home/hadoop/neo-db
> >>> >
> >>> >
> >>> > Does this sound like the best method to use?
> >>> > cheers
> >>> > Paddy
> >>> > _______________________________________________
> >>> > Neo4j mailing list
> >>> > User@lists.neo4j.org
> >>> > https://lists.neo4j.org/mailman/listinfo/user
> >>> >
> >>> _______________________________________________
> >>> Neo4j mailing list
> >>> User@lists.neo4j.org
> >>> https://lists.neo4j.org/mailman/listinfo/user
> >>>
> >>
> >>
> > _______________________________________________
> > Neo4j mailing list
> > User@lists.neo4j.org
> > https://lists.neo4j.org/mailman/listinfo/user
> >
> _______________________________________________
> Neo4j mailing list
> User@lists.neo4j.org
> https://lists.neo4j.org/mailman/listinfo/user
>
_______________________________________________
Neo4j mailing list
User@lists.neo4j.org
https://lists.neo4j.org/mailman/listinfo/user

Re: [Neo4j] Accessing DB data directory on Elastic MapReduce

Reply via email to