I have about 2 million records which have about 4 string fields each
which needs to be checked for duplicates. To be more specific I have
name, phone, address and fathername as fields and I must check for
dedupe using all these fields with rest of data. The resulting unique
records need to be noted into db.

I have been able to implement mapreduce, iterarate of all records.
Task rate is set to 100/s and bucket-size to 100. Billing enabled.

Currently, everything is working, but performance is very very slow. I
have been able to complete only 1000 records dedupe processing among a
test dataset of 10,000 records in 6 hours.

The current design in java is:

In every map iteration, I compare the current record with the previous
record
- Previous record is a single record in db which acts like a global
variable which I overwrite with another previous record in each map
iteration
- Comparison is done using an algorithm and result is written as a new
entity to db
- At the end of one Mapreduce job, i programatically create another
job
- The previous record variable helps the job to compare with next
candidate record with rest of the data
- I am ready to increase any amount of GAE resources to achieve this
in shortest time.

My Questions are:

- Will the accuracy of dedupe (checking for duplicates) affect due to
parallel jobs/tasks?
- How can this design be improved?
- Will this scale to 20 million records
- Whats the fastest way to read/write variables (not just counters)
during map iteration which can be used across one mapreduce job.
- Freelancers most welcome to assist in this.

Thanks for your help.

-- 
You received this message because you are subscribed to the Google Groups 
"Google App Engine" group.
To post to this group, send email to google-appengine@googlegroups.com.
To unsubscribe from this group, send email to 
google-appengine+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/google-appengine?hl=en.

Reply via email to