[Craig] >>If you haven't looked at clustering algorithms yet, you might want to do so. >>Your problem is a special case of clustering, where you have a large number >>of small clusters. A good place to start is the overview on Wikipedia: >>http://en.wikipedia.org/wiki/Cluster_analysis
According to this list, your method is similar to http://en.wikipedia.org/wiki/Basic_sequential_algorithmic_scheme, but with what seems to be some logical errors. >The simplest approach I could think of is that I process each row of the 3e8 >rows >sequentially and ask: >Do I have an identifier in the radius of 1 arcsec? >No: Generate one and assign me to it. >Yes: Update it and assigne me to it. The update is done as weighted average – >I keep the >number of how many observations the identifier has been computed. The result >is that the >identifier will have average coordinates of all the observations it identifies >– it will >be the center. Let say you have 2 single measures on a line at arcsec 1 and 2.1. which hence correspond to 2 ipix_cat. Now add a new measure at 1.9: as you choose any of the possible adjacent ipix_cat whitout considering the least distance, you may end with an ipix_at at 1.45 which is at less than 1 arcsec then the next one. Moreover you have raised the weight of both ipix_cat. Which increases the lock probability when trying a nutlithreaded upgrade. The max distance between 2 observations belonging to a same ipix_cat tends to 2 arcsec with your method. If this is ok, you should probbaly modify your method so that the 2 first points of my example would have megred to a single ipix_cat. You could use your weigth for this: increase your search radius to 2arcsec and then reject the candidates located between 1 and 2 arsec depending on their weight. The additional work load might be compensated by the smaller number of ipix_cat that woul will have.