Enumerate the file locations (map) , put them in a queue like rabbit or Kafka (Persist the map), have a bunch of threads , workers, containers, whatever pop off the queue , process the item (reduce).
-- Rahul Singh rahul.si...@anant.us Anant Corporation On May 20, 2018, 7:24 AM -0400, Raymond Xie <xie3208...@gmail.com>, wrote: > I know how to do indexing on file system like single file or folder, but > how do I do that in a parallel way? The data I need to index is of huge > volume and can't be put on HDFS. > > Thank you > > *------------------------------------------------* > *Sincerely yours,* > > > *Raymond*