Hi Steven. Thank you for your detailed response. The code will be executed on a web server with limited memory so the desire to keep file loading in check. I like the approach you have suggested to score to give the best guess. It leaves it fairly modular in respect to how detailed you want to be about adding statements specific to a particular format (that would increase the certainty of choosing it correctly). I wish I had more control over the files I may receive but I have to assume the worse. File extensions are not always telling the true situation and sometimes they can be left off. Mime types are not always interpreted properly either and I am restricting these before getting to a sniffing stage to eliminate certain types of files from getting that far. I think what I might do is read the first x lines with readlines(). I think a sample of up to the first 100 lines should probably be good enough to generate a decent scores for the type.
Regards, David > def sniff(filename): > """Return one of "xml", "csv", "txt" or "tkn", or "???" > if it can't decide the file type. > """ > fp = open(filename, "r") > scores = {"xml": 0, "csv": 0, "txt": 0, "tkn": 0} > for line in fp.readlines(): > if not line: > continue > if line[0] == "<": > scores["xml"] += 1 > if '\t' in line: > scores["txt"] += 1 > if ',' in line: > scores["csv"] += 1 > if SOMETOKEN in line: > scores["csv"] += 1 > # Pick the best guess: > L = [(score, name) for (name, score) in scores.items()] > L.sort() > L.reverse() > # L is now sorted from highest down to lowest by score. > best_guess = L[0] > second_best_guess = L[0] > if best_guess[0] > 10*second_best_guess[0]: > fp.close() > return best_guess[1] > fp.close() > return "???" -- http://mail.python.org/mailman/listinfo/python-list