Yes, indeed. May not be necessary initially, but we could support XPath or something down the road to allow us to specify what things > I wouldn't worry about generalizing too much > to start with. Once we have a couple collections then we can go that
> route. My thoughts, too. I've been looking at the Reuters stuff. It uncompressed the distribution and then creates per-article files. I can't decide if I think that's a good idea for Wikipedia. It's big (about 10G uncompressed) and has about 1.2M files (so I've heard; unverified). On the one hand, creating separate per-article files is "clean" in that when you then ingest, you only have disk i/o that's going to affect the ingest performance (as opposed to, say, uncompressing/parsing). On the other hand, that's a lot of disk i/o (compresses by about 5X) and a lot of directory lookups. Anybody have any opinions/relevant past experience? --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]