>> any warning message given if things are removed (I hope)? > No, maybe we should make it optional. Then the following would happen: > > 1. User selects a couple of files, one or more contains non-xml chars > 2. Import fails because of that complaining about the first file and suggest > to enable "Remove non-xml chars" option > 3. User enables "Remove non-xml chars" option and retries > > What do you think?
+1 for the warning ;) Having the option sounds like a good idea. I guess these illegal characters should only be very few that typically do not ever occur in a text (control characters, etc.?) > Maybe we should speak a little about how the import wizard should be. > The current one can only import plain/text and rtf files. And it supports > only one view. One view is fine for me. > One more restriction we currently have is that it only imports > plain/text from > files which end with .txt (and .rtf). Should we remove this limitation? How about using TIKA in the importer? > Do we need to set the language in the wizard? Would be very nice to have the option. > Do you think the name "Document" import wizard is fine? I think that's ok. You an audio file or video probably wouldn't be called document by most people. A Word or PDF, however, would be and can be converted to plain text. Cheers, Richard -- ------------------------------------------------------------------- Richard Eckart de Castilho Technical Lead Ubiquitous Knowledge Processing Lab FB 20 Computer Science Department Technische Universität Darmstadt Hochschulstr. 10, D-64289 Darmstadt, Germany phone [+49] (0)6151 16-7477, fax -5455, room S2/02/B117 [email protected] www.ukp.tu-darmstadt.de Web Research at TU Darmstadt (WeRC) www.werc.tu-darmstadt.de -------------------------------------------------------------------
