On 16/02/2011 11:05, Marta Villegas wrote:
> Hi,

Hello

> When running Taverna under Windows, the system I/O functions Read Text
> File and Write Text File do not behave as expected,
> in case of UTF8 files.

[snip of specifying encoding for reading/writing files]

> Now, while this can be done explicitly in cases where the Beanshell code
> can be edited, it can not be
> done in cases where an input port is used; there is no option to edit
> the underlying code.
> Modifying the batch scripts (executeworkflow.bat) by adding the line
> set ARGS=%ARGS% -Dfile.encoding=UTF-8 was not successful either.

Yes, looking at the code, it seems Taverna reads the files as a byte 
array because it cannot predict whether the services will consume the 
data as a character-encoded text or as bytes. The byte array is then not 
properly converted using the current defaultCharSet.

I think the only way round it at present would be to feed the data into 
a customized version of the local service "Byte array to string", rather 
than reading it directly.

> It would be nice if Taverna software is adapted so that it enforces
> UTF-8 processing by e.g.
> taking over the character code setting as the above, or by any other
> action. May be we miss something...

It is a difficult problem.  We did look at using an equivalent to 
mimemagic to try to detect the character set, but in general that does 
not work.

> We are engaged in natural language processing projects and character
> encoding is crucial in our domain.
> (Note that, for example: we cannot read a file and send its content to a
> named entity recognizer or a
> translator system if we are in windows and we get unexpected results
> when typing inputs)

I have put an issue into Jira to look at this, preferably for Taverna 
2.3.  (The issue is at 
http://www.mygrid.org.uk/dev/issues/browse/T2-1750)  The external tool 
service will almost certainly need this as well.

> Thanks!
>
> --
> Marta Villegas
> [email protected] <mailto:[email protected]>

Alan

------------------------------------------------------------------------------
The ultimate all-in-one performance toolkit: Intel(R) Parallel Studio XE:
Pinpoint memory and threading errors before they happen.
Find and fix more than 250 security defects in the development cycle.
Locate bottlenecks in serial and parallel code that limit performance.
http://p.sf.net/sfu/intel-dev2devfeb
_______________________________________________
taverna-users mailing list
[email protected]
[email protected]
Web site: http://www.taverna.org.uk
Mailing lists: http://www.taverna.org.uk/about/contact-us/

Reply via email to