Re: [galaxy-dev] character set encoding metadata

John Chilton Thu, 28 Nov 2013 12:48:28 -0800

Excellent talk! I watched it earlier in the week - I feel like it has
already made me a better programmer.

That said, I am not sure I have any real great answers for you. The
least disruptive thing you could do is add a new metadata element to
the Text datatype (hopefully most text based datatypes are sub-typing
that). Intuitively this makes sense to because this attribute is only
valid for text-based datatypes not all datasets or hdas. How to set
that is not entirely clear to me though - for any given datatype you
could override or modify set_meta to do this - but I don't know how I
would get that input from the user into the set_meta procedure.

Depending on your use cases - this next suggestion might not be viable
- but the easiest and most robust thing to do is probably not track
what character-set something is but instead assume it is all UTF-8
(ISO-8859-1 and ASCII files are already right). Then modify the upload
form(s) to have new option to convert incoming input files from a
supplied character set into UTF-8. Then in tools/data-source/upload.py
check for this new character set parameter and use a tool such as
recode or icov to do this conversion during the upload/dataset
creation process. I am making no promises, but if you were hoping to
get these changes included in Galaxy this is what I would be most
willing to consider.

Getting Galaxy to track, process, and serve a bunch of different
character sets would be a real challenge - allowing for the assumption
that it is all just UTF-8 though however is much easier. It is a
variant on the advice given in the video you sent along as well,
instead of converting everything to unicode as soon as possible
though, you would be converting everything to UTF-8 as soon as
possible.

Hope this helps.

-John

On Sun, Nov 24, 2013 at 5:22 PM, Robert Baertsch <baert...@soe.ucsc.edu> wrote:
> Does galaxy have an official place to store the character set for a text file?
>
> I'm thinking of modifying the upload tool to prompt for the character set, 
> since it cannot be sniffed.
>
> It should go in the hda/ldda metadata hash or extended_metadata for the 
> dataset.
>
> Any opinions?
>
> BTW: everyone should watch this talk on unicode.
>
> http://bit.ly/unipain
>
>
> ___________________________________________________________
> Please keep all replies on the list by using "reply all"
> in your mail client.  To manage your subscriptions to this
> and other Galaxy lists, please use the interface at:
>   http://lists.bx.psu.edu/
>
> To search Galaxy mailing lists use the unified search at:
>   http://galaxyproject.org/search/mailinglists/
___________________________________________________________
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:
  http://lists.bx.psu.edu/

To search Galaxy mailing lists use the unified search at:
  http://galaxyproject.org/search/mailinglists/

Re: [galaxy-dev] character set encoding metadata

Reply via email to