subject:"\"Re\\\: XML Considered Harmful\""

RE: XML Considered Harmful

2021-09-29 Thread Avi Gross via Python-list

I think that to make electricity comprehend, you need a room temperature
superconductor. The Cooper Pairs took a while to comprehend but now ...

I think, seriously, we have established the problems with guessing that
others are using the language in a way we assume. 

So how many comprehensions does Python have?

[] - list comprehension
{} - dictionary OR set comprehension
() - generator expression

Tuples are incomprehensible and I wonder if any other comprehensions might
make sense to add, albeit we may need new symbols.

-Original Message-
From: Python-list  On
Behalf Of Michael F. Stemper
Sent: Wednesday, September 29, 2021 9:04 AM
To: python-list@python.org
Subject: Re: XML Considered Harmful

On 28/09/2021 18.21, Greg Ewing wrote:
> On 29/09/21 4:37 am, Michael F. Stemper wrote:
>> I'm talking about something made
>> from tons of iron and copper that is oil-filled and rotates at 1800 rpm.
> 
> To avoid confusion, we should rename them "electricity comprehensions".

Hah!

--
Michael F. Stemper
If you take cranberries and stew them like applesauce they taste much more
like prunes than rhubarb does.
--
https://mail.python.org/mailman/listinfo/python-list

-- 
https://mail.python.org/mailman/listinfo/python-list

Re: XML Considered Harmful

2021-09-29 Thread Michael F. Stemper


On 28/09/2021 18.21, Greg Ewing wrote:

On 29/09/21 4:37 am, Michael F. Stemper wrote:

I'm talking about something made
from tons of iron and copper that is oil-filled and rotates at 1800 rpm.


To avoid confusion, we should rename them "electricity comprehensions".


Hah!

--
Michael F. Stemper
If you take cranberries and stew them like applesauce they taste much
more like prunes than rhubarb does.
--
https://mail.python.org/mailman/listinfo/python-list

Re: XML Considered Harmful

2021-09-28 Thread dn via Python-list

On 29/09/2021 06.53, Michael F. Stemper wrote:
> On 28/09/2021 10.53, Stefan Ram wrote:
>> "Michael F. Stemper"  writes:
>>> Well, I could continue to hard-code the data into one of the test
>>> programs
>>
>>    One can employ a gradual path from a program with hardcoded
>>    data to an entity sharable by different programs.
>>
>>    When I am hurried to rush to a working program, I often
>>    end up with code that contains configuration data spread
>>    (interspersed) all over the code. For example:
> 
>>    1st step: give a name to all the config data:
> 
>>    2nd: move all config data to the top of the source code,
>>    directly after all the import statements:
> 
>>    3rd: move all config data to a separate "config.py" module:
>>
>> import ...
>> import config
>> ...
>>
>> ...
>> open( config.project_directory + "data.txt" )
>> ...
>>
>>> but that would mean that every time that I wanted to look
>>> at a different scenario, I'd need to modify a program.
>>
>>    Now you just have to modify "config.py" - clearly separated
>>    from the (rest of the) "program".
> 
> Well, that doesn't really address what format to store the data
> in. I was going to write a module that would read data from an
> XML file:
> 
> import EDXML
> gens = EDXML.GeneratorsFromXML( "gendata1.xml" )
> fuels = EDXML.FuelsFromXML( "fueldata3.xml" )
> 
> (Of course, I'd really get the file names from command-line arguments.)
> 
> Then I read a web page that suggested use of XML was a poor idea,
> so I posted here asking for a clarification and alternate suggestions.
> 
> One suggestion was that I use YAML, in which case, I'd write:
> 
> import EDfromYAML
> gens = EDfromYAML( "gendata1.yaml" )
> fuels = EDXML.FuelsFromYAML( "fueldata3.yaml" )
> 
>>> And when I discover anomalous behavior, I'd need to copy the
>>> hard-coded data into another program.
>>
>>    Now you just have to import "config.py" from the other program.
> 
> This sounds like a suggestion that I hard-code the data into a
> module. I suppose that I could have half-a-dozen modules with
> different data sets and ln them as required:
> 
> $ rm GenData.py* FuelData.py*
> $ ln gendata1.py GenData.py
> $ ln fueldata3.py FuelData.py
> 
> It seems to me that a more thorough separation of code and data
> might be useful.

Dear Michael,

May I suggest that you are right - and that he is right!
(which is a polite way of saying, also, that both are wrong. Oops!)
(with any and all due apologies)

There are likely cross-purposes here.

I am interpreting various clues, from throughout the thread (from when
the snowflakes were still falling!) that you and I were trained
way-back: to first consider the problem, state the requirements
("hypothesis" in Scientific Method), and work our way to a solution
on-paper. Only when we had a complete 'working solution', did we step up
to the machine (quite possibly a Card Punch, cf a 'computer') and
implement.

Also, that we thought in terms of a clear distinction between
"program[me]" and "data" - and the compiler and link[age]-editor
software technology of the time maintained such.

Whereas 'today', many follow the sequence of "Test-Driven Development"
(er, um, often omitting the initial test) of attempting some idea as
code, reviewing the result, and then "re-factoring" (improving), in a
circular progression - until it not only works, but works well.

This requires similar "stepwise decomposition" to what we learned, but
differs when it comes to code-composition. This approach is more likely
to accumulate a solution 'bottom-up' and component-wise, rather than
creating an entire (and close-to-perfect) solution first and as an whole.

Let's consider the Python REPL. Opening a terminal and starting the
Python interpreter, gives us the opportunity to write short "snippets"
of code and see the results immediately. This is VERY handy for ensuring
that an idea is correct, or to learn exactly how a particular construct
works. Thus, we can 'test' before we write any actual code (and can
copy-paste the successful 'prototype' into our IDE/editor!).

We didn't enjoy such luxury back in the good?bad old days. Young people
today - they just don't know how lucky they are!
(cue other 'grumpy old man' mutterings)

Other points to consider: 'terminals' (cf mainframes), interpreted
languages, and 'immediacy'. These have all brought "opportunities" and
thus "change" to the way developers (can) work and think! (which is why
I outlined what I think of as 'our training' and thus 'our thinking
process' when it comes to software design, above)

Another 'tectonic shift' is that in the old days 'computer time' was
hugely expensive and thus had to be optimised. Whereas these days (even
in retirement) programming-time has become the more expensive component
as computers (or compute-time in cloud-speak) have become cheaper - and
thus we reveal one of THE major attractive attributes of the Python
programming language!

Accordingly, (and now any apologies-due ma

Re: XML Considered Harmful

2021-09-28 Thread Greg Ewing


On 29/09/21 4:37 am, Michael F. Stemper wrote:

I'm talking about something made
from tons of iron and copper that is oil-filled and rotates at 1800 rpm.


To avoid confusion, we should rename them "electricity comprehensions".

--
Greg
--
https://mail.python.org/mailman/listinfo/python-list

Re: XML Considered Harmful

2021-09-28 Thread Chris Angelico

On Wed, Sep 29, 2021 at 8:00 AM Stefan Ram  wrote:
>   JSON is a kind of a subset of JavaScript for JavaScript
>   programmers. In Python, we can use JSON too, or we can
>   use Python itself.
>
>   When some external requirement to use a data exchange
>   notation like JSON should appear, one can still "translate"
>   such Python modules to JSON. This path is not blocked.

JSON exists as a transport mechanism because it is restricted and
can't contain malicious code. A Python equivalent would be
ast.literal_eval - a strict subset of the language but restricted for
safety. For trusted code, yes, straight code can be used.

(And ast.literal_eval, unlike JSON, can handle comments.)

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list

RE: XML Considered Harmful

2021-09-28 Thread Avi Gross via Python-list

Not lucky at all, Michael. The problem is he published a number of things
with Paul Erdős a few years after I got my degrees and went to Bell
laboratories. I never met Erdős but he was prolific and had 507 people
publish with him as co-authors. I would have loved to as I also speak
languages he spoke including Hungarian and Math.

Well, time to get back to something remotely about Python. Is there any
concept of a Rossum Number where anyone who worked directly with Guido Van
Rossum is a 1 (or True or truthy) and ...

Hey I just realized my Berners-Lee number might be 1 but it was so long ago
we worked on what Hypertext should look like, ...

-Original Message-
From: Python-list  On
Behalf Of Michael F. Stemper
Sent: Tuesday, September 28, 2021 2:41 PM
To: python-list@python.org
Subject: Re: XML Considered Harmful

On 28/09/2021 13.27, Avi Gross wrote:
> Well, Michael, if you want to go back to the eighties, and people you 
> worked with, I did my Thesis with a professor who later had an Erdős
number of 1!
> Too bad I never got around to publishing something with him or I could 
> have been a 2!

Lucky you. If a paper that a friend of mine is submitting to various
journals gets accepted by one of them, I'll end up with a 4 or 5 through
him. However, as the months pass, it's looking more like mine will end up
NaN.

--
Michael F. Stemper
Isaiah 58:6-7
--
https://mail.python.org/mailman/listinfo/python-list

-- 
https://mail.python.org/mailman/listinfo/python-list

Re: XML Considered Harmful

2021-09-28 Thread Michael F. Stemper


On 28/09/2021 13.27, Avi Gross wrote:

Well, Michael, if you want to go back to the eighties, and people you worked
with, I did my Thesis with a professor who later had an Erdős number of 1!
Too bad I never got around to publishing something with him or I could have
been a 2!


Lucky you. If a paper that a friend of mine is submitting to various
journals gets accepted by one of them, I'll end up with a 4 or 5 through
him. However, as the months pass, it's looking more like mine will end
up NaN.

--
Michael F. Stemper
Isaiah 58:6-7
--
https://mail.python.org/mailman/listinfo/python-list

Re: XML Considered Harmful

2021-09-28 Thread Karsten Hilbert

Am Tue, Sep 28, 2021 at 12:53:49PM -0500 schrieb Michael F. Stemper:

> This sounds like a suggestion that I hard-code the data into a
> module. I suppose that I could have half-a-dozen modules with
> different data sets and ln them as required:
>
> $ rm GenData.py* FuelData.py*
> $ ln gendata1.py GenData.py
> $ ln fueldata3.py FuelData.py

vi data.py

generators = {}
generators['name1'] = {'fuel': ..., ...}
generators['name2'] = {...}
...

vi simulation.py

import sys
import data

generator = data.generators[sys.argv[1]]
run_simulation(generator)

or some such ?

Your data "format" is ... Python code.

Karsten
--
GPG  40BE 5B0E C98E 1713 AFA6  5BC0 3BEA AC80 7D4F C89B
-- 
https://mail.python.org/mailman/listinfo/python-list

RE: XML Considered Harmful

2021-09-28 Thread Avi Gross via Python-list

Well, Michael, if you want to go back to the eighties, and people you worked
with, I did my Thesis with a professor who later had an Erdős number of 1!
Too bad I never got around to publishing something with him or I could have
been a 2!

But that work, being so long ago, was not in Python but mainly in PASCAL.

Ah the good old days.

-Original Message-
From: Python-list  On
Behalf Of Michael F. Stemper
Sent: Tuesday, September 28, 2021 11:45 AM
To: python-list@python.org
Subject: Re: XML Considered Harmful

On 28/09/2021 02.25, Peter J. Holzer wrote:
> On 2021-09-27 21:01:04 -0400, Avi Gross via Python-list wrote:
>> You keep talking about generators, though. If the generators are 
>> outside of your program, then yes, you need to read in whatever they
produce.
> 
> As I understood it, the "generators" don't generate the data, they are 
> the subject of the data: Devices that generate electricity by burning 
> fuel and he's modelling some aspect of their operation. Maybe 
> efficiency or power output or something like that (I tried to search 
> for "IHR curve", but couldn't find anything).

If you expand "IHR curve" to "incremental heat rate curve", you'll get
better results. When power engineers talk, we say the first, when we publish
papers, we write the second.

If you want to see the bigger picture, search on "Economic Dispatch".
In fact, doing so points me to something written by a guy I worked with back
in the 1980s:
<http://www2.econ.iastate.edu/classes/econ458/tesfatsion/EconomicDispatchInt
roToOptimization.DKirschen2004.LTEdits.pdf>

Slide 3 even shows a piecewise-linear curve.

--
Michael F. Stemper
A preposition is something you should never end a sentence with.
--
https://mail.python.org/mailman/listinfo/python-list

-- 
https://mail.python.org/mailman/listinfo/python-list

RE: XML Considered Harmful

2021-09-28 Thread Avi Gross via Python-list

I replied to Michael privately but am intrigued by his words here:

"The thing that creates realistic test cases is my brain."

I consider extensions to my brain to include using a language like Python on
my computer and in particular, to take a model I think of and instantiate
it. Lots of people have shared modules that can be tweaked to do all kinds
of simulations using a skeleton you provide that guides random number usage.
Some will generate lots of those and stare at them and use their brain to
further narrow it down to realistic ones. For example, in designing say a
car with characteristics like miles per gallon should randomly range between
10 and 100 while engine size ranges from this to that and so on, it may turn
out that large engines don't go well with large number for miles per gallon.

I have worked on projects where a set of guides then created hundreds of
thousands of fairly realistic scenarios using every combination of an
assortment of categorical variables and the rest of the program sliced and
diced the results and did all kinds of statistical calculations and then
generated all kinds of graphs. There was no real data but there was a
generator that was based on the kinds of distributions previously published
in the field that helped guide parameters to be somewhat realistic.

In your case, I understand you will decide how to do it and just note you
used language with multiple meanings that misled a few of us into thinking
you either had a python function in mind using one of several ways Python
refers to as generators, such as one that efficiently yields the next prime
number when asked. Clearly your explanation now shows you plan on making a
handful of data sets by hand using an editor like vi. Fair enough. No need
to write complex software if your mind is easily able to just make half a
dozen variations in files. And, frankly, not sure why you need XML or much
of anything. It obviously depends on how much you are working with and how
variable. For simpler things, you can hard-code your data structure directly
into your program, run an analysis, change the variables to your second
simulation and repeat.

I am afraid that I, like a few others here, assumed a more abstract and much
more complex need to be addressed. Yours may be complex in other parts but
may need nothing much for the part we are talking about. It sounds like you
do want something easier to create while editing.

-Original Message-
From: Python-list  On
Behalf Of Michael F. Stemper
Sent: Tuesday, September 28, 2021 11:38 AM
To: python-list@python.org
Subject: Re: XML Considered Harmful

On 27/09/2021 20.01, Avi Gross wrote:
> Michael,
> 
> Given your further explanation, indeed reading varying numbers of 
> points in using a CSV is not valid, albeit someone might just make N 
> columns (maybe a few more than 7) to handle a hopefully worst case. 
> Definitely it makes more sense to read in a list or other data structure.
> 
> You keep talking about generators, though. If the generators are 
> outside of your program, then yes, you need to read in whatever they
produce.

My original post (which is as the snows of yesteryear) made explicit the
fact that when I refer to a generator, I'm talking about something made from
tons of iron and copper that is oil-filled and rotates at 1800 rpm.
(In most of the world other than North America, they rotate at 1500 rpm.)

Nothing to do with the similarly-named python construct. Sorry for the
ambiguity.

> But if
> your data generator is within your own program,

The data is created in my mind, and approximates typical physical
characteristics of real generators.

> My impression is you may not be using your set of data points for any 
> other purposes except when ready to draw a spline.

Nope, the points give a piecewise-linear curve, and values between two
consecutive points are found by linear interpolation. It's industry standard
practice.

> Can I just ask if by a generator, you do NOT mean the more typical use 
> of "generator" as used in python

Nope; I mean something that weighs 500 tons and rotates, producing
electrical energy.

>   Do you mean something that creates
> realistic test cases to simulate a real-word scenario?

The thing that creates realistic test cases is my brain.

>   These often can
> create everything at once and often based on random numbers.

I have written such, but not in the last thirty years. At that time, I
needed to make up data for fifty or one hundred generators, along with tie
lines and loads.

What I'm working on now only needs a handful of generators at a time; just
enough to test my hypothesis. (Theoretically, I could get by with two, but
that offends my engineering sensibilities.)

> create everything at once and often based on random numbers. Again, if 
> you have or build such code, it is not clear it needs to be written to 
>

Re: XML Considered Harmful

2021-09-28 Thread Michael F. Stemper


On 28/09/2021 10.53, Stefan Ram wrote:

"Michael F. Stemper"  writes:

Well, I could continue to hard-code the data into one of the test
programs


   One can employ a gradual path from a program with hardcoded
   data to an entity sharable by different programs.

   When I am hurried to rush to a working program, I often
   end up with code that contains configuration data spread
   (interspersed) all over the code. For example:



   1st step: give a name to all the config data:



   2nd: move all config data to the top of the source code,
   directly after all the import statements:



   3rd: move all config data to a separate "config.py" module:

import ...
import config
...

...
open( config.project_directory + "data.txt" )
...


but that would mean that every time that I wanted to look
at a different scenario, I'd need to modify a program.


   Now you just have to modify "config.py" - clearly separated
   from the (rest of the) "program".


Well, that doesn't really address what format to store the data
in. I was going to write a module that would read data from an
XML file:

import EDXML
gens = EDXML.GeneratorsFromXML( "gendata1.xml" )
fuels = EDXML.FuelsFromXML( "fueldata3.xml" )

(Of course, I'd really get the file names from command-line arguments.)

Then I read a web page that suggested use of XML was a poor idea,
so I posted here asking for a clarification and alternate suggestions.

One suggestion was that I use YAML, in which case, I'd write:

import EDfromYAML
gens = EDfromYAML( "gendata1.yaml" )
fuels = EDXML.FuelsFromYAML( "fueldata3.yaml" )


And when I discover anomalous behavior, I'd need to copy the
hard-coded data into another program.


   Now you just have to import "config.py" from the other program.


This sounds like a suggestion that I hard-code the data into a
module. I suppose that I could have half-a-dozen modules with
different data sets and ln them as required:

$ rm GenData.py* FuelData.py*
$ ln gendata1.py GenData.py
$ ln fueldata3.py FuelData.py

It seems to me that a more thorough separation of code and data
might be useful.

--
Michael F. Stemper
The name of the story is "A Sound of Thunder".
It was written by Ray Bradbury. You're welcome.
--
https://mail.python.org/mailman/listinfo/python-list

Re: XML Considered Harmful

2021-09-28 Thread Michael F. Stemper


On 28/09/2021 02.25, Peter J. Holzer wrote:

On 2021-09-27 21:01:04 -0400, Avi Gross via Python-list wrote:

You keep talking about generators, though. If the generators are outside of
your program, then yes, you need to read in whatever they produce.


As I understood it, the "generators" don't generate the data, they are
the subject of the data: Devices that generate electricity by burning
fuel and he's modelling some aspect of their operation. Maybe efficiency
or power output or something like that (I tried to search for "IHR
curve", but couldn't find anything).


If you expand "IHR curve" to "incremental heat rate curve", you'll get
better results. When power engineers talk, we say the first, when we
publish papers, we write the second.

If you want to see the bigger picture, search on "Economic Dispatch".
In fact, doing so points me to something written by a guy I worked with
back in the 1980s:


Slide 3 even shows a piecewise-linear curve.

--
Michael F. Stemper
A preposition is something you should never end a sentence with.
--
https://mail.python.org/mailman/listinfo/python-list

Re: XML Considered Harmful

2021-09-28 Thread Michael F. Stemper


On 27/09/2021 20.01, Avi Gross wrote:

Michael,

Given your further explanation, indeed reading varying numbers of points in
using a CSV is not valid, albeit someone might just make N columns (maybe a
few more than 7) to handle a hopefully worst case. Definitely it makes more
sense to read in a list or other data structure.

You keep talking about generators, though. If the generators are outside of
your program, then yes, you need to read in whatever they produce.


My original post (which is as the snows of yesteryear) made explicit the
fact that when I refer to a generator, I'm talking about something made
from tons of iron and copper that is oil-filled and rotates at 1800 rpm.
(In most of the world other than North America, they rotate at 1500 rpm.)

Nothing to do with the similarly-named python construct. Sorry for the
ambiguity.


But if
your data generator is within your own program,


The data is created in my mind, and approximates typical physical
characteristics of real generators.


My impression is you may not be using your set of data points for any other
purposes except when ready to draw a spline.


Nope, the points give a piecewise-linear curve, and values between two
consecutive points are found by linear interpolation. It's industry
standard practice.



Can I just ask if by a generator, you do NOT mean the more typical use of
"generator" as used in python


Nope; I mean something that weighs 500 tons and rotates, producing
electrical energy.


  Do you mean something that creates
realistic test cases to simulate a real-word scenario?


The thing that creates realistic test cases is my brain.


  These often can
create everything at once and often based on random numbers.


I have written such, but not in the last thirty years. At that time, I
needed to make up data for fifty or one hundred generators, along with
tie lines and loads.

What I'm working on now only needs a handful of generators at a time;
just enough to test my hypothesis. (Theoretically, I could get by with
two, but that offends my engineering sensibilities.)


create everything at once and often based on random numbers. Again, if you
have or build such code, it is not clear it needs to be written to disk and
then read back.


Well, I could continue to hard-code the data into one of the test
programs, but that would mean that every time that I wanted to look
at a different scenario, I'd need to modify a program. And when I
discover anomalous behavior, I'd need to copy the hard-coded data
into another program.

Having the data in a separate file means that I can provide a function
to read that file and return a list of generators (or fuels) to a
program. Multiple test cases are then just multiple files, all of which
are available to multiple programs.


  You may of course want to save it, perhaps as a log, to show
what your program was working on.


That's another benefit of having the data in external files.

--
Michael F. Stemper
A preposition is something you should never end a sentence with.
--
https://mail.python.org/mailman/listinfo/python-list

Re: XML Considered Harmful

2021-09-28 Thread dn via Python-list

On 25/09/2021 11.26, David L Neil via Python-list wrote:
> On 25/09/2021 11.00, Chris Angelico wrote:
> 
>> Invented because there weren't enough markup languages, so we needed another?
> 
> Anything You Can Do I Can Do Better
> https://www.youtube.com/watch?v=_UB1YAsPD6U


Article (rather brief) introducing YAML, of possible interest:
https://opensource.com/article/21/9/intro-yaml

-- 
Regards,
=dn
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: XML Considered Harmful

2021-09-28 Thread Peter J. Holzer

On 2021-09-27 21:01:04 -0400, Avi Gross via Python-list wrote:
> You keep talking about generators, though. If the generators are outside of
> your program, then yes, you need to read in whatever they produce.

As I understood it, the "generators" don't generate the data, they are
the subject of the data: Devices that generate electricity by burning
fuel and he's modelling some aspect of their operation. Maybe efficiency
or power output or something like that (I tried to search for "IHR
curve", but couldn't find anything).

hp

-- 
   _  | Peter J. Holzer| Story must make more sense than reality.
|_|_) ||
| |   | h...@hjp.at |-- Charles Stross, "Creative writing
__/   | http://www.hjp.at/ |   challenge!"

signature.asc
Description: PGP signature
-- 
https://mail.python.org/mailman/listinfo/python-list

RE: XML Considered Harmful

2021-09-27 Thread Avi Gross via Python-list

Michael,

Given your further explanation, indeed reading varying numbers of points in
using a CSV is not valid, albeit someone might just make N columns (maybe a
few more than 7) to handle a hopefully worst case. Definitely it makes more
sense to read in a list or other data structure.

You keep talking about generators, though. If the generators are outside of
your program, then yes, you need to read in whatever they produce. But if
your data generator is within your own program, that opens up other
possibilities. I am not saying you necessarily would want to use the usual
numpy/pandas modules and have some kind of data.frame. I do know other
languages (like R) where I have used columns that are lists.

My impression is you may not be using your set of data points for any other
purposes except when ready to draw a spline. Again, in some languages this
opens up many possibilities. A fairly trivial one is if you store your
points as something like "1.2:3.86:12:83.2" meaning a character string with
some divider. When ready to use that, it is fairly straightforward to
convert it to a list to use for your purpose.

Can I just ask if by a generator, you do NOT mean the more typical use of
"generator" as used in python in which some code sort of runs as needed to
keep generating the next item to work on. Do you mean something that creates
realistic test cases to simulate a real-word scenario? These often can
create everything at once and often based on random numbers. Again, if you
have or build such code, it is not clear it needs to be written to disk and
then read back. You may of course want to save it, perhaps as a log, to show
what your program was working on. 

-Original Message-
From: Python-list  On
Behalf Of Michael F. Stemper
Sent: Monday, September 27, 2021 11:40 AM
To: python-list@python.org
Subject: Re: XML Considered Harmful

On 25/09/2021 16.39, Avi Gross wrote:
> Michael,
> 
> I don't care what you choose. Whatever works is fine for an internal use.

Maybe I should have taken the provoking article with a few more grains of
salt. At this point, I'm not seeing any issues that are applicable to my use
case.

> But is the data scheme you share representative of your actual
application?
> 
>>From what I see below, unless the number of "point" variables is not 
>>always
> exactly four, the application might be handled well by any format that 
> handles rectangular data, perhaps even CSV.
> 
> You show a I mean anything like a data.frame can contain data columns 
> like
> p1,p2,p3,p4 and a categorical one like IHRcurve_name.
> 
> Or do you have a need for more variability such as an undetermined 
> number of similar units in ways that might require more flexibility or 
> be more efficient done another way?

As far as the number of points per IHR curve, the only requirement is that
there must be at least two. It's hard to define a line segment with only
one. The mock data that I have so far has curves ranging from two to five
points. I didn't notice that the snippet that I posted had two curves with
the same number of breakpoints, which was misleading.

My former employer's systems had, IIRC, space for seven points per curve in
the database structures. Of all the sizing changes made over a long career,
I don't recall any customer ever requiring more than that. But, it's
cleanest to use python lists (with no inherent sizing limitations) to
represent the IHR (and incremental cost) curves.

> MOST of the discussion I am seeing here seems peripheral to getting 
> you what you need for your situation and may require a learning curve 
> to learn to use properly. Are you planning on worrying about how to 
> ship your data encrypted, for example? Any file format you use for 
> storage can presumably be encrypted and send and decrypted if that
matters.

This work is intended to look at the feasability of relaxing some
constraints normally required for the solution of Economic Dispatch.
So all of my data are hypothetical. Once I have stuff up and running, I'll
be making up data for lots of different generators.

Being retired, I don't have access to any proprietary information about any
specific generators, so all of the data is made up out of my head. I still
need a way to get it into my programs, of course.

> So, yes, from an abstract standpoint we can discuss the merits of 
> various approaches. If it matters that humans can deal with your data 
> in a file or that it be able to be imported into a program like EXCEL, 
> those are considerations. But if not, there are quite a few relatively 
> binary formats where your program can save a snapshot of the data into 
> a file and read it back in next time.

Not needed here. I'm strictly interested in getting the models of
(generic) generating fleets in. Output of significant results

Re: XML Considered Harmful

2021-09-27 Thread Chris Angelico

On Tue, Sep 28, 2021 at 2:30 AM Michael F. Stemper
 wrote:
> As far as power is concerned, this is utility-level generating fleets,
> so it's always going to be MW -- even in the US, where we still use
> BTUs for heat.
>

It's easy for *you* to know, and therefore assume, that it's always
MW. But someone else coming along will appreciate some sort of
indication that it's MW and not (say) KW or GW.

I've spent a long time decoding other people's file formats, trying to
figure out what unit something is in. "Huh. The date seems to be
stored in hours?!?"

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: XML Considered Harmful

2021-09-27 Thread Michael F. Stemper


On 25/09/2021 16.48, 2qdxy4rzwzuui...@potatochowder.com wrote:

On 2021-09-25 at 15:20:19 -0500,
"Michael F. Stemper"  wrote:


... For instance, if
I modeled a fuel like this:

   
 ton
 21.96
 18.2
   

and a generating unit like this:

   
 
   
   
   
   
 
 
   
   
   
   
 
   

why would the fact that I could have chosen, instead, to model
the unit of measure as an attribute of the fuel, or its name
as a sub-element matter? Once the modeling decision has been
made, all of the decisions that might have been would seem to
be irrelevant.


Disclaimer:  I am not a big XML fan, for a number of reasons
already stated in this thread.

That said, please do include units in elements like heat_content,
whether or not it's Joules/kilogram/K, and price, even if is the
local currency in the only country to which your data applies.


Since the units (dimensions) don't matter as long as they're consistent
between heat_content and the IHR value (MBTU and MBTU/MWh or GJ and
GJ/MWh), I was initially going to ignore this suggestion. However, it
seems that if I added attributes for the unit of measure of heat, that
would allow checking that the data provided are indeed consistent.

Thanks for the suggestion.

With respect to currency, I've had customers (back when I had to work
for a living) use dollars, pesetas, Euros, and pounds. In of Wood and
Wollenberg[1], the authors use \cancel{R} to represent a generic
currency. But I might even add a currency attribute to the price
element.


If there's a standard for your industry, or your company, or on
some other level, then at least document what it is and that
you're using it, so that the next person (which may be you a
year from now) doesn't have to guess.


As far as power is concerned, this is utility-level generating fleets,
so it's always going to be MW -- even in the US, where we still use
BTUs for heat.



[1] _Power Generation, Operation, and Control; Allen J. Wood and Bruce
F. Wollenberg; (c) 1984, John Wiley & Sons.
--
Michael F. Stemper
The FAQ for rec.arts.sf.written is at

Please read it before posting.
--
https://mail.python.org/mailman/listinfo/python-list

Re: XML Considered Harmful

2021-09-27 Thread Michael F. Stemper


On 25/09/2021 16.39, Avi Gross wrote:

Michael,

I don't care what you choose. Whatever works is fine for an internal use.


Maybe I should have taken the provoking article with a few more grains
of salt. At this point, I'm not seeing any issues that are applicable to
my use case.


But is the data scheme you share representative of your actual application?


From what I see below, unless the number of "point" variables is not always

exactly four, the application might be handled well by any format that
handles rectangular data, perhaps even CSV.

You show a I mean anything like a data.frame can contain data columns like
p1,p2,p3,p4 and a categorical one like IHRcurve_name.

Or do you have a need for more variability such as an undetermined number of
similar units in ways that might require more flexibility or be more
efficient done another way?


As far as the number of points per IHR curve, the only requirement
is that there must be at least two. It's hard to define a line segment
with only one. The mock data that I have so far has curves ranging
from two to five points. I didn't notice that the snippet that I
posted had two curves with the same number of breakpoints, which was
misleading.

My former employer's systems had, IIRC, space for seven points per curve
in the database structures. Of all the sizing changes made over a long
career, I don't recall any customer ever requiring more than that. But,
it's cleanest to use python lists (with no inherent sizing limitations)
to represent the IHR (and incremental cost) curves.



MOST of the discussion I am seeing here seems peripheral to getting you what
you need for your situation and may require a learning curve to learn to use
properly. Are you planning on worrying about how to ship your data
encrypted, for example? Any file format you use for storage can presumably
be encrypted and send and decrypted if that matters.


This work is intended to look at the feasability of relaxing some
constraints normally required for the solution of Economic Dispatch.
So all of my data are hypothetical. Once I have stuff up and running,
I'll be making up data for lots of different generators.

Being retired, I don't have access to any proprietary information
about any specific generators, so all of the data is made up out
of my head. I still need a way to get it into my programs, of course.


So, yes, from an abstract standpoint we can discuss the merits of various
approaches. If it matters that humans can deal with your data in a file or
that it be able to be imported into a program like EXCEL, those are
considerations. But if not, there are quite a few relatively binary formats
where your program can save a snapshot of the data into a file and read it
back in next time.


Not needed here. I'm strictly interested in getting the models of
(generic) generating fleets in. Output of significant results will
probably be in CSV, which nicely replicates tabular displays that
I used through most of my career.


Or, did I miss something and others have already produced the data using
other tools, in which case you have to read it in at least once/


Well, the "tool" is vi, but this is a good description of what I'm
doing.

--
Michael F. Stemper
The FAQ for rec.arts.sf.written is at

Please read it before posting.
--
https://mail.python.org/mailman/listinfo/python-list

Re: XML Considered Harmful

2021-09-25 Thread Chris Angelico

On Sun, Sep 26, 2021 at 9:09 AM Eli the Bearded <*@eli.users.panix.com> wrote:
>
> In comp.lang.python, Chris Angelico   wrote:
> > Eli the Bearded <*@eli.users.panix.com> wrote:
> >> I'd use one of the netpbm formats instead of JPEG. PBM for one bit
> >> bitmaps, PGM for one channel (typically grayscale), PPM for three
> >> channel RGB, and PAM for anything else (two channel gray plus alpha,
> >> CMYK, RGBA, HSV, YCbCr, and more exotic formats). JPEG is tricky to
> >> map to CSV since it is a three channel format (YCbCr), where the
> >> channels are typically not at the same resolution. Usually Y is full
> >> size and the Cb and Cr channels are one quarter size ("4:2:0 chroma
> >> subsampling"). The unequal size of the channels does not lend itself
> >> to CSV, but I can't say it's impossible.
> > Examine prior art, and I truly do mean art, from Matt Parker:
> > https://www.youtube.com/watch?v=UBX2QQHlQ_I
>
> His spreadsheet is a PPM file, not a JPEG. You can tell because all of
> the cells are the same size.
>
> He also ignores vector graphics when considering digital images. Often
> they are rendered in what he calls "spreadsheets" but not always. I have
> a Vectrex, for example.
>
> Elijah
> --
> then there's typewriter art with non-square "pixels"

Ah, I remember playing around with line printer art. We mostly had
Epsons and IBMs that did have some measure of graphical capabilities,
but it was WAY faster to print text, so we sometimes did things the
hacky and elegant way instead.

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: XML Considered Harmful

2021-09-25 Thread Eli the Bearded

In comp.lang.python, Chris Angelico   wrote:
> Eli the Bearded <*@eli.users.panix.com> wrote:
>> I'd use one of the netpbm formats instead of JPEG. PBM for one bit
>> bitmaps, PGM for one channel (typically grayscale), PPM for three
>> channel RGB, and PAM for anything else (two channel gray plus alpha,
>> CMYK, RGBA, HSV, YCbCr, and more exotic formats). JPEG is tricky to
>> map to CSV since it is a three channel format (YCbCr), where the
>> channels are typically not at the same resolution. Usually Y is full
>> size and the Cb and Cr channels are one quarter size ("4:2:0 chroma
>> subsampling"). The unequal size of the channels does not lend itself
>> to CSV, but I can't say it's impossible.
> Examine prior art, and I truly do mean art, from Matt Parker:
> https://www.youtube.com/watch?v=UBX2QQHlQ_I

His spreadsheet is a PPM file, not a JPEG. You can tell because all of
the cells are the same size.

He also ignores vector graphics when considering digital images. Often
they are rendered in what he calls "spreadsheets" but not always. I have
a Vectrex, for example.

Elijah
--
then there's typewriter art with non-square "pixels"
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: XML Considered Harmful

2021-09-25 Thread dn via Python-list

On 26/09/2021 10.48, 2qdxy4rzwzuui...@potatochowder.com wrote:
> On 2021-09-25 at 15:20:19 -0500,
> "Michael F. Stemper"  wrote:
> 
>> ... For instance, if
>> I modeled a fuel like this:
>>
>>   
>> ton
>> 21.96
>> 18.2
>>   
...


> Disclaimer:  I am not a big XML fan, for a number of reasons
> already stated in this thread.
> 
> That said, please do include units in elements like heat_content,
> whether or not it's Joules/kilogram/K, and price, even if is the
> local currency in the only country to which your data applies.
> If there's a standard for your industry, or your company, or on
> some other level, then at least document what it is and that
> you're using it, so that the next person (which may be you a
> year from now) doesn't have to guess.

+1
*always* add unit attributes
-- 
Regards,
=dn
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: XML Considered Harmful

2021-09-25 Thread dn via Python-list

On 26/09/2021 10.07, Stefan Ram wrote:
> "Michael F. Stemper"  writes:
>>   fitting hierarchical
>> data into rows/columns just seems wrong
> 
>   There were hierarchical database management systems like
>   IMS by IBM based on that point of view. Today, almost all
>   hierarchical data that is stored in databases is stored
>   in relational databases. Maybe, the relational model has
>   proven superior to the hierarchical data model after all.

Back in the days of mainframes (and when the Flintstones was 'filmed
before a live studio audience') hierarchical DBs were considerably
faster than RDBMS. Because of this, we used to take a daily 'snapshot'
of the transaction DBs (in IMS) and make a 'copy' as DB2 relational DBs,
which were (supposedly) used for MIS (Management Information Systems -
as distinct from TPS (Transaction Processing Systems)).

These days RDBMS are (a lot!) faster - much of which would be better
expressed as: the hardware these days is a lot faster. Therefore an
RDBMS is sufficiently responsive, and we no-longer need to maintain
separate, 'parallel' systems (and multiple mainframes)!

Cue: NoSQL justifications...

Today's best example of an hierarchical DB is probably LDAP. It is most
commonly used within the 'directory' of communications systems, eg
email. Such waters muddied considerably by MSFT's attempts to 'improve'
international 'standards' and integrate AD with Exchange (so don't go
there!).

There have been some well-engineered systems based on LDAP, eg
organisational/personnel and part/component break-downs.

That said, unless looking at something such as just-mentioned,
overlaying hierarchy onto 3NF and using an RDBMS would be my first
thought - but because of the recursive JOINs, I recommend something more
capable than SQLite.
-- 
Regards,
=dn
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: XML Considered Harmful

2021-09-25 Thread 2QdxY4RzWzUUiLuE

On 2021-09-25 at 15:20:19 -0500,
"Michael F. Stemper"  wrote:

> ... For instance, if
> I modeled a fuel like this:
> 
>   
> ton
> 21.96
> 18.2
>   
> 
> and a generating unit like this:
> 
>   
> 
>   
>   
>   
>   
> 
> 
>   
>   
>   
>   
> 
>   
> 
> why would the fact that I could have chosen, instead, to model
> the unit of measure as an attribute of the fuel, or its name
> as a sub-element matter? Once the modeling decision has been
> made, all of the decisions that might have been would seem to
> be irrelevant.

Disclaimer:  I am not a big XML fan, for a number of reasons
already stated in this thread.

That said, please do include units in elements like heat_content,
whether or not it's Joules/kilogram/K, and price, even if is the
local currency in the only country to which your data applies.
If there's a standard for your industry, or your company, or on
some other level, then at least document what it is and that
you're using it, so that the next person (which may be you a
year from now) doesn't have to guess.

You also never know when someone else on the other side of the
planet will notice your work and try to duplicate it and/or
update it (again, even if it's you).  The fewer assumptions
that person has to make, the better.
-- 
https://mail.python.org/mailman/listinfo/python-list

RE: XML Considered Harmful

2021-09-25 Thread Avi Gross via Python-list

Michael,

I don't care what you choose. Whatever works is fine for an internal use.

But is the data scheme you share representative of your actual application?

>From what I see below, unless the number of "point" variables is not always
exactly four, the application might be handled well by any format that
handles rectangular data, perhaps even CSV.

You show a I mean anything like a data.frame can contain data columns like
p1,p2,p3,p4 and a categorical one like IHRcurve_name.

Or do you have a need for more variability such as an undetermined number of
similar units in ways that might require more flexibility or be more
efficient done another way?

MOST of the discussion I am seeing here seems peripheral to getting you what
you need for your situation and may require a learning curve to learn to use
properly. Are you planning on worrying about how to ship your data
encrypted, for example? Any file format you use for storage can presumably
be encrypted and send and decrypted if that matters.

So, yes, from an abstract standpoint we can discuss the merits of various
approaches. If it matters that humans can deal with your data in a file or
that it be able to be imported into a program like EXCEL, those are
considerations. But if not, there are quite a few relatively binary formats
where your program can save a snapshot of the data into a file and read it
back in next time. I often do that in another language that lets me share
variable including nested components such as the complex structures that
come out of a statistical analysis or the components needed to make one or
more graphs later. If you write the program that creates the darn things as
well as the one that later reads them back in, you can do what you want.

Or, did I miss something and others have already produced the data using
other tools, in which case you have to read it in at least once/ 

-Original Message-
From: Python-list  On
Behalf Of Michael F. Stemper
Sent: Saturday, September 25, 2021 4:20 PM
To: python-list@python.org
Subject: Re: XML Considered Harmful

On 21/09/2021 13.12, Michael F. Stemper wrote:

> If XML is not the way to package data, what is the recommended 
> approach?

Well, there have been a lot of ideas put forth on this thread, many more
than I expected. I'd like to thank everyone who took the time to contribute.

Most of the reasons given for avoiding XML appear to be along the lines of
"XML has all of these different options that it supports."

However, it seems that I could ignore 99% of those things and just use a
teeny subset of its capabilities. For instance, if I modeled a fuel like
this:

 ton
 21.96
 18.2

and a generating unit like this:

why would the fact that I could have chosen, instead, to model the unit of
measure as an attribute of the fuel, or its name as a sub-element matter?
Once the modeling decision has been made, all of the decisions that might
have been would seem to be irrelevant.

Some years back, IEC's TC57 came up with CIM[1]. This nailed down a lot of
decisions. The fact that other decisions could have been made doesn't seem
to keep utilities from going forward with it as an enterprise-wide data
model.

My current interests are not anywhere so expansive, but it seems that the
situations are at least similar:
1. Look at an endless range of options for a data model.
2. Pick one.
3. Run with it.

To clearly state my (revised) question:

   Why does the existence of XML's many options cause a problem
   for my use case?

Other reactions:

Somebody pointed out that some approaches would require that I climb a
learning curve. That's appreciated, although learning new things is always
good.

NestedText looks cool, and a lot like YAML. Having not gotten around to
playing with YAML yet, I was surprised to learn that it tries to guess data
types. This sounds as if it could lead to the same type of problems that led
to the names of some genes being turned into dates.

It was suggested that I use an RDBMS, such as sqlite3, for the input data.
I've used sqlite3 for real-time data exchange between concurrently-running
programs. However, I don't see syntax like:

sqlite> INSERT INTO Fuels
...> (name,uom,price,heat_content)
...> VALUES ("Montana Sub-Bituminous", "ton", 21.96, 13.65);

as being nearly as readable as the XML that I've sketched above.
Yeah, I could write a program to do this, but that doesn't really change
anything, since I'd still need to get the data into the program.

(Changing a value would be even worse, requiring the dreaded UPDATE INTO
statement, instead of five seconds in vi.)

Many of the problems listed for CSV, which come from its lack of
standardization, seem similar to those given for XML. "Commas or tabs?" "How
are new

Re: XML Considered Harmful

2021-09-25 Thread Michael F. Stemper


On 21/09/2021 13.12, Michael F. Stemper wrote:


If XML is not the way to package data, what is the recommended
approach?


Well, there have been a lot of ideas put forth on this thread,
many more than I expected. I'd like to thank everyone who
took the time to contribute.

Most of the reasons given for avoiding XML appear to be along
the lines of "XML has all of these different options that it
supports."

However, it seems that I could ignore 99% of those things and
just use a teeny subset of its capabilities. For instance, if
I modeled a fuel like this:

  
ton
21.96
18.2
  

and a generating unit like this:

  

  
  
  
  


  
  
  
  

  

why would the fact that I could have chosen, instead, to model
the unit of measure as an attribute of the fuel, or its name
as a sub-element matter? Once the modeling decision has been
made, all of the decisions that might have been would seem to
be irrelevant.

Some years back, IEC's TC57 came up with CIM[1]. This nailed down
a lot of decisions. The fact that other decisions could have been
made doesn't seem to keep utilities from going forward with it as
an enterprise-wide data model.

My current interests are not anywhere so expansive, but it seems
that the situations are at least similar:
1. Look at an endless range of options for a data model.
2. Pick one.
3. Run with it.

To clearly state my (revised) question:

  Why does the existence of XML's many options cause a problem
  for my use case?


Other reactions:

Somebody pointed out that some approaches would require that I
climb a learning curve. That's appreciated, although learning
new things is always good.

NestedText looks cool, and a lot like YAML. Having not gotten
around to playing with YAML yet, I was surprised to learn that it
tries to guess data types. This sounds as if it could lead to the
same type of problems that led to the names of some genes being
turned into dates.

It was suggested that I use an RDBMS, such as sqlite3, for the
input data. I've used sqlite3 for real-time data exchange between
concurrently-running programs. However, I don't see syntax like:

sqlite> INSERT INTO Fuels
   ...> (name,uom,price,heat_content)
   ...> VALUES ("Montana Sub-Bituminous", "ton", 21.96, 13.65);

as being nearly as readable as the XML that I've sketched above.
Yeah, I could write a program to do this, but that doesn't really
change anything, since I'd still need to get the data into the
program.

(Changing a value would be even worse, requiring the dreaded
UPDATE INTO statement, instead of five seconds in vi.)

Many of the problems listed for CSV, which come from its lack of
standardization, seem similar to those given for XML. "Commas
or tabs?" "How are new-lines represented?" If I was to use CSV,
I'd be able to just pick answers. However, fitting hierarchical
data into rows/columns just seems wrong, so I doubt that I'll
end up going that way.

As far as disambiguating authors, I believe that most journals
are now expecting an ORCID[2] (which doesn't help with papers
published before that came around).

As far as use of XML to store program state, I wouldn't ever
consider that. As noted above, I've used an RDBMS to do so.
It handles all of the concurrency issues for me. The current use
case is specifically for raw, static input.

Fascinating to find out that XML was originally designed to
mark up text, especially legal text.

It was nice to be reminded of what Matt Parker looked like when
he had hair.


[1] 
[2] 
--
Michael F. Stemper
Psalm 82:3-4
--
https://mail.python.org/mailman/listinfo/python-list

Re: XML Considered Harmful

2021-09-25 Thread Jon Ribbens via Python-list

On 2021-09-25, Peter J. Holzer  wrote:
> On 2021-09-24 23:32:47 -, Jon Ribbens via Python-list wrote:
>> JSON Schema provides a way to denote composite types.
>
> I probably wasn't clear what I meant. In XML, every element has a tag,
> which is basically its type. So by looking at an XML file (without
> reference to a schema) you can tell what each element is. And a
> validator can say something like "expected a 'product' or 'service'
> element here but found a 'person'".
>
> In JSON everything is just an object or a list. You may guess that an
> object with a field "product_id" is a product, but is one with "name":
> "Billy" a person or a piece of furniture?
>
> I'm not familiar with JSON schema (I know that it exists and I've read a
> tutorial or two but I've never used it in a real project), but as far as
> I know it doesn't change that. It describes the structure of a JSON
> document but it doesn't add type information to that document. So a
> validator can at best guess what the malformed thing it just found was
> supposed to be.

JSON Schema absolutely does change that. You can create named types
and specify where they may appear in the document. With a well-defined
schema you do not need to make any guesses about what type something is.
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: XML Considered Harmful

2021-09-25 Thread Karsten Hilbert

Am Fri, Sep 24, 2021 at 08:59:23PM +0200 schrieb Peter J. Holzer:

> JSON: Has a few primitive data types (bool, number, string) and a two
> compound types (list, dict(string -> any)). Still missing many
> frequently used data types (e.g. dates)

But that (dates) at least has a well-known mapping to string,
which makes it usable within JSON.

Karsten
--
GPG  40BE 5B0E C98E 1713 AFA6  5BC0 3BEA AC80 7D4F C89B
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: XML Considered Harmful

2021-09-25 Thread Peter J. Holzer

On 2021-09-24 23:32:47 -, Jon Ribbens via Python-list wrote:
> On 2021-09-24, Chris Angelico  wrote:
> > On Sat, Sep 25, 2021 at 8:53 AM dn via Python-list
> > wrote:
> >> On 25/09/2021 06.59, Peter J. Holzer wrote:
> >> > CSV: Good for tabular data of a single data type (strings). As soon as
> >> > there's a second data type (numbers, dates, ...) you leave standard
> >> > territory and are into "private agreements".
> 
> CSV is not good for strings, as there is no one specification of how to
> encode things like newlines and commas within the strings, so you may
> find that your CSV data transfer fails or even silently corrupts data.

Those two cases are actually pretty straightforward: Just enclose the
field in quotes.

Handling quotes is less standardized. I think doubling quotes is much more
common than an escape character, but I've certainly seen both.

But if you get down to it, the problems with CSV start at a much lower
level:

1) The encoding is not defined. These days UTF-8 (with our without BOM)
is pretty common, but I still regularly get files in Windows-1252
encoding and occasionally something else.

2) The record separator isn't defined. CRLF is most common, followed by
   LF. But just recently I got a file with CR (Does Eurostat still use
   some Macs with MacOS 9?)

3) The field separator isn't defined. Officially the format is known as
   "comma separated values", but in my neck of the woods it's actually
   semicolon-separated in the vast majority of cases.

So even for the most simple files there are three parameters the sender
and the receiver have to agree on.

> >> > JSON: Has a few primitive data types (bool, number, string) and a two
> >> > compound types (list, dict(string -> any)). Still missing many
> >> > frequently used data types (e.g. dates) and has no standard way to
> >> > denote composite types. But its simple and if it's sufficient for your
> >> > needs, use it.
> 
> JSON Schema provides a way to denote composite types.

I probably wasn't clear what I meant. In XML, every element has a tag,
which is basically its type. So by looking at an XML file (without
reference to a schema) you can tell what each element is. And a
validator can say something like "expected a 'product' or 'service'
element here but found a 'person'".

In JSON everything is just an object or a list. You may guess that an
object with a field "product_id" is a product, but is one with "name":
"Billy" a person or a piece of furniture?

I'm not familiar with JSON schema (I know that it exists and I've read a
tutorial or two but I've never used it in a real project), but as far as
I know it doesn't change that. It describes the structure of a JSON
document but it doesn't add type information to that document. So a
validator can at best guess what the malformed thing it just found was
supposed to be.

hp

-- 
   _  | Peter J. Holzer| Story must make more sense than reality.
|_|_) ||
| |   | h...@hjp.at |-- Charles Stross, "Creative writing
__/   | http://www.hjp.at/ |   challenge!"

signature.asc
Description: PGP signature
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: XML Considered Harmful

2021-09-24 Thread Greg Ewing


On 25/09/21 10:51 am, dn wrote:

XML: Originally invented for text markup, and that shows. Can represent
different types (via tags), can define those types (via DTD and/or
schemas), can identify schemas in a globally-unique way and you can mix
them all in a single document (and there are tools available to validate
your files). But those features make it very complex


And for all that complexity, it still doesn't map very well
onto the kinds of data structures used inside programs (lists,
structs, etc.), so you end up having to build those structures
on top of it, and everyone does that in a different way.

--
Greg
--
https://mail.python.org/mailman/listinfo/python-list

Re: XML Considered Harmful

2021-09-24 Thread Greg Ewing


On 25/09/21 11:00 am, Chris Angelico wrote:

On Sat, Sep 25, 2021 at 8:53 AM dn via Python-list
 wrote:


and YAML?


Invented because there weren't enough markup languages, so we needed another?


There were *too many* markup languages, so we invented another!

--
Greg
--
https://mail.python.org/mailman/listinfo/python-list

Re: XML Considered Harmful

2021-09-24 Thread Greg Ewing


On 25/09/21 6:34 am, Peter J. Holzer wrote:

Several hundred genes were recently renamed because Excel was unable to
read their names as simply strings and insisted on interpreting them as
something else (e.g. dates).


Another fun one I've come across is interpreting phone numbers
as floating point and writing them out again with exponents...

--
Greg
--
https://mail.python.org/mailman/listinfo/python-list

Re: XML Considered Harmful

2021-09-24 Thread Greg Ewing


On 25/09/21 6:29 am, Peter J. Holzer wrote:

don't forget that
XML was intended to replace SGML, and that SGML was intended to mark up
text, not represent any data.


And for me this is the number one reason why XML is the wrong
tool for almost everything it's used for nowadays.

It's bizarre. It's as though there were a large community of
professional builders who insisted on using hammers to drive
scews, and extolled the advantages of doing so.

--
Greg

--
https://mail.python.org/mailman/listinfo/python-list

Re: XML Considered Harmful

2021-09-24 Thread Jon Ribbens via Python-list

On 2021-09-24, Chris Angelico  wrote:
> On Sat, Sep 25, 2021 at 8:53 AM dn via Python-list
> wrote:
>> On 25/09/2021 06.59, Peter J. Holzer wrote:
>> > CSV: Good for tabular data of a single data type (strings). As soon as
>> > there's a second data type (numbers, dates, ...) you leave standard
>> > territory and are into "private agreements".

CSV is not good for strings, as there is no one specification of how to
encode things like newlines and commas within the strings, so you may
find that your CSV data transfer fails or even silently corrupts data.

>> > JSON: Has a few primitive data types (bool, number, string) and a two
>> > compound types (list, dict(string -> any)). Still missing many
>> > frequently used data types (e.g. dates) and has no standard way to
>> > denote composite types. But its simple and if it's sufficient for your
>> > needs, use it.

JSON Schema provides a way to denote composite types.

>> > XML: Originally invented for text markup, and that shows. Can represent
>> > different types (via tags), can define those types (via DTD and/or
>> > schemas), can identify schemas in a globally-unique way and you can mix
>> > them all in a single document (and there are tools available to validate
>> > your files). But those features make it very complex (you almost
>> > certainly don't want to write your own parser) and you really have to
>> > understand the data model (especiall namespaces) to use it.
>>
>> and YAML?
>
> Invented because there weren't enough markup languages, so we needed
> another?

Invented as a drunken bet that got out of hand, and used by people who
don't realise this.
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: XML Considered Harmful

2021-09-24 Thread David L Neil via Python-list

On 25/09/2021 11.00, Chris Angelico wrote:

> Invented because there weren't enough markup languages, so we needed another?

Anything You Can Do I Can Do Better
https://www.youtube.com/watch?v=_UB1YAsPD6U

-- 
Regards =dn
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: XML Considered Harmful

2021-09-24 Thread Chris Angelico

On Sat, Sep 25, 2021 at 8:53 AM dn via Python-list
 wrote:
>
> On 25/09/2021 06.59, Peter J. Holzer wrote:
> > There are a gazillion formats and depending on your needs one of them
> > might be perfect. Or you may have to define you own bespoke format (I
> > mean, nobody (except Matt Parker) tries to represent images or videos as
> > CSVs: There's PNG and JPEG and WEBP and H.264 and AV1 and whatever for
> > that).
> >
> > Of the three formats discussed here my take is:
> >
> > CSV: Good for tabular data of a single data type (strings). As soon as
> > there's a second data type (numbers, dates, ...) you leave standard
> > territory and are into "private agreements".
> >
> > JSON: Has a few primitive data types (bool, number, string) and a two
> > compound types (list, dict(string -> any)). Still missing many
> > frequently used data types (e.g. dates) and has no standard way to
> > denote composite types. But its simple and if it's sufficient for your
> > needs, use it.
> >
> > XML: Originally invented for text markup, and that shows. Can represent
> > different types (via tags), can define those types (via DTD and/or
> > schemas), can identify schemas in a globally-unique way and you can mix
> > them all in a single document (and there are tools available to validate
> > your files). But those features make it very complex (you almost
> > certainly don't want to write your own parser) and you really have to
> > understand the data model (especiall namespaces) to use it.
>
> and YAML?

Invented because there weren't enough markup languages, so we needed another?

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: XML Considered Harmful

2021-09-24 Thread dn via Python-list

On 25/09/2021 06.59, Peter J. Holzer wrote:
> There are a gazillion formats and depending on your needs one of them
> might be perfect. Or you may have to define you own bespoke format (I
> mean, nobody (except Matt Parker) tries to represent images or videos as
> CSVs: There's PNG and JPEG and WEBP and H.264 and AV1 and whatever for
> that).
> 
> Of the three formats discussed here my take is:
> 
> CSV: Good for tabular data of a single data type (strings). As soon as
> there's a second data type (numbers, dates, ...) you leave standard
> territory and are into "private agreements".
> 
> JSON: Has a few primitive data types (bool, number, string) and a two
> compound types (list, dict(string -> any)). Still missing many
> frequently used data types (e.g. dates) and has no standard way to
> denote composite types. But its simple and if it's sufficient for your
> needs, use it.
> 
> XML: Originally invented for text markup, and that shows. Can represent
> different types (via tags), can define those types (via DTD and/or
> schemas), can identify schemas in a globally-unique way and you can mix
> them all in a single document (and there are tools available to validate
> your files). But those features make it very complex (you almost
> certainly don't want to write your own parser) and you really have to
> understand the data model (especiall namespaces) to use it.

and YAML?
-- 
Regards,
=dn
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: XML Considered Harmful

2021-09-24 Thread Peter J. Holzer

On 2021-09-21 13:12:10 -0500, Michael F. Stemper wrote:
> I read this page right when I was about to write an XML parser
> to get data into the code for a research project I'm working on.
> It seems to me that XML is the right approach for this sort of
> thing, especially since the data is hierarchical in nature.
> 
> Does the advice on that page mean that I should find some other
> way to get data into my programs, or does it refer to some kind
> of misuse/abuse of XML for something that it wasn't designed
> for?
> 
> If XML is not the way to package data, what is the recommended
> approach?

There are a gazillion formats and depending on your needs one of them
might be perfect. Or you may have to define you own bespoke format (I
mean, nobody (except Matt Parker) tries to represent images or videos as
CSVs: There's PNG and JPEG and WEBP and H.264 and AV1 and whatever for
that).

Of the three formats discussed here my take is:

CSV: Good for tabular data of a single data type (strings). As soon as
there's a second data type (numbers, dates, ...) you leave standard
territory and are into "private agreements".

JSON: Has a few primitive data types (bool, number, string) and a two
compound types (list, dict(string -> any)). Still missing many
frequently used data types (e.g. dates) and has no standard way to
denote composite types. But its simple and if it's sufficient for your
needs, use it.

XML: Originally invented for text markup, and that shows. Can represent
different types (via tags), can define those types (via DTD and/or
schemas), can identify schemas in a globally-unique way and you can mix
them all in a single document (and there are tools available to validate
your files). But those features make it very complex (you almost
certainly don't want to write your own parser) and you really have to
understand the data model (especiall namespaces) to use it.

You can of course represent any data in any format if you jump through
enough hoops, but the real question is "does the data I have fit
naturally within the data model of the format I'm trying to use". If it
doesn't, look for something else. For me, CSV, JSON and XML form a
hierarchy where each can naturally represent all the data of its
predecessors, but not vice versa.

hp

-- 
   _  | Peter J. Holzer| Story must make more sense than reality.
|_|_) ||
| |   | h...@hjp.at |-- Charles Stross, "Creative writing
__/   | http://www.hjp.at/ |   challenge!"

signature.asc
Description: PGP signature
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: XML Considered Harmful

2021-09-24 Thread Peter J. Holzer

On 2021-09-23 06:53:10 -0600, Mats Wichmann wrote:
> The problem with csv is that a substantial chunk of the world seems to
> live inside Excel,

This is made sp much worse by Excel being exceptionally bad at reading
CSV.

Several hundred genes were recently renamed because Excel was unable to
read their names as simply strings and insisted on interpreting them as
something else (e.g. dates).

hp

-- 
   _  | Peter J. Holzer| Story must make more sense than reality.
|_|_) ||
| |   | h...@hjp.at |-- Charles Stross, "Creative writing
__/   | http://www.hjp.at/ |   challenge!"

signature.asc
Description: PGP signature
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: XML Considered Harmful

2021-09-24 Thread Peter J. Holzer

On 2021-09-21 19:46:19 -0700, Dan Stromberg wrote:
> On Tue, Sep 21, 2021 at 7:26 PM Michael F. Stemper <
> michael.stem...@gmail.com> wrote:
> > If XML is not the way to package data, what is the recommended
> > approach?
> >
> 
> I prefer both JSON and YAML over XML.
> 
> XML has both elements and tags, but it didn't really need both.

I think you meant "both elements and attributes". Tags are how you
denote elements, so they naturally go together.

I agree that for representing data (especially object-oriented data) the
distiction between (sub-)elements and attributes seems moot (should
represent that field as an attribute or a field?), but don't forget that
XML was intended to replace SGML, and that SGML was intended to mark up
text, not represent any data.

Would you really want to write

Mr. Smiths point was corroborated by
Ms. Jones point that bla, bla, which
seemed more plausibe than Mr. Willam
claim that blub, blub.

as

Mr. Smiths point was corroborated by
Ms. Jones point that bla, bla, which
seemed more plausibe than Mr. Willam
claim that blub, blub.

or

Mr. Smith<(defendant>s point was
corroborated by Ms. Jones point that bla,
bla, which seemed more plausibe than Mr. 
Willam claim that blub,
blub.

?

I probably chose an example (no doubt influenced by the fact that SGML
was originally invented to digitize court decisions) which is too simple
(in HTML I often see many attributes on a single element, even with
CSS), but even here you can see that attributes add clarity.

hp

-- 
   _  | Peter J. Holzer| Story must make more sense than reality.
|_|_) ||
| |   | h...@hjp.at |-- Charles Stross, "Creative writing
__/   | http://www.hjp.at/ |   challenge!"

signature.asc
Description: PGP signature
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: XML Considered Harmful

2021-09-24 Thread Mostowski Collapse



Or then use cryptographic methods to protect your XML
file when in transit. Like encryption and/or signatures.

Mostowski Collapse schrieb am Freitag, 24. September 2021 um 15:46:27 UTC+2:
> BTW: I think its problematic to associate Java with XML.
> Michael F. Stemper schrieb am Dienstag, 21. September 2021 um 20:12:33 UTC+2: 
> > On the prolog thread, somebody posted a link to: 
> > 
> The above linke is very old, from 2004, and might apply 
> how Java presented itself back in those days. But since 
> the Jigsaw project, XML has practically left Java. 
> 
> Its all not anymore part of the javax.* or java.* namespace, 
> Oracle got rid of XML technologies housing in these 
> namespaces, and there is now the jakarta.* namespace. 
> 
> Example JAXB: 
> Jakarta XML Binding (JAXB; formerly Java Architecture for XML Binding) 
> https://de.wikipedia.org/wiki/Jakarta_XML_Binding 
> 
> If I remember well, also XML never went into the Java 
> Language Specification, unlike the Scala programming 
> language, where you can have XML literals: 
> 
> XML literals in scala 
> https://tuttlem.github.io/2015/02/24/xml-literals-in-scala.html 
> 
> An easy protection against tampered XML data vulnerabilities 
> is DTD or some other XML schema language. It can at least catch 
> problems that are in the scope of the schema language.
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: XML Considered Harmful

2021-09-24 Thread Mostowski Collapse

BTW: I think its problematic to associate Java with XML.

Michael F. Stemper schrieb am Dienstag, 21. September 2021 um 20:12:33 UTC+2:
> On the prolog thread, somebody posted a link to: 
>  

The above linke is very old, from 2004, and might apply
how Java presented itself back in those days. But since
the Jigsaw project, XML has practically left Java.

Its all not anymore part of the javax.* or java.* namespace,
Oracle got rid of XML technologies housing in these 
namespaces, and there is now the jakarta.* namespace.

Example JAXB:
Jakarta XML Binding (JAXB; formerly Java Architecture for XML Binding) 
https://de.wikipedia.org/wiki/Jakarta_XML_Binding

If I remember well, also XML never went into the Java
Language Specification, unlike the Scala programming 
language, where you can have XML literals:

XML literals in scala
https://tuttlem.github.io/2015/02/24/xml-literals-in-scala.html
 
An easy protection against tampered XML data vulnerabilities 
is DTD or some other XML schema language. It can at least catch
problems that are in the scope of the schema language.
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: XML Considered Harmful

2021-09-23 Thread Mike Dewhirst via Python-list

I had to use XML once because that was demanded by the receiving machine over 
which I had no say.I wouldn't use it otherwise because staring at it makes you 
dizzy.I would want to know how the data are derived from the multiple sources 
and transmitted to the collating platform before pontificating.Then I would 
ignore any potential future enhancements and choose the easiest possible 
mechanism. I have used json with python and been delighted at the ease of 
converting data into dicts and even arbitrary nesting where data values can 
also be dicts etc.Good luck--(Unsigned mail from my phone)
 Original message From: dn via Python-list 
 Date: 24/9/21  15:42  (GMT+10:00) To: 
python-list@python.org Subject: Re: XML Considered Harmful On 24/09/2021 14.07, 
Stefan Ram wrote:> dn  writes:>> With that, why 
not code it as Python expressions, and include the module?> >   This might 
create a code execution vulnerability if such >   files are exchanged between 
multiple parties.The OP's spec, as quoted earlier(!), reads:"It's my own 
research, so I can give myself the data in any format thatI like."Whither 
"files are exchanged" and/or "multiple parties"? Are theseanticipations of 
problems that may/won't ever apply? aka YAGNI.Concern about such an approach 
*is* warranted.However, the preceding question to be considered during the 
design-stageis: 'does such concern apply?'. The OP describes full and unique 
agency.Accordingly, "KISS"!NB my personal choice would likely be JSON or YAML, 
but see reservations(eg @Chris) - and with greater relevance: shouldn't we 
consider the OP's'learning curve'?(such deduced only from OP's subsequent 
reactions/responses 'here' -with any and all due apologies)-- Regards,=dn-- 
https://mail.python.org/mailman/listinfo/python-list
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: XML Considered Harmful

2021-09-23 Thread dn via Python-list

On 24/09/2021 14.07, Stefan Ram wrote:
> dn  writes:
>> With that, why not code it as Python expressions, and include the module?
> 
>   This might create a code execution vulnerability if such 
>   files are exchanged between multiple parties.

The OP's spec, as quoted earlier(!), reads:

"It's my own research, so I can give myself the data in any format that
I like."

Whither "files are exchanged" and/or "multiple parties"? Are these
anticipations of problems that may/won't ever apply? aka YAGNI.

Concern about such an approach *is* warranted.

However, the preceding question to be considered during the design-stage
is: 'does such concern apply?'. The OP describes full and unique agency.
Accordingly, "KISS"!

NB my personal choice would likely be JSON or YAML, but see reservations
(eg @Chris) - and with greater relevance: shouldn't we consider the OP's
'learning curve'?
(such deduced only from OP's subsequent reactions/responses 'here' -
with any and all due apologies)
-- 
Regards,
=dn
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: XML Considered Harmful

2021-09-23 Thread Chris Angelico

On Fri, Sep 24, 2021 at 1:44 PM Dan Stromberg  wrote:
>
>
> On Thu, Sep 23, 2021 at 8:12 PM Chris Angelico  wrote:
>>
>> One good hybrid is to take a subset of Python syntax (so it still
>> looks like a Python script for syntax highlighting etc), and then
>> parse that yourself, using the ast module. For instance, you can strip
>> out comments, then look for "VARNAME = ...", and parse the value using
>> ast.literal_eval(), which will give you a fairly flexible file format
>> that's still quite safe.
>
>
> Restricting Python with the ast module is interesting, but I don't think I'd 
> want to bet my career on the actual safety of such a thing.  Given that Java 
> bytecode was a frequent problem inside web browsers, imagine all the 
> messiness that could accidentally happen with a subset of Python syntax from 
> untrusted sources.
>
> ast.literal_eval might be a little better - or a list of such, actually.

Uhh, I specifically mention literal_eval in there :) Simple text
parsing followed by literal_eval for the bulk of it is a level of
safety that I *would* bet my career on.

> Better still to use JSON or ini format - IOW something designed for the 
> purpose.

It all depends on how human-editable it needs to be. JSON has several
problems in that respect, including some rigidities, and a lack of
support for comments. INI format doesn't have enough data types for
many purposes. YAML might be closer, but it's not for every situation
either.

That's why we have options.

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: XML Considered Harmful

2021-09-23 Thread Dan Stromberg

On Thu, Sep 23, 2021 at 8:12 PM Chris Angelico  wrote:

> One good hybrid is to take a subset of Python syntax (so it still
> looks like a Python script for syntax highlighting etc), and then
> parse that yourself, using the ast module. For instance, you can strip
> out comments, then look for "VARNAME = ...", and parse the value using
> ast.literal_eval(), which will give you a fairly flexible file format
> that's still quite safe.
>

Restricting Python with the ast module is interesting, but I don't think
I'd want to bet my career on the actual safety of such a thing.  Given that
Java bytecode was a frequent problem inside web browsers, imagine all the
messiness that could accidentally happen with a subset of Python syntax
from untrusted sources.

ast.literal_eval might be a little better - or a list of such, actually.

Better still to use JSON or ini format - IOW something designed for the
purpose.
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: XML Considered Harmful

2021-09-23 Thread Chris Angelico

On Fri, Sep 24, 2021 at 12:22 PM Stefan Ram  wrote:
>
> dn  writes:
> >With that, why not code it as Python expressions, and include the module?
>
>   This might create a code execution vulnerability if such
>   files are exchanged between multiple parties.
>
>   If code execution vulnerabilities and human-readability are
>   not an issue, then one could also think about using pickle.
>
>   If one ignores security concerns for a moment, serialization into
>   a text format and subsequent deserialization can be a easy as:
>
> |>>> eval( str( [1, (2, 3)] ))
> |[1, (2, 3)]
>

One good hybrid is to take a subset of Python syntax (so it still
looks like a Python script for syntax highlighting etc), and then
parse that yourself, using the ast module. For instance, you can strip
out comments, then look for "VARNAME = ...", and parse the value using
ast.literal_eval(), which will give you a fairly flexible file format
that's still quite safe.

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: XML Considered Harmful

2021-09-23 Thread dn via Python-list

On 22/09/2021 07.22, Michael F. Stemper wrote:
> On 21/09/2021 13.49, alister wrote:
>> On Tue, 21 Sep 2021 13:12:10 -0500, Michael F. Stemper wrote:
>>
>>> On the prolog thread, somebody posted a link to:
>>> 

Given the source, shouldn't one take any criticism of Python (or Java)
with at least the proverbial grain of salt!


>>> One thing that it tangentially says is "XML is not the answer."

"tangential" as in 'spinning off'?


...

> It's my own research, so I can give myself the data in any format that I
> like.
...
With that, why not code it as Python expressions, and include the module?
-- 
Regards,
=dn
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: XML Considered Harmful

2021-09-23 Thread Jon Ribbens via Python-list

On 2021-09-23, Stefan Ram  wrote:
>   The real problem with CSV is that there is no CSV.
>
>   This is not a specific data language with a specific
>   specification. Instead it is a vague designation for
>   a plethora of CSV dialects, which usually dot not even
>   have a specification.

Indeed. For example, at least at some points in its history,
Excel has been unable to import CSV written by itself, because
its importer was incompatible with its own exporter.

>   Compare this with XML. XML has a sole specification managed
>   by the W3C.

Other well-defined formats are also available ;-)
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: XML Considered Harmful

2021-09-23 Thread Chris Angelico

On Fri, Sep 24, 2021 at 7:11 AM Eli the Bearded <*@eli.users.panix.com> wrote:
>
> In comp.lang.python, Christian Gollwitzer   wrote:
> > Am 22.09.21 um 16:52 schrieb Michael F. Stemper:
> >> On 21/09/2021 19.30, Eli the Bearded wrote:
> >>> Yes, CSV files can model that. But it would not be my first choice of
> >>> data format. (Neither would JSON.) I'd probably use XML.
> >> Okay. 'Go not to the elves for counsel, for they will say both no
> >> and yes.' (I'm not actually surprised to find differences of opinion.)
>
> Well, I have a recommendation with my answer.
>
> > It's the same as saying "CSV supports images". Of course it doesn't, its
> > a textfile, but you could encode a JPEG as base64 and then put this
> > string into the cell of a CSV table. That definitely isn't what a sane
> > person would understand as "support".
>
> I'd use one of the netpbm formats instead of JPEG. PBM for one bit
> bitmaps, PGM for one channel (typically grayscale), PPM for three
> channel RGB, and PAM for anything else (two channel gray plus alpha,
> CMYK, RGBA, HSV, YCbCr, and more exotic formats). JPEG is tricky to
> map to CSV since it is a three channel format (YCbCr), where the
> channels are typically not at the same resolution. Usually Y is full
> size and the Cb and Cr channels are one quarter size ("4:2:0 chroma
> subsampling"). The unequal size of the channels does not lend itself
> to CSV, but I can't say it's impossible.
>

Examine prior art, and I truly do mean art, from Matt Parker:

https://www.youtube.com/watch?v=UBX2QQHlQ_I

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list

RE: XML Considered Harmful

2021-09-23 Thread Avi Gross via Python-list

What you are describing Stephen, is what I meant by emulating a relational 
database with tables.

And, FYI, There is no guarantee that two authors with the same name will not be 
assumed to be the same person.

Besides the lack of any one official CSV format, there are oodles of features I 
have seen that are normally external to the CSV. For example, I have often read 
in data from a CSV or similar, where you could tell the software to consider a 
blank or 999 to mean NA and what denotes a line in the file to be ignored as a 
comment and whether a separator is a space or any combination of whitespace and 
what quotes something so say you can hide a comma and how to handle escapes and 
whether to skip blank lines and more.

Now a really good design might place some metadata into the file that can be 
used to set defaults for things like that or incorporate them into the format 
unambiguously. It might calculate the likely data type for various fields and 
store that in the metadata. So even if you stored rectangular data in a CSV 
file, perhaps the early lines would be in some format that can be read as 
comments and supply some info like the above.

Are any of the CSV variants more like that?

-Original Message-
From: Python-list  On 
Behalf Of Stefan Ram
Sent: Thursday, September 23, 2021 5:43 PM
To: python-list@python.org
Subject: Re: XML Considered Harmful

"Avi Gross"  writes:
>But scientific papers seemingly allow oodles of authors and any time 
>you update the data, you may need yet another column.

  You can use three CSV files: papers, persons, and authors:

  papers.csv

1, "Is the accelerated expansion evidence of a change of signature?"

  persons.csv

1, Marc Mars

  authors.csv

1, 1

  I.e., paper 1 is authored by person 1.

  Now, when we learn that José M. M. Senovilla also is a
  co-author of "Is the accelerated expansion evidence of a
  forthcoming change of signature?", we do only have to add
  new rows, no new colums.

  papers.csv

1, "Is the accelerated expansion evidence of a change of signature?"

  persons.csv

1, "Marc Mars"
2, "José M. M. Senovilla"

  authors.csv

1, 1
1, 2

  The real problem with CSV is that there is no CSV.

  This is not a specific data language with a specific
  specification. Instead it is a vague designation for
  a plethora of CSV dialects, which usually dot not even
  have a specification. Compare this with XML. XML has
  a sole specification managed by the W3C.

--
https://mail.python.org/mailman/listinfo/python-list

-- 
https://mail.python.org/mailman/listinfo/python-list

RE: XML Considered Harmful

2021-09-23 Thread Avi Gross via Python-list

Can we agree that there are way more general ways to store data than
anything currently in common use and that in some ways, CSV and cousins like
TSV are a subset of the others in a sense? There are trees and arbitrary
graphs and many complex data structures often encountered while a program is
running as in-memory objects. Many are not trivial to store.

But some are if all you see is table-like constructs including matrices and
data.frames.

I mean any rectangular data format with umpteen rows and N columns can
trivially be stored in many other formats and especially when it allows some
columns to have NA values. The other format would simply have major
categories that contain components with one per column, and if missing,
represents an NA. Is there any reason JSON or XML cannot include the
contents of any CSV with headers and without loss of info?

Going the other way is harder. Note that a data.frame type of structure
often imposes restrictions on a CSV and requires everything in a column to
be of the same type, or coercible to a common type. (well, not always true
as in using list columns in R.)  But given some arbitrary structure in XML,
can you look at all possible labels and if it is not too complex, make a CSV
with one or more columns for every possible need? It can be a problem if say
a record for an Author allows multiple actual co-authors. Normal books may
let you get by with multiple columns (mostly containing an NA) with names
like author1, author2, author3, ...

But scientific papers seemingly allow oodles of authors and any time you
update the data, you may need yet another column. And, of course, processing
data where many columns have the same meaning is a bit of a pain. Data
structures can also often be nested multiple levels and at some point, CSV
is not a reasonable fit unless you play database games and make multiple
tables you can store and retrieve to make complex queries, as in many
relational database systems. Yes, each such table can be a CSV.

But if you give someone a hammer, they tend to stop using thumbtacks or
other tools. The real question is what kind of data makes good sense for an
application. If a nice rectangular format works, great. Even if not, the
Author problem above can fairly easily be handled by making the author
column something like a character string you compose as "Last1, First1;
Last2, First2; Last3, First3" and that fits fine in a CSV but can be taken
apart in your software if looking for any book by a particular author. Not
optimal, but a workaround I am sure is used.

But using the most abstract and complex storage method is very often
overkill and unless you are very good at it, may well be a fairly slow and
even error-prone way to solve a problem.

-Original Message-
From: Python-list  On
Behalf Of Chris Angelico
Sent: Thursday, September 23, 2021 9:27 AM
To: Python 
Subject: Re: XML Considered Harmful

On Thu, Sep 23, 2021 at 10:55 PM Mats Wichmann  wrote:
>
> On 9/22/21 10:31, Dennis Lee Bieber wrote:
>
> >   If you control both the data generation and the data 
> > consumption, finding some format  ...
>
> This is really the key.  I rant at people seeming to believe that csv 
> is THE data interchange format, and it's about as bad as it gets at 
> that, if you have a choice.  xml is noisy but at least (potentially) 
> self-documenting, and ought to be able to recover from certain errors.
> The problem with csv is that a substantial chunk of the world seems to 
> live inside Excel, and so data is commonly both generated in csv so it 
> can be imported into excel and generated in csv as a result of 
> exporting from excel, so the parts often are *not* in your control.
>
> Sigh.

The only people who think that CSV is *the* format are people who habitually
live in spreadsheets. People who move data around the internet, from program
to program, are much more likely to assume that JSON is the sole format. Of
course, there is no single ultimate data interchange format, but JSON is a
lot closer to one than CSV is.

(Or to be more precise: any such thing as a "single ultimate data
interchange format" will be so generic that it isn't enough to define
everything. For instance, "a stream of bytes" is a universal data
interchange format, but that's not ultimately a very useful claim.)

ChrisA
--
https://mail.python.org/mailman/listinfo/python-list

-- 
https://mail.python.org/mailman/listinfo/python-list

Re: XML Considered Harmful

2021-09-23 Thread Michael F. Stemper


On 23/09/2021 12.51, Eli the Bearded wrote:

Am 22.09.21 um 16:52 schrieb Michael F. Stemper:

On 21/09/2021 19.30, Eli the Bearded wrote:

Yes, CSV files can model that. But it would not be my first choice of
data format. (Neither would JSON.) I'd probably use XML.

Okay. 'Go not to the elves for counsel, for they will say both no
and yes.' (I'm not actually surprised to find differences of opinion.)


Well, I have a recommendation with my answer.


Sorry, didn't mean that to be disparaging.

--
Michael F. Stemper
This post contains greater than 95% post-consumer bytes by weight.
--
https://mail.python.org/mailman/listinfo/python-list

Re: XML Considered Harmful

2021-09-23 Thread Michael F. Stemper


On 22/09/2021 17.37, Dennis Lee Bieber wrote:

On Wed, 22 Sep 2021 09:52:59 -0500, "Michael F. Stemper"
 declaimed the following:

On 21/09/2021 19.30, Eli the Bearded wrote:

In comp.lang.python, Michael F. Stemper  wrote:



How does CSV handle hierarchical data? For instance, I have



Can CSV files model this sort of situation?





Yes, CSV files can model that. But it would not be my first choice of
data format. (Neither would JSON.) I'd probably use XML.


Okay. 'Go not to the elves for counsel, for they will say both no
and yes.' (I'm not actually surprised to find differences of opinion.)


You'd have to include a "level" (and/or data type if multiple objects
can be at the same level) field (as the first field) in CSV which
identifies how to parse the rest of the CSV data (well, technically, the
CSV module has "parsed" it -- in terms of splitting at commas, handling
quoted strings (which may contain commas which are not split points, etc.).

1-generator, name
2-fuel, name, UOM, heat-content, price
2-curve, name
3-point, X, Y
3-point, X, Y
...
2-curve, name
3-point, X, Y
3-point, X, Y


This reminds me of how my (former) employer imported data models into
our systems from the 1970s until the mid-2000s. We had 80-column records
(called "card images"), that would have looked like:

FUEL0 LIGNITETON13.610 043.581
UNIT1 COAL CREK1
UNIT2 ...

The specific columns for the start and end of each field on each record
were defined in a thousand-plus page document. (We modeled all of a
power system, not just economic data about generators.)

However, this doesn't seem like it would fit too well with the csv
module, since it requires a lot more logic on the part of the consuming
program.

Interesting flashback, though.

--
Michael F. Stemper
Deuteronomy 24:17
--
https://mail.python.org/mailman/listinfo/python-list

Re: XML Considered Harmful

2021-09-23 Thread Christian Gollwitzer


Am 22.09.21 um 16:52 schrieb Michael F. Stemper:

On 21/09/2021 19.30, Eli the Bearded wrote:

Yes, CSV files can model that. But it would not be my first choice of
data format. (Neither would JSON.) I'd probably use XML.


Okay. 'Go not to the elves for counsel, for they will say both no
and yes.' (I'm not actually surprised to find differences of opinion.)


It is wrong, CSV has no model of hierarchical data. A CSV file is a 2d 
table, just like a database table or an Excel sheet.


You can /layer/ high-dimensional data on top of a 2D table, there is the 
relational algebra theory behind this, but it is wrong (or misleading at 
best) to say that CSV can model hierarchical data.


It's the same as saying "CSV supports images". Of course it doesn't, its 
a textfile, but you could encode a JPEG as base64 and then put this 
string into the cell of a CSV table. That definitely isn't what a sane 
person would understand as "support".


Christian

--
https://mail.python.org/mailman/listinfo/python-list

Re: XML Considered Harmful

2021-09-23 Thread Eli the Bearded

In comp.lang.python, Christian Gollwitzer   wrote:
> Am 22.09.21 um 16:52 schrieb Michael F. Stemper:
>> On 21/09/2021 19.30, Eli the Bearded wrote:
>>> Yes, CSV files can model that. But it would not be my first choice of
>>> data format. (Neither would JSON.) I'd probably use XML.
>> Okay. 'Go not to the elves for counsel, for they will say both no
>> and yes.' (I'm not actually surprised to find differences of opinion.)

Well, I have a recommendation with my answer.

> It's the same as saying "CSV supports images". Of course it doesn't, its 
> a textfile, but you could encode a JPEG as base64 and then put this 
> string into the cell of a CSV table. That definitely isn't what a sane 
> person would understand as "support".

I'd use one of the netpbm formats instead of JPEG. PBM for one bit
bitmaps, PGM for one channel (typically grayscale), PPM for three
channel RGB, and PAM for anything else (two channel gray plus alpha,
CMYK, RGBA, HSV, YCbCr, and more exotic formats). JPEG is tricky to
map to CSV since it is a three channel format (YCbCr), where the
channels are typically not at the same resolution. Usually Y is full
size and the Cb and Cr channels are one quarter size ("4:2:0 chroma
subsampling"). The unequal size of the channels does not lend itself
to CSV, but I can't say it's impossible.

But maybe you meant the whole JFIF or Exif JPEG file format base64
encoded with no attempt to understand the image. That sort of thing
is common in JSON, and I've seen it in YAML, too. It wouldn't surprise
me if people do that in CSV or XML, but I have so far avoided seeing
that. I used that method for sticking a tiny PNG in a CSS file just
earlier this month. The whole PNG was smaller than the typical headers
of an HTTP/1.1 request and response, so I figured "don't make it a
separate file".

Elijah
--
can at this point recegnize a bunch of "magic numbers" in base64

-- 
https://mail.python.org/mailman/listinfo/python-list

Re: XML Considered Harmful

2021-09-23 Thread Chris Angelico

On Thu, Sep 23, 2021 at 10:55 PM Mats Wichmann  wrote:
>
> On 9/22/21 10:31, Dennis Lee Bieber wrote:
>
> >   If you control both the data generation and the data consumption,
> > finding some format  ...
>
> This is really the key.  I rant at people seeming to believe that csv is
> THE data interchange format, and it's about as bad as it gets at that,
> if you have a choice.  xml is noisy but at least (potentially)
> self-documenting, and ought to be able to recover from certain errors.
> The problem with csv is that a substantial chunk of the world seems to
> live inside Excel, and so data is commonly both generated in csv so it
> can be imported into excel and generated in csv as a result of exporting
> from excel, so the parts often are *not* in your control.
>
> Sigh.

The only people who think that CSV is *the* format are people who
habitually live in spreadsheets. People who move data around the
internet, from program to program, are much more likely to assume that
JSON is the sole format. Of course, there is no single ultimate data
interchange format, but JSON is a lot closer to one than CSV is.

(Or to be more precise: any such thing as a "single ultimate data
interchange format" will be so generic that it isn't enough to define
everything. For instance, "a stream of bytes" is a universal data
interchange format, but that's not ultimately a very useful claim.)

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: XML Considered Harmful

2021-09-23 Thread Mats Wichmann


On 9/22/21 10:31, Dennis Lee Bieber wrote:


If you control both the data generation and the data consumption,
finding some format  ...


This is really the key.  I rant at people seeming to believe that csv is 
THE data interchange format, and it's about as bad as it gets at that, 
if you have a choice.  xml is noisy but at least (potentially) 
self-documenting, and ought to be able to recover from certain errors. 
The problem with csv is that a substantial chunk of the world seems to 
live inside Excel, and so data is commonly both generated in csv so it 
can be imported into excel and generated in csv as a result of exporting 
from excel, so the parts often are *not* in your control.


Sigh.

--
https://mail.python.org/mailman/listinfo/python-list

Re: XML Considered Harmful

2021-09-22 Thread Dennis Lee Bieber

On Wed, 22 Sep 2021 09:52:59 -0500, "Michael F. Stemper"
 declaimed the following:

>On 21/09/2021 19.30, Eli the Bearded wrote:
>> In comp.lang.python, Michael F. Stemper  wrote:
>>> How does CSV handle hierarchical data? For instance, I have
>>> generators[1], each of which has a name, a fuel and one or more
>>> incremental heat rate curves. Each fuel has a name, UOM, heat content,
>>> and price. Each incremental cost curve has a name, and a series of
>>> ordered pairs (representing a piecewise linear curve).
>>>
>>> Can CSV files model this sort of situation?
>> 
 
>> Yes, CSV files can model that. But it would not be my first choice of
>> data format. (Neither would JSON.) I'd probably use XML.
>
>Okay. 'Go not to the elves for counsel, for they will say both no
>and yes.' (I'm not actually surprised to find differences of opinion.)
>
You'd have to include a "level" (and/or data type if multiple objects
can be at the same level) field (as the first field) in CSV which
identifies how to parse the rest of the CSV data (well, technically, the
CSV module has "parsed" it -- in terms of splitting at commas, handling
quoted strings (which may contain commas which are not split points, etc.).

1-generator, name
2-fuel, name, UOM, heat-content, price
2-curve, name
3-point, X, Y
3-point, X, Y
...
2-curve, name
3-point, X, Y
3-point, X, Y
...

You extract objects at each level; if the level is the same or "lower"
(numerically -- higher in hierarchy) you attach the "previously" extracted
object to the parent object... Whether list or dictionary, or class
instance(s):

class Point():
#Point may be overkill, easier to just use a tuple (X, Y)
def __init__(self, X, Y):
self.X = X
self.Y = Y

class Curve():
def __init__(self, name):
self.name = name
self.points = []

#use as aCurve.points.append(currentPoint)

class Fuel():
def __init__(self, name, ..., price):
self.name = name
...
self.price = price

class Generator():
def __init__(self, name):
self.name = name
self.fuel = None
self.curves = []

#aGenerator.fuel = currentCurve
#aGenerator.curves.append(currentCurve)



-- 
Wulfraed Dennis Lee Bieber AF6VN
wlfr...@ix.netcom.comhttp://wlfraed.microdiversity.freeddns.org/

-- 
https://mail.python.org/mailman/listinfo/python-list

Re: XML Considered Harmful

2021-09-22 Thread Dennis Lee Bieber

On Tue, 21 Sep 2021 13:12:10 -0500, "Michael F. Stemper"
 declaimed the following:

>On the prolog thread, somebody posted a link to:
>
>
>One thing that it tangentially says is "XML is not the answer."
>
>I read this page right when I was about to write an XML parser
>to get data into the code for a research project I'm working on.
>It seems to me that XML is the right approach for this sort of
>thing, especially since the data is hierarchical in nature.
>
>Does the advice on that page mean that I should find some other
>way to get data into my programs, or does it refer to some kind
>of misuse/abuse of XML for something that it wasn't designed
>for?

There are some that try to use XML as a /live/ data /storage/ format
(such as http://www.drivehq.com/web/brana/pandora.htm which has to parse
XML files for all configuration data and filter definitions on start-up,
and update those files on any changes).

If you control both the data generation and the data consumption,
finding some format with less overhead than XML is probably to be
recommended. XML is more a self-documented (in theory) means of packaging
data for transport between widely disparate applications, which are likely
written by different teams, if not different companies, who only interface
via the definition of the data as seen by XML.

>
>If XML is not the way to package data, what is the recommended
>approach?

Again, if you control both generation and consumption... I'd probably
use an RDBM. SQLite tends to be packaged with Python [Windows] or, at the
least, the DB-API adapter [Linux tends to expect SQLite as a standard
installed item]. SQLite is a "file server" model (as is the JET engine used
by M$ Access) -- each application (instance) is directly accessing the
database file; there is no server process mediating access.

Hierarchical (since you mention that in later posts) would be
represented by relations (terminology from relational theory -- a "table"
to most) linked by foreign keys.


-- 
Wulfraed Dennis Lee Bieber AF6VN
wlfr...@ix.netcom.comhttp://wlfraed.microdiversity.freeddns.org/

-- 
https://mail.python.org/mailman/listinfo/python-list

Re: XML Considered Harmful

2021-09-22 Thread Michael F. Stemper


On 21/09/2021 19.30, Eli the Bearded wrote:

In comp.lang.python, Michael F. Stemper  wrote:

I've heard of JSON, but never done anything with it.


You probably have used it inadvertantly on a regular basis over the
past few years. Websites live on it.


I used to use javascript when I was running Windows (up until 2009),
since it was the only programming language to which I had ready
access. Then I got a linux box and quickly discovered python. I
dropped javascript like a hot potato.


How does CSV handle hierarchical data? For instance, I have
generators[1], each of which has a name, a fuel and one or more
incremental heat rate curves. Each fuel has a name, UOM, heat content,
and price. Each incremental cost curve has a name, and a series of
ordered pairs (representing a piecewise linear curve).

Can CSV files model this sort of situation?


Can a string of ones and zeros encode the sounds of Bach, the images
of his sheet music, the details to reproduce his bust in melted plastic
extruded from nozzle under the control of machines?

Yes, CSV files can model that. But it would not be my first choice of
data format. (Neither would JSON.) I'd probably use XML.


Okay. 'Go not to the elves for counsel, for they will say both no
and yes.' (I'm not actually surprised to find differences of opinion.)


[1] The kind made of tons of iron and copper, filled with oil, and
rotating at 1800 rpm.


Those are rather hard to model in CSV, too, but I'm sure it could be
done.



for bonus round, use punched holes in paper to encode the ones and zeros


I've done cardboard.


--
Michael F. Stemper
No animals were harmed in the composition of this message.
--
https://mail.python.org/mailman/listinfo/python-list

Re: XML Considered Harmful

2021-09-22 Thread Michael F. Stemper


On 21/09/2021 16.21, Pete Forman wrote:

"Michael F. Stemper"  writes:

On 21/09/2021 13.49, alister wrote:

On Tue, 21 Sep 2021 13:12:10 -0500, Michael F. Stemper wrote:

It's my own research, so I can give myself the data in any format that I
like.


as far as I can see the main issue with XML is bloat, it tries to do
too many things & is a very verbose format, often the quantity of
mark-up can easily exceed the data contained within it. other formats
such a JSON & csv have far less overhead, although again not always
suitable.


I've heard of JSON, but never done anything with it.


Then you should certainly try to get a basic understanding of it. One
thing JSON shares with XML is that it is best left to machines to
produce and consume. Because both can be viewed in a text editor there
is a common misconception that they are easy to edit. Not so, commas are
a common bugbear in JSON and non-trivial edits in (XML unaware) text
editors are tricky.


Okay, after playing around with the example in Lubanovic's book[1]
I've managed to create a dict of dicts of dicts and write it to a
json file. It seems to me that this is how json handles hierarchical
data. Is that understanding correct?

Is this then the process that I would use to create a *.json file
to provide data to my various programs? Copy and paste the current
hard-coded assignment statements into REPL, use json.dump(dict,fp)
to write it to a file, and then read the file into each program
with json.load(fp)? (Actually, I'd write a function to do that,
just as I would with XML.)


Consider what overhead you should worry about. If you are concerned
about file sizes then XML, JSON and CSV should all compress to a similar
size.


Not a concern at all for my current application.


How does CSV handle hierarchical data? For instance, I have
generators[1], each of which has a name, a fuel and one or more
incremental heat rate curves. Each fuel has a name, UOM, heat content,
and price. Each incremental cost curve has a name, and a series of
ordered pairs (representing a piecewise linear curve).

Can CSV files model this sort of situation?


The short answer is no. CSV files represent spreadsheet row-column
values with nothing fancier such as formulas or other redirections.


Okay, that was what I suspected.


CSV is quite good as a lowest common denominator exchange format. I say
quite because I would characterize it by 8 attributes and you need to
pick a dialect such as MS Excel which sets out what those are. XML and
JSON are controlled much better. You can easily verify that you conform
to those and guarantee that *any* conformant parser can read your
content. XML is more powerful in that repect than JSON in that you can
define and enforce schemas. In your case the fuel name, UOM, etc. can be
validated with standard tools.


Yeah, validating against a DTD is pretty easy, since lxml.etree does all
of the work.


  In JSON all that checking is entirely
handled by the consuming program(s).

Well, the consumer's (almost) always going to need to do *some*
validation. For instance, as far as I can tell, a DTD can't specify
that there must be at least two of a particular item.

The designers of DTD seem to have taken the advice of MacLennan[2]:
  "The only reasonable numbers are zero, one, or infinity."

Which is great until you need to make sure that you have enough
points to define at least one line segment.


As in all such cases it is a matter of choosing the most apropriate tool
for the job in hand.


Naturally. That's what I'm exploring.


You might also like to consider HDF5. It is targeted at large volumes of
scientific data and its capabilities are well above what you need.


Yeah, I won't be looking at more than five or ten generators at most. A
small number is enough to confirm or refute the behavior that I'm
testing.


[1] _Introducing Python: Modern Computing in Simple Packages_,
Second Release, (c) 2015, Bill Lubanovic, O'Reilly Media, Inc.
[2] _Principles of Programming Languages: Design, Evaluation,
and Implementation_, Second Edition, (c) 1987, Bruce J. MacLennan,
Holt, Rinehart, & Winston
--
Michael F. Stemper
No animals were harmed in the composition of this message.
--
https://mail.python.org/mailman/listinfo/python-list

Re: XML Considered Harmful

2021-09-22 Thread Pete Forman

Jon Ribbens  writes:

> On 2021-09-21, Pete Forman  wrote:
>> CSV is quite good as a lowest common denominator exchange format. I
>> say quite because I would characterize it by 8 attributes and you
>> need to pick a dialect such as MS Excel which sets out what those
>> are. XML and JSON are controlled much better. You can easily verify
>> that you conform to those and guarantee that *any* conformant parser
>> can read your content. XML is more powerful in that repect than JSON
>> in that you can define and enforce schemas. In your case the fuel
>> name, UOM, etc. can be validated with standard tools. In JSON all
>> that checking is entirely handled by the consuming program(s).
>
> That's not true. You can use "JSON Schema" to create a schema for
> validating JSON files, and there appear to be at least four
> implementations in Python.

Fair point. It has been a while since I looked at JSON schemas and they
were rather less mature then.

-- 
Pete Forman
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: XML Considered Harmful

2021-09-21 Thread Dan Stromberg

On Tue, Sep 21, 2021 at 7:26 PM Michael F. Stemper <
michael.stem...@gmail.com> wrote:

> If XML is not the way to package data, what is the recommended
> approach?
>

I prefer both JSON and YAML over XML.

XML has both elements and tags, but it didn't really need both. This
results in more complexity than necessary.  Also, XSLT and XPath are not
really all that simple.

But there's hope.  If you're stuck with XML, you can use xmltodict, which
makes XML almost as easy as JSON.

HTH.
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: XML Considered Harmful

2021-09-21 Thread Ethan Furman


On 9/21/21 11:12 AM, Michael F. Stemper wrote:

> It seems to me that XML is the right approach for this sort of
> thing, especially since the data is hierarchical in nature.

If you're looking for a format that you can read (as a human) and possibly 
hand-edit,
check out NestedText:

  https://nestedtext.org/en/stable/

--
~Ethan~
--
https://mail.python.org/mailman/listinfo/python-list

Re: XML Considered Harmful

2021-09-21 Thread Eli the Bearded

In comp.lang.python, Michael F. Stemper  wrote:
> I've heard of JSON, but never done anything with it.

You probably have used it inadvertantly on a regular basis over the
past few years. Websites live on it.

> How does CSV handle hierarchical data? For instance, I have
> generators[1], each of which has a name, a fuel and one or more
> incremental heat rate curves. Each fuel has a name, UOM, heat content,
> and price. Each incremental cost curve has a name, and a series of
> ordered pairs (representing a piecewise linear curve).
> 
> Can CSV files model this sort of situation?

Can a string of ones and zeros encode the sounds of Bach, the images
of his sheet music, the details to reproduce his bust in melted plastic
extruded from nozzle under the control of machines?

Yes, CSV files can model that. But it would not be my first choice of
data format. (Neither would JSON.) I'd probably use XML.

I rather suspect that all (many) of those genomes that end up in
Microsoft Excel files get there via a CSV export from a command line
tool. Once you can model life in CSV, everything seems possible.

> [1] The kind made of tons of iron and copper, filled with oil, and
> rotating at 1800 rpm.

Those are rather hard to model in CSV, too, but I'm sure it could be
done.

Elijah
--
for bonus round, use punched holes in paper to encode the ones and zeros
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: XML Considered Harmful

2021-09-21 Thread Jon Ribbens via Python-list

On 2021-09-21, Pete Forman  wrote:
> CSV is quite good as a lowest common denominator exchange format. I say
> quite because I would characterize it by 8 attributes and you need to
> pick a dialect such as MS Excel which sets out what those are. XML and
> JSON are controlled much better. You can easily verify that you conform
> to those and guarantee that *any* conformant parser can read your
> content. XML is more powerful in that repect than JSON in that you can
> define and enforce schemas. In your case the fuel name, UOM, etc. can be
> validated with standard tools. In JSON all that checking is entirely
> handled by the consuming program(s).

That's not true. You can use "JSON Schema" to create a schema
for validating JSON files, and there appear to be at least four
implementations in Python.
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: XML Considered Harmful

2021-09-21 Thread Joe Pfeiffer

r...@zedat.fu-berlin.de (Stefan Ram) writes:

> - S expressions (i.e., LISP notation)

If you're looking at hierarchical data and you don't have some good
reason to use something else, this is very likely to be your simplest
option.
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: XML Considered Harmful

2021-09-21 Thread alister via Python-list

On Tue, 21 Sep 2021 14:22:52 -0500, Michael F. Stemper wrote:

> On 21/09/2021 13.49, alister wrote:
>> On Tue, 21 Sep 2021 13:12:10 -0500, Michael F. Stemper wrote:
>> 
>>> On the prolog thread, somebody posted a link to:
>>> 
>>>
>>> One thing that it tangentially says is "XML is not the answer."
>>>
>>> I read this page right when I was about to write an XML parser to get
>>> data into the code for a research project I'm working on.
>>> It seems to me that XML is the right approach for this sort of thing,
>>> especially since the data is hierarchical in nature.
>>>
>>> Does the advice on that page mean that I should find some other way to
>>> get data into my programs, or does it refer to some kind of
>>> misuse/abuse of XML for something that it wasn't designed for?
>>>
>>> If XML is not the way to package data, what is the recommended
>>> approach?
>> 
>> 1'st can I say don't write your own XML parser, there are already a
>> number of existing parsers that should do everything you will need. 
>> This is a wheel that does not need re-inventing.
> 
> I was going to build it on top of xml.etree.ElementTree
> 
so not writing a parser, using one, that's ok

>> 2nd if you are not generating the data then you have to use whatever
>> data format you are supplied
> 
> It's my own research, so I can give myself the data in any format that I
> like.
> 
>> as far as I can see the main issue with XML is bloat, it tries to do
>> too many things & is a very verbose format, often the quantity of
>> mark-up can easily exceed the data contained within it.
>> 
>> other formats such a JSON & csv have far less overhead, although again
>> not always suitable.
> 
> I've heard of JSON, but never done anything with it.
the python json library makes it simple.
it was originally invented for javascript, it looks very much like the 
repl for a list/dictionary but if you are using std libraries you don't 
really need to know except for academic interst
> 
> How does CSV handle hierarchical data?
It dosn't, if you have heirachiacl data it is not a suitable format
> For instance, I have
> generators[1], each of which has a name, a fuel and one or more
> incremental heat rate curves. Each fuel has a name, UOM, heat content,
> and price. Each incremental cost curve has a name, and a series of
> ordered pairs (representing a piecewise linear curve).
> 
> Can CSV files model this sort of situation?
> 
>> As in all such cases it is a matter of choosing the most apropriate
>> tool for the job in hand.
> 
> Naturally. That's what I'm exploring.
> 
> 
> [1] The kind made of tons of iron and copper, filled with oil, and
> rotating at 1800 rpm.





-- 
Riches cover a multitude of woes.
-- Menander
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: XML Considered Harmful

2021-09-21 Thread Pete Forman

"Michael F. Stemper"  writes:

> On 21/09/2021 13.49, alister wrote:
>> On Tue, 21 Sep 2021 13:12:10 -0500, Michael F. Stemper wrote:
> It's my own research, so I can give myself the data in any format that I
> like.
>
>> as far as I can see the main issue with XML is bloat, it tries to do
>> too many things & is a very verbose format, often the quantity of
>> mark-up can easily exceed the data contained within it. other formats
>> such a JSON & csv have far less overhead, although again not always
>> suitable.
>
> I've heard of JSON, but never done anything with it.

Then you should certainly try to get a basic understanding of it. One
thing JSON shares with XML is that it is best left to machines to
produce and consume. Because both can be viewed in a text editor there
is a common misconception that they are easy to edit. Not so, commas are
a common bugbear in JSON and non-trivial edits in (XML unaware) text
editors are tricky.

Consider what overhead you should worry about. If you are concerned
about file sizes then XML, JSON and CSV should all compress to a similar
size.

> How does CSV handle hierarchical data? For instance, I have
> generators[1], each of which has a name, a fuel and one or more
> incremental heat rate curves. Each fuel has a name, UOM, heat content,
> and price. Each incremental cost curve has a name, and a series of
> ordered pairs (representing a piecewise linear curve).
>
> Can CSV files model this sort of situation?

The short answer is no. CSV files represent spreadsheet row-column
values with nothing fancier such as formulas or other redirections.

CSV is quite good as a lowest common denominator exchange format. I say
quite because I would characterize it by 8 attributes and you need to
pick a dialect such as MS Excel which sets out what those are. XML and
JSON are controlled much better. You can easily verify that you conform
to those and guarantee that *any* conformant parser can read your
content. XML is more powerful in that repect than JSON in that you can
define and enforce schemas. In your case the fuel name, UOM, etc. can be
validated with standard tools. In JSON all that checking is entirely
handled by the consuming program(s).

>> As in all such cases it is a matter of choosing the most apropriate tool
>> for the job in hand.
>
> Naturally. That's what I'm exploring.

You might also like to consider HDF5. It is targeted at large volumes of
scientific data and its capabilities are well above what you need.
MATLAB, Octave and Scilab use it as their native format. PyTables and
h2py provide Python/NumPy bindings to it.

-- 
Pete Forman
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: XML Considered Harmful

2021-09-21 Thread Michael F. Stemper


On 21/09/2021 13.49, alister wrote:

On Tue, 21 Sep 2021 13:12:10 -0500, Michael F. Stemper wrote:


On the prolog thread, somebody posted a link to:


One thing that it tangentially says is "XML is not the answer."

I read this page right when I was about to write an XML parser to get
data into the code for a research project I'm working on.
It seems to me that XML is the right approach for this sort of thing,
especially since the data is hierarchical in nature.

Does the advice on that page mean that I should find some other way to
get data into my programs, or does it refer to some kind of misuse/abuse
of XML for something that it wasn't designed for?

If XML is not the way to package data, what is the recommended approach?


1'st can I say don't write your own XML parser, there are already a
number of existing parsers that should do everything you will need.  This
is a wheel that does not need re-inventing.


I was going to build it on top of xml.etree.ElementTree


2nd if you are not generating the data then you have to use whatever data
format you are supplied


It's my own research, so I can give myself the data in any format that I
like.


as far as I can see the main issue with XML is bloat, it tries to do too
many things & is a very verbose format, often the quantity of mark-up can
easily exceed the data contained within it.

other formats such a JSON & csv have far less overhead, although again
not always suitable.


I've heard of JSON, but never done anything with it.

How does CSV handle hierarchical data? For instance, I have
generators[1], each of which has a name, a fuel and one or more
incremental heat rate curves. Each fuel has a name, UOM, heat content,
and price. Each incremental cost curve has a name, and a series of
ordered pairs (representing a piecewise linear curve).

Can CSV files model this sort of situation?


As in all such cases it is a matter of choosing the most apropriate tool
for the job in hand.


Naturally. That's what I'm exploring.


[1] The kind made of tons of iron and copper, filled with oil, and
rotating at 1800 rpm.

--
Michael F. Stemper
This sentence no verb.
--
https://mail.python.org/mailman/listinfo/python-list

Re: XML Considered Harmful

2021-09-21 Thread Joe Pfeiffer

Eli the Bearded <*@eli.users.panix.com> writes:

> In comp.lang.python, Michael F. Stemper  wrote:
>> I've heard of JSON, but never done anything with it.
>
> You probably have used it inadvertantly on a regular basis over the
> past few years. Websites live on it.

If the user has any interaction whatever with the formats being used to
transfer data then something is very, very wrong.  Someone using a
website built on JSON isn't using JSON in any meaningful sense of the
term.

>> How does CSV handle hierarchical data? For instance, I have
>> generators[1], each of which has a name, a fuel and one or more
>> incremental heat rate curves. Each fuel has a name, UOM, heat content,
>> and price. Each incremental cost curve has a name, and a series of
>> ordered pairs (representing a piecewise linear curve).
>> 
>> Can CSV files model this sort of situation?
>
> Can a string of ones and zeros encode the sounds of Bach, the images
> of his sheet music, the details to reproduce his bust in melted plastic
> extruded from nozzle under the control of machines?
>
> Yes, CSV files can model that. But it would not be my first choice of
> data format. (Neither would JSON.) I'd probably use XML.
>
> I rather suspect that all (many) of those genomes that end up in
> Microsoft Excel files get there via a CSV export from a command line
> tool. Once you can model life in CSV, everything seems possible.

Whenever someone asks "can this be done?" in any sort of computer
related question, the real question is "is this practical?"  I have hazy
memories of seeing a Turing Machine implemented in an Excel spreadsheet,
so *anything* can, with sufficiently ridiculous amounts of work.  That's
not really helpful here.

>> [1] The kind made of tons of iron and copper, filled with oil, and
>> rotating at 1800 rpm.
>
> Those are rather hard to model in CSV, too, but I'm sure it could be
> done.

So let's try to point him at representations that are easy.
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: XML Considered Harmful

2021-09-21 Thread alister via Python-list

On Tue, 21 Sep 2021 13:12:10 -0500, Michael F. Stemper wrote:

> On the prolog thread, somebody posted a link to:
> 
> 
> One thing that it tangentially says is "XML is not the answer."
> 
> I read this page right when I was about to write an XML parser to get
> data into the code for a research project I'm working on.
> It seems to me that XML is the right approach for this sort of thing,
> especially since the data is hierarchical in nature.
> 
> Does the advice on that page mean that I should find some other way to
> get data into my programs, or does it refer to some kind of misuse/abuse
> of XML for something that it wasn't designed for?
> 
> If XML is not the way to package data, what is the recommended approach?

1'st can I say don't write your own XML parser, there are already a 
number of existing parsers that should do everything you will need.  This 
is a wheel that does not need re-inventing.

2nd if you are not generating the data then you have to use whatever data 
format you are supplied

as far as I can see the main issue with XML is bloat, it tries to do too 
many things & is a very verbose format, often the quantity of mark-up can 
easily exceed the data contained within it.

other formats such a JSON & csv have far less overhead, although again 
not always suitable.

As in all such cases it is a matter of choosing the most apropriate tool 
for the job in hand. 

-- 
Antonym, n.:
The opposite of the word you're trying to think of.
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: XML Considered Harmful

2021-09-21 Thread Jon Ribbens via Python-list

On 2021-09-21, Michael F. Stemper  wrote:
> On the prolog thread, somebody posted a link to:
>
>
> One thing that it tangentially says is "XML is not the answer."
>
> I read this page right when I was about to write an XML parser
> to get data into the code for a research project I'm working on.
> It seems to me that XML is the right approach for this sort of
> thing, especially since the data is hierarchical in nature.
>
> Does the advice on that page mean that I should find some other
> way to get data into my programs, or does it refer to some kind
> of misuse/abuse of XML for something that it wasn't designed
> for?
>
> If XML is not the way to package data, what is the recommended
> approach?

I'd agree that you should not use XML unless the data is being supplied
already in XML format or perhaps if there is already a schema defined in
XML for exactly your purpose.

If there is nothing pre-existing to build upon then I'd suggest JSON.

If anyone suggests YAML, then you should just back slowly away while
speaking in a low calm voice until you have reached sufficient safe
distance, then turn and run.
-- 
https://mail.python.org/mailman/listinfo/python-list

74 matches

Mail list logo