Hi Python Tutor folks
This is a rather long post, but i wanted to include all the details &
everything i have tried so far myself, so please bear with me & read the
entire boringly long post.
Goal : I am trying to parse a ginormous ( ~ 1gb) xml file.
I am looking for a specific element..there ar
If you can assume a well formatted file I would just parse it linearly, should
be much faster. Read the file in as lines if the XML is already in human
readable form, or just read in blocks and append to a list and do a join() when
you have a whole match.
-
Sent from
ashish makani wrote:
Goal : I am trying to parse a ginormous ( ~ 1gb) xml file.
I sympathize with you. I wonder who thought that building a 1GB XML file
was a good thing.
Forget about using any XML parser that reads the entire file into
memory. By the time that 1GB of text is read and pars
On Mon, Dec 20, 2010 at 4:19 PM, Steven D'Aprano wrote:
>> Goal : I am trying to parse a ginormous ( ~ 1gb) xml file.
>
> I sympathize with you. I wonder who thought that building a 1GB XML file was
> a good thing.
XML is like violence: if it isn't working, try more.
--
Brett Ritter / SwiftOne
[?] Brett, that was very mischievous.
I wish I could help - am watching this thread with great curiosity, I could
learn something from it myself.
On Mon, Dec 20, 2010 at 11:40 PM, Brett Ritter wrote:
> On Mon, Dec 20, 2010 at 4:19 PM, Steven D'Aprano
> wrote:
> >> Goal : I am trying to parse a
Brett Ritter wrote:
On Mon, Dec 20, 2010 at 4:19 PM, Steven D'Aprano wrote:
Goal : I am trying to parse a ginormous ( ~ 1gb) xml file.
I sympathize with you. I wonder who thought that building a 1GB XML file was
a good thing.
XML is like violence: if it isn't working, try more.
I love it
On Mon, Dec 20, 2010 at 5:32 PM, Steven D'Aprano wrote:
>> XML is like violence: if it isn't working, try more.
>
> I love it -- may I quote you?
I can't claim credit for it, I saw originally saw it on some sigs on
Slashdot a few years ago. It certainly matches the majority of XML
usage I've enc
"ashish makani" wrote
I am looking for a specific element..there are several 10s/100s
occurrences
of that element in the 1gb file.
I need to detect them & then for each 1, i need to copy all the
content b/w
the element's start & end tags & create a smaller xml
This is exactly what sax and
Thanks Luke, Steve, Brett, Lloyd & Alan
for your prompt responses & sharing your wisdom.
I <3 the python community... You(We ?) folks are AWESOME
I cross-posted this query on comp.lang.python
I bet most of you hang @ c.l.p too, but just in case, here is the link to
the discussion at c.l.p
https:/
This isn't XML, it's an abomination of XML. Best to not treat it as XML.
Good thing you're only after one class of tags. Here's what I'd do. I'll
give a general solution, but there are two parameters / four cases that could
make the code simpler, I'll just point them out at the end.
Iterat
Chris
This block of code made my day - especially yummydataaddrs & "here's your
stupid data"
> for start,end in yummydataaddrs:
>fd.seek(start)
>print "here's your stupid data:", fd.read(end-start+1)
Nothing is more impressive than solid code, with a good sense of humor.
Thanks for the
[note that this has also been posted to comp.lang.python and discussed
separately over there]
Steven D'Aprano, 20.12.2010 22:19:
ashish makani wrote:
Goal : I am trying to parse a ginormous ( ~ 1gb) xml file.
I sympathize with you. I wonder who thought that building a 1GB XML file
was a goo
On Tue, Dec 21, 2010 at 3:44 AM, Stefan Behnel wrote:
> [note that this has also been posted to comp.lang.python and discussed
> separately over there]
>
> Steven D'Aprano, 20.12.2010 22:19:
>>
>> ashish makani wrote:
>>
>>> Goal : I am trying to parse a ginormous ( ~ 1gb) xml file.
>>
>> I sympat
Chris Fuller, 21.12.2010 03:27:
This isn't XML, it's an abomination of XML. Best to not treat it as XML.
Good thing you're only after one class of tags. Here's what I'd do. I'll
give a general solution, but there are two parameters / four cases that could
make the code simpler, I'll just point
But then again, maybe it's too much of an optimization for someone not
optimizing for others or a specific application for the hardware, or
it's not part of the standard python library, and therefore,
expendable.
___
Tutor maillist - Tutor@python.org
To
On Tue, Dec 21, 2010 at 3:52 AM, Stefan Behnel wrote:
> Chris Fuller, 21.12.2010 03:27:
>>
>> This isn't XML, it's an abomination of XML. Best to not treat it as XML.
>> Good thing you're only after one class of tags. Here's what I'd do. I'll
>> give a general solution, but there are two parame
On Tue, Dec 21, 2010 at 3:55 AM, David Hutto wrote:
> On Tue, Dec 21, 2010 at 3:52 AM, Stefan Behnel wrote:
>> Chris Fuller, 21.12.2010 03:27:
>>>
>>> This isn't XML, it's an abomination of XML. Best to not treat it as XML.
>>> Good thing you're only after one class of tags. Here's what I'd do.
And from what I recall XML is intended for data transfer in respect to
HTML(from a recent brushup, nothing more), so not having used it, it
sure has been displayed as a data transfer mechanism, I remember this
from using Joomla's framework, and the xml files for menus I think.
_
David Hutto, 21.12.2010 09:49:
Steven D'Aprano, 20.12.2010 22:19:
ashish makani wrote:
Goal : I am trying to parse a ginormous ( ~ 1gb) xml file.
I sympathize with you. I wonder who thought that building a 1GB XML file
was a good thing.
http://gnosis.cx/publish/programming/xml_matters_29.
On Tue, Dec 21, 2010 at 3:59 AM, David Hutto wrote:
> And from what I recall XML is intended for data transfer in respect to
> HTML(from a recent brushup, nothing more),
Apologies that is browser based transfer, (not sure what more,
although I think it means any data tranfer)
so not having used
David Hutto, 21.12.2010 09:55:
On Tue, Dec 21, 2010 at 3:52 AM, Stefan Behnel wrote:
Chris Fuller, 21.12.2010 03:27:
This isn't XML, it's an abomination of XML. Best to not treat it as XML.
Good thing you're only after one class of tags. Here's what I'd do. I'll
give a general solution, but
.
I sympathize with you. I wonder who thought that building a 1GB XML file
was a good thing.
If it is:
XML stands for eXtensible Markup Language.
XML is designed to transport and store data.
Then what other file medium would you suggest as the tagging means.
You have a file wit
On Tue, Dec 21, 2010 at 4:10 AM, Stefan Behnel wrote:
> David Hutto, 21.12.2010 09:55:
>>
>> On Tue, Dec 21, 2010 at 3:52 AM, Stefan Behnel wrote:
>>>
>>> Chris Fuller, 21.12.2010 03:27:
This isn't XML, it's an abomination of XML. Best to not treat it as
XML.
Good thing you're
On Tue, Dec 21, 2010 at 4:17 AM, David Hutto wrote:
> On Tue, Dec 21, 2010 at 4:10 AM, Stefan Behnel wrote:
>> David Hutto, 21.12.2010 09:55:
>>>
>>> On Tue, Dec 21, 2010 at 3:52 AM, Stefan Behnel wrote:
Chris Fuller, 21.12.2010 03:27:
>
> This isn't XML, it's an abomination of
Hi,
I wonder why you reply to my e-mail without replying to what I wrote in it.
David Hutto, 21.12.2010 10:12:
.
I sympathize with you. I wonder who thought that building a 1GB XML file
was a good thing.
This was written by Steven D'Aprano.
If it is:
XML stands for eXtensible Markup Lan
File = string
going through string code
finding pieces of the string and marking the territory.
I don't see 'real' optimization other than rolling your own.
___
Tutor maillist - Tutor@python.org
To unsubscribe or change subscription options:
http://
David Hutto, 21.12.2010 10:19:
On Tue, Dec 21, 2010 at 4:17 AM, David Hutto wrote:
On Tue, Dec 21, 2010 at 4:10 AM, Stefan Behnel wrote:
Note that it's not unlikely that this is actually *slower* than using a
real XML parser:
Or a 'real' language like C or C++ maybe to increase, or in Python'
On Tue, Dec 21, 2010 at 4:28 AM, Stefan Behnel wrote:
> Hi,
>
> I wonder why you reply to my e-mail without replying to what I wrote in it.
>
>
> David Hutto, 21.12.2010 10:12:
>>
>> .
>>
>> I sympathize with you. I wonder who thought that building a 1GB XML
>> file
>> was a good t
David Hutto, 21.12.2010 10:29:
File = string
going through string code
finding pieces of the string and marking the territory.
I don't see 'real' optimization other than rolling your own.
Reads like a Haiku. Doesn't quite fit the verse, though.
From your behaviour, I get the impression tha
On Tue, Dec 21, 2010 at 4:34 AM, Stefan Behnel wrote:
> David Hutto, 21.12.2010 10:19:
>>
>> On Tue, Dec 21, 2010 at 4:17 AM, David Hutto wrote:
>>>
>>> On Tue, Dec 21, 2010 at 4:10 AM, Stefan Behnel wrote:
>>
>> Note that it's not unlikely that this is actually *slower* than using
>>
"David Hutto" wrote
And from what I recall XML is intended for data transfer in respect
to
HTML(from a recent brushup, nothing more),
Apologies that is browser based transfer,
I'm not sure what that last bit means.
XML is a self-describing data format. It is usually used for files
but can
"David Hutto" wrote
> Note that it's not unlikely that this is actually *slower* than
> using a real
> XML parser:
Or a 'real' language like C or C++ maybe to increase, or in Python's
case, bypass, the interpreter?
Most of the Python xml parsers are written in C - many use the
industry sta
On Tue, Dec 21, 2010 at 4:43 AM, Stefan Behnel wrote:
> David Hutto, 21.12.2010 10:29:
>>
>> File = string
A file is a string of character encoded in it's format
>>
>> going through string code
Code that goes through the file format and the encoding
>>
>> finding pieces of the string and marki
"David Hutto" wrote
I sympathize with you. I wonder who thought that building a 1GB XML
file
was a good thing.
that was just the first listing:
http://www.google.com/search?client=ubuntu&channel=fs&q=parsing+gigabyte+xml+python&ie=utf-8&oe=utf-8
Eeek! One of the listings says:
22 Jan
On Tue, Dec 21, 2010 at 4:46 AM, Alan Gauld wrote:
> "David Hutto" wrote
>
>>> And from what I recall XML is intended for data transfer in respect to
>>> HTML(from a recent brushup, nothing more),
>>
>> Apologies that is browser based transfer,
>
> I'm not sure what that last bit means.
> XML is
On Tue, Dec 21, 2010 at 4:49 AM, Alan Gauld wrote:
>
> "David Hutto" wrote
>
>> > Note that it's not unlikely that this is actually *slower* than > using
>> > a real
>> > XML parser:
>>
>> Or a 'real' language like C or C++ maybe to increase, or in Python's
>> case, bypass, the interpreter?
>
> M
"David Hutto" wrote
XML stands for eXtensible Markup Language.
XML is designed to transport and store data.
Then what other file medium would you suggest as the tagging means.
See my other post but there are many alternatives that are orders
of magnitude more efficient. XML is one of the mo
On Tue, Dec 21, 2010 at 4:58 AM, Alan Gauld wrote:
>
> "David Hutto" wrote
>
>>> I sympathize with you. I wonder who thought that building a 1GB XML file
>>> was a good thing.
>
>> that was just the first listing:
>>
>>
>> http://www.google.com/search?client=ubuntu&channel=fs&q=parsing+gigabyte+x
Alan Gauld, 21.12.2010 10:58:
"David Hutto" wrote
http://www.google.com/search?client=ubuntu&channel=fs&q=parsing+gigabyte+xml+python&ie=utf-8&oe=utf-8
Eeek! One of the listings says:
22 Jan 2009 ... Stripping Illegal Characters from XML in Python >>
... I'd be asking Python to process 6.4
Give me a little time to review this when it's not 5:30 in the morning
and I've been up since 9 am yesterday, and 'relearning' c++:)
But it still seems that you have have coding + filetype +
charactersinfileinformat., one long string that has to be parsed by
the C functions.
__
Alan Gauld, 21.12.2010 10:46:
You don't have to use it for data transfer - eg MS's use
as a document storage format in Office - but frankly if
you use XML to store large volumes of data you are mad,
a database is a much more sensible option being far more
space efficient and faster to work with.
On Tue, Dec 21, 2010 at 5:19 AM, Stefan Behnel wrote:
> Alan Gauld, 21.12.2010 10:58:
>>
>> "David Hutto" wrote
>>>
>>>
>>> http://www.google.com/search?client=ubuntu&channel=fs&q=parsing+gigabyte+xml+python&ie=utf-8&oe=utf-8
>>
>> Eeek! One of the listings says:
>>
>>> 22 Jan 2009 ... Stripping I
"David Hutto" wrote
(*)ASN.1, IDL etc all rely on a shared definition, and
often shared code library, at both sender and receiver.
This I might have to work on, but I rely on experience to
quasi-trust
experience.
These are all data transport formats agreed and standardised
long before XM
"David Hutto" wrote
Somewhat of the fact that python uses C encourages me of that, but I
have still been looking into c++ to optimize, because I've used it
before, and the more languages I learn the more they feel 'similar',
but the same, if you can understand that!
Absolutely! That's why I
"David Hutto" wrote
That';s what I saying above that xml seems to be the hog in terms of
it's user defined tags. Is that somewhat a confirmation of my hunch,
that it's the length of the users predefined tags that add to the
above mess, and that maybe a lessened tag system in accordance with
xm
On Tue, Dec 21, 2010 at 5:35 AM, Alan Gauld wrote:
>
> "David Hutto" wrote
>
>> Somewhat of the fact that python uses C encourages me of that, but I
>> have still been looking into c++ to optimize, because I've used it
>> before, and the more languages I learn the more they feel 'similar',
>> but
David Hutto, 21.12.2010 11:29:
On Tue, Dec 21, 2010 at 5:19 AM, Stefan Behnel wrote:
Alan Gauld, 21.12.2010 10:58:
22 Jan 2009 ... Stripping Illegal Characters from XML in Python>>
... I'd be asking Python to process 6.4 gigabytes of CSV into
6.5 gigabytes of XML 1. . In fact, what happen
On Tuesday 21.12.2010 10:12:55 David Hutto wrote:
> Then what other file medium would you suggest as the tagging means.
One of those formats, that are specially designed for large amounts of data,
is HDF5. It is intended for numerical data, but you can store text as well.
There are multiple Pyth
On Tue, Dec 21, 2010 at 5:45 AM, Alan Gauld wrote:
>
> "David Hutto" wrote
>
>> That';s what I saying above that xml seems to be the hog in terms of
>> it's user defined tags. Is that somewhat a confirmation of my hunch,
>> that it's the length of the users predefined tags that add to the
>> abov
On Tue, Dec 21, 2010 at 5:49 AM, Stefan Behnel wrote:
> David Hutto, 21.12.2010 11:29:
>>
>> On Tue, Dec 21, 2010 at 5:19 AM, Stefan Behnel wrote:
>>>
>>> Alan Gauld, 21.12.2010 10:58:
>
> 22 Jan 2009 ... Stripping Illegal Characters from XML in Python>>
... I'd be asking Python
David Hutto, 21.12.2010 12:02:
On Tue, Dec 21, 2010 at 5:45 AM, Alan Gauld wrote:
8 bytes to describe an int which could be represented in
a single byte in binary (or even in CSV).
Well, "CSV" indicates that there's at least one separator character
involved, so make that an asymptotic 2 bytes
On Tue, Dec 21, 2010 at 6:19 AM, Stefan Behnel wrote:
> David Hutto, 21.12.2010 12:02:
>>
>> On Tue, Dec 21, 2010 at 5:45 AM, Alan Gauld wrote:
>>>
>>> 8 bytes to describe an int which could be represented in
>>> a single byte in binary (or even in CSV).
>
> Well, "CSV" indicates that there's at l
On Tue, Dec 21, 2010 at 6:41 AM, David Hutto wrote:
> On Tue, Dec 21, 2010 at 6:19 AM, Stefan Behnel wrote:
>> David Hutto, 21.12.2010 12:02:
>>>
>>> On Tue, Dec 21, 2010 at 5:45 AM, Alan Gauld wrote:
8 bytes to describe an int which could be represented in
a single byte in binary
David Hutto, 21.12.2010 12:45:
If file a.xml has simple tagged xml like, and file b.config has
tags that represent the a.xml(i.e. =) as greater tags,
does this pattern optimize the process by limiting the size of the
tags to be parsed in the xml, then converting those simpler tags that
are found
On Tue, Dec 21, 2010 at 6:59 AM, Stefan Behnel wrote:
> David Hutto, 21.12.2010 12:45:
>>>
>>> If file a.xml has simple tagged xml like, and file b.config has
>>> tags that represent the a.xml(i.e. =) as greater tags,
>>> does this pattern optimize the process by limiting the size of the
>>> tags
David Hutto, 21.12.2010 13:09:
On Tue, Dec 21, 2010 at 6:59 AM, Stefan Behnel wrote:
David Hutto, 21.12.2010 12:45:
If file a.xml has simple tagged xml like, and file b.config has
tags that represent the a.xml(i.e.=) as greater tags,
does this pattern optimize the process by limiting the s
Alan Gauld wrote:
XML is a self-describing data format. It is usually used for files
but can be used in data streams or in-memory strings.
It's natural competitors are TLV (Tag,Lenth,Value) and
CSV(Comma Seperated Value) files but neither is as rich
in structure. Alternative options include AS
Stefan Behnel wrote:
David Hutto, 21.12.2010 10:29:
File = string
going through string code
finding pieces of the string and marking the territory.
I don't see 'real' optimization other than rolling your own.
Reads like a Haiku. Doesn't quite fit the verse, though.
From your behaviour, I
"Steven D'Aprano" wrote
It's natural competitors are TLV (Tag,Lenth,Value) and
CSV(Comma Seperated Value) files but neither is as rich
I would have thought that both JSON and YAML are competitors to XML,
Totally agree but I excluded those on the basis that they weren't
around when XML was
"Stefan Behnel" wrote
And I thought a 1G file was extreme... Do these people stop to
think that
with XML as much as 80% of their "data" is just description (ie the
tags).
As I already said, it compresses well. In run-length compressed XML
files, the tags can easily take up a negligible amo
Establish that with fact that initiatially I didn't have a reason to
be hostile, and that your comment of my kubit kaba here, and your
comment on comp.python.lang about your pystats, aftger our
conversation, and your reference to it not being "set in stone",
wasn't a reference tyo our statrs argume
Take a look at the flame wars individuals see, comments by programmers
who are sarcastic, and think of the response you might have had to the
initial questions you had , and maybe even a few paranoid delusions
you got hacked.
It's not a rewarding experience not being a college educated
individual
On Tue, Dec 21, 2010 at 9:32 AM, David Hutto wrote:
> Take a look at the flame wars individuals see, comments by programmers
> who are sarcastic, and think of the response you might have had to the
> initial questions you had , and maybe even a few paranoid delusions
> you got hacked.
>
> It's not
David Hutto wrote:
Establish that with fact that initiatially I didn't have a reason to
be hostile, and that your comment of my kubit kaba here, and your
comment on comp.python.lang about your pystats, aftger our
conversation, and your reference to it not being "set in stone",
wasn't a reference
And furthermore, I'm not the first, nor the last to get angry and
frustrated on the internet. I'm not the first to get drunk, and type.
And I dare any employer to deny me the right to MY personal time.
___
Tutor maillist - Tutor@python.org
To unsubscrib
On Tue, Dec 21, 2010 at 9:36 AM, Steven D'Aprano wrote:
> David Hutto wrote:
>>
>> Establish that with fact that initiatially I didn't have a reason to
>> be hostile, and that your comment of my kubit kaba here, and your
>> comment on comp.python.lang about your pystats, aftger our
>> conversation
On Tue, Dec 21, 2010 at 9:40 AM, David Hutto wrote:
> On Tue, Dec 21, 2010 at 9:36 AM, Steven D'Aprano wrote:
>> David Hutto wrote:
>>>
>>> Establish that with fact that initiatially I didn't have a reason to
>>> be hostile, and that your comment of my kubit kaba here, and your
>>> comment on com
Me and you, apparently know exactly what i'm talking about...
http://code.activestate.com/lists/python-tutor/79293/
___
Tutor maillist - Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor
you got nothing of real value.
___
Tutor maillist - Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor
And a lesson of what you really are to anyone listening.
___
Tutor maillist - Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor
Alan Gauld, 21.12.2010 15:11:
"Stefan Behnel" wrote
And I thought a 1G file was extreme... Do these people stop to think that
with XML as much as 80% of their "data" is just description (ie the tags).
As I already said, it compresses well. In run-length compressed XML
files, the tags can easil
On Tue, Dec 21, 2010 at 10:03 AM, Stefan Behnel wrote:
> Alan Gauld, 21.12.2010 15:11:
>>
>> "Stefan Behnel" wrote
And I thought a 1G file was extreme... Do these people stop to think
that
with XML as much as 80% of their "data" is just description (ie the
tags).
>>>
>>> A
David Hutto, 21.12.2010 16:11:
On Tue, Dec 21, 2010 at 10:03 AM, Stefan Behnel wrote:
I meant
uncompressing the data *while* parsing it. Just like you have to decode it
for parsing, it's just an additional step to decompress it before decoding.
Depending on the performance relation between I/O s
"Stefan Behnel" wrote
But I don't understand how uncompressing a file before parsing it
can
be faster than parsing the original uncompressed file?
I didn't say "uncompressing a file *before* parsing it". I meant
uncompressing the data *while* parsing it.
Ah, ok that can work, although it
You're not going to win any friends here Dave. Steven is well known on this
list. He is sometimes abrasive but it's rarely if ever malicious. Anytime he's
ever been rude to me it was deserved. Like how I top post from my phone. Or
giving bad advice to newbies.
People are getting irritated becau
On 21 December 2010 14:11, Alan Gauld wrote:
> But I don't understand how uncompressing a file before parsing it can
> be faster than parsing the original uncompressed file?
>
Because of IO overhead/benefits. It's not so much that the parsing aspect
of it is faster of course (it is what it is),
On 21 December 2010 17:57, Alan Gauld wrote:
>
> "Stefan Behnel" wrote
>
> But I don't understand how uncompressing a file before parsing it can
>>> be faster than parsing the original uncompressed file?
>>>
>>
>> I didn't say "uncompressing a file *before* parsing it". I meant
>> uncompressing
On Tue, Dec 21, 2010 at 1:23 PM, Luke Paireepinart
wrote:
> You're not going to win any friends here Dave.
Wasn't trying to.
Steven is well known on this list.
And that means something to you only.
He is sometimes abrasive but it's rarely if ever malicious.
Anytime he's ever been rude to me
Walter Prins, 21.12.2010 22:13:
On 21 December 2010 17:57, Alan Gauld wrote:
"Stefan Behnel" wrote
But I don't understand how uncompressing a file before parsing it can
be faster than parsing the original uncompressed file?
I didn't say "uncompressing a file *before* parsing it". I meant
un
79 matches
Mail list logo