Re: Relaxed, or best-efforts JSON parser for Python?

2015-10-12 Thread victor . hooi
On Monday, October 12, 2015 at 10:02:13 PM UTC+11, Laura Creighton wrote:
> In a message of Sun, 11 Oct 2015 17:56:33 -0700, Victor Hooi writes:
> >Hi,
> >
> >I'm attempting to parse MongoDB loglines.
> >
> >The formatting of these loglines could best be described as JSON-like...
> >
> >For example - arrays 
> >
> >Anyhow, say I had the following logline snippet:
> >
> >{ Global: { acquireCount: { r: 2, w: 2 } }, Database: { acquireCount: { 
> > w: 2 } }, Collection: { acquireCount: { w: 1 } }, oplog: { acquireCount: { 
> > w: 1 } } }
> >
> >This won't parse with json.loads() - the main issues is the missing 
> >quotation marks (") around the strings.
> >
> >My question, is there a more lenient, or relaxed JSON parser available for 
> >Python, that will try to do a best-efforts parsing of non-spec JSON?
> >
> >Cheers,
> >Victor
> >-- 
> >https://mail.python.org/mailman/listinfo/python-list
> 
> Won't this 
> http://blog.mongodb.org/post/85123256973/introducing-mtools
> https://github.com/rueckstiess/mtools
> https://pypi.python.org/pypi/mtools/1.1.3
> 
> be better? :)

Hi,

@MRAB - Thanks for the tip. I did actually think of doing that as well - it's 
what we (MongoDB) do internally for a few of our tools, but was really hoping 
to avoid going down the regex route. However, this is what I'm doing for now:

locks = re.sub(r"(\w+):", "\"\g<1>\":", locks)

@Random832 - No, it's not YAML. The MongoDB log format issort of JSON, but 
not. IMHO ,it's a bit of an ugly mess. So things like string fields aren't 
quoted, you have random custom types, parentheses aren't necessarily balanced 
(e.g. if you have long loglines that get truncated at 10K characters etc.). I 
could go on.

@Laura Creighton - Yup, mtools is actually written by a colleague of mine =). 
Awesome guy. He does a lot of stuff to work around the idiosyncrasies of the 
MongoDB log format. However, there's quite a bit of overhead to using the full 
module for this - for this use case, I just needed to parse a specific "locks" 
document from a logline, so I was hoping for a clean way to just take it and 
parse it - in this case, the only issue that could hit us (AFAIK) is the lack 
of quotes around string fields. If they ever introduced a field with spaces in 
itI don't know what would happen, lol.

-- 
https://mail.python.org/mailman/listinfo/python-list


Relaxed, or best-efforts JSON parser for Python?

2015-10-11 Thread Victor Hooi
Hi,

I'm attempting to parse MongoDB loglines.

The formatting of these loglines could best be described as JSON-like...

For example - arrays 

Anyhow, say I had the following logline snippet:

{ Global: { acquireCount: { r: 2, w: 2 } }, Database: { acquireCount: { w: 
2 } }, Collection: { acquireCount: { w: 1 } }, oplog: { acquireCount: { w: 1 } 
} }

This won't parse with json.loads() - the main issues is the missing quotation 
marks (") around the strings.

My question, is there a more lenient, or relaxed JSON parser available for 
Python, that will try to do a best-efforts parsing of non-spec JSON?

Cheers,
Victor
-- 
https://mail.python.org/mailman/listinfo/python-list


Reading in large logfiles, and processing lines in batches - maximising throughput?

2015-09-16 Thread Victor Hooi
I'm using Python to parse metrics out of logfiles.

The logfiles are fairly large (multiple GBs), so I'm keen to do this in a 
reasonably performant way.

The metrics are being sent to a InfluxDB database - so it's better if I can 
batch multiple metrics into a batch ,rather than sending them individually.

Currently, I'm using the grouper() recipe from the itertools documentation to 
process multiples lines in "chunks" - I then send the collected points to the 
database:

def grouper(iterable, n, fillvalue=None):
"Collect data into fixed-length chunks or blocks"
# grouper('ABCDEFG', 3, 'x') --> ABC DEF Gxx
args = [iter(iterable)] * n
return zip_longest(fillvalue=fillvalue, *args)

with open(args.input_file, 'r') as f:
line_counter = 0
for chunk in grouper(f, args.batch_size):
json_points = []
for line in chunk:
line_counter +=1
# Do some processing
json_points.append(some_metrics)
if json_points:
write_points(logger, client, json_points, line_counter)

However, not every line will produce metrics - so I'm batching on the number of 
input lines, rather than on the items I send to the database.

My question is, would it make sense to simply have a json_points list that 
accumulated metrics, check the size each iteration and then send them off when 
it reaches a certain size. Eg.:

BATCH_SIZE = 1000

with open(args.input_file, 'r') as f:
json_points = []
for line_number, line in enumerate(f):
# Do some processing
json_points.append(some_metrics)
if len(json_points) >= BATCH_SIZE:
write_points(logger, client, json_points, line_counter)
json_points = []

Also, I originally used grouper because I thought it better to process lines in 
batches, rather than individually. However, is there actually any throughput 
advantage from doing it this way in Python? Or is there a better way of getting 
better throughput?

We can assume for now that the CPU load of the processing is fairly light 
(mainly string splitting, and date parsing).
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Using enumerate to get line-numbers with itertools grouper?

2015-09-13 Thread Victor Hooi
On Thursday, 3 September 2015 03:49:05 UTC+10, Terry Reedy  wrote:
> On 9/2/2015 6:04 AM, Victor Hooi wrote:
> > I'm using grouper() to iterate over a textfile in groups of lines:
> >
> > def grouper(iterable, n, fillvalue=None):
> >  "Collect data into fixed-length chunks or blocks"
> >  # grouper('ABCDEFG', 3, 'x') --> ABC DEF Gxx
> >  args = [iter(iterable)] * n
> >  return zip_longest(fillvalue=fillvalue, *args)
> >
> > However, I'd also like to know the line-number that I'm up to, for printing 
> > out in informational or error messages.
> >
> > Is there a way to use enumerate with grouper to achieve this?
> 
> Without a runnable test example, it is hard to be sure what you want. 
> However, I believe replacing 'iter(iterable)' with 'enumerate(iterable, 
> 1)', and taking into account that you will get (line_number, line) 
> tuples instead of lines, will do what you want.
> 
> -- 
> Terry Jan Reedy

Hi,

Hmm,  I've tried that suggestion, but for some reason, it doesn't seem to be 
unpacking the values correctly - in this case, line_number and chunk below just 
give me two successive items from the iterable:

Below is the complete code I'm running:

#!/usr/bin/env python3
from datetime import datetime
from itertools import zip_longest

def grouper(iterable, n, fillvalue=None):
"Collect data into fixed-length chunks or blocks"
# grouper('ABCDEFG', 3, 'x') --> ABC DEF Gxx
args = [enumerate(iterable, 1)] * n
return zip_longest(fillvalue=fillvalue, *args)


def parse_iostat(lines):
"""Parse lines of iostat information, yielding iostat blocks.

lines should be an iterable yielding separate lines of output
"""
block = None
for line in lines:
line = line.strip()
try:
if ' AM' in line or ' PM' in line: # What happens if their 
device names have AM or PM?
tm = datetime.strptime(line, "%m/%d/%Y %I:%M:%S %p")
else:
tm = datetime.strptime(line, "%m/%d/%y %H:%M:%S")
if block: yield block
block = [tm]
except ValueError:
# It's not a new timestamp, so add it to the existing block
# We ignore the iostat startup lines (which deals with random 
restarts of iostat), as well as empty lines
if '_x86_64_' not in line:
block.append(line)
if block: yield block

with open('iostat_sample_12hr_time', 'r') as f:
f.__next__() # Skip the "Linux..." line
f.__next__() # Skip the blank line
for line_number, chunk in grouper(parse_iostat(f), 2):
print("Line Number: {}".format(line_number))
print("Chunk: {}".format(chunk))


Here is the input file:

Linux 3.19.0-20-generic (ip-172-31-12-169)  06/25/2015  _x86_64_
(2 CPU)

06/25/2015 07:37:04 AM
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
   0.020.000.020.000.00   99.95

Device: rrqm/s   wrqm/s r/s w/srMB/swMB/s avgrq-sz 
avgqu-sz   await r_await w_await  svctm  %util
xvdap10.00 0.040.030.07 0.00 0.0084.96  
   0.00   30.362.74   42.83   0.53   0.01
xvdb  0.00 0.000.000.00 0.00 0.0011.62  
   0.000.230.192.13   0.16   0.00
xvdf  0.00 0.000.000.00 0.00 0.0010.29  
   0.000.410.410.73   0.38   0.00
xvdg  0.00 0.000.000.00 0.00 0.00 9.12  
   0.000.360.351.20   0.34   0.00
xvdh  0.00 0.000.000.00 0.00 0.0033.35  
   0.001.390.418.91   0.39   0.00
dm-0  0.00 0.000.000.00 0.00 0.0011.66  
   0.000.460.460.00   0.37   0.00

06/25/2015 07:37:05 AM
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
   0.500.000.500.000.00   99.01

Device: rrqm/s   wrqm/s r/s w/srMB/swMB/s avgrq-sz 
avgqu-sz   await r_await w_await  svctm  %util
xvdap10.00 0.000.000.00 0.00 0.00 0.00  
   0.000.000.000.00   0.00   0.00
xvdb  0.00 0.000.000.00 0.00 0.00 0.00  
   0.000.000.000.00   0.00   0.00
xvdf  0.00 0.000.000.00 0.00 0.00 0.00  
   0.000.000.000.00   0.00

Accumulating points in batch for sending off

2015-09-04 Thread Victor Hooi
Hi,

I'm using Python to parse out  metrics from logfiles, and ship them off to a 
database called InfluxDB, using their Python driver 
(https://github.com/influxdb/influxdb-python).

With InfluxDB, it's more efficient if you pack in more points into each message.

Hence, I'm using the grouper() recipe from the itertools documentation 
(https://docs.python.org/3.6/library/itertools.html), to process the data in 
chunks, and then shipping off the points at the end of each chunk:

  def grouper(iterable, n, fillvalue=None):
  "Collect data into fixed-length chunks or blocks"
  # grouper('ABCDEFG', 3, 'x') --> ABC DEF Gxx
  args = [iter(iterable)] * n
  return zip_longest(fillvalue=fillvalue, *args)
  
  for chunk in grouper(parse_iostat(f), 500):
  json_points = []
  for block in chunk:
  if block:
  try:
  for i, line in enumerate(block):
  # DO SOME STUFF
  except ValueError as e:
  print("Bad output seen - skipping")
  client.write_points(json_points)
  print("Wrote in {} points to InfluxDB".format(len(json_points)))


However, for some parsers, not every line will yield a datapoint.

I'm wondering if perhaps rather than trying to chunk the input, it might be 
better off just calling len() on the points list each time, and sending it off 
when it's ready. E.g.:

#!/usr/bin/env python3

json_points = []
_BATCH_SIZE = 2

for line_number, line in enumerate(open('blah.txt', 'r')):
if 'cat' in line:
print('Found cat on line {}'.format(line_number + 1 ))
json_points.append(line_number)
print("json_points contains {} points".format(len(json_points)))
if len(json_points) >= _BATCH_SIZE:
# print("json_points contains {} points".format(len(json_points)))
print('Sending off points!')
json_points = []

print("Loop finished. json_points contains {} 
points".format(len(json_points)))
print('Sending off points!')

Does the above seem reasonable? Any issues you see? Or are there any other more 
efficient approaches to doing this?
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Using enumerate to get line-numbers with itertools grouper?

2015-09-02 Thread Victor Hooi
Hi Peter,

Hmm, are you sure that will work?

The indexes returned by enumerate will start from zero.

Also, I've realised line_number is a bit of a misnomer here - it's actually the 
index for the chunks that grouper() is returning.

So say I had a 10-line textfile, and I was using a _BATCH_SIZE of 50.

If I do:

print(line_number * _BATCH_SIZE)

I'd just get (0 * 50) = 0 printed out 10 times.

Even if I add one:

print((line_number + 1) * _BATCH_SIZE)

I will just get 50 printed out 10 times.

My understanding is that the file handle f is being passed to grouper, which is 
then passing another iterable to enumerate - I'm just not sure of the best way 
to get the line numbers from the original iterable f, and pass this through the 
chain?

On Wednesday, 2 September 2015 20:37:01 UTC+10, Peter Otten  wrote:
> Victor Hooi wrote:
> 
> > I'm using grouper() to iterate over a textfile in groups of lines:
> > 
> > def grouper(iterable, n, fillvalue=None):
> > "Collect data into fixed-length chunks or blocks"
> > # grouper('ABCDEFG', 3, 'x') --> ABC DEF Gxx
> > args = [iter(iterable)] * n
> > return zip_longest(fillvalue=fillvalue, *args)
> > 
> > However, I'd also like to know the line-number that I'm up to, for
> > printing out in informational or error messages.
> > 
> > Is there a way to use enumerate with grouper to achieve this?
> > 
> > The below won't work, as enumerate will give me the index of the group,
> > rather than of the lines themselves:
> > 
> > _BATCH_SIZE = 50
> > 
> > with open(args.input_file, 'r') as f:
> > for line_number, chunk in enumerate(grouper(f, _BATCH_SIZE)):
> > print(line_number)
> > 
> > I'm thinking I could do something to modify grouper, maybe, but I'm sure
> > there's an easier way?
> 
> print(line_number * _BATCH_SIZE)
> 
> Eureka ;)
-- 
https://mail.python.org/mailman/listinfo/python-list


for loop over function that returns a tuple?

2015-09-02 Thread Victor Hooi
I have a function which is meant to return a tuple:

def get_metrics(server_status_json, metrics_to_extract, line_number):

return ((timestamp, "serverstatus", values, tags))

I also have:

def create_point(timestamp, metric_name, values, tags):
return {
"measurement": _MEASUREMENT_PREFIX + metric_name,
"tags": tags,
"time": timestamp,
"fields": values
}

I am calling get_metric in a for loop like so:

for metric_data in get_metrics(server_status_json, mmapv1_metrics, 
line_number):
json_points.append(create_point(*metric_data))

I was hoping to use tuple unpacking to pass metric_data straight from 
get_metrics through to create_point.

However, in this case, metric_data only contains timestamp.

I suppose I could assign multiple variables like, and pass them through:

for timestamp, metric_name, value, tags in get_metrics(server_status_json, 
common_metrics, line_number):

However, that seems unnecessarily verbose, and I'm sure there's a simple way to 
do this with tuple unpacking?
-- 
https://mail.python.org/mailman/listinfo/python-list


Using enumerate to get line-numbers with itertools grouper?

2015-09-02 Thread Victor Hooi
I'm using grouper() to iterate over a textfile in groups of lines:

def grouper(iterable, n, fillvalue=None):
"Collect data into fixed-length chunks or blocks"
# grouper('ABCDEFG', 3, 'x') --> ABC DEF Gxx
args = [iter(iterable)] * n
return zip_longest(fillvalue=fillvalue, *args)

However, I'd also like to know the line-number that I'm up to, for printing out 
in informational or error messages.

Is there a way to use enumerate with grouper to achieve this?

The below won't work, as enumerate will give me the index of the group, rather 
than of the lines themselves:

_BATCH_SIZE = 50

with open(args.input_file, 'r') as f:
for line_number, chunk in enumerate(grouper(f, _BATCH_SIZE)):
print(line_number)

I'm thinking I could do something to modify grouper, maybe, but I'm sure 
there's an easier way?
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Casting to a "number" (both int and float)?

2015-08-28 Thread Victor Hooi
Hi,

Thanks heaps to everybody for their suggestions/advice =).

Currently I'm using this:

def strip_floatApprox_wrapping(field):
# Extracts a integer value from a field. Workaround for the float_approx 
wrapping.
if isinstance(field, dict):
return field['floatApprox']
else:
return field

I was a little hesitant to go down that path (using isinstance()) since it 
seems a bit "un-Pythonic" but it seems to do what I want in a minimal amount of 
code.

Somebody suggested going through and transforming the whole output from 
json.loads() in a single pass - I don't actually need *all* the fields, just a 
fair number of them - is there a particularly strong reason stripping out the 
floatApprox()'s first is a good approach?

Also, to answer somebody's question - yes, this is MongoDB, specifically the 
output from db.serverStatus(). The logging and reporting consistency in MongoDB 
is...quirky, shall we say.

Cheers,
Victor

On Friday, 28 August 2015 16:15:21 UTC+10, Jussi Piitulainen  wrote:
> Ben Finney writes:
> 
> > Victor Hooi writes:
> [- -]
> >> For example:
> >>
> >> {
> >> "hostname": "example.com",
> >> "version": "3.0.5",
> >> "pid": {
> >> "floatApprox": 18403
> >> }
> >> "network": {
> >> "bytesIn": 123123,
> >> "bytesOut": {
> >> "floatApprox": 213123123
> >> }
> >> }
> 
> [- -]
> 
> > In JSON there is no distinction at all, the only numeric type is
> > 'float'. What information is there in the input that can be used to
> > know which values should result in an 'int' instance, versus values
> > that should result in a 'float' instance?
> 
> I seem to get ints in the example data.
> 
> >>> json.load(io.StringIO('{"floatApprox":31213}'))
> {'floatApprox': 31213}
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Casting to a "number" (both int and float)?

2015-08-27 Thread Victor Hooi
Actually, I've just realised, if I just test for numeric or try to cast to 
ints, this will break for string fields.

As in, the intention is to call strip_floatAprox_wrapping on all the fields I'm 
parsing, and have it deal with the floatApprox dict wrapping, whether the 
contents are numbers or strings (although strings would not be wrapped in 
floatApprox).

On Friday, 28 August 2015 14:58:01 UTC+10, Victor Hooi  wrote:
> I'm reading JSON output from an input file, and extracting values.
> 
> Many of the fields are meant to be numerical, however, some fields are 
> wrapped in a "floatApprox" dict, which messed with my parsing.
> 
> For example:
> 
> {
> "hostname": "example.com",
> "version": "3.0.5",
> "pid": {
> "floatApprox": 18403
> }
> "network": {
> "bytesIn": 123123,
> "bytesOut": {
> "floatApprox": 213123123
> }
> }
> 
> The floatApprox wrapping appears to happen sporadically in the input.
> 
> I'd like to find a way to deal with this robustly.
> 
> For example, I have the following function:
> 
> def strip_floatApprox_wrapping(field):
> # Extracts a integer value from a field. Workaround for the float_approx 
> wrapping.
> try:
> return int(field)
> except TypeError:
> return int(field['floatApprox'])
> 
> which I can then call on each field I want to extract.
> 
> However, this relies on casting to int, which will only work for ints - for 
> some fields, they may actually be floats, and I'd like to preserve that if 
> possible.
> 
> (I know there's a isnumber() field - but you can only call that on a string - 
> so if I do hit a floatApprox field, it will trigger a AttributeError 
> exception, which seems a bit clunky to handle).
> 
> def strip_floatApprox_wrapping(field):
> # Extracts a integer value from a field. Workaround for the float_approx 
> wrapping.
> try:
> if field.isnumeric():
> return field
> except AttributeError:
> return field['floatApprox']
> 
> Is there a way to re-write strip_floatApprox_wrapping to handle both 
> ints/floats, and preserve the original format?
> 
> Or is there a more elegant way to deal with the arbitrary nesting with 
> floatApprox?

-- 
https://mail.python.org/mailman/listinfo/python-list


Casting to a "number" (both int and float)?

2015-08-27 Thread Victor Hooi
I'm reading JSON output from an input file, and extracting values.

Many of the fields are meant to be numerical, however, some fields are wrapped 
in a "floatApprox" dict, which messed with my parsing.

For example:

{
"hostname": "example.com",
"version": "3.0.5",
"pid": {
"floatApprox": 18403
}
"network": {
"bytesIn": 123123,
"bytesOut": {
"floatApprox": 213123123
}
}

The floatApprox wrapping appears to happen sporadically in the input.

I'd like to find a way to deal with this robustly.

For example, I have the following function:

def strip_floatApprox_wrapping(field):
# Extracts a integer value from a field. Workaround for the float_approx 
wrapping.
try:
return int(field)
except TypeError:
return int(field['floatApprox'])

which I can then call on each field I want to extract.

However, this relies on casting to int, which will only work for ints - for 
some fields, they may actually be floats, and I'd like to preserve that if 
possible.

(I know there's a isnumber() field - but you can only call that on a string - 
so if I do hit a floatApprox field, it will trigger a AttributeError exception, 
which seems a bit clunky to handle).

def strip_floatApprox_wrapping(field):
# Extracts a integer value from a field. Workaround for the float_approx 
wrapping.
try:
if field.isnumeric():
return field
except AttributeError:
return field['floatApprox']

Is there a way to re-write strip_floatApprox_wrapping to handle both 
ints/floats, and preserve the original format?

Or is there a more elegant way to deal with the arbitrary nesting with 
floatApprox?
-- 
https://mail.python.org/mailman/listinfo/python-list


Storing dictionary locations as a string and using eval - alternatives?

2015-08-19 Thread Victor Hooi
Hi,

I have a textfile with a bunch of JSON objects, one per line.

I'm looking at parsing each of these, and extract some metrics from each line.

I have a dict called "metrics_to_extract", containing the metrics I'm looking 
at extracting. In this, I store a name used to identify the metric, along with 
the location in the parsed JSON object.

Below is my code:

>>>
metrics_to_extract = {
'current_connections': "server_status_json['connections']['current']",
'resident_memory': "server_status_json['mem']['resident']"
}


def add_point(name, value, timestamp, tags):
return {
"measurement": name,
"tags": tags,
# "time": timestamp.isoformat(),
"time": timestamp,
"fields": {
"value": float(value)
}
}

with open(input_file, 'r') as f:
json_points = []
for line in f:
if line.startswith("{"):
server_status_json = json.loads(line)
# pp.pprint(server_status_json)
# import ipdb; ipdb.set_trace()
timestamp = server_status_json['localTime']
tags = {
'project': project,
'hostname': server_status_json['host'],
'version': server_status_json['version'],
'storage_engine': server_status_json['storageEngine']['name']
}

for key, value in metrics_to_extract.items():
json_points.append(add_point(key, eval(value), timestamp, tags))
# client.write_points(json_points)
else:
print("non matching line")
>>>

My question is - I'm using "eval" in the above, with the nested location (e.g. 
"server_status_json['mem']['resident']") stored as a string.

I get the feeling this isn't particularly idiomatic or a great way of doing it 
- and would be keen to hear alternative suggestions?

Thanks,
Victor
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Split on multiple delimiters, and also treat consecutive delimiters as a single delimiter?

2015-07-28 Thread Victor Hooi
On Tuesday, 28 July 2015 23:59:11 UTC+10, m  wrote:
> W dniu 28.07.2015 o 15:55, Victor Hooi pisze:
> > I know the regex library also has a split, unfortunately, that does not 
> > collapse consecutive whitespace:
> > 
> > In [19]: re.split(' |', f)
> 
> Try ' *\|'
> 
> p. m.

Hmm, that seems to be getting closer (it returns a four-element list):

In [23]: re.split(' *\|', f)
Out[23]:
['14 *0330 *0 760   411',
 '0   0   770g  1544g   117g   1414 computedshopcartdb:103.5%  0
  30',
 '0 0',
 '119m97m  1538 ComputedCartRS  PRI   09:40:26']
-- 
https://mail.python.org/mailman/listinfo/python-list


Split on multiple delimiters, and also treat consecutive delimiters as a single delimiter?

2015-07-28 Thread Victor Hooi
I have a line that looks like this:

14 *0330 *0 760   411|0   0   770g  1544g   117g   1414 
computedshopcartdb:103.5%  0  30|0 0|119m97m  1538 
ComputedCartRS  PRI   09:40:26

I'd like to split this line on multiple separators - in this case, consecutive 
whitespace, as well as the pipe symbol (|).

If I run .split() on the line, it will split on consecutive whitespace:

In [17]: f.split()
Out[17]:
['14',
 '*0',
 '330',
 '*0',
 '760',
 '411|0',
 '0',
 '770g',
 '1544g',
 '117g',
 '1414',
 'computedshopcartdb:103.5%',
 '0',
 '30|0',
 '0|1',
 '19m',
 '97m',
 '1538',
 'ComputedCartRS',
 'PRI',
 '09:40:26']

If I try to run .split(' |'), however, I get:

f.split(' |')
Out[18]: ['14 *0330 *0 760   411|0   0   770g  1544g   
117g   1414 computedshopcartdb:103.5%  0  30|0 0|119m
97m  1538 ComputedCartRS  PRI   09:40:26']

I know the regex library also has a split, unfortunately, that does not 
collapse consecutive whitespace:

In [19]: re.split(' |', f)
Out[19]:
['',
 '',
 '',
 '',
 '14',
 '',
 '',
 '',
 '',
 '*0',
 '',
 '',
 '',
 '330',
 '',
 '',
 '',
 '',
 '*0',
 '',
 '',
 '',
 '',
 '760',
 '',
 '',
 '411|0',
 '',
 '',
 '',
 '',
 '',
 '',
 '0',
 '',
 '',
 '770g',
 '',
 '1544g',
 '',
 '',
 '117g',
 '',
 '',
 '1414',
 'computedshopcartdb:103.5%',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '0',
 '',
 '',
 '',
 '',
 '',
 '30|0',
 '',
 '',
 '',
 '',
 '0|1',
 '',
 '',
 '',
 '19m',
 '',
 '',
 '',
 '97m',
 '',
 '1538',
 'ComputedCartRS',
 '',
 'PRI',
 '',
 '',
 '09:40:26']

Is there an easy way to split on multiple characters, and also treat 
consecutive delimiters as a single delimiter?
-- 
https://mail.python.org/mailman/listinfo/python-list


Creating JSON from iostat lines ; Adding timezone information to naive datetimes?

2015-07-02 Thread Victor Hooi
I just want to run some things past you guys, to make sure I'm doing it right.

I'm using Python to parse disk metrics out of iostat output. The device lines 
look like this:

Device: rrqm/s   wrqm/s r/s w/s   rsec/s   wsec/s avgrq-sz 
avgqu-sz   await  svctm  %util
sda   0.00 0.000.000.00 0.00 0.00 0.00  
   0.000.00   0.00   0.00

My goal is JSON output for each metric that look like the below (this is for 
InfluxDB):

  {
"measurement": "read_requests",
"tags": {
  "project": "SOME_PROJECT",
  "hostname": "server1.newyork.com",
},
"time": timestamp.isoformat(),
"fields": {
  "value": 0.00
}
  }

To create the above, I am using the following code:

  disk_stat_headers = ['device', 'read_requests_merged', 
'write_requests_merged', 'read_requests', 'write_requests', 'read_sectors', 
'write_sectors', 'average_request_size', 'average_queue_length', 
'average_wait', 'average_service_time', 'utilisation']
  ..
elif i >= 5 and line:
  disk_stats = {}
  device = line.split()[0]
  disk_stats[device] = dict(zip(disk_stat_headers, line.split()[1:]))

  json_points = []
  for disk_name, metrics in disk_stats.items():
  print(disk_name)
  print(metrics)
  for key, value in metrics.items():
  json_points.append({
  "measurement": key,
  "tags": {
  "project": project,
  "hostname": hostname,
  },
  "time": timestamp.isoformat(),
  "fields": {
  "value": value
  }
  })


Is there any issue with the above? Or can you see a better way to do this?

(I'm calling split() twice, not sure if that's worse than storing it in a 
variable)

Second question - the timestamps in isotat are timezone-naive. I'm using the 
below to add the right timezone (EDT in this case) to them, which depends on 
the pytz library:

  from pytz import timezone
  eastern = timezone('US/Eastern')
  timestamp = eastern.localize(line)

Is the above an appropriate way of doing this? Or is there an easier way just 
using the Python stdlib's?
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Parsing logfile with multi-line loglines, separated by timestamp?

2015-06-30 Thread Victor Hooi
Aha, cool, that's a good idea =) - it seems I should spend some time getting to 
know generators/iterators.

Also, sorry if this is basic, but once I have the "block" list itself, what is 
the best way to parse each relevant line?

In this case, the first line is a timestamp, the next two lines are system 
stats, and then a newline, and then one line for each block device.

I could just hardcode in the lines, but that seems ugly:

  for block in parse_iostat(f):
  for i, line in enumerate(block):
  if i == 0:
  print("timestamp is {}".format(line))
  elif i == 1 or i == 2:
  print("system stats: {}".format(line))
  elif i >= 4:
  print("disk stats: {}".format(line))

Is there a prettier or more Pythonic way of doing this?

Thanks,
Victor

On Wednesday, 1 July 2015 02:03:01 UTC+10, Chris Angelico  wrote:
> On Wed, Jul 1, 2015 at 1:47 AM, Skip Montanaro  
> wrote:
> > Maybe define a class which wraps a file-like object. Its next() method (or
> > is it __next__() method?) can just buffer up lines starting with one which
> > successfully parses as a timestamp, accumulates all the rest, until a blank
> > line or EOF is seen, then return that, either as a list of strings, one
> > massive string, or some higher level representation (presumably an instance
> > of another class) which represents one "paragraph" of iostat output.
> 
> next() in Py2, __next__() in Py3. But I'd do it, instead, as a
> generator - that takes care of all the details, and you can simply
> yield useful information whenever you have it. Something like this
> (untested):
> 
> def parse_iostat(lines):
> """Parse lines of iostat information, yielding ... something
> 
> lines should be an iterable yielding separate lines of output
> """
> block = None
> for line in lines:
> line = line.strip()
> try:
> tm = datetime.datetime.strptime(line, "%m/%d/%Y %I:%M:%S %p")
> if block: yield block
> block = [tm]
> except ValueError:
> # It's not a new timestamp, so add it to the existing block
> block.append(line)
> if block: yield block
> 
> This is a fairly classic line-parsing generator. You can pass it a
> file-like object, a list of strings, or anything else that it can
> iterate over; it'll yield some sort of aggregate object representing
> each time's block. In this case, all it does is append strings to a
> list, so this will result in a series of lists of strings, each one
> representing a single timestamp; you can parse the other lines in any
> way you like and aggregate useful data. Usage would be something like
> this:
> 
> with open("logfile") as f:
> for block in parse_iostat(f):
> # do stuff with block
> 
> This will work quite happily with an ongoing stream, too, so if you're
> working with a pipe from a currently-running process, it'll pick stuff
> up just fine. (However, since it uses the timestamp as its signature,
> it won't yield anything till it gets the *next* timestamp. If the
> blank line is sufficient to denote the end of a block, you could
> change the loop to look for that instead.)
> 
> Hope that helps!
> 
> ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list


Parsing logfile with multi-line loglines, separated by timestamp?

2015-06-30 Thread Victor Hooi
Hi,

I'm trying to parse iostat -xt output using Python. The quirk with iostat is 
that the output for each second runs over multiple lines. For example:

06/30/2015 03:09:17 PM
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
   0.030.000.030.000.00   99.94

Device: rrqm/s   wrqm/s r/s w/srkB/swkB/s avgrq-sz 
avgqu-sz   await r_await w_await  svctm  %util
xvdap10.00 0.040.020.07 0.30 3.2881.37 
0.00   29.832.74   38.30   0.47   0.00
xvdb  0.00 0.000.000.00 0.00 0.0011.62 
0.000.230.192.13   0.16   0.00
xvdf  0.00 0.000.000.00 0.00 0.0010.29 
0.000.410.410.73   0.38   0.00
xvdg  0.00 0.000.000.00 0.00 0.00 9.12 
0.000.360.351.20   0.34   0.00
xvdh  0.00 0.000.000.00 0.00 0.0033.35 
0.001.390.418.91   0.39   0.00
dm-0  0.00 0.000.000.00 0.00 0.0011.66 
0.000.460.460.00   0.37   0.00

06/30/2015 03:09:18 PM
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
   0.000.000.500.000.00   99.50

Device: rrqm/s   wrqm/s r/s w/srkB/swkB/s avgrq-sz 
avgqu-sz   await r_await w_await  svctm  %util
xvdap10.00 0.000.000.00 0.00 0.00 0.00 
0.000.000.000.00   0.00   0.00
xvdb  0.00 0.000.000.00 0.00 0.00 0.00 
0.000.000.000.00   0.00   0.00
xvdf  0.00 0.000.000.00 0.00 0.00 0.00 
0.000.000.000.00   0.00   0.00
xvdg  0.00 0.000.000.00 0.00 0.00 0.00 
0.000.000.000.00   0.00   0.00
xvdh  0.00 0.000.000.00 0.00 0.00 0.00 
0.000.000.000.00   0.00   0.00
dm-0  0.00 0.000.000.00 0.00 0.00 0.00 
0.000.000.000.00   0.00   0.00

06/30/2015 03:09:19 PM
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
   0.000.000.500.000.00   99.50

Device: rrqm/s   wrqm/s r/s w/srkB/swkB/s avgrq-sz 
avgqu-sz   await r_await w_await  svctm  %util
xvdap10.00 0.000.000.00 0.00 0.00 0.00 
0.000.000.000.00   0.00   0.00
xvdb  0.00 0.000.000.00 0.00 0.00 0.00 
0.000.000.000.00   0.00   0.00
xvdf  0.00 0.000.000.00 0.00 0.00 0.00 
0.000.000.000.00   0.00   0.00
xvdg  0.00 0.000.000.00 0.00 0.00 0.00 
0.000.000.000.00   0.00   0.00
xvdh  0.00 0.000.000.00 0.00 0.00 0.00 
0.000.000.000.00   0.00   0.00
dm-0  0.00 0.000.000.00 0.00 0.00 0.00 
0.000.000.000.00   0.00   0.00

Essentially I need to parse the output in "chunks", where each chunk is 
separated by a timestamp.

I was looking at itertools.groupby(), but that doesn't seem to quite do what I 
want here - it seems more for grouping lines, where each is united by a common 
key, or something that you can use a function to check for.

Another thought was something like:

for line in f:
if line.count("/") == 2 and line.count(":") == 2:
current_time = datetime.strptime(line.strip(), '%m/%d/%y %H:%M:%S')
while line.count("/") != 2 and line.count(":") != 2:
print(line)
continue

But that didn't quite seem to work.

Is there a Pythonic way of parsing the above iostat output, and break it into 
chunks split by the timestamp?

Cheers,
Victor
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Using csv DictWriter - add a extra field

2015-03-31 Thread Victor Hooi
Hi,

Aha, yeah, I can add the connection_id as another field in the inner dict - the 
only drawback is that the data is duplicated twice. However, I suppose even if 
it's not elegant, it does work.

However, that ChainMap does look interesting =). And yes, I am actually using 
Python 3.x (mainly because of http://bugs.python.org/issue6641).

So if I understand correctly, I can just use ChainMap to join any arbitrary 
number of dicts together - it seems like the right solution here.

Are there any drawbacks to using ChainMap here? (Aside from needing Python 3.x).

Cheers,
Victor
-- 
https://mail.python.org/mailman/listinfo/python-list


Using csv DictWriter - add a extra field

2015-03-30 Thread Victor Hooi
Hi,

I have a dict named "connections", with items like the following:

In [18]: connections
Out[18]:
{'3424234': {'end_timestamp': datetime.datetime(2015, 3, 25, 5, 31, 30, 406000, 
tzinfo=datetime.timezone(datetime.timedelta(-1, 61200))),
  'ip_address': '10.168.8.36:52440',
  'open_timestamp': datetime.datetime(2015, 3, 25, 5, 31, 0, 383000, 
tzinfo=datetime.timezone(datetime.timedelta(-1, 61200))),
  'time_open': datetime.timedelta(0, 30, 23000)}}

In this case, the key is a connection id (e.g. "3424234"), and the value is a 
another dict, which contains things like 'end_timestamp', 'ip_address" etc.

I'm writing the output of "connections" to a CSV file using DictWriter:

fieldnames = ['connection_id', 'ip_address', 'open_timestamp', 'end_timestamp', 
'time_open']
with open('output.csv', 'w') as csvfile:
writer = DictWriter(csvfile, fieldnames)
writer.writeheader()
for connection, values in sorted(connections.items()):
if 'time_open' in values:
writer.writerow(values, {'connection_id': connection})
else:
pass
# DO SOME STUFF

The only problem is, I'd also like output the connection_id field as part of 
each CSV record.

However, connection_id in this case is the key for the parent dict.

Is there a clean way to add a extra field to DictWriter writerows, or it is the 
contents of the dict and that's it?

Cheers,
Victor
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Deep comparison of dicts - cmp() versus ==?

2015-03-19 Thread Victor Hooi
Hi Ben,

When I said "deep", I meant, as in, to an arbitrary level of nesting (i.e. 
dicts, containing dicts, containing dicts etc) - sorry if I got the terminology 
wrong.

The two dicts weren't equal by intention - the idea was that a comparison 
operator would return "False" for those two.

I was just curious why cmp() was phased out (as in, were there cases where "==" 
was better) - but if functionality they're the same, and it's just a 
nomenclature thing, that's also fine =).

Finally, so cmp()/== return true/false for comparison - just noticed this which 
actually prints out diff-style comparisons:

https://pypi.python.org/pypi/datadiff

Cheers,
Victor


On Friday, 20 March 2015 13:33:52 UTC+11, Ben Finney  wrote:
> Victor Hooi  writes:
> 
> > What is the currently most Pythonic way for doing deep comparisons
> > between dicts?
> 
> What distinction do you intend by saying "deep comparison"? As
> contrasted with what?
> 
> > For example, say you have the following two dictionaries
> >
> > a = {
> > 'bob': { 'full_name': 'bob jones', 'age': 4, 'hobbies': ['hockey', 
> > 'tennis'], 'parents': { 'mother': 'mary', 'father', 'mike'}},
> > 'james': { 'full_name': 'james joyce', 'age': 6, 'hobbies': [],}
> > }
> >
> > b = {
> > 'bob': { 'full_name': 'bob jones', 'age': 4, 'hobbies': ['hockey', 
> > 'tennis']},
> > 'james': { 'full_name': 'james joyce', 'age': 5, 'hobbies': []}
> > }
> 
> Those two dicts are not equal. How would your intended "deep comparison"
> behave for those two values?
> 
> > However, this page seems to imply that cmp() is deprecated?
> > https://docs.python.org/3/whatsnew/3.0.html#ordering-comparisons
> 
> It is, yes.
> 
> > Should we just be using the equality operator ("==") instead then? E.g.:
> >
> > a == b
> 
> Yes. That is a comparison that would return False for comparing the
> above two values. Would you expect different behaviour?
> 
> > What is the reason for this?
> 
> I don't really understand. 'cmp' is deprecated, and you can compare two
> dicts with the built-in operators. That's the reason; are you expecting
> some other reason?
> 
> > Or is there a better way to do this?
> 
> I don't really know what it is you want to do. What behaviour different
> from the built-in comparison operators do you want?
> 
> -- 
>  \ "I went over to the neighbor's and asked to borrow a cup of |
>   `\   salt. 'What are you making?' 'A salt lick.'" --Steven Wright |
> _o__)  |
> Ben Finney
-- 
https://mail.python.org/mailman/listinfo/python-list


Deep comparison of dicts - cmp() versus ==?

2015-03-19 Thread Victor Hooi
Hi,

What is the currently most Pythonic way for doing deep comparisons between 
dicts?

For example, say you have the following two dictionaries

a = {
'bob': { 'full_name': 'bob jones', 'age': 4, 'hobbies': ['hockey', 
'tennis'], 'parents': { 'mother': 'mary', 'father', 'mike'}},
'james': { 'full_name': 'james joyce', 'age': 6, 'hobbies': [],}
}

b = {
'bob': { 'full_name': 'bob jones', 'age': 4, 'hobbies': ['hockey', 
'tennis']},
'james': { 'full_name': 'james joyce', 'age': 5, 'hobbies': []}
}

Previously, I though you could do a cmp():

cmp(a, b)

However, this page seems to imply that cmp() is deprecated?

https://docs.python.org/3/whatsnew/3.0.html#ordering-comparisons

Should we just be using the equality operator ("==") instead then? E.g.:

a == b

What is the reason for this?

Or is there a better way to do this?

Regards,
Victor
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Python and PEP8 - Recommendations on breaking up long lines?

2013-11-27 Thread Victor Hooi
Hi,

Also, forgot two other examples that are causing me grief:

cur.executemany("INSERT INTO foobar_foobar_files VALUES (?)",
[[os.path.relpath(filename, foobar_input_folder)] for 
filename in filenames])

I've already broken it up using the parentheses, not sure what's the tidy way 
to break it up again to fit under 80? In this case, the 80-character mark is 
hitting me around the "for filename" towards the end.

and:

if os.path.join(root, file) not in 
previously_processed_files and os.path.join(root, file)[:-3] not in 
previously_processed_files:

In this case, the 80-character mark is actually partway through "previously 
processed files" (the first occurrence)...

Cheers,
Victor

On Thursday, 28 November 2013 12:57:13 UTC+11, Victor Hooi  wrote:
> Hi,
> 
> 
> 
> I'm running pep8 across my code, and getting warnings about my long lines (> 
> 80 characters).
> 
> 
> 
> I'm wonder what's the recommended way to handle the below cases, and fit 
> under 80 characters.
> 
> 
> 
> First example - multiple context handlers:
> 
> 
> 
> with open(self.full_path, 'r') as input, open(self.output_csv, 
> 'ab') as output:
> 
> 
> 
> and in my case, with indents, the 80-character marks is just before the 
> ending "as output".
> 
> 
> 
> What's the standard recognised way to split this across multiple lines, so 
> that I'm under 80 characters?
> 
> 
> 
> I can't just split after the "as input," as that isn't valid syntax, and 
> there's no convenient parentheses for me to split over.
> 
> 
> 
> Is there a standard Pythonic way?
> 
> 
> 
> Second example - long error messages:
> 
> 
> 
> self.logger.error('Unable to open input or output file - %s. 
> Please check you have sufficient permissions and the file and parent 
> directory exist.' % e)
> 
> 
> 
> I can use triple quotes:
> 
> 
> 
> self.logger.error(
> 
> """Unable to open input or output file - %s. Please check you
> 
> have sufficient permissions and the file and parent directory
> 
> exist.""" % e)
> 
> 
> 
> However, that will introduce newlines in the message, which I don't want.
> 
> 
> 
> I can use backslashes:
> 
> 
> 
> self.logger.error(
> 
> 'Unable to open input or output file - %s. Please check you\
> 
> have sufficient permissions and the file and parent directory\
> 
> exist.' % e)
> 
> 
> 
> which won't introduce newlines.
> 
> 
> 
> Or I can put them all as separate strings, and trust Python to glue them 
> together:
> 
> 
> 
> self.logger.error(
> 
> 'Unable to open input or output file - %s. Please check you'
> 
> 'have sufficient permissions and the file and parent 
> directory'
> 
> 'exist.' % e)
> 
> 
> 
> Which way is the recommended Pythonic way?
> 
> 
> 
> Third example - long comments:
> 
> 
> 
> """ NB - We can't use Psycopg2's parametised statements here, as
> 
> that automatically wraps everything in single quotes.
> 
> So s3://my_bucket/my_file.csv.gz would become 
> s3://'my_bucket'/'my_file.csv.gz'.
> 
> Hence, we use Python's normal string formating - this could
> 
> potentially exposes us to SQL injection attacks via the 
> config.yaml
> 
> file.
> 
> I'm not aware of any easy ways around this currently though - I'm
> 
> open to suggestions though.
> 
> See
> 
> 
> http://stackoverflow.com/questions/9354392/psycopg2-cursor-execute-with-sql-query-parameter-causes-syntax-error
> 
> for further information. """
> 
> 
> 
> In this case, I'm guessing a using triple quotes (""") is a better idea with 
> multi-line comments, right?
> 
> 
> 
> However, I've noticed that I can't seem to put in line-breaks inside the 
> comment without triggering a warning. For example, trying to put in another 
> empty line in between lines 6 and 7 above causes a warning.
> 
> 
> 
> Also, how would I split up the long URLs? Breaking it up makes it annoying to 
> use the URL. Thoughts?
> 
> 
> 
> Cheers,
> 
> Victor
-- 
https://mail.python.org/mailman/listinfo/python-list


Python and PEP8 - Recommendations on breaking up long lines?

2013-11-27 Thread Victor Hooi
Hi,

I'm running pep8 across my code, and getting warnings about my long lines (> 80 
characters).

I'm wonder what's the recommended way to handle the below cases, and fit under 
80 characters.

First example - multiple context handlers:

with open(self.full_path, 'r') as input, open(self.output_csv, 
'ab') as output:

and in my case, with indents, the 80-character marks is just before the ending 
"as output".

What's the standard recognised way to split this across multiple lines, so that 
I'm under 80 characters?

I can't just split after the "as input," as that isn't valid syntax, and 
there's no convenient parentheses for me to split over.

Is there a standard Pythonic way?

Second example - long error messages:

self.logger.error('Unable to open input or output file - %s. Please 
check you have sufficient permissions and the file and parent directory exist.' 
% e)

I can use triple quotes:

self.logger.error(
"""Unable to open input or output file - %s. Please check you
have sufficient permissions and the file and parent directory
exist.""" % e)

However, that will introduce newlines in the message, which I don't want.

I can use backslashes:

self.logger.error(
'Unable to open input or output file - %s. Please check you\
have sufficient permissions and the file and parent directory\
exist.' % e)

which won't introduce newlines.

Or I can put them all as separate strings, and trust Python to glue them 
together:

self.logger.error(
'Unable to open input or output file - %s. Please check you'
'have sufficient permissions and the file and parent directory'
'exist.' % e)

Which way is the recommended Pythonic way?

Third example - long comments:

""" NB - We can't use Psycopg2's parametised statements here, as
that automatically wraps everything in single quotes.
So s3://my_bucket/my_file.csv.gz would become 
s3://'my_bucket'/'my_file.csv.gz'.
Hence, we use Python's normal string formating - this could
potentially exposes us to SQL injection attacks via the config.yaml
file.
I'm not aware of any easy ways around this currently though - I'm
open to suggestions though.
See

http://stackoverflow.com/questions/9354392/psycopg2-cursor-execute-with-sql-query-parameter-causes-syntax-error
for further information. """

In this case, I'm guessing a using triple quotes (""") is a better idea with 
multi-line comments, right?

However, I've noticed that I can't seem to put in line-breaks inside the 
comment without triggering a warning. For example, trying to put in another 
empty line in between lines 6 and 7 above causes a warning.

Also, how would I split up the long URLs? Breaking it up makes it annoying to 
use the URL. Thoughts?

Cheers,
Victor
-- 
https://mail.python.org/mailman/listinfo/python-list


Python String Formatting - passing both a dict and string to .format()

2013-11-26 Thread Victor Hooi
Hi,

I'm trying to use Python's new style string formatting with a dict and string 
together.

For example, I have the following dict and string variable:

my_dict = { 'cat': 'ernie', 'dog': 'spot' }
foo = 'lorem ipsum'

If I want to just use the dict, it all works fine:

'{cat} and {dog}'.format(**my_dict)
'ernie and spot'

(I'm also curious how the above ** works in this case).

However, if I try to combine them:

'{cat} and {dog}, {}'.format(**my_dict, foo)
...
SyntaxError: invalid syntax

I also tried with:

'{0['cat']} {1} {0['dog']}'.format(my_dict, foo)
...
SyntaxError: invalid syntax

However, I found that if I take out the single quotes around the keys it then 
works:

'{0[cat]} {1} {0[dog]}'.format(my_dict, foo)
"ernie lorem ipsum spot"

I'm curious - why does this work? Why don't the dictionary keys need quotes 
around them, like when you normally access a dict's elements?

Also, is this the best practice to pass both a dict and string to .format()? Or 
is there another way that avoids needing to use positional indices? ({0}, {1} 
etc.)

Cheers,
Victor
-- 
https://mail.python.org/mailman/listinfo/python-list


Understanding relative imports in package - and running pytest with relative imports?

2013-11-24 Thread Victor Hooi
Hi,

Ok, this is a topic that I've never really understood properly, so I'd like to 
find out what's the "proper" way of doing things.

Say I have a directory structure like this:

furniture/
__init__.py
chair/
__init__.py
config.yaml
build_chair.py
common/
__init__.py
shared.py
table/
__init__.py
config.yaml
create_table.sql
build_table.py

The package is called furniture, and we have modules chair, common and table 
underneath that.

build_chair.py and build_table.py are supposed to import from common/shared.py 
using relative imports. e.g.:

from ..common.shared import supplies

However, if you then try to run the scripts build_chair.py, or build_table.py, 
they'll complain about:

ValueError: Attempted relative import in non-package

After some Googling:

http://stackoverflow.com/questions/11536764/attempted-relative-import-in-non-package-even-with-init-py
http://stackoverflow.com/questions/72852/how-to-do-relative-imports-in-python
http://stackoverflow.com/questions/1198/getting-attempted-relative-import-in-non-package-error-in-spite-of-having-init
http://stackoverflow.com/questions/14664313/attempted-relative-import-in-non-package-although-packaes-with-init-py-in
http://melitamihaljevic.blogspot.com.au/2013/04/python-relative-imports-hard-way.html

The advice seems to be either to run it from the parent directory of furniture 
with:

python -m furniture.chair.build_chair

Or to have a main.py outside of the package directory and run that, and have it 
import things.

However, I don't see having a separate single main.py outside my package would 
work with keeping my code tidy/organised, and or how it'd work with the other 
files (config.yaml, or create_table.sql) which are associated with each script?

A third way I thought of way just to create a setup.py and install the package 
into site-packages - and then everything will work? However, I don't think that 
solves my problem of understanding how things work, or getting my directory 
structure right.

Although apparently running a script inside a package is an anti-pattern? 
(https://mail.python.org/pipermail/python-3000/2007-April/006793.html)

How would you guys organise the code above?

Also, if I have tests (say with pyttest), inside 
furniture/table/tests/test_table.py, how would I run these as well? If I run 
py.test from there, I get the same:

$ py.test
 
 from ..table.build_table import Table
 E   ValueError: Attempted relative import in non-package


(Above is just an extract).

Assuming I use pytest, where should my tests be in the directory structure, and 
how should I be running them?

Cheers,
Victor
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Using try-catch to handle multiple possible file types?

2013-11-19 Thread Victor Hooi
Hi,

Is either approach (try-excepts, or using libmagic) considered more idiomatic? 
What would you guys prefer yourselves?

Also, is it possible to use either approach with a context manager ("with"), 
without duplicating lots of code?

For example:

try:
with gzip.open('blah.txt', 'rb') as f:
for line in f:
print(line)
except IOError as e:
with open('blah.txt', 'rb') as f:
for line in f:
print(line)

I'm not sure of how to do this without needing to duplicating the processing 
lines (everything inside the with)?

And using:

try:
f = gzip.open('blah.txt', 'rb')
except IOError as e:
f = open('blah.txt', 'rb')
finally:
for line in f:
print(line)

won't work, since the exception won't get thrown until you actually try to open 
the file. Plus, I'm under the impression that I should be using 
context-managers where I can.

Also, on another note, python-magic will return a string as a result, e.g.:

gzip compressed data, was "blah.txt", from Unix, last modified: Wed Nov 20 
10:48:35 2013

I suppose it's enough to just do a?

if "gzip compressed data" in results:

or is there a better way?

Cheers,
Victor

On Tuesday, 19 November 2013 20:36:47 UTC+11, Mark Lawrence  wrote:
> On 19/11/2013 07:13, Victor Hooi wrote:
> 
> >
> 
> > So basically, using exception handling for flow-control.
> 
> >
> 
> > However, is that considered bad practice, or un-Pythonic?
> 
> >
> 
> 
> 
> If it works for you use it, practicality beats purity :)
> 
> 
> 
> -- 
> 
> Python is the second best programming language in the world.
> 
> But the best has yet to be invented.  Christian Tismer
> 
> 
> 
> Mark Lawrence
-- 
https://mail.python.org/mailman/listinfo/python-list


Using try-catch to handle multiple possible file types?

2013-11-18 Thread Victor Hooi
Hi,

I have a script that needs to handle input files of different types 
(uncompressed, gzipped etc.).

My question is regarding how I should handle the different cases.

My first thought was to use a try-catch block and attempt to open it using the 
most common filetype, then if that failed, try the next most common type etc. 
before finally erroring out.

So basically, using exception handling for flow-control.

However, is that considered bad practice, or un-Pythonic?

What other alternative constructs could I also use, and pros and cons?

(I was thinking I could also use python-magic which wraps libmagic, or I can 
just rely on file extensions).

Other thoughts?

Cheers,
Victor
-- 
https://mail.python.org/mailman/listinfo/python-list


Where to handle try-except - close to the statement, or in outer loop?

2013-11-11 Thread Victor Hooi
Hi,

I have a general question regarding try-except handling in Python.

Previously, I was putting the try-handle blocks quite close to where the errors 
occured:

A somewhat contrived example:

if __name__ == "__main__":
my_pet = Dog('spot', 5, 'brown')
my_pet.feed()
my_pet.shower()

and then, in each of the methods (feed(), shower()), I'd open up files, open 
database connections etc.

And I'd wrap each statement there in it's own individual try-except block. (I'm 
guessing I should wrap the whole lot in a single try-except, and handle each 
exception there?)

However, the author here:

http://stackoverflow.com/a/3644618/139137

suggests that it's a bad habit to catch an exception as early as possible, and 
you should handle it at an outer level.

>From reading other posts, this seems to be the consensus as well.

However, how does this work if you have multiple methods which can throw the 
same types of exceptions?

For example, if both feed() and shower() above need to write to files, when you 
get your IOError, how do you distinguish where it came from? (e.g. If you 
wanted to print a friendly error message, saying "Error writing to file while 
feeding.", or if you otherwise wanted to handle it different).

Would I wrap all of the calls in a try-except block?

try:
my_pet.feed()
my_pet.shower()
except IOError as e:
# Do something to handle exception?

Can anybody recommend any good examples that show current best practices for 
exception handling, for programs with moderate complexity? (i.e. anything more 
than the examples in the tutorial, basically).

Cheers,
Victor
-- 
https://mail.python.org/mailman/listinfo/python-list


Compiling Python 3.3.2 on CentOS 6.4 - unable to find compiled OpenSSL?

2013-11-04 Thread Victor Hooi
Hi,

We have a machine running CentOS 6.4, and we're attempting to compile Python 
3.3.2 on it:

# cat /etc/redhat-release
CentOS release 6.4 (Final)

We've compiled openssl 1.0.1e 11 by hand on this box, and installed it into 
/usr/local/:

# openssl
OpenSSL> version
OpenSSL 1.0.1e 11 Feb 2013
 
# ls /usr/local/include/openssl/
aes.h   blowfish.h  cmac.h  crypto.h   dso.h ec.h  
hmac.h  md4.h  obj_mac.h  pem2.hrand.hsafestack.h  ssl23.h  
symhacks.h   ui.h
asn1.h  bn.hcms.h   des.h  dtls1.h   engine.h  
idea.h  md5.h  ocsp.h pem.h rc2.h seed.h   ssl2.h   
tls1.h   whrlpool.h
asn1_mac.h  buffer.hcomp.h  des_old.h  ebcdic.h  e_os2.h   
krb5_asn.h  mdc2.h opensslconf.h  pkcs12.h  rc4.h sha.hssl3.h   
ts.h x509.h
asn1t.h camellia.h  conf_api.h  dh.h   ecdh.herr.h 
kssl.h  modes.hopensslv.h pkcs7.h   ripemd.h  srp.hssl.h
txt_db.h x509v3.h
bio.h   cast.h  conf.h  dsa.h  ecdsa.h   evp.h 
lhash.h objects.h  ossl_typ.h pqueue.h  rsa.h srtp.h   stack.h  
ui_compat.h  x509_vfy.h

However, when we try to build Python 3.3.2, it can't seem to find the SSL 
installation:

# make
---
Modules/Setup.dist is newer than Modules/Setup;
check to make sure you have all the updates you
need in your Modules/Setup file.
Usually, copying Modules/Setup.dist to Modules/Setup will work.
---
running build
running build_ext
INFO: Can't locate Tcl/Tk libs and/or headers
building '_ssl' extension
gcc -pthread -fPIC -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes 
-I./Include -I. -IInclude -I/usr/local/include -I/root/Python-3.3.2/Include 
-I/root/Python-3.3.2 -c /root/Python-3.3.2/Modules/_ssl.c -o 
build/temp.linux-x86_64-3.3/root/Python-3.3.2/Modules/_ssl.o
gcc -pthread -shared 
build/temp.linux-x86_64-3.3/root/Python-3.3.2/Modules/_ssl.o -L/usr/local/lib 
-lssl -lcrypto -o build/lib.linux-x86_64-3.3/_ssl.cpython-33m.so
*** WARNING: renaming "_ssl" since importing it failed: 
build/lib.linux-x86_64-3.3/_ssl.cpython-33m.so: undefined symbol: 
EC_KEY_new_by_curve_name
 
Python build finished, but the necessary bits to build these modules 
were not found:
_dbm   _gdbm  _lzma  
_tkinter  
To find the necessary bits, look in setup.py in detect_modules() for 
the module's name.
 
 
Failed to build these modules:
_ssl 
 
running build_scripts
copying and adjusting /root/Python-3.3.2/Tools/scripts/pydoc3 -> 
build/scripts-3.3
copying and adjusting /root/Python-3.3.2/Tools/scripts/idle3 -> 
build/scripts-3.3
copying and adjusting /root/Python-3.3.2/Tools/scripts/2to3 -> 
build/scripts-3.3
copying and adjusting /root/Python-3.3.2/Tools/scripts/pyvenv -> 
build/scripts-3.3
changing mode of build/scripts-3.3/pydoc3 from 644 to 755
changing mode of build/scripts-3.3/idle3 from 644 to 755
changing mode of build/scripts-3.3/2to3 from 644 to 755
changing mode of build/scripts-3.3/pyvenv from 644 to 755
renaming build/scripts-3.3/pydoc3 to build/scripts-3.3/pydoc3.3
renaming build/scripts-3.3/idle3 to build/scripts-3.3/idle3.3
renaming build/scripts-3.3/2to3 to build/scripts-3.3/2to3-3.3
renaming build/scripts-3.3/pyvenv to build/scripts-3.3/pyvenv-3.3

I also tried editing the Modules/Setup.dist file, no luck there either.

Any thoughts on what we're doing wrong?

Cheers,
Victor
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: ValueError: zero length field name in format - Running under Python 2.7.3?

2013-11-04 Thread Victor Hooi
Hi,

You're right - it was sudo playing up with the virtualenv.

The script was in /opt, so I was testing with sudo to get it to run.

I should have setup a service account, and tested it with that =).

$ python sync_bexdb.py
2.7.3 (default, Jan  7 2013, 11:52:52)
[GCC 4.4.6 20120305 (Red Hat 4.4.6-4)]

$ sudo python sync_bexdb.py
[sudo] password for victor:
2.6.6 (r266:84292, Jul 10 2013, 22:48:45)
[GCC 4.4.7 20120313 (Red Hat 4.4.7-3)]

Cheers,
Victor

On Tuesday, 5 November 2013 10:02:50 UTC+11, Chris Angelico  wrote:
> On Tue, Nov 5, 2013 at 9:33 AM, Victor Hooi  wrote:
> 
> > However, when I run this line, I get the following error:
> 
> >
> 
> > Traceback (most recent call last):
> 
> >   File "my_script.py", line 25, in 
> 
> > LOG_FILENAME = 
> > 'my_something_{}.log'.format(datetime.now().strftime('%Y-%d-%m_%H.%M.%S'))
> 
> > ValueError: zero length field name in format
> 
> >
> 
> >
> 
> > The weird thing, when I start a Python REPL and run that line 
> > interactively, it works fine
> 
> 
> 
> Google tells me that that was an issue in Python 2.6, so my first
> 
> check would be to see what `/usr/bin/env python` actually gives you -
> 
> are you running inside an environment that changes your path? Drop a
> 
> "import sys; print(sys.version)" at the top of your script and see
> 
> what it's really running as.
> 
> 
> 
> ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list


ValueError: zero length field name in format - Running under Python 2.7.3?

2013-11-04 Thread Victor Hooi
Hi,

I have a Python script that's using a format string without positional 
specifiers. I.e.:

LOG_FILENAME = 
'my_something_{}.log'.format(datetime.now().strftime('%Y-%d-%m_%H.%M.%S'))

I'm running this from within a virtualenv, running under Python 2.7.3.

$ python -V
Python 2.7.3
$ which python
/opt/my_project_venv/bin/python

The first line of the script is:

#!/usr/bin/env python

However, when I run this line, I get the following error:

Traceback (most recent call last):
  File "my_script.py", line 25, in 
LOG_FILENAME = 
'my_something_{}.log'.format(datetime.now().strftime('%Y-%d-%m_%H.%M.%S'))
ValueError: zero length field name in format


The weird thing, when I start a Python REPL and run that line interactively, it 
works fine:

$ python
Python 2.7.3 (default, Jan  7 2013, 11:52:52)
[GCC 4.4.6 20120305 (Red Hat 4.4.6-4)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> from datetime import datetime
>>> LOG_FILENAME = 
'my_project_{}.log'.format(datetime.now().strftime('%Y-%d-%m_%H.%M.%S'))
>>> print(LOG_FILENAME)
my_project_2013-05-11_09.29.47.log

My understanding was that in Python 2.7/3.1, you could omit the positional 
specifiers in a format string.



Cheers,
Victor
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Try-except for flow control in reading Sqlite

2013-10-31 Thread Victor Hooi
Hi,

You're right, if the databse doesn't exist, the sqlite3 library will simply 
create it.

Hmm, in that case, what is the Pythonic way to handle this then?

If the database is new, then it won't have the table I need, and it will return 
something like:

sqlite3.OperationalError: no such table: my_table

I suppose I can try the query, and catch OperationalError, and if so, create 
the new schema then?

However, that seems a bit ugly, as I'm guessing OperationalError could be 
caused by a number of other reasons?

Should I perhaps be using some kind of version table as Burak Aslan suggested?

Cheers,
victor 

On Tuesday, 29 October 2013 10:43:19 UTC+11, Dennis Lee Bieber  wrote:
> On Sun, 27 Oct 2013 20:43:07 -0700 (PDT), Victor Hooi
> 
>  declaimed the following:
> 
> 
> 
> >Hi,
> 
> >
> 
> >I'd like to double-check something regarding using try-except for 
> >controlling flow.
> 
> >
> 
> >I have a script that needs to lookup things in a SQLite database.
> 
> >
> 
> >If the SQLite database file doesn't exist, I'd like to create an empty 
> >database, and then setup the schema.
> 
> >
> 
> >Is it acceptable to use try-except in order to achieve this? E.g.:
> 
> >
> 
> >try:
> 
> ># Try to open up the SQLite file, and lookup the required entries
> 
> >except OSError:
> 
> ># Open an empty SQLite file, and create the schema
> 
> >
> 
> >
> 
>   In my experience, SQLite will /create/ an empty database file if the
> 
> specified name does not exit. So just executing the connect() call is all
> 
> that is needed. After all, checking for data IN the database will either
> 
> return something or fail at that point in which case you can now populate
> 
> the schema.
> 
> 
> 
> -=-=-=-=-=-
> 
> >>> import sqlite3 as db
> 
> >>> con = db.connect("anUnknown.db")
> 
> >>> cur = con.cursor()
> 
> >>> rst = cur.execute("pragma table_info('aTable')")
> 
> >>> rst
> 
> 
> 
> >>> for ln in rst:
> 
> ...   print ln
> 
> ...   
> 
> >>> for ln in cur:
> 
> ...   print ln
> 
> ...   
> 
> >>> rst = cur.execute("create table aTable ( junk varchar )")
> 
> >>> con.commit()
> 
> >>> rst = cur.execute("pragma table_info('aTable')")
> 
> >>> for ln in rst:
> 
> ...   print ln
> 
> ... 
> 
> (0, u'junk', u'varchar', 0, None, 0)
> 
> >>> 
> 
> 
> 
> 
> 
>   No try/except needed -- just an a conditional testing the length of the
> 
> result returned by the pragma instruction on the table you expect to find
> 
> in the database.
> 
> -- 
> 
>   Wulfraed Dennis Lee Bieber AF6VN
> 
> wlfr...@ix.netcom.comHTTP://wlfraed.home.netcom.com/
-- 
https://mail.python.org/mailman/listinfo/python-list


Sharing common code between multiple scripts?

2013-10-29 Thread Victor Hooi
Hi,

NB - I'm the original poster here - 
https://groups.google.com/d/topic/comp.lang.python/WUuRLEXJP4E/discussion - 
however, that post seems to have diverted, and I suspect my original question 
was poorly worded.

I have several Python scripts that use similar functions.

Currently, these functions are duplicated in each script.

These functions wrap things like connecting to databases, reading in config 
files, writing to CSV etc.

I'd like to pull them out, and move them to a common module for all the scripts 
to import.

Originally, I thought I'd create a package, and have it all work:

my_package
__init__.py
common/
my_functions.py
script1/
__init__.py
config.yaml
script1.py
script2/
__init__.py
config.yaml
script2.py

However, there apparently isn't an easy way to have script1.py and script2.py 
import from common/my_functions.py.

So my new question is - what is the idiomatic way to structure this in Python, 
and easily share common functions between the scripts?

Ideally, I'd like to avoid having everything in a single directory - i.e. 
script1.py should be in it's own directory, as it has it's own config and other 
auxiliary files. However, if this is a bad idea, let me know.

Also, say I have a class in script1.py, and I want it pull in a common method 
as well. For example, I want multiples classes to have the following method:

def gzip_csv_file(self):
self.gzip_filename = '%s.gz' % self.output_csv
with open(self.output_csv, 'rb') as uncompressed:
with gzip.open(self.gzip_filename, 'wb') as compressed:
compressed.writelines(uncompressed)

self.logger.debug('Compressed to %s GZIP file.' % 
humansize(os.path.getsize(self.gzip_filename)))

How could I share this? Mixins? Or is there something better?

Cheers,
Victor
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Using "with open(filename, 'ab'):" and calling code only if the file is new?

2013-10-29 Thread Victor Hooi
Hi,

In theory, it *should* just be our script writing to the output CSV file.

However, I wanted it to be robust - e.g. in case somebody spins up two copies 
of this script running concurrently.

I suppose the timing would have to be pretty unlucky to hit a race condition 
there, right?

As in, somebody would have have to open the new file and write to it somewhere 
in between the check line (os.path.getsize) and the following line 
(writeheaders).

However, you're saying the only way to be completely safe is some kind of file 
locking?

Another person (Zachary Ware) suggested using .tell() on the file as well - I 
suppose that's similar enough to using os.path.getsize(), right?

But basically, I can call .tell() or os.path.getsize() on the file to see if 
it's zero, and then just call writeheaders() on the following line.

In the future - we may be moving to storing results in something like SQLite, 
or MongoDB and outputting a CSV directly from there.

Cheers,
Victor

On Wednesday, 30 October 2013 13:55:53 UTC+11, Joseph L. Casale  wrote:
> > Like Victor says, that opens him up to race conditions.
> 
> 
> 
> Slim chance, it's no more possible than it happening in the time try/except
> 
> takes to recover an alternative procedure.
> 
> 
> 
> with open('in_file') as in_file, open('out_file', 'ab') as outfile_file:
> 
> if os.path.getsize('out_file'):
> 
> print('file not empty')
> 
> else:
> 
> #write header
> 
> print('file was empty')
> 
> 
> 
> And if that's still not acceptable (you did say new) than open the out_file 
> 'r+' an seek
> 
> and read to check for a header.
> 
> 
> 
> But if your file is not new and lacks a header, then what?
> 
> jlc

-- 
https://mail.python.org/mailman/listinfo/python-list


Using "with open(filename, 'ab'):" and calling code only if the file is new?

2013-10-29 Thread Victor Hooi
Hi,

I have a CSV file that I will repeatedly appending to.

I'm using the following to open the file:

with open(self.full_path, 'r') as input, open(self.output_csv, 'ab') as 
output:
fieldnames = (...)
csv_writer = DictWriter(output, filednames)
# Call csv_writer.writeheader() if file is new.
csv_writer.writerows(my_dict)

I'm wondering what's the best way of calling writeheader() only if the file is 
new?

My understanding is that I don't want to use os.path.exist(), since that opens 
me up to race conditions.

I'm guessing I can't use try-except with IOError, since the open(..., 'ab') 
will work whether the file exists or not.

Is there another way I can execute code only if the file is new?

Cheers,
Victor
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Using urlparse.parse_qs() - first element in dict is keyed on URL+key, instead of just key?

2013-10-29 Thread Victor Hooi
Hi,

My bad - PEBKAC - didn't read the docs properly.

I need to use urlparse.urlparse() to extract the query first.

So for anybody searching this, you can use something liek:

In [39]: url
Out[39]: 
'https://www.foo.com/cat/dog-13?utm_source=foo1043c&utm_medium=email&utm_campaign=ba^Cn=HC'

In [40]: urlparse.parse_qs(urlparse.urlparse(url).query)
Out[40]:
{'utm_campaign': ['ba^Cn=HC'],
 'utm_medium': ['email'],
 'utm_source': ['foo1043c']}

Cheers,
Victor

On Wednesday, 30 October 2013 09:34:15 UTC+11, Victor Hooi  wrote:
> Hi,
> 
> 
> 
> I'm attempting to use urlparse.parse_qs() to parse the following url:
> 
> 
> 
> 
> https://www.foo.com/cat/dog-13?utm_source=foo1043c&utm_medium=email&utm_campaign=ba^Cn=HC
> 
> 
> 
> However, when I attempt to parse it, I get:
> 
> 
> 
> {'https://www.foo.com/cat/dog-13?utm_source': ['foo1043c'],
> 
>  'utm_campaign': ['ba^Cn=HC'],
> 
>  'utm_medium': ['email']}
> 
> 
> 
> For some reason - the utm_source doesn't appear to have been extracted 
> correctly, and it's keying the result on the url plus utm_source, rather than 
> just 'utm_source'?
> 
> 
> 
> Cheers,
> 
> Victor
-- 
https://mail.python.org/mailman/listinfo/python-list


Using urlparse.parse_qs() - first element in dict is keyed on URL+key, instead of just key?

2013-10-29 Thread Victor Hooi
Hi,

I'm attempting to use urlparse.parse_qs() to parse the following url:


https://www.foo.com/cat/dog-13?utm_source=foo1043c&utm_medium=email&utm_campaign=ba^Cn=HC

However, when I attempt to parse it, I get:

{'https://www.foo.com/cat/dog-13?utm_source': ['foo1043c'],
 'utm_campaign': ['ba^Cn=HC'],
 'utm_medium': ['email']}

For some reason - the utm_source doesn't appear to have been extracted 
correctly, and it's keying the result on the url plus utm_source, rather than 
just 'utm_source'?

Cheers,
Victor
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Organising packages/modules - importing functions from a common.py in a separate directory?

2013-10-29 Thread Victor Hooi
Hi,

Wait - err, subpackage != module, right? Do you think you could explain what a 
sub-package is please? I tried Googling, and couldn't seem to find the term in 
this context.

Also, so you're saying to put the actual script that I want to invoke *outside* 
the Python package.

Do you mean something like this:

> sync_em.py
> sync_pg.py
> foo_loading/ 
> __init__.py 
> common/ 
> common_foo.py 
> em_load/ 
> __init__.py 
> config.yaml 
> em.py
> pg_load/ 
> __init__.py 
> config.yaml 
> pg.py

and the sync_em.py and sync_pg.py would just be thin wrappers pulling in things 
from em.py and pg.py? Is that a recommended approach to organise the code?

Would it make any difference if I actually packaged it up so you could install 
it in site-packages? Could I then call modules from other modules within the 
package?

Cheers,
Victor

On Tuesday, 29 October 2013 18:44:47 UTC+11, Peter Otten  wrote:
> Victor Hooi wrote:
> 
> 
> 
> > Hi,
> 
> > 
> 
> > Hmm, this post on SO seems to suggest that importing from another sibling
> 
> > directory in a package ins't actually possibly in Python without some ugly
> 
> > hacks?
> 
> > 
> 
> > http://stackoverflow.com/questions/6323860/sibling-package-imports
> 
> > 
> 
> > Did I read the above correctly?
> 
> 
> 
> Yes.
> 
>  
> 
> > Is there another way I can structure my code so that I can run the
> 
> > sync_em.py and sync_pg.py scripts, and they can pull common functions from
> 
> > somewhere?
> 
> 
> 
> The packages you are trying to access in your original post 
> 
> 
> 
> > foo_loading/
> 
> > __init__.py
> 
> > common/
> 
> > common_foo.py
> 
> > em_load/
> 
> > __init__.py
> 
> > config.yaml
> 
> > sync_em.py
> 
> > pg_load/
> 
> > __init__.py
> 
> > config.yaml
> 
> > sync_pg.py
> 
> 
> 
> 
> 
> aren't actually siblings in the sense of the stackoverflow topic above, they 
> 
> are subpackages of foo_loading, and as you already found out
> 
> 
> 
> > So from within the sync_em.py script, I'm trying to import a function from 
> 
> foo_loading/common/common_foo.py.
> 
> > 
> 
> > from ..common.common_foo import setup_foo_logging
> 
> > 
> 
> > I get the error:
> 
> > 
> 
> > ValueError: Attempted relative import in non-package 
> 
> > 
> 
> > If I change directories to the parent of "foo_loading", then run
> 
> > 
> 
> > python -m foo_loading.em_load.sync_em sync_em.py
> 
> > 
> 
> > it works. However, this seems a bit roundabout, and I suspect I'm not 
> 
> doing things correctly.
> 
> > 
> 
> > Ideally, I want a user to be able to just run sync_em.py from it's own 
> 
> directory, and have it correctly import the logging/config modules from 
> 
> common_foo.py, and just work.
> 
> > 
> 
> > What is the correct way to achieve this?
> 
> 
> 
> you can access them as long as the *parent* directory of foo_loading is in 
> 
> sys.path through PYTHONPATH, or as the working directory, or any other 
> 
> means. However, if you step into the package, e. g.
> 
> 
> 
> $ cd foo_loading
> 
> $ python -c 'import common'
> 
> 
> 
> then from Python's point of view 'common' is a toplevel package rather than 
> 
> the intended 'foo_loading.common', and intra-package imports will break.
> 
> 
> 
> To preserve your sanity I therefore recommend that you 
> 
> 
> 
> (1) avoid to put package directories into sys.path
> 
> (1a) avoid to cd into a package
> 
> (2) put scripts you plan to invoke directly rather than import outside the 
> 
> package.
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Organising packages/modules - importing functions from a common.py in a separate directory?

2013-10-29 Thread Victor Hooi
Hi,

Hmm, this post on SO seems to suggest that importing from another sibling 
directory in a package ins't actually possibly in Python without some ugly 
hacks?

http://stackoverflow.com/questions/6323860/sibling-package-imports

Did I read the above correctly?

Is there another way I can structure my code so that I can run the sync_em.py 
and sync_pg.py scripts, and they can pull common functions from somewhere?

Cheers,
Victor

On Tuesday, 29 October 2013 12:08:10 UTC+11, Victor Hooi  wrote:
> Hi,
> 
> 
> 
> If I try to use:
> 
> 
> 
> from .common.common_foo import setup_foo_logging
> 
> 
> 
> I get:
> 
> 
> 
> ValueError: Attempted relative import in non-package
> 
> 
> 
> And the absolute imports don't seem to be able to find the right modules.
> 
> 
> 
> Is it something to do with the fact I'm running the sync_em.py script from 
> the "foo_loading/em_load" directory?
> 
> 
> 
> I thought I could just refer to the full path, and it'd find it, but 
> evidently not...hmm.
> 
> 
> 
> Cheers,
> 
> Victor
> 
> 
> 
> On Tuesday, 29 October 2013 12:01:03 UTC+11, Ben Finney  wrote:
> 
> > Victor Hooi  writes:
> 
> > 
> 
> > 
> 
> > 
> 
> > > Ok, so I should be using absolute imports, not relative imports.
> 
> > 
> 
> > 
> 
> > 
> 
> > I'd say it is fine to use relative imports, so long as they are
> 
> > 
> 
> > explicit. (In Python 3, the default for an import is to be absolute, and
> 
> > 
> 
> > the *only* way to do a relative import is to make it explicitly
> 
> > 
> 
> > relative. So you may as well start doing so now.)
> 
> > 
> 
> > 
> 
> > 
> 
> > > Hmm, I just tried to use absolute imports, and it can't seem to locate
> 
> > 
> 
> > > the modules:
> 
> > 
> 
> > >
> 
> > 
> 
> > > In the file "foo_loading/em_load/sync_em.py", I have:
> 
> > 
> 
> > >
> 
> > 
> 
> > > from common.common_bex import setup_foo_logging
> 
> > 
> 
> > 
> 
> > 
> 
> > So I'd recommend this be done with an explicit relative import:
> 
> > 
> 
> > 
> 
> > 
> 
> > from .common.common_bex import setup_foo_logging
> 
> > 
> 
> > 
> 
> > 
> 
> > or, better, import a module:
> 
> > 
> 
> > 
> 
> > 
> 
> > from .common import common_bex
> 
> > 
> 
> > 
> 
> > 
> 
> > or a whole package:
> 
> > 
> 
> > 
> 
> > 
> 
> > from . import common
> 
> > 
> 
> > 
> 
> > 
> 
> > -- 
> 
> > 
> 
> >  \ “I went over to the neighbor's and asked to borrow a cup of |
> 
> 
> >   `\   salt. ‘What are you making?’ ‘A salt lick.’” —Steven Wright |
> 
> > 
> 
> > _o__)  |
> 
> > 
> 
> > Ben Finney
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Organising packages/modules - importing functions from a common.py in a separate directory?

2013-10-28 Thread Victor Hooi
Hi,

If I try to use:

from .common.common_foo import setup_foo_logging

I get:

ValueError: Attempted relative import in non-package

And the absolute imports don't seem to be able to find the right modules.

Is it something to do with the fact I'm running the sync_em.py script from the 
"foo_loading/em_load" directory?

I thought I could just refer to the full path, and it'd find it, but evidently 
not...hmm.

Cheers,
Victor

On Tuesday, 29 October 2013 12:01:03 UTC+11, Ben Finney  wrote:
> Victor Hooi  writes:
> 
> 
> 
> > Ok, so I should be using absolute imports, not relative imports.
> 
> 
> 
> I'd say it is fine to use relative imports, so long as they are
> 
> explicit. (In Python 3, the default for an import is to be absolute, and
> 
> the *only* way to do a relative import is to make it explicitly
> 
> relative. So you may as well start doing so now.)
> 
> 
> 
> > Hmm, I just tried to use absolute imports, and it can't seem to locate
> 
> > the modules:
> 
> >
> 
> > In the file "foo_loading/em_load/sync_em.py", I have:
> 
> >
> 
> > from common.common_bex import setup_foo_logging
> 
> 
> 
> So I'd recommend this be done with an explicit relative import:
> 
> 
> 
> from .common.common_bex import setup_foo_logging
> 
> 
> 
> or, better, import a module:
> 
> 
> 
> from .common import common_bex
> 
> 
> 
> or a whole package:
> 
> 
> 
> from . import common
> 
> 
> 
> -- 
> 
>  \ “I went over to the neighbor's and asked to borrow a cup of |
> 
>   `\   salt. ‘What are you making?’ ‘A salt lick.’” —Steven Wright |
> 
> _o__)  |
> 
> Ben Finney
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Organising packages/modules - importing functions from a common.py in a separate directory?

2013-10-28 Thread Victor Hooi
Hi,

Ok, so I should be using absolute imports, not relative imports.

Hmm, I just tried to use absolute imports, and it can't seem to locate the 
modules:

In the file "foo_loading/em_load/sync_em.py", I have:

from common.common_bex import setup_foo_logging

When I try to run that script:

python sync_em.py

I get:

ImportError: No module named common.common_foo

I've also tried adding "foo_loading" (the package name):

from foo_loading.common.common_bex import setup_foo_logging

Same error:

ImportError: No module named foo_loading.common.bex_common

Any thoughts?

Cheers,
Victor

On Tuesday, 29 October 2013 00:12:58 UTC+11, Jean-Michel Pichavant  wrote:
> - Original Message -
> > Hi,
> > 
> > I have a collection of Python scripts I'm using to load various bits
> > of data into a database.
> > 
> > I'd like to move some of the common functions (e.g. to setup loggers,
> > reading in configuration etc.) into a common file, and import them
> > from there.
> > 
> > I've created empty __init__.py files, and my current directory
> > structure looks something like this:
> > 
> > foo_loading/
> > __init__.py
> > common/
> > common_foo.py
> > em_load/
> > __init__.py
> > config.yaml
> > sync_em.py
> > pg_load/
> > __init__.py
> > config.yaml
> > sync_pg.py
> > 
> > So from within the sync_em.py script, I'm trying to import a function
> > from foo_loading/common/common_foo.py.
> > 
> > from ..common.common_foo import setup_foo_logging
> > 
> > I get the error:
> > 
> > ValueError: Attempted relative import in non-package
> > 
> > If I change directories to the parent of "foo_loading", then run
> > 
> > python -m foo_loading.em_load.sync_em sync_em.py
> > 
> > it works. However, this seems a bit roundabout, and I suspect I'm not
> > doing things correctly.
> > 
> > Ideally, I want a user to be able to just run sync_em.py from it's
> > own directory, and have it correctly import the logging/config
> > modules from common_foo.py, and just work.
> > 
> > What is the correct way to achieve this?
> > 
> > Secondly, if I want to move all of the config.yaml files to a common
> > foo_loading/config.yaml, or even foo_loading/config/config.yaml,
> > what is the correct way to access this from within the scripts?
> > Should I just be using "../", or is there a better way?
> > 
> > Cheers,
> > Victor
> 
> Long story short : use absolute imports.
> 
> name properly your module with a distinct name and import that way, even 
> inside your package:
> 
> import foo_loading.common.common_foo
> 
> Names like common, lib, setup are farely prone to collision with other badly 
> referenced import from other modules. One way to solve this is to use a 
> distinct namespace, in other words, prefix every import with the module name.
> 
> cheers,
> 
> JM
> 
> 
> -- IMPORTANT NOTICE: 
> 
> The contents of this email and any attachments are confidential and may also 
> be privileged. If you are not the intended recipient, please notify the 
> sender immediately and do not disclose the contents to any other person, use 
> it for any purpose, or store or copy the information in any medium. Thank you.
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Try-except for flow control in reading Sqlite

2013-10-28 Thread Victor Hooi
Hi,

We're on Python 2.6 (RHEL based system...) - I don't believe this exposes 
FileNotFoundError =(.

Cheers,
Victor

On Monday, 28 October 2013 17:36:05 UTC+11, Chris Angelico  wrote:
> On Mon, Oct 28, 2013 at 2:43 PM, Victor Hooi  wrote:
> 
> > Is it acceptable to use try-except in order to achieve this? E.g.:
> 
> >
> 
> > try:
> 
> > # Try to open up the SQLite file, and lookup the required entries
> 
> > except OSError:
> 
> > # Open an empty SQLite file, and create the schema
> 
> >
> 
> >
> 
> > My thinking is that it is (easier to ask forgiveness than permission), but 
> > I just wanted to check if there is a better way of achieving this?
> 
> 
> 
> That looks fine as a model, but is OSError what you want to be
> 
> catching? I'd go with FileNotFoundError if that's what you're looking
> 
> for - OSError would also catch quite a bit else, like permissions
> 
> errors.
> 
> 
> 
> ChrisA

-- 
https://mail.python.org/mailman/listinfo/python-list


Try-except for flow control in reading Sqlite

2013-10-27 Thread Victor Hooi
Hi,

I'd like to double-check something regarding using try-except for controlling 
flow.

I have a script that needs to lookup things in a SQLite database.

If the SQLite database file doesn't exist, I'd like to create an empty 
database, and then setup the schema.

Is it acceptable to use try-except in order to achieve this? E.g.:

try:
# Try to open up the SQLite file, and lookup the required entries
except OSError:
# Open an empty SQLite file, and create the schema


My thinking is that it is (easier to ask forgiveness than permission), but I 
just wanted to check if there is a better way of achieving this?

I'd also be doing the same thing for checking if a file is gzipped or not - we 
try to open it as a gzip, then as an ordinary text file, and if that also 
fails, raise a parsing error.


Cheers,
Victor
-- 
https://mail.python.org/mailman/listinfo/python-list


Organising packages/modules - importing functions from a common.py in a separate directory?

2013-10-27 Thread Victor Hooi
Hi,

I have a collection of Python scripts I'm using to load various bits of data 
into a database.

I'd like to move some of the common functions (e.g. to setup loggers, reading 
in configuration etc.) into a common file, and import them from there.

I've created empty __init__.py files, and my current directory structure looks 
something like this:

foo_loading/
__init__.py
common/
common_foo.py
em_load/
__init__.py
config.yaml
sync_em.py
pg_load/
__init__.py
config.yaml
sync_pg.py

So from within the sync_em.py script, I'm trying to import a function from 
foo_loading/common/common_foo.py.

from ..common.common_foo import setup_foo_logging

I get the error:

ValueError: Attempted relative import in non-package 

If I change directories to the parent of "foo_loading", then run

python -m foo_loading.em_load.sync_em sync_em.py

it works. However, this seems a bit roundabout, and I suspect I'm not doing 
things correctly.

Ideally, I want a user to be able to just run sync_em.py from it's own 
directory, and have it correctly import the logging/config modules from 
common_foo.py, and just work.

What is the correct way to achieve this?

Secondly, if I want to move all of the config.yaml files to a common 
foo_loading/config.yaml, or even foo_loading/config/config.yaml, what is the 
correct way to access this from within the scripts? Should I just be using 
"../", or is there a better way?

Cheers,
Victor
-- 
https://mail.python.org/mailman/listinfo/python-list


Processing large CSV files - how to maximise throughput?

2013-10-24 Thread Victor Hooi
Hi,

We have a directory of large CSV files that we'd like to process in Python.

We process each input CSV, then generate a corresponding output CSV file.

input CSV -> munging text, lookups etc. -> output CSV

My question is, what's the most Pythonic way of handling this? (Which I'm 
assuming 

For the reading, I'd

with open('input.csv', 'r') as input, open('output.csv', 'w') as output:
csv_writer = DictWriter(output)
for line in DictReader(input):
# Do some processing for that line...
output = process_line(line)
# Write output to file
csv_writer.writerow(output)

So for the reading, it'll iterates over the lines one by one, and won't read it 
into memory which is good.

For the writing - my understanding is that it writes a line to the file object 
each loop iteration, however, this will only get flushed to disk every now and 
then, based on my system default buffer size, right?

So if the output file is going to get large, there isn't anything I need to 
take into account for conserving memory?

Also, if I'm trying to maximise throughput of the above, is there anything I 
could try? The processing in process_line is quite line - just a bunch of 
string splits and regexes.

If I have multiple large CSV files to deal with, and I'm on a multi-core 
machine, is there anything else I can do to boost throughput?

Cheers,
Victor
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Re-raising a RuntimeError - good practice?

2013-10-24 Thread Victor Hooi
Hi,

Thanks to @Stephen D'APrano and @Andrew Berg for your advice.

The advice seems to be that I should move my exception higher up, and try to 
handle it all in one place:

for job in jobs: 
try: 
try: 
job.run_all() 
except Exception as err:  # catch *everything* 
logger.error(err) 
raise 
except (SpamError, EggsError, CheeseError): 
# We expect these exceptions, and ignore them. 
# Everything else is a bug. 
pass 

That makes sense, but I'm sorry but I'm still a bit confused.

Essentially, my requirements are:

1. If any job raises an exception, end that particular job, and continue 
with the next job.
2. Be able to differentiate between different exceptions in different 
stages of the job. For example, if I get a IOError in self.export_to_csv() 
versus one in  self.gzip_csv_file(), I want to be able to handle them 
differently. Often this may just result in logging a slightly different 
friendly error message to the logfile.

Am I still able to handle 2. if I handle all exceptions in the "for job in 
jobs" loop? How will I be able to distinguish between the same types of 
Exceptions being raise by different methods?

Also, @Andrew Berg - you mentioned I'm just swallowing the original exception 
and re-raising a new RuntimeError - I'm guessing this is a bad practice, right? 
If I use just "raise"

except Exception as err:  # catch *everything* 
logger.error(err) 
raise 

that will just re-raise the original exception right?

Cheers,
Victor

On Thursday, 24 October 2013 15:42:53 UTC+11, Andrew Berg  wrote:
> On 2013.10.23 22:23, Victor Hooi wrote:
> 
> > For example:
> 
> > 
> 
> > def run_all(self):
> 
> > self.logger.debug('Running loading job for %s' % self.friendly_name)
> 
> > try:
> 
> > self.export_to_csv()
> 
> > self.gzip_csv_file()
> 
> > self.upload_to_foo()
> 
> > self.load_foo_to_bar()
> 
> > except RuntimeError as e:
> 
> > self.logger.error('Error running job %s' % self.friendly_name)
> 
> > ...
> 
> > def export_to_csv(self):
> 
> > ...
> 
> > try:
> 
> > with open(self.export_sql_file, 'r') as f:
> 
> > self.logger.debug('Attempting to read in SQL export 
> > statement from %s' % self.export_sql_file)
> 
> > self.export_sql_statement = f.read()
> 
> > self.logger.debug('Successfully read in SQL export 
> > statement')
> 
> > except Exception as e:
> 
> > self.logger.error('Error reading in %s - %s' % 
> > (self.export_sql_file, e), exc_info=True)
> 
> > raise RuntimeError
> 
> You're not re-raising a RuntimeError. You're swallowing all exceptions and 
> then raising a RuntimeError. Re-raise the original exception in
> 
> export_to_csv() and then handle it higher up. As Steven suggested, it is a 
> good idea to handle exceptions in as few places as possible (and
> 
> as specifically as possible). Also, loggers have an exception method, which 
> can be very helpful in debugging when unexpected things happen,
> 
> especially when you need to catch a wide range of exceptions.
> 
> 
> 
> -- 
> 
> CPython 3.3.2 | Windows NT 6.2.9200 / FreeBSD 10.0
-- 
https://mail.python.org/mailman/listinfo/python-list


Re-raising a RuntimeError - good practice?

2013-10-23 Thread Victor Hooi
Hi,

I have a Python class that represents a loading job.

Each job has a run_all() method that calls a number of other class methods.

I'm calling run_all() on a bunch of jobs.

Some of methods called by run_all() can raise exceptions (e.g. missing files, 
DB connection failures) which I'm catching and logging.

If any of the methods fails, I'd like to terminate running that job, and move 
onto the next job.

I'm currently re-raising a RuntimeError, so that I can break out the run_all() 
and move onto the next job. 

For example:

def run_all(self):
self.logger.debug('Running loading job for %s' % self.friendly_name)
try:
self.export_to_csv()
self.gzip_csv_file()
self.upload_to_foo()
self.load_foo_to_bar()
except RuntimeError as e:
self.logger.error('Error running job %s' % self.friendly_name)
...
def export_to_csv(self):
...
try:
with open(self.export_sql_file, 'r') as f:
self.logger.debug('Attempting to read in SQL export statement 
from %s' % self.export_sql_file)
self.export_sql_statement = f.read()
self.logger.debug('Successfully read in SQL export statement')
except Exception as e:
self.logger.error('Error reading in %s - %s' % 
(self.export_sql_file, e), exc_info=True)
raise RuntimeError

My question is - is the above Pythonic, or an acceptable practice?

Or is there another way I should be handling errors, and moving on from 
failures, and if so what is it please?

Cheers,
Victor
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Using "with" context handler, and catching specific exception?

2013-10-22 Thread Victor Hooi
Hi,

I'm actually on Python 2.7, so we don't have access to any of those nice new 
exceptions in Python 3.3 =(:

http://docs.python.org/2.7/library/exceptions.html#exception-hierarchy

@Ben - Good point about just catching the more general exception, and just 
printing out the string message.

I suppose in most cases, we won't be doing anything special for the different 
types (e.g. file not found, permission error, is a directory etc.) - it'll just 
be going into logs.

Is there anything wrong with me just catching "Exception" in this case of 
opening a file, and printing the message from there?

Cheers,
Victor


On Tuesday, 22 October 2013 14:53:58 UTC+11, Ben Finney  wrote:
> Victor Hooi  writes:
> 
> 
> 
> > Aha, good point about IOError encapsulating other things, I'll use
> 
> > FileNotFoundError, and also add in some other except blocks for the
> 
> > other ones.
> 
> 
> 
> Or not; you can catch OSError, which is the parent of FileNotFoundError
> 
> http://docs.python.org/3/library/exceptions.html#exception-hierarchy>,
> 
> but don't assume in your code that it means anything more specific.
> 
> 
> 
> You should only catch specific exceptions if you're going to do
> 
> something specific with them. If all you want to do is log them and move
> 
> on, then catch a more general class and ask the exception object to
> 
> describe itself (by using it in a string context).
> 
> 
> 
> 
> 
> In versions of Python before 3.3, you have to catch EnvironmentError
> 
> http://docs.python.org/3.2/library/exceptions.html#EnvironmentError>
> 
> and then distinguish the specific errors by their ‘errno’ attribute
> 
> http://docs.python.org/3.2/library/errno.html>::
> 
> 
> 
> import errno
> 
> 
> 
> try:
> 
> with open('somefile.log', 'wb') as f:
> 
> f.write("hello there")
> 
> except EnvironmentError as exc:
> 
> if exc.errno == errno.ENOENT:
> 
> handle_file_not_found_error()
> 
> elif exc.errno == errno.EACCES:
> 
> handle_permission_denied()
> 
> elif exc.errno == errno.EEXIST:
> 
> handle_file_exists()
> 
> …
> 
> else:
> 
> handle_all_other_environment_errors()
> 
> 
> 
> That's much more clumsy, which is why it was improved in the latest
> 
> Python. If you can, code for Python 3.3 or higher.
> 
> 
> 
> -- 
> 
>  \ “Unix is an operating system, OS/2 is half an operating system, |
> 
>   `\Windows is a shell, and DOS is a boot partition virus.” —Peter |
> 
> _o__)H. Coffin |
> 
> Ben Finney
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Using "with" context handler, and catching specific exception?

2013-10-21 Thread Victor Hooi
Hi,

Thanks for the replies =).

Aha, good point about IOError encapsulating other things, I'll use 
FileNotFoundError, and also add in some other except blocks for the other ones.

And yes, I didn't use the exception object in my sample - I just sort. I'd 
probably be doing something like this.

logger.error("Some error message - %s" % e) 

So is the consensus then that I should wrap the "with" in a try-except block?

try: 
  with open('somefile.log', 'wb') as f: 
  f.write("hello there") 
except FileNotFoundError as e: 
logger.error("Uhoh, the file wasn't there - %s" % e) 

Cheers,
Victor

On Tuesday, 22 October 2013 14:04:14 UTC+11, Ben Finney  wrote:
> Victor Hooi  writes:
> 
> 
> 
> > try:
> 
> > with open('somefile.log', 'wb' as f:
> 
> > f.write("hello there")
> 
> > except IOError as e:
> 
> > logger.error("Uhoh, the file wasn't there").
> 
> 
> 
> IOError, as Steven D'Aprano points out, is not equivalent to “file not
> 
> found”. Also, you're not doing anything with the exception object, so
> 
> there's no point binding it to the name ‘e’.
> 
> 
> 
> What you want is the specific FileNotFoundError:
> 
> 
> 
> try:
> 
> with open('somefile.log', 'wb' as f:
> 
> f.write("hello there")
> 
> except FileNotFoundError:
> 
> logger.error("Uhoh, the file wasn't there").
> 
> 
> 
> See http://docs.python.org/3/library/exceptions.html#FileNotFoundError>.
> 
> 
> 
> -- 
> 
>  \“Choose mnemonic identifiers. If you can't remember what |
> 
>   `\mnemonic means, you've got a problem.” —Larry Wall |
> 
> _o__)  |
> 
> Ben Finney
-- 
https://mail.python.org/mailman/listinfo/python-list


Using "with" context handler, and catching specific exception?

2013-10-21 Thread Victor Hooi
Hi,

I suspect I'm holding 

How should I use the "with" context handler as well as handling specific 
exceptions?

For example, for a file:

with open('somefile.log', 'wb') as f:
f.write("hello there")

How could I specifically catch IOError in the above, and handle that? Should I 
wrap the whole thing in a try-except block?

(For example, if I wanted to try a different location, or if I wanted to print 
a specific error message to the logfile).

try:
with open('somefile.log', 'wb' as f:
f.write("hello there")
except IOError as e:
logger.error("Uhoh, the file wasn't there").

Cheers,
Victor
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Python - forking an external process?

2013-07-02 Thread Victor Hooi
Hi,

Hmm, this script is actually written using the Cliff framework 
(https://github.com/dreamhost/cliff).

I was hoping to keep the whole approach fairly simple, without needing to pull 
in too much external stuff, or set anything up.

There's no way to do it with just Python core is there?

Also, what's this improvement you mentioned?

Cheers,
Victor

On Wednesday, 3 July 2013 13:59:19 UTC+10, rusi  wrote:
> On Wednesday, July 3, 2013 9:17:29 AM UTC+5:30, Victor Hooi wrote:
> 
> > Hi,
> 
> > 
> 
> > I have a Python script where I want to run fork and run an external command 
> 
> > (or set of commands).
> 
> > For example, after doing , I then want to run ssh to a host, handover 
> 
> > control back to the user, and have my script terminate.
> 
> 
> 
> Seen Fabric? 
> 
> http://docs.fabfile.org/en/1.6/
> 
> 
> 
> Recently -- within the last month methinks -- there was someone who posted a 
> supposed improvement to it (forget the name)
-- 
http://mail.python.org/mailman/listinfo/python-list


Python - forking an external process?

2013-07-02 Thread Victor Hooi
Hi,

I have a Python script where I want to run fork and run an external command (or 
set of commands).

For example, after doing , I then want to run ssh to a host, handover 
control back to the user, and have my script terminate.

Or I might want to run ssh to a host, less a certain textfile, then exit.

What's the idiomatic way of doing this within Python? Is it possible to do with 
Subprocess?

Cheers,
Victor

(I did see this SO post - 
http://stackoverflow.com/questions/6011235/run-a-program-from-python-and-have-it-continue-to-run-after-the-script-is-kille,
 but it's a bit older, and I was going to see what the current idiomatic way of 
doing this is).
-- 
http://mail.python.org/mailman/listinfo/python-list


Using re.VERBOSE, and re-using components of regex?

2013-04-16 Thread Victor Hooi
Hi,

I'm trying to compile a regex Python with the re.VERBOSE flag (so that I can 
add some friendly comments).

However, the issue is, I normally use constants to define re-usable bits of the 
regex - however, these doesn't get interpreted inside the triple quotes.

For example:

import re

TIMESTAMP = r'(?P\d{2}:\d{2}:\d{2}.\d{9})'
SPACE = r' '
FOO = r'some_regex'
BAR = r'some_regex'

regexes = {
'data_sent': re.compile("""
TIMESTAMP # Timestamp of our log message
SPACE
FOO # Some comment
SPACE
""", re.VERBOSE),
'data_received': re.compile("""
TIMESTAMP # Timestamp of our log message
SPACE
BAR # Some comment
SPACE
""", re.VERBOSE),
  }

Is there a way to use CONSTANTS (or at least re-use fragments of my regex), and 
also use re.VERBOSE so I can comment my regex?

Cheers,
Victor
-- 
http://mail.python.org/mailman/listinfo/python-list


Doing both regex match and assignment within a If loop?

2013-03-28 Thread Victor Hooi
Hi,

I have logline that I need to test against multiple regexes. E.g.:

import re

expression1 = re.compile(r'')
expression2 = re.compile(r'')

with open('log.txt') as f:
for line in f:
if expression1.match(line):
# Do something - extract fields from line.
elif expression2.match(line):
# Do something else - extract fields from line.
else:
# Oh noes! Raise exception.

However, in the "Do something" section - I need access to the match object 
itself, so that I can strip out certain fields from the line.

Is it possible to somehow test for a match, as well as do assignment of the re 
match object to a variable?

if expression1.match(line) = results:
results.groupsdict()...

Obviously the above won't work - however, is there a Pythonic way to tackle 
this?

What I'm trying to avoid is this:

if expression1.match(line):
results = expression1.match(line)

which I assume would call the regex match against the line twice - and when I'm 
dealing with a huge amount of log lines, slow things down.

Cheers,
Victor
-- 
http://mail.python.org/mailman/listinfo/python-list


Writing Python framework for declarative checks?

2013-03-18 Thread Victor Hooi
HI,

NB: I've posted this question on Reddit as well (but didn't get many responses 
from Pythonistas) - hope it's ok if I post here as well.

We currently use a collection of custom Python scripts to validate various 
things in our production environment/configuration.

Many of these are simple XML checks (i.e. validate that the value of this XML 
tag here equals the value in that file over there). Others might be to check 
that a host is up, or that this application's crontab start time is within 20 
minutes of X, or that a logfile on a server contains a certain line.

The checks are executed automatically before every production push.

The scripts are written imperatively. E.g.:

SSH into a server
Open up a file
Parse the XML
Search for a XML tag
Store the value in a variable
Compare it to another value somewhere else.
I'd like to look at writing a framework to do these validation in a slightly 
more declarative way - i.e. instead of defining how the server should check 
something, we should just be able to say value should equal foobar - 
and let the framework handle the how.

I was thinking we could then schedule the checks and shove the jobs onto a 
queue like Celery.

To stop me from re-inventing the wheel - are there any existing projects that 
do something like this already?

Or has anybody here done something similar, or would be able to offer any 
advice?

(I aware of things like Puppet or Chef - or even Salt Stack - however, these 
are for managing deployments, or actually pushing out configurations. These 
validation scripts are more to ensure that the configuration changes done by 
hand are sane, or don't violate certain basic rules).

Cheers,
Victor
-- 
http://mail.python.org/mailman/listinfo/python-list


Spawn a process, then exit, whilst leaving process running?

2013-02-08 Thread Victor Hooi
Hi,

I have a Python script that I'd like to spawn a separate process (SSH client, 
in this case), and then have the script exit whilst the process continues to 
run.

I looked at Subprocess, however, that leaves the script running, and it's more 
for spawning processes and then dealing with their output.

Somebody mentioned multiprocessing, however, I'm not sure quite sure how that 
would work here.

What's the most Pythontic way of achieving this purpose?

Cheers,
Victor
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Searching through two logfiles in parallel?

2013-01-07 Thread Victor Hooi
Hi Oscar,

Thanks for the quick reply =).

I'm trying to understand your code properly, and it seems like for each line in 
logfile1, we loop through all of logfile2?

The idea was that it would remember it's position in logfile2 as well - since 
we can assume that the loglines are in chronological order - we only need to 
search forwards in logfile2 each time, not from the beginning each time.

So for example - logfile1:

05:00:06 Message sent - Value A: 5.6, Value B: 6.2, Value C: 9.9 
05:00:08 Message sent - Value A: 3.3, Value B: 4.3, Value C: 2.3
05:00:14 Message sent - Value A: 1.0, Value B: 0.4, Value C: 5.4

logfile2:

05:00:09 Message received - Value A: 5.6, Value B: 6.2, Value C: 9.9 
05:00:12 Message received - Value A: 3.3, Value B: 4.3, Value C: 2.3
05:00:15 Message received - Value A: 1.0, Value B: 0.4, Value C: 5.4

The idea is that I'd iterate through logfile 1 - I'd get the 05:00:06 logline - 
I'd search through logfile2 and find the 05:00:09 logline.

Then, back in logline1 I'd find the next logline at 05:00:08. Then in logfile2, 
instead of searching back from the beginning, I'd start from the next line, 
which happens to be 5:00:12.

In reality, I'd need to handle missing messages in logfile2, but that's the 
general idea.

Does that make sense? (There's also a chance I've misunderstood your buf code, 
and it does do this - in that case, I apologies - is there any chance you could 
explain it please?)

Cheers,
Victor

On Tuesday, 8 January 2013 09:58:36 UTC+11, Oscar Benjamin  wrote:
> On 7 January 2013 22:10, Victor Hooi  wrote:
> 
> > Hi,
> 
> >
> 
> > I'm trying to compare two logfiles in Python.
> 
> >
> 
> > One logfile will have lines recording the message being sent:
> 
> >
> 
> > 05:00:06 Message sent - Value A: 5.6, Value B: 6.2, Value C: 9.9
> 
> >
> 
> > the other logfile has line recording the message being received
> 
> >
> 
> > 05:00:09 Message received - Value A: 5.6, Value B: 6.2, Value C: 9.9
> 
> >
> 
> > The goal is to compare the time stamp between the two - we can safely 
> > assume the timestamp on the message being received is later than the 
> > timestamp on transmission.
> 
> >
> 
> > If it was a direct line-by-line, I could probably use itertools.izip(), 
> > right?
> 
> >
> 
> > However, it's not a direct line-by-line comparison of the two files - the 
> > lines I'm looking for are interspersed among other loglines, and the time 
> > difference between sending/receiving is quite variable.
> 
> >
> 
> > So the idea is to iterate through the sending logfile - then iterate 
> > through the receiving logfile from that timestamp forwards, looking for the 
> > matching pair. Obviously I want to minimise the amount of back-forth 
> > through the file.
> 
> >
> 
> > Also, there is a chance that certain messages could get lost - so I assume 
> > there's a threshold after which I want to give up searching for the 
> > matching received message, and then just try to resync to the next sent 
> > message.
> 
> >
> 
> > Is there a Pythonic way, or some kind of idiom that I can use to approach 
> > this problem?
> 
> 
> 
> Assuming that you can impose a maximum time between the send and
> 
> recieve timestamps, something like the following might work
> 
> (untested):
> 
> 
> 
> def find_matching(logfile1, logfile2, maxdelta):
> 
> buf = {}
> 
> logfile2 = iter(logfile2)
> 
> for msg1 in logfile1:
> 
> if msg1.key in buf:
> 
> yield msg1, buf.pop(msg1.key)
> 
> continue
> 
> maxtime = msg1.time + maxdelta
> 
> for msg2 in logfile2:
> 
> if msg2.key == msg1.key:
> 
> yield msg1, msg2
> 
> break
> 
> buf[msg2.key] = msg2
> 
> if msg2.time > maxtime:
> 
> break
> 
> else:
> 
> yield msg1, 'No match'
> 
> 
> 
> 
> 
> Oscar
-- 
http://mail.python.org/mailman/listinfo/python-list


Searching through two logfiles in parallel?

2013-01-07 Thread Victor Hooi
Hi,

I'm trying to compare two logfiles in Python.

One logfile will have lines recording the message being sent:

05:00:06 Message sent - Value A: 5.6, Value B: 6.2, Value C: 9.9

the other logfile has line recording the message being received

05:00:09 Message received - Value A: 5.6, Value B: 6.2, Value C: 9.9

The goal is to compare the time stamp between the two - we can safely assume 
the timestamp on the message being received is later than the timestamp on 
transmission.

If it was a direct line-by-line, I could probably use itertools.izip(), right?

However, it's not a direct line-by-line comparison of the two files - the lines 
I'm looking for are interspersed among other loglines, and the time difference 
between sending/receiving is quite variable.

So the idea is to iterate through the sending logfile - then iterate through 
the receiving logfile from that timestamp forwards, looking for the matching 
pair. Obviously I want to minimise the amount of back-forth through the file.

Also, there is a chance that certain messages could get lost - so I assume 
there's a threshold after which I want to give up searching for the matching 
received message, and then just try to resync to the next sent message.

Is there a Pythonic way, or some kind of idiom that I can use to approach this 
problem?

Cheers,
Victor
-- 
http://mail.python.org/mailman/listinfo/python-list


Using mktime to convert date to seconds since epoch - omitting elements from the tuple?

2013-01-02 Thread Victor Hooi
Hi,

I'm using pysvn to checkout a specific revision based on date - pysvn will only 
accept a date in terms of seconds since the epoch.

I'm attempting to use time.mktime() to convert a date (e.g. "2012-02-01) to 
seconds since epoch.

According to the docs, mktime expects a 9-element tuple.

My question is, how should I omit elements from this tuple? And what is the 
expected behaviour when I do that?

For example, (zero-index), element 6 is the day of the week, and element 7 is 
the day in the year, out of 366 - if I specify the earlier elements, then I 
shouldn't really need to specify these.

However, the docs don't seem to talk much about this.

I just tried testing putting garbage numbers for element 6 and 7, whilst 
specifying the earlier elements:

> time.mktime((2012, 5, 5, 23, 59, 59, 23424234, 5234234 ,0 ))

It seems to have no effect what numbers I set 6 and 7 to - is that because the 
earlier elements are set?

How should I properly omit them? Is this all documented somewhere? What is the 
minimum I need to specify? And what happens to the fields I don't specify?

Cheers,
Victor
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Using regexes versus "in" membership test?

2012-12-12 Thread Victor Hooi
Heya,

See my original first post =):

> Would I be substantially better off using a list of strings and using "in" 
> against each line, then using a second pass of regex only on the matched 
> lines? 

Based on what Steven said, and what I know about the logs in question, it's 
definitely better to do it that way.

However, I'd still like to fix up the regex, or fix any glaring issues with it 
as well.

Cheers,
Victor

On Thursday, 13 December 2012 17:19:57 UTC+11, Chris Angelico  wrote:
> On Thu, Dec 13, 2012 at 5:10 PM, Victor Hooi  wrote:
> 
> > Are there any other general pointers you might give for that regex? The 
> > lines I'm trying to match look something like this:
> 
> >
> 
> > 07:40:05.793627975 [Info  ] [SOME_MODULE] [SOME_FUNCTION] 
> > [SOME_OTHER_FLAG] [RequestTag=0 ErrorCode=3 ErrorText="some error message" 
> > ID=0:0x Foo=1 Bar=5 Joe=5]
> 
> >
> 
> > Essentially, I'd want to strip out the timestamp, logging-level, module, 
> > function etc - and possibly the tag-value pairs?
> 
> 
> 
> If possible, can you do a simple test to find out whether or not you
> 
> want a line and then do more complex parsing to get the info you want
> 
> out of it? For instance, perhaps the presence of the word "ErrorCode"
> 
> is all you need to check - it wouldn't hurt if you have a few percent
> 
> of false positives that get discarded during the parse phase, it'll
> 
> still be quicker to do a single string-in-string check than a complex
> 
> regex to figure out if you even need to process the line at all.
> 
> 
> 
> ChrisA

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Using regexes versus "in" membership test?

2012-12-12 Thread Victor Hooi
Hi,

That was actually *one* regex expression...lol.

And yes, it probably is a bit convoluted.

Thanks for the tip about using VERBOSE - I'll use that, and comment my regex - 
that's a useful tip.

Are there any other general pointers you might give for that regex? The lines 
I'm trying to match look something like this:

07:40:05.793627975 [Info  ] [SOME_MODULE] [SOME_FUNCTION] [SOME_OTHER_FLAG] 
[RequestTag=0 ErrorCode=3 ErrorText="some error message" 
ID=0:0x Foo=1 Bar=5 Joe=5]

Essentially, I'd want to strip out the timestamp, logging-level, module, 
function etc - and possibly the tag-value pairs?

And yes, based on what you said, I probably will use the "in" loop first 
outside the regex - the lines I'm searching for are fairly few compared to the 
overall log size.

Cheers,
Victor

On Thursday, 13 December 2012 12:09:33 UTC+11, Steven D'Aprano  wrote:
> On Wed, 12 Dec 2012 14:35:41 -0800, Victor Hooi wrote:
> 
> 
> 
> > Hi,
> 
> > 
> 
> > I have a script that trawls through log files looking for certain error
> 
> > conditions. These are identified via certain keywords (all different) in
> 
> > those lines
> 
> > 
> 
> > I then process those lines using regex groups to extract certain fields.
> 
> [...]
> 
> > Also, my regexs could possibly be tuned, they look something like this:
> 
> > 
> 
> > (?P\d{2}:\d{2}:\d{2}.\d{9})\s*\[(?P\w+)\s*
> 
> \]\s*\[(?P\w+)\s*\]\s*\[{0,1}\]{0,1}\s*\[(?P\w+)\s*\]
> 
> \s*level\(\d\) broadcast\s*\(\[(?P\w+)\]\s*\[(?P\w+)\]
> 
> \s*(?P\w{4}):(?P\w+) failed order: (?P\w+) (?
> 
> P\d+) @ (?P[\d.]+), error on update \(\d+ : Some error 
> 
> string. Active Orders=(?P\d+) Limit=(?P\d+)\)\)
> 
> >
> 
> > (Feel free to suggest any tuning, if you think they need it).
> 
> 
> 
> "Tuning"? I think it needs to be taken out and killed with a stake to the 
> 
> heart, then buried in concrete! :-)
> 
> 
> 
> An appropriate quote:
> 
> 
> 
> Some people, when confronted with a problem, think "I know, 
> 
> I'll use regular expressions." Now they have two problems.
> 
> -- Jamie Zawinski
> 
> 
> 
> Is this actually meant to be a single regex, or did your email somehow 
> 
> mangle multiple regexes into a single line?
> 
> 
> 
> At the very least, you should write your regexes using the VERBOSE flag, 
> 
> so you can use non-significant whitespace and comments. There is no 
> 
> performance cost to using VERBOSE once they are compiled, but a huge 
> 
> maintainability benefit.
> 
> 
> 
> 
> 
> > My question is - I've heard that using the "in" membership operator is
> 
> > substantially faster than using Python regexes.
> 
> > 
> 
> > Is this true? What is the technical explanation for this? And what sort
> 
> > of performance characteristics are there between the two?
> 
> 
> 
> Yes, it is true. The technical explanation is simple:
> 
> 
> 
> * the "in" operator implements simple substring matching, 
> 
>   which is trivial to perform and fast;
> 
> 
> 
> * regexes are an interpreted mini-language which operate via
> 
>   a complex state machine that needs to do a lot of work,
> 
>   which is complicated to perform and slow.
> 
> 
> 
> Python's regex engine is not as finely tuned as (say) Perl's, but even in 
> 
> Perl simple substring matching ought to be faster, simply because you are 
> 
> doing less work to match a substring than to run a regex.
> 
> 
> 
> But the real advantage to using "in" is readability and maintainability.
> 
> 
> 
> As for the performance characteristics, you really need to do your own 
> 
> testing. Performance will depend on what you are searching for, where you 
> 
> are searching for it, whether it is found or not, your version of Python, 
> 
> your operating system, your hardware.
> 
> 
> 
> At some level of complexity, you are better off just using a regex rather 
> 
> than implementing your own buggy, complicated expression matcher: for 
> 
> some matching tasks, there is no reasonable substitute to regexes. But 
> 
> for *simple* uses, you should prefer *simple* code:
> 
> 
> 
> [steve@ando ~]$ python -m timeit \
> 
> > -s "data = 'abcd'*1000 + 'xyz' + 'abcd'*1000" \
> 
> > "'xyz' in data"
> 
> 10 loops, best of 3: 4.17 usec per loop
> 
> 
> 
> [steve@ando ~]$ python -m timeit \
> 
> > -s "data = 'abcd'*100

Using regexes versus "in" membership test?

2012-12-12 Thread Victor Hooi
Hi,

I have a script that trawls through log files looking for certain error 
conditions. These are identified via certain keywords (all different) in those 
lines

I then process those lines using regex groups to extract certain fields.

Currently, I'm using a for loop to iterate through the file, and a dict of 
regexes:

breaches = {
'type1': re.compile(r'some_regex_expression'),
'type2': re.compile(r'some_regex_expression'),
'type3': re.compile(r'some_regex_expression'),
'type4': re.compile(r'some_regex_expression'),
'type5': re.compile(r'some_regex_expression'),
}
...
with open('blah.log', 'r') as f:
for line in f:
for breach in breaches:
results = breaches[breach].search(line)
if results:
self.logger.info('We found an error - {0} - 
{1}'.format(results.group('errorcode'), results.group('errormsg'))
# We do other things with other regex groups as well.

(This isn't the *exact* code, but it shows the logic/flow fairly closely).

For completeness, the actual regexes look something like this:

Also, my regexs could possibly be tuned, they look something like this:


(?P\d{2}:\d{2}:\d{2}.\d{9})\s*\[(?P\w+)\s*\]\s*\[(?P\w+)\s*\]\s*\[{0,1}\]{0,1}\s*\[(?P\w+)\s*\]\s*level\(\d\)
 
broadcast\s*\(\[(?P\w+)\]\s*\[(?P\w+)\]\s*(?P\w{4}):(?P\w+)
 failed order: (?P\w+) (?P\d+) @ (?P[\d.]+), error on 
update \(\d+ : Some error string. Active Orders=(?P\d+) 
Limit=(?P\d+)\)\)

(Feel free to suggest any tuning, if you think they need it).

My question is - I've heard that using the "in" membership operator is 
substantially faster than using Python regexes.

Is this true? What is the technical explanation for this? And what sort of 
performance characteristics are there between the two?

(I couldn't find much in the way of docs for "in", just the brief mention here 
- http://docs.python.org/2/reference/expressions.html#not-in )

Would I be substantially better off using a list of strings and using "in" 
against each line, then using a second pass of regex only on the matched lines?

(Log files are compressed, I'm actually using bz2 to read them in, uncompressed 
size is around 40-50 Gb).



Cheers,
Victor
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Creating different classes dynamically?

2012-12-09 Thread Victor Hooi
heya,

Dave: Ahah, thanks =).

You're right, my terminology was off, I want to dynamically *instantiate*, not 
create new classes.

And yes, removing the brackets worked =).

Cheers,
Victor

On Monday, 10 December 2012 11:53:30 UTC+11, Dave Angel  wrote:
> On 12/09/2012 07:35 PM, Victor Hooi wrote:
> 
> > Hi,
> 
> >
> 
> > I have a directory tree with various XML configuration files.
> 
> >
> 
> > I then have separate classes for each type, which all inherit from a base. 
> > E.g.
> 
> >
> 
> > class AnimalConfigurationParser:
> 
> > ...
> 
> >
> 
> > class DogConfigurationParser(AnimalConfigurationParser):
> 
> > ...
> 
> >
> 
> > class CatConfigurationParser(AnimalConfigurationParser):
> 
> > 
> 
> >
> 
> > I can identify the type of configuration file from the root XML tag.
> 
> >
> 
> > I'd like to walk through the directory tree, and create different objects 
> > based on the type of configuration file:
> 
> >
> 
> > for root, dirs, files in os.walk('./'):
> 
> > for file in files:
> 
> > if file.startswith('ml') and file.endswith('.xml') and 'entity' 
> > not in file:
> 
> > with open(os.path.join(root, file), 'r') as f:
> 
> > try:
> 
> > tree = etree.parse(f)
> 
> > root = tree.getroot()
> 
> > print(f.name)
> 
> > print(root.tag)
> 
> > # Do something to create the appropriate type of 
> > parser
> 
> > except xml.parsers.expat.ExpatError as e:
> 
> > print('Unable to parse file {0} - 
> > {1}'.format(f.name, e.message))
> 
> >
> 
> > I have a dict with the root tags - I was thinking of mapping these directly 
> > to the functions - however, I'm not sure if that's the right way to do it? 
> > Is there a more Pythonic way of doing this?
> 
> >
> 
> > root_tags = {
> 
> >'DogRootTag': DogConfigurationParser(),
> 
> > 'CatRootTag': CatConfigurationParser(),
> 
> > }
> 
> >
> 
> > Cheers,
> 
> > Victor
> 
> 
> 
> Your subject line says you want to create the classes dynamically, but
> 
> that's not what your code implies.  if you just want to decide which
> 
> class to INSTANTIATE dynamically, that's easily done, and you have it
> 
> almost right.  In your dict you should leave off those parentheses.
> 
> 
> 
> 
> 
> 
> 
> Then the parser creation looks something like:
> 
>parser_instance = root_tags[root.tag] (arg1, arg2)
> 
> 
> 
> where the arg1, arg2 are whatever arguments the __init__ of these
> 
> classes expects.
> 
> 
> 
> (untested)
> 
> 
> 
> -- 
> 
> 
> 
> DaveA

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: TypeError: 'in ' requires string as left operand, not Element

2012-12-09 Thread Victor Hooi
Hi,

Ignore me - PEBKAC...lol.

I used "root" both for the os.walk, and also for the root XML element.

Thanks anyhow =).

Cheers,
Victor

On Monday, 10 December 2012 11:52:34 UTC+11, Victor Hooi  wrote:
> Hi,
> 
> 
> 
> I'm getting a strange error when I try to run the following:
> 
> 
> 
> for root, dirs, files in os.walk('./'):
> 
> for file in files:
> 
> if file.startswith('ml') and file.endswith('.xml') and 'entity' 
> not in file:
> 
> print(root)
> 
> print(file)
> 
> with open(os.path.join(root, file), 'r') as f:
> 
> print(f.name)
> 
> try:
> 
> tree = etree.parse(f)
> 
> root = tree.getroot()
> 
> print(f.name)
> 
> print(root.tag)
> 
> except xml.parsers.expat.ExpatError as e:
> 
> print('Unable to parse file {0} - {1}'.format(f.name, 
> e.message))
> 
> 
> 
> The error is:
> 
> 
> 
> Traceback (most recent call last):
> 
>   File "foo.py", line 275, in 
> 
> marketlink_configfiles()
> 
>   File "foo.py", line 83, in bar
> 
> with open(os.path.join(root, file), 'r') as f:
> 
>   File "C:\Python27\lib\ntpath.py", line 97, in join
> 
> if path[-1] in "/\\":
> 
> TypeError: 'in ' requires string as left operand, not Element
> 
> 
> 
> Cheers,
> 
> Victor

-- 
http://mail.python.org/mailman/listinfo/python-list


TypeError: 'in ' requires string as left operand, not Element

2012-12-09 Thread Victor Hooi
Hi,

I'm getting a strange error when I try to run the following:

for root, dirs, files in os.walk('./'):
for file in files:
if file.startswith('ml') and file.endswith('.xml') and 'entity' not 
in file:
print(root)
print(file)
with open(os.path.join(root, file), 'r') as f:
print(f.name)
try:
tree = etree.parse(f)
root = tree.getroot()
print(f.name)
print(root.tag)
except xml.parsers.expat.ExpatError as e:
print('Unable to parse file {0} - {1}'.format(f.name, 
e.message))

The error is:

Traceback (most recent call last):
  File "foo.py", line 275, in 
marketlink_configfiles()
  File "foo.py", line 83, in bar
with open(os.path.join(root, file), 'r') as f:
  File "C:\Python27\lib\ntpath.py", line 97, in join
if path[-1] in "/\\":
TypeError: 'in ' requires string as left operand, not Element

Cheers,
Victor
-- 
http://mail.python.org/mailman/listinfo/python-list


Creating different classes dynamically?

2012-12-09 Thread Victor Hooi
Hi,

I have a directory tree with various XML configuration files.

I then have separate classes for each type, which all inherit from a base. E.g.

class AnimalConfigurationParser:
...

class DogConfigurationParser(AnimalConfigurationParser):
...

class CatConfigurationParser(AnimalConfigurationParser):


I can identify the type of configuration file from the root XML tag.

I'd like to walk through the directory tree, and create different objects based 
on the type of configuration file:

for root, dirs, files in os.walk('./'):
for file in files:
if file.startswith('ml') and file.endswith('.xml') and 'entity' not 
in file:
with open(os.path.join(root, file), 'r') as f:
try:
tree = etree.parse(f)
root = tree.getroot()
print(f.name)
print(root.tag)
# Do something to create the appropriate type of parser
except xml.parsers.expat.ExpatError as e:
print('Unable to parse file {0} - {1}'.format(f.name, 
e.message))

I have a dict with the root tags - I was thinking of mapping these directly to 
the functions - however, I'm not sure if that's the right way to do it? Is 
there a more Pythonic way of doing this?

root_tags = {
   'DogRootTag': DogConfigurationParser(),
'CatRootTag': CatConfigurationParser(),
}

Cheers,
Victor
-- 
http://mail.python.org/mailman/listinfo/python-list


Using argparse to call method in various Classes?

2011-07-17 Thread Victor Hooi
Hi,

I'm attempting to use argparse to write a simple script to perform operations 
on various types of servers:

manage_servers.py   

Operations are things like check, build, deploy, configure, verify etc.

Types of server are just different types of inhouse servers we use.

We have a generic server class, and specific types that inherit from that:

class Server
def configure_logging(self, loggin_file):
...
def check(self):
...
def deploy(self):
...
def configure(self):
...
def __init__(self, hostname):
self.hostname = hostname
logging = self.configure_logging(LOG_FILENAME)
class SpamServer(Server):
def check(self):
...
class HamServer(Server):
def deploy(self):
...

My question is how to link that all up to argparse?

Originally, I was using argparse subparses for the operations (check, build, 
deploy) and another argument for the type.

subparsers = parser.add_subparsers(help='The operation that you want to run on 
the server.')
parser_check = subparsers.add_parser('check', help='Check that the server has 
been setup correctly.')
parser_build = subparsers.add_parser('build', help='Download and build a copy 
of the execution stack.')
parser_build.add_argument('-r', '--revision', help='SVN revision to build 
from.')
...
parser.add_argument('type_of_server', action='store', choices=types_of_servers,
help='The type of server you wish to create.')

Normally, you'd link each subparse to a method - and then pass in the 
type_of_server as an argument. However, that's slightly backwards due to the 
classes- I need to create an instance of the appropriate Server class, then 
call the operation method inside of that - not a generic check/build/configure 
module

Any ideas of how I could achieve the above? Perhaps a different design pattern 
for Servers? Or any way to mould argparse to work with this?

Thanks,
Victor
-- 
http://mail.python.org/mailman/listinfo/python-list


Argparse, and linking to methods in Subclasses

2011-07-17 Thread Victor Hooi
Hi,

I have a simple Python script to perform operations on various types on 
in-house servers:

manage_servers.py   

Operations are things like check, build, deploy, configure, verify etc.

Types of server are just different types of inhouse servers we use.

We have a generic server class, then specific types that inherit from that:

class Server
def configure_logging(self, loggin_file):
...
def check(self):
...
def deploy(self):
...
def configure(self):
...
def __init__(self, hostname):
self.hostname = hostname
logging = self.configure_logging(LOG_FILENAME)
class SpamServer(Server):
def check(self):
...
class HamServer(Server):
def deploy(self):
...

My question is how to link that all up to argparse?

Originally, I was using argparse subparses for the operations (check, build, 
deploy) and another argument for the type.

subparsers = parser.add_subparsers(help='The operation that you want to run on 
the server.')
parser_check = subparsers.add_parser('check', help='Check that the server has 
been setup correctly.')
parser_build = subparsers.add_parser('build', help='Download and build a copy 
of the execution stack.')
parser_build.add_argument('-r', '--revision', help='SVN revision to build 
from.')
...
parser.add_argument('type_of_server', action='store', choices=types_of_servers,
help='The type of server you wish to create.')

Normally, you'd link each subparse to a method - and then pass in the 
type_of_server as an argument. However, that's slightly backwards due to the 
classes- I need to create an instance of the appropriate Server class, then 
call the operation method inside of that.

Any ideas of how I could achieve the above? Perhaps a different design pattern 
for Servers? Or a way to use argparse in this situation?

Thanks,
Victor
-- 
http://mail.python.org/mailman/listinfo/python-list