[issue24787] csv.Sniffer guesses "M" instead of \t or , as the delimiter

2015-08-08 Thread Peter Otten

Peter Otten added the comment:

Have you considered writing your own little sniffer? Getting it right for your 
actual data is usually easier to achieve than a general solution.

The following simplistic sniffer should work with your samples:

def make_dialect(delimiter):
class Dialect(csv.excel):
pass
Dialect.delimiter = delimiter
return Dialect

def sniff(sample):
count, delimiter = max(
((sample.count(delim), delim) for delim in ",\t|;"),
key=operator.itemgetter(0))
if count == 0:
if " " in sample:
delimiter = " "
else:
raise csv.Error("Could not determine delimiter")
return make_dialect(delimiter)

Tiago, If you want to follow that path we should take the discussion to the 
general python mailing list.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue24787] csv.Sniffer guesses "M" instead of \t or , as the delimiter

2015-08-07 Thread Tiago Wright

Tiago Wright added the comment:

Attached is a .py file with 32 test cases for the Sniff class, 18 that
fail, 14 that pass.

My hope is that these samples can be used to improve the delimiter
detection code.

-Tiago

--
Added file: http://bugs.python.org/file40149/testround8.py

___
Python tracker 

___import csv

def test_delimiters():

delimiter_samples = [

{ 'delimiter' :"\t", 'sample' :   # error:"Exception"
'''Field Name   Definition
RefID   Unique (sequential) number assigned to 
vehicles
IsBadBuyIdentifies if the kicked vehicle was an 
avoidable purchase 
PurchDate   The Date the vehicle was Purchased at 
Auction
Auction Auction provider at which the  vehicle 
was purchased
VehYear The manufacturer's year of the vehicle
VehicleAge  The Years elapsed since the 
manufacturer's year
''' },

{ 'delimiter' :"\t", 'sample' :   # error:"Exception"
'''rulessupport confidence  lift
1   {Brushes} => {Nail.Polish}  0.149   1   3.57142857142857
2   {Brushes} => {Bronzer}  0.097   0.651006711409396   2.5738856414
3   {Brushes} => {Concealer}0.092   0.61744966442953
1.39694494214826
4   {Lip.liner} => {Concealer}  0.179   0.764957264957265   
1.73067254515218
5   {Bronzer} => {Concealer}0.175   0.627240143369176   
1.41909534698909
6   {Blush} => {Concealer}  0.220.606060606060606   1.37117784176608
''' },

{ 'delimiter' :",", 'sample' :   # error:"Exception"
'''A,B,C,D,E
2000-01-03 
00:00:00,0.980268513777,3.68573087906,-0.364216805298,-1.15973806169,foo
2000-01-04 
00:00:00,1.04791624281,-0.0412318367011,-0.16181208307,0.212549316967,bar
2000-01-05 
00:00:00,0.498580885705,0.731167677815,-0.537677223318,1.34627041952,baz
2000-01-06 
00:00:00,1.12020151869,1.56762092543,0.00364077397681,0.67525259227,qux
2000-01-07 
00:00:00,-0.487094399463,0.571454623474,-1.6116394093,0.103468562917,foo2
''' },

{ 'delimiter' :",", 'sample' :   # error:"Exception"
'''1,699,4751,4158
8,1856
12,4059,5716,4299,4967,2128
16,1928,1176
19,1928,2775,4646,1720,3148,2552,5978,3736,3090
22,4059,1856,4103,4739,4865,4769,621,2874,1637,252
28,5321,4059,4952,1856,4103,699,1976
''' },

{ 'delimiter' :",", 'sample' :   # error:"Exception"
'''���Date,From,To,Flight_Number,Airline,Distance,Duration,Seat,Seat_Type,Class,Reason,Plane,Registration,Trip,Note,From_OID,To_OID,Airline_OID,Plane_OID
2004-08-27,YHZ,YYZ,,Air Canada,801,01:56,,A,Y,L,73,193,330
2004-08-01,YYZ,YHZ,,Air Canada,801,01:56,,A,Y,L,193,73,330
2004-07-30,YHZ,YYZ,,Air Canada,801,01:56,,A,Y,L,73,193,330
2004-05-30,ZRH,MUC,,Lufthansa,162,00:47,,,Y,L,1678,346,3320
2004-05-30,MUC,YYZ,,Air Canada,4131,07:53,,,Y,L,346,193,330
2004-05-30,YYZ,YOW,,Unknown,226,00:54,,,Y,L,193,100,-1
''' },

{ 'delimiter' :"\t", 'sample' :   # error:"Exception"
'''Format version   Start date  End dateSender  Recipient   
Aggregator
5   2010-05-01  2010-05-31  Spotify Udsvxd  Udsvxd
Country Label   Product CurrencyTotal tracksRightholder's tracks
Pro rata share  Revenue share   Number of users Net revenue Payable USD 
RateUSD Payable
XV  Ipstqx Gjivgmn  C   JFG 331264067   0.0020.00   
87845   851092.49   0.045.6647  0.09
JN  Mvcqxv Gjivgmqxd Iv P   JFG 368037889   635611  0.01
40.00   472355  639147.36   506.62  5.6647  562.82
IL  Mvcqxv Gjivgmn  C   JFG 35016   0.0420.00   8   
31.61   0.055.6647  0.05
DW  Mvcqxv  C   DWO 6283654158448   0.0420.00   84344   
330574.21   557.63  5.8230  513.62
''' },

{ 'delimiter' :",", 'sample' :   # error:"Exception"
'''age,workclass,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,1iclass
39, State-gov, 77516, Bachelors, 13, Never-married, Adm-clerical, 
Not-in-family, White, Male, 2174, 0, 40, United-States, <=50K
50, Self-emp-not-inc, 83311, Bachelors, 13, Married-civ-spouse, 
Exec-managerial, Husband, White, Male, 0, 0, 13, United-States, <=50K
38, Private, 215646, HS-grad, 9, Divorced, Handlers-cleaners, Not-in-family, 
White, Male, 0, 0, 40, United-States, <=50K
53, Private, 234721, 11th, 7, Married-civ-spouse, Handlers-cleaners, Husband, 
Black, Male, 0, 0, 40, United-States, <=50K
28, Private, 338409, Bachelors, 13, Married-civ-spouse, Prof-specialty, Wife, 
Black, Female, 0, 0, 40, Cuba, <=50K
37, Private, 284582, Masters, 14, Married-civ

[issue24787] csv.Sniffer guesses "M" instead of \t or , as the delimiter

2015-08-06 Thread Tiago Wright

Tiago Wright added the comment:

I've run the Sniffer against the same data set, but varied the size of the
sample given to the code. It seems that feeding it more data actually seems
to make the results less accurate. Table attached.
On Thu, Aug 6, 2015 at 12:29 PM R. David Murray 
wrote:

>
> R. David Murray added the comment:
>
> Yes, much better :)
>
> --
>
> ___
> Python tracker 
> 
> ___
>

--
Added file: http://bugs.python.org/file40141/csvsniffertest5.txt

___
Python tracker 

___ lines3  lines7  lines70  lines700
human Sniff   
, ,  490 487 424  393 
  A  1   0   00   
  Exception  6   8   44   
  c  1   1   11   
  g  1   0   00   
  h  1   0   00   
  space  0   0   97   
  y  0   0   11   
; ;  1   1   11   
\t\t 918 917 929  706 
  *  0   0   67   
  ,  6   3   21   
  -  0   0   05   
  :  0   2   22   
  D  5   0   00   
  E  0   0   10   10  
  Exception  52  91  18   18  
  M  1   1   00   
  c  2   0   00   
  m  2   0   00   
  p  61  27  22   22  
  s  0   0   22   
  space  1   6   51   125 
bar   bar33  33  20   9   
space Exception  0   1   11   
  e  4   4   44   
  space  10  9   99   
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue24787] csv.Sniffer guesses "M" instead of \t or , as the delimiter

2015-08-06 Thread R. David Murray

R. David Murray added the comment:

Yes, much better :)

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue24787] csv.Sniffer guesses "M" instead of \t or , as the delimiter

2015-08-06 Thread Tiago Wright

Tiago Wright added the comment:

I apologize, it seems the text table got line wrapped. This time as a TXT
attachment.

-Tiago

On Thu, Aug 6, 2015 at 12:22 PM Tiago Wright  wrote:

>
> Tiago Wright added the comment:
>
>
>

--
Added file: http://bugs.python.org/file40140/csvsniffertest3.txt

___
Python tracker 

___|   Sniffer 
|
Human   |   ,   |   ;   |   \t  |   \   |  space|Except |   :   |   )   |   c   
|   e   |   M   |   p   |Total  |   %Error
---
,   |   498 |   |   |   2   |   1   |   10  |   |   |   1   
|   |   |   |   512 |   2.7%
;   |   |   1   |   |   |   |   |   |   |   
|   |   |   |   1   |   0.0%
\t  |   3   |   |   922 |   |   6   |   91  |   2   |   1   |   
|   |   2   |   27  |   1054|   12.5%
|   |   |   |   |   33  |   |   |   |   |   
|   |   |   |   33  |   0.0%
space   |   |   |   |   |   9   |   1   |   |   |   
|   4   |   |   |   14  |   35.7%
---
Total   |   501 |   1   |   922 |   35  |   16  |   102 |   2   |   1   |   1   
|   4   |   2   |   27  |   1614
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue24787] csv.Sniffer guesses "M" instead of \t or , as the delimiter

2015-08-06 Thread R. David Murray

R. David Murray added the comment:

Your best bet is to attach an ascii text file as an uploaded file.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue24787] csv.Sniffer guesses "M" instead of \t or , as the delimiter

2015-08-06 Thread Tiago Wright

Tiago Wright added the comment:

It seems the HTML file did not come through correctly. Trying a text
version, please view this in a monospace font:

|   Sniffer
|
Human   |   ,   |   ;   |   \t  |   \   |  space|Except |   :   |   )   |
c   |   e   |   M   |   p   |Total  |   %Error
---
,   |   498 |   |   |   2   |   1   |   10  |   |   |
1   |   |   |   |   512 |   2.7%
;   |   |   1   |   |   |   |   |   |   |
|   |   |   |   1   |   0.0%
\t  |   3   |   |   922 |   |   6   |   91  |   2   |   1   |
|   |   2   |   27  |   1054|   12.5%
|   |   |   |   |   33  |   |   |   |   |
|   |   |   |   33  |   0.0%
space   |   |   |   |   |   9   |   1   |   |   |
|   4   |   |   |   14  |   35.7%
---
Total   |   501 |   1   |   922 |   35  |   16  |   102 |   2   |   1   |
1   |   4   |   2   |   27  |   1614

On Thu, Aug 6, 2015 at 8:54 AM Tiago Wright  wrote:

>
> Tiago Wright added the comment:
>
> Table attached.
>
> -Tiago
>
> On Wed, Aug 5, 2015 at 8:14 PM Skip Montanaro 
> wrote:
>
> >
> > Skip Montanaro added the comment:
> >
> > Tiago, sorry, but your last post with results is completely
> > unintelligible. Can you toss the table in a file and attach it instead?
> >
> > --
> >
> > ___
> > Python tracker 
> > 
> > ___
> >
>
> --
> Added file: http://bugs.python.org/file40138/csvsniffertest3.htm
>
> ___
> Python tracker 
> 
> ___

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue24787] csv.Sniffer guesses "M" instead of \t or , as the delimiter

2015-08-06 Thread Tiago Wright

Tiago Wright added the comment:

Table attached.

-Tiago

On Wed, Aug 5, 2015 at 8:14 PM Skip Montanaro 
wrote:

>
> Skip Montanaro added the comment:
>
> Tiago, sorry, but your last post with results is completely
> unintelligible. Can you toss the table in a file and attach it instead?
>
> --
>
> ___
> Python tracker 
> 
> ___
>

--
Added file: http://bugs.python.org/file40138/csvsniffertest3.htm

___
Python tracker 

___http://www.w3.org/TR/REC-html40";>