On Thursday, September 30, 2021 at 9:20:37 AM UTC+8, hongy...@gmail.com wrote:
> On Thursday, September 30, 2021 at 5:20:04 AM UTC+8, Peter J. Holzer wrote: 
> > On 2021-09-29 01:22:03 -0700, hongy...@gmail.com wrote: 
> > > I tried to convert a xls file into csv with the following command, but 
> > > failed: 
> > > 
> > > $ in2csv --sheet 'Sheet1' 2021-2022-1.xls 
> > > XLRDError: Unsupported format, or corrupt file: Expected BOF record; 
> > > found b'\r\n\r\n\r\n\r\n' 
> > > 
> > > The above testing file is located at here [1]. 
> > > 
> > > [1] https://github.com/hongyi-zhao/temp/blob/master/2021-2022-1.xls 
> > Why is that file name .xls when it's obviously an HTML file?
> Good catch! Thank you for pointing this out. This file is automatically 
> exported from my university's teaching management system, and it was assigned 
> the .xls extension by default. 

According to the above comment, after I change the extension to html, the 
following python code will do the trick:


import sys
import pandas as pd

if len(sys.argv) != 2:
    print('Usage: ' + sys.argv[0] + ' input-file')
    exit(1)

myhtml_pd = pd.read_html(sys.argv[1])
#In [25]: len(myhtml_pd)
#Out[25]: 3

for i in myhtml_pd[2].index:
    if i > 0:
        for j in myhtml_pd[2].columns:
            if j >1 and not pd.isnull(myhtml_pd[2].loc[i][j]):
                print(myhtml_pd[2].loc[i][j])

HZ
-- 
https://mail.python.org/mailman/listinfo/python-list

Reply via email to