On Wednesday, September 29, 2021 at 8:12:08 PM UTC+8, J.O. Aho wrote: > On 29/09/2021 13.10, hongy...@gmail.com wrote: > > On Wednesday, September 29, 2021 at 5:40:58 PM UTC+8, J.O. Aho wrote: > >> On 29/09/2021 10.22, hongy...@gmail.com wrote: > >>> I tried to convert a xls file into csv with the following command, but > >>> failed: > >>> > >>> $ in2csv --sheet 'Sheet1' 2021-2022-1.xls > >>> XLRDError: Unsupported format, or corrupt file: Expected BOF record; > >>> found b'\r\n\r\n\r\n\r\n' > >>> > >>> The above testing file is located at here [1]. > >>> > >>> [1] https://github.com/hongyi-zhao/temp/blob/master/2021-2022-1.xls > >>> > >>> Any hints for fixing this problem? > >> You need to delete the 13 first lines in the file > > > > Yes. After deleting the top 3 lines, the problem has been fixed. > > > >> or you see to that your code does first trim the data before start xml > >> parse it. > > > > Yes. I really want to do this trick programmatically, but how do I do it > > without manually editing the file? > You could do something like loading the XML into a string (myxmlstr)
How to do this operation? As you have seen, the file refused to be loaded at all. > and then find the fist < in that string > > xmlstart = myxmlstr.find('<') > > xmlstr = myxmlstr[xmlstart:] > > then use the xmlstr in the xml parser, sure not as convenient as loading > the file directly to the xml parser. > > I don't say this is the best way of doing it, I'm sure some python wiz > here would have a smarter solution. Another very strange thing: I trimmed the first 3 lines in the original file and saved it into a new one named as 2021-2022-1-trimmed-top-3-lines.xls. [1] Then I read the file with the following python script named as pandas-excel.py: ------ import pandas as pd excel_file='2021-2022-1-trimmed-top-3-lines.xls' #print(pd.ExcelFile(excel_file).sheet_names) newpd=pd.read_excel(excel_file, sheet_name='Sheet1') for i in newpd.index: if i >1: for j in newpd.columns: if int(j.split()[1]) > 2: if not pd.isnull(newpd.loc[i][j]): print(newpd.loc[i][j]) ------ $ python pandas-excel.py | sort -u 汽车实用英语 [1-8]周 1-4节 38 汽车楼413基础电气实训室II 汽修1932 汽车车载网络系统的检测与修复 [1-12]周 1-4节 38 汽车楼416安全、舒适系统实训室 汽修1932 OTOH, I also tried to read the file with in2csv as follows: $ in2csv --sheet Sheet1 2021-2022-1-trimmed-top-3-lines.xls 2>/dev/null |tr ',' '\n' | \ sed -re '/^$/d' | sort -u | awk '{print length($0),$0}' | sort -k1n | tail -3 | cut -d ' ' -f2- 汽车实用英语 [1-8]周 1-4节 38 汽车楼413基础电气实训室II 汽修1932 智能网联汽车概论 [1-8]周 6-9节 45 汽车楼511汽车营销策划实训室 汽销1931 汽车车载网络系统的检测与修复 [1-12]周 1-4节 38 汽车楼416安全、舒适系统实训室 汽修1932 As you can see, the above two methods give different results. I'm very puzzled by this phenomenon. Any hints/tips/comments will be greatly appreciated. [1] https://github.com/hongyi-zhao/temp/blob/master/2021-2022-1-trimmed-top-3-lines.xls Regards, HZ -- https://mail.python.org/mailman/listinfo/python-list