On 2017-08-02 19:05, MRAB wrote: > On 2017-08-02 16:05, Daiyue Weng wrote: >> Hi, I am trying to removing extra quotes from a large set of strings (a >> list of strings), so for each original string, it looks like, >> >> """str_value1"",""str_value2"",""str_value3"",1,""str_value4""" >> >> >> I like to remove the start and end quotes and extra pairs of quotes on >> each >> string value, so the result will look like, >> >> "str_value1","str_value2","str_value3",1,"str_value4" >> >> >> and then join each string by a new line. >> >> I have tried the following code, >> >> for line in str_lines[1:]: >> strip_start_end_quotes = line[1:-1] >> splited_line_rem_quotes = >> strip_start_end_quotes.replace('\"\"', '"') >> str_lines[str_lines.index(line)] = splited_line_rem_quotes >> >> for_pandas_new_headers_str = '\n'.join(splited_lines) >> >> but it is really slow (running for ages) if the list contains over 1 >> million string lines. I am thinking about a fast way to do that. >> > [snip] > > The problem is the line: > > str_lines[str_lines.index(line)] > > It does a linear search through str_lines until time finds a match for > the line. > > To find the 10th line it must search through the first 10 lines. > > To find the 100th line it must search through the first 100 lines. > > To find the 1000th line it must search through the first 1000 lines. > > And so on. > > In Big-O notation, the performance is O(n**2). > > The Pythonic way of doing it is to put the results into a new list: > > > new_str_lines = str_lines[:1] > > for line in str_lines[1:]: > strip_start_end_quotes = line[1:-1] > splited_line_rem_quotes = strip_start_end_quotes.replace('\"\"', '"') > new_str_lines.append(splited_line_rem_quotes) > > > In Big-O notation, the performance is O(n).
Sometimes it's desirable to modify the list in-place (such as in this case, where you don't really want to double the memory use: for idx, line in enumerate(str_lines): str_lines[idx] = fixed(line) The most Pythonic way to process a large "list" of data is often to not use a list at all, but to use iterators. Whether it's feasible to access the strings one-by-one will depend on where they come from and where they're going. Something like this may or may not be useful: def remove_quotes_from_all(lines): for line in lines: yield line[1:-1].replace('""', '"') with open('weird_file.txt', 'r') as input: with open('not_so_weird_file.txt', 'w') as output: for fixed_line in remove_quotes_from_all(input): output.write(f'{fixed_line}\n') -- Thomas -- https://mail.python.org/mailman/listinfo/python-list