Re: [julia-users] Re: Crashing while parsing large XML file

Brandon Booth Sat, 30 Jan 2016 11:09:12 -0800

I'm a moron, but that's a different issue. I fixed the readline/eachline 
issue, but that didn't address the crashing problem. I did some 
experimenting though and I think I fixed the problem.


I added free(str) at the end of each loop to free up the memory from 
parse_string. I parsed each line and for some reason my program was hanging 
onto the results so the memory usage was slowly creeping up until the 
program crashed. Adding frree(str) kept the memory usage flat and ran 
through the entire file.



On Thursday, January 28, 2016 at 3:38:45 PM UTC-5, Stefan Karpinski wrote:
>
> At best, you'll only see every other line, right? At worst, eachline may 
> do some IO lookahead (i.e. read one line ahead) and this will do something 
> even more confusing.
>
> On Thu, Jan 28, 2016 at 3:35 PM, Brandon Booth <etu...@gmail.com 
> <javascript:>> wrote:
>
>> No real reason. I was going back and forth between eachline(f) and for i 
>> = 1:n to see if it worked for 1000 rows, then 10,000 rows, etc. I ended up 
>> with a hybrid of the two. Will that matter much?
>>
>>
>> On Thursday, January 28, 2016 at 1:32:09 PM UTC-5, Diego Javier Zea wrote:
>>>
>>> Hi! 
>>>
>>> Why you are using 
>>>
>>> for line in eachline(f)  l = readline(f)
>>>
>>>
>>> instead of
>>>
>>> for l in eachline(f)
>>>
>>>
>>> ?
>>>
>>> Best
>>>
>>> El jueves, 28 de enero de 2016, 12:42:35 (UTC-3), Brandon Booth escribió:
>>>>
>>>> I'm parsing an XML file that's about 30gb and wrote the loop below to 
>>>> parse it line by line. My code cycles through each line and builds a 1x200 
>>>> dataframe that is appended to a larger dataframe. When the larger 
>>>> dataframe 
>>>> gets to 1000 rows I stream it to an SQLite table. The code works for the 
>>>> first 25 million or so lines (which equates to 125,000 or so records in 
>>>> the 
>>>> SQLite table) and then freezes. I've tried it without the larger dataframe 
>>>> but that didn't help.
>>>>
>>>> Any suggestions to avoid crashing?
>>>>
>>>> Thanks.
>>>>
>>>> Brandon
>>>>
>>>>
>>>>
>>>> The XML structure:
>>>> <doc>
>>>> <field1>value</field1>
>>>> <field2>value>/field2>
>>>> ...
>>>> </doc>
>>>> <doc>
>>>> <field1>value</field1>
>>>> <field2>value>/field2>
>>>> ...
>>>> </doc>
>>>>
>>>>
>>>> My loop:
>>>>
>>>> f = open("contracts.xml","r")readline(f)n = countlines(f)tic()for line in 
>>>> eachline(f)  l = readline(f)  if startswith(l,"<doc")    df = 
>>>> DataFrame(df_types,df_names, 1)  elseif startswith(l,"</doc")    
>>>> append!(df1,df)    if size(df1,1) == 1000      source = convertdf(df1)     
>>>>  Data.stream!(source,sink)      deleterows!(df1,1:1000)    end  else    
>>>> str = parse_string(l)    r = root(str)    df[symbol(name(r))] = 
>>>> string(content(r))  endend
>>>>
>>>> close(f)
>>>>
>>>>
>

Re: [julia-users] Re: Crashing while parsing large XML file

Reply via email to