Re: [julia-users] Re: PEG Parser

John Myles White Tue, 27 May 2014 15:59:36 -0700

I'd be really interested to see how this parser compares with DataFrames. 
There's a bunch of test files in the DataFrames.jl/test directory.


 -- John

On May 27, 2014, at 3:49 PM, Abe Schneider <abe.schnei...@gmail.com> wrote:

> I don't know how the speed of the parser will be compared to DataFrames -- 
> I've done absolutely no work to date on profiling the code, but I thought 
> writing a CSV parser was a good way to test out code (and helped find a bunch 
> of bugs).
> 
> I've also committed (under examples/) the CSV parser. The grammar (from the 
> RFC) is:
> 
> @grammar csv begin
>   start = data
>   data = record + *(crlf + record)
>   record = field + *(comma + field)
>   field = escaped_field | unescaped_field
>   escaped_field = dquote + *(textdata | comma | cr | lf | dqoute2) + dquote
>   unescaped_field = textdata
>   textdata = r"[ !#$%&'()*+\-./0-~]+"
>   cr = '\r'
>   lf = '\n'
>   crlf = cr + lf
>   dquote = '"'
>   dqoute2 = "\"\""
>   comma = ','
> end
> 
> and the actions are:
> 
> tr["crlf"] = (node, children) -> nothing
> tr["comma"] = (node, children) -> nothing
> 
> tr["escaped_field"] = (node, children) -> node.children[2].value
> tr["unescaped_field"] = (node, children) -> node.children[1].value
> tr["field"] = (node, children) -> children
> tr["record"] = (node, children) -> unroll(children)
> tr["data"] = (node, children) -> unroll(children)
> tr["textdata"] = (node, children) -> node.value
> 
> 
> give the data:
> 
> parse_data = """1,2,3\r\nthis is,a test,of csv\r\n"these","are","quotes 
> ("")""""
> 
> and running the parser:
> 
> (node, pos, error) = parse(csv, parse_data)
> result = transform(tr, node)
> 
> I get:
> 
> {{"1","2","3"},{"this is","a test","of csv"},{"these","are","quotes (\"\")"}}
> 
> 
> 
> 
> 
> On Monday, May 26, 2014 3:41:26 AM UTC-4, harven wrote:
> Nice!
> 
> If you are interested by testing your library on a concrete problem, you may 
> want to parse comma separated value (csv) files. The bnf is in the 
> specification RFC4180. http://tools.ietf.org/html/rfc4180
> 
> AFAIK, the readcsv function provided in Base does not handle quotations well 
> whereas the csv parser in DataFrames is slow, so that julia does not have yet 
> a native efficient way to parse csv files.

Re: [julia-users] Re: PEG Parser

Reply via email to