since there is no key to group by and assemble records i would suggest to write this in RDD land and then convert to data frame. you can use sc.wholeTextFiles to process text files and create a state machine
On Feb 4, 2017 16:25, "Paul Tremblay" <paulhtremb...@gmail.com> wrote: I am using pyspark 2.1 and am wondering how to convert a flat file, with one record per row, into a columnar format. Here is an example of the data: u'WARC/1.0', u'WARC-Type: warcinfo', u'WARC-Date: 2016-12-08T13:00:23Z', u'WARC-Record-ID: <urn:uuid:f609f246-df68-46ef-a1c5-2f66e833ffd6>', u'Content-Length: 344', u'Content-Type: application/warc-fields', u'WARC-Filename: CC-MAIN-20161202170900-00000-ip-10-31-129-80.ec2.internal. warc.gz', u'', u'robots: classic', u'hostname: ip-10-31-129-80.ec2.internal', u'software: Nutch 1.6 (CC)/CC WarcExport 1.0', u'isPartOf: CC-MAIN-2016-50', u'operator: CommonCrawl Admin', u'description: Wide crawl of the web for November 2016', u'publisher: CommonCrawl', u'format: WARC File Format 1.0', u'conformsTo: http://bibnum.bnf.fr/WARC/WARC_ISO_28500_version1_ latestdraft.pdf', u'', u'', u'WARC/1.0', u'WARC-Type: request', u'WARC-Date: 2016-12-02T17:54:09Z', u'WARC-Record-ID: <urn:uuid:cc7ddf8b-4646-4440-a70a-e253818cf10b>', u'Content-Length: 220', u'Content-Type: application/http; msgtype=request', u'WARC-Warcinfo-ID: <urn:uuid:f609f246-df68-46ef-a1c5-2f66e833ffd6>', u'WARC-IP-Address: 217.197.115.133', u'WARC-Target-URI: http://1018201.vkrugudruzei.ru/blog/', u'', u'GET /blog/ HTTP/1.0', u'Host: 1018201.vkrugudruzei.ru', u'Accept-Encoding: x-gzip, gzip, deflate', u'User-Agent: CCBot/2.0 (http://commoncrawl.org/faq/)', u'Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8', u'', u'', u'', u'WARC/1.0', u'WARC-Type: response', u'WARC-Date: 2016-12-02T17:54:09Z', u'WARC-Record-ID: <urn:uuid:4c5e6d1a-e64f-4b6e-8101-c5e46feb84a0>', u'Content-Length: 577', u'Content-Type: application/http; msgtype=response', u'WARC-Warcinfo-ID: <urn:uuid:f609f246-df68-46ef-a1c5-2f66e833ffd6>', u'WARC-Concurrent-To: <urn:uuid:cc7ddf8b-4646-4440-a70a-e253818cf10b>', u'WARC-IP-Address: 217.197.115.133', u'WARC-Target-URI: http://1018201.vkrugudruzei.ru/blog/', u'WARC-Payload-Digest: sha1:Y4TZFLB6UTXHU4HUVONBXC5NZQW2LYMM', u'WARC-Block-Digest: sha1:3J7HHBMWTSC7W53DDB7BHTUVPM26QS4B', u''] I want to convert it to something like: {warc-type='request',warc-date='2016-12-02'. ward-record-id='<urn:uuid: cc7ddf8b-4646-4440-a70a-e253818cf10b....} In Python I would simply set a flag, and read line by line (create a state machine). You can't do this in spark, though. Thanks Henry -- Paul Henry Tremblay Robert Half Technology