[ https://issues.apache.org/jira/browse/SPARK-10310?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Yi Zhou updated SPARK-10310: ---------------------------- Summary: [Spark SQL] All result records will be popluated into ONE line during the script transform due to missing the correct line/filed delimiter (was: [Spark SQL] All result records will be popluated into ONE line during the script transform due to missing the correct line/filed delimeter) > [Spark SQL] All result records will be popluated into ONE line during the > script transform due to missing the correct line/filed delimiter > ------------------------------------------------------------------------------------------------------------------------------------------ > > Key: SPARK-10310 > URL: https://issues.apache.org/jira/browse/SPARK-10310 > Project: Spark > Issue Type: Bug > Components: SQL > Reporter: Yi Zhou > Priority: Critical > > There is real case using python stream script in Spark SQL query. We found > that all result records were wroten in ONE line as input from "select" > pipeline for python script and so it caused script will not identify each > record.Other, filed separator in spark sql will be '^A' or '\001' which is > inconsistent/incompatible the '\t' in Hive implementation. > Key query: > {code:sql} > CREATE VIEW temp1 AS > SELECT * > FROM > ( > FROM > ( > SELECT > c.wcs_user_sk, > w.wp_type, > (wcs_click_date_sk * 24 * 60 * 60 + wcs_click_time_sk) AS tstamp_inSec > FROM web_clickstreams c, web_page w > WHERE c.wcs_web_page_sk = w.wp_web_page_sk > AND c.wcs_web_page_sk IS NOT NULL > AND c.wcs_user_sk IS NOT NULL > AND c.wcs_sales_sk IS NULL --abandoned implies: no sale > DISTRIBUTE BY wcs_user_sk SORT BY wcs_user_sk, tstamp_inSec > ) clicksAnWebPageType > REDUCE > wcs_user_sk, > tstamp_inSec, > wp_type > USING 'python sessionize.py 3600' > AS ( > wp_type STRING, > tstamp BIGINT, > sessionid STRING) > ) sessionized > {code} > Key Python script: > {noformat} > for line in sys.stdin: > user_sk, tstamp_str, value = line.strip().split("\t") > {noformat} > Sample SELECT result: > {noformat} > ^V31^A3237764860^Afeedback^U31^A3237769106^Adynamic^T31^A3237779027^Areview > {noformat} > Expected result: > {noformat} > 31 3237764860 feedback > 31 3237769106 dynamic > 31 3237779027 review > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org