[ https://issues.apache.org/jira/browse/SPARK-10310?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14733195#comment-14733195 ]
Yi Zhou commented on SPARK-10310: --------------------------------- Hi [~marmbrus] Could you please help to review and evaluate this critical Spark SQL issue to see if it can be fixed in Spark 1.5.0 (I saw the code is ready) ? The issue caused to fail to extract correct record due to missing line/filed delimiter and it blocked the conformance validation. Thanks in advance ! > [Spark SQL] All result records will be popluated into ONE line during the > script transform due to missing the correct line/filed delimiter > ------------------------------------------------------------------------------------------------------------------------------------------ > > Key: SPARK-10310 > URL: https://issues.apache.org/jira/browse/SPARK-10310 > Project: Spark > Issue Type: Bug > Components: SQL > Reporter: Yi Zhou > Priority: Critical > > There is real case using python stream script in Spark SQL query. We found > that all result records were wroten in ONE line as input from "select" > pipeline for python script and so it caused script will not identify each > record.Other, filed separator in spark sql will be '^A' or '\001' which is > inconsistent/incompatible the '\t' in Hive implementation. > Key query: > {code:sql} > CREATE VIEW temp1 AS > SELECT * > FROM > ( > FROM > ( > SELECT > c.wcs_user_sk, > w.wp_type, > (wcs_click_date_sk * 24 * 60 * 60 + wcs_click_time_sk) AS tstamp_inSec > FROM web_clickstreams c, web_page w > WHERE c.wcs_web_page_sk = w.wp_web_page_sk > AND c.wcs_web_page_sk IS NOT NULL > AND c.wcs_user_sk IS NOT NULL > AND c.wcs_sales_sk IS NULL --abandoned implies: no sale > DISTRIBUTE BY wcs_user_sk SORT BY wcs_user_sk, tstamp_inSec > ) clicksAnWebPageType > REDUCE > wcs_user_sk, > tstamp_inSec, > wp_type > USING 'python sessionize.py 3600' > AS ( > wp_type STRING, > tstamp BIGINT, > sessionid STRING) > ) sessionized > {code} > Key Python script: > {noformat} > for line in sys.stdin: > user_sk, tstamp_str, value = line.strip().split("\t") > {noformat} > Sample SELECT result: > {noformat} > ^V31^A3237764860^Afeedback^U31^A3237769106^Adynamic^T31^A3237779027^Areview > {noformat} > Expected result: > {noformat} > 31 3237764860 feedback > 31 3237769106 dynamic > 31 3237779027 review > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org