[ https://issues.apache.org/jira/browse/SPARK-10310?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Yi Zhou updated SPARK-10310: ---------------------------- Description: There is real case using python stream script in Spark SQL query. We found that all result records from "select" write in ONE line as input for python script and so it cause script will not identify each record.Other, filed separator in spark sql will be '^A' or '\001' which is inconsistent/incompatible the '\t' in Hive implementation. #################Key Query####################: CREATE VIEW temp1 AS SELECT * FROM ( FROM ( SELECT c.wcs_user_sk, w.wp_type, (wcs_click_date_sk * 24 * 60 * 60 + wcs_click_time_sk) AS tstamp_inSec FROM web_clickstreams c, web_page w WHERE c.wcs_web_page_sk = w.wp_web_page_sk AND c.wcs_web_page_sk IS NOT NULL AND c.wcs_user_sk IS NOT NULL AND c.wcs_sales_sk IS NULL --abandoned implies: no sale DISTRIBUTE BY wcs_user_sk SORT BY wcs_user_sk, tstamp_inSec ) clicksAnWebPageType REDUCE wcs_user_sk, tstamp_inSec, wp_type USING 'python sessionize.py 3600' AS ( wp_type STRING, tstamp BIGINT, sessionid STRING) ) sessionized #############Key Python Script################# for line in sys.stdin: user_sk, tstamp_str, value = line.strip().split("\t") ############Result Records example from 'select' ################## ^V31^A3237764860^Afeedback^U31^A3237769106^Adynamic^T31^A3237779027^Areview ############Result Records example in format###################### 31 3237764860 feedback 31 3237769106 dynamic 31 3237779027 review was: There is real case using python stream script in Spark SQL query. We found that all result records from "select" write in ONE line as input for python script and so it cause script will not identify each record.Other, filed separator in spark sql will be '^A' or '\001' which is inconsistent the '\t' in Hive implementation. #################Key Query####################: CREATE VIEW temp1 AS SELECT * FROM ( FROM ( SELECT c.wcs_user_sk, w.wp_type, (wcs_click_date_sk * 24 * 60 * 60 + wcs_click_time_sk) AS tstamp_inSec FROM web_clickstreams c, web_page w WHERE c.wcs_web_page_sk = w.wp_web_page_sk AND c.wcs_web_page_sk IS NOT NULL AND c.wcs_user_sk IS NOT NULL AND c.wcs_sales_sk IS NULL --abandoned implies: no sale DISTRIBUTE BY wcs_user_sk SORT BY wcs_user_sk, tstamp_inSec ) clicksAnWebPageType REDUCE wcs_user_sk, tstamp_inSec, wp_type USING 'python sessionize.py 3600' AS ( wp_type STRING, tstamp BIGINT, sessionid STRING) ) sessionized #############Key Python Script################# for line in sys.stdin: user_sk, tstamp_str, value = line.strip().split("\t") ############Result Records example from 'select' ################## ^V31^A3237764860^Afeedback^U31^A3237769106^Adynamic^T31^A3237779027^Areview ############Result Records example in format###################### 31 3237764860 feedback 31 3237769106 dynamic 31 3237779027 review > [Spark SQL] All result records will be popluated in ONE line due to missing > the correct line/filed separator > ------------------------------------------------------------------------------------------------------------ > > Key: SPARK-10310 > URL: https://issues.apache.org/jira/browse/SPARK-10310 > Project: Spark > Issue Type: Bug > Components: SQL > Reporter: Yi Zhou > Priority: Blocker > > There is real case using python stream script in Spark SQL query. We found > that all result records from "select" write in ONE line as input for python > script and so it cause script will not identify each record.Other, filed > separator in spark sql will be '^A' or '\001' which is > inconsistent/incompatible the '\t' in Hive implementation. > #################Key Query####################: > CREATE VIEW temp1 AS > SELECT * > FROM > ( > FROM > ( > SELECT > c.wcs_user_sk, > w.wp_type, > (wcs_click_date_sk * 24 * 60 * 60 + wcs_click_time_sk) AS tstamp_inSec > FROM web_clickstreams c, web_page w > WHERE c.wcs_web_page_sk = w.wp_web_page_sk > AND c.wcs_web_page_sk IS NOT NULL > AND c.wcs_user_sk IS NOT NULL > AND c.wcs_sales_sk IS NULL --abandoned implies: no sale > DISTRIBUTE BY wcs_user_sk SORT BY wcs_user_sk, tstamp_inSec > ) clicksAnWebPageType > REDUCE > wcs_user_sk, > tstamp_inSec, > wp_type > USING 'python sessionize.py 3600' > AS ( > wp_type STRING, > tstamp BIGINT, > sessionid STRING) > ) sessionized > #############Key Python Script################# > for line in sys.stdin: > user_sk, tstamp_str, value = line.strip().split("\t") > ############Result Records example from 'select' ################## > ^V31^A3237764860^Afeedback^U31^A3237769106^Adynamic^T31^A3237779027^Areview > ############Result Records example in format###################### > 31 3237764860 feedback > 31 3237769106 dynamic > 31 3237779027 review -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org