[jira] [Commented] (SPARK-10310) [Spark SQL] All result records will be popluated into ONE line during the script transform due to missing the correct line/filed delimiter

Yi Zhou (JIRA) Sun, 06 Sep 2015 18:50:18 -0700

    [ 
https://issues.apache.org/jira/browse/SPARK-10310?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14733195#comment-14733195
 ]


Yi Zhou commented on SPARK-10310:
---------------------------------

Hi [~marmbrus]
Could you please help to review and evaluate this critical Spark SQL issue to 
see if it can be fixed in Spark 1.5.0 (I saw the code is ready) ?  The issue 
caused to fail to extract correct record due to missing line/filed delimiter 
and it blocked the conformance validation. Thanks in advance !

> [Spark SQL] All result records will be popluated into ONE line during the 
> script transform due to missing the correct line/filed delimiter
> ------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-10310
>                 URL: https://issues.apache.org/jira/browse/SPARK-10310
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>            Reporter: Yi Zhou
>            Priority: Critical
>
> There is real case using python stream script in Spark SQL query. We found 
> that all result records were wroten in ONE line as input from "select" 
> pipeline for python script and so it caused script will not identify each 
> record.Other, filed separator in spark sql will be '^A' or '\001' which is 
> inconsistent/incompatible the '\t' in Hive implementation.
> Key query:
> {code:sql}
> CREATE VIEW temp1 AS
> SELECT *
> FROM
> (
>   FROM
>   (
>     SELECT
>       c.wcs_user_sk,
>       w.wp_type,
>       (wcs_click_date_sk * 24 * 60 * 60 + wcs_click_time_sk) AS tstamp_inSec
>     FROM web_clickstreams c, web_page w
>     WHERE c.wcs_web_page_sk = w.wp_web_page_sk
>     AND   c.wcs_web_page_sk IS NOT NULL
>     AND   c.wcs_user_sk     IS NOT NULL
>     AND   c.wcs_sales_sk    IS NULL --abandoned implies: no sale
>     DISTRIBUTE BY wcs_user_sk SORT BY wcs_user_sk, tstamp_inSec
>   ) clicksAnWebPageType
>   REDUCE
>     wcs_user_sk,
>     tstamp_inSec,
>     wp_type
>   USING 'python sessionize.py 3600'
>   AS (
>     wp_type STRING,
>     tstamp BIGINT, 
>     sessionid STRING)
> ) sessionized
> {code}
> Key Python script:
> {noformat}
> for line in sys.stdin:
>      user_sk,  tstamp_str, value  = line.strip().split("\t")
> {noformat}
> Sample SELECT result:
> {noformat}
> ^V31^A3237764860^Afeedback^U31^A3237769106^Adynamic^T31^A3237779027^Areview
> {noformat}
> Expected result:
> {noformat}
> 31   3237764860   feedback
> 31   3237769106   dynamic
> 31   3237779027   review
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-10310) [Spark SQL] All result records will be popluated into ONE line during the script transform due to missing the correct line/filed delimiter

Reply via email to