[jira] [Updated] (SPARK-10310) [Spark SQL] All result records will be popluated in ONE line due to missing the correct line/filed delimeter

Yi Zhou (JIRA) Thu, 27 Aug 2015 01:14:19 -0700

     [ 
https://issues.apache.org/jira/browse/SPARK-10310?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Yi Zhou updated SPARK-10310:
----------------------------
    Description: 
There is real case using python stream script in Spark SQL query. We found that 
all result records were wroten in ONE line as input from "select" pipeline for 
python script and so it cause script will not identify each record.Other, filed 
separator in spark sql will be '^A' or '\001' which is 
inconsistent/incompatible the '\t' in Hive implementation.

#################Key  Query####################:
CREATE VIEW temp1 AS
SELECT *
FROM
(
  FROM
  (
    SELECT
      c.wcs_user_sk,
      w.wp_type,
      (wcs_click_date_sk * 24 * 60 * 60 + wcs_click_time_sk) AS tstamp_inSec
    FROM web_clickstreams c, web_page w
    WHERE c.wcs_web_page_sk = w.wp_web_page_sk
    AND   c.wcs_web_page_sk IS NOT NULL
    AND   c.wcs_user_sk     IS NOT NULL
    AND   c.wcs_sales_sk    IS NULL --abandoned implies: no sale
    DISTRIBUTE BY wcs_user_sk SORT BY wcs_user_sk, tstamp_inSec
  ) clicksAnWebPageType
  REDUCE
    wcs_user_sk,
    tstamp_inSec,
    wp_type
  USING 'python sessionize.py 3600'
  AS (
    wp_type STRING,
    tstamp BIGINT, 
    sessionid STRING)
) sessionized

#############Key Python Script#################
for line in sys.stdin:
     user_sk,  tstamp_str, value  = line.strip().split("\t")

############Result Records example from 'select' ##################
^V31^A3237764860^Afeedback^U31^A3237769106^Adynamic^T31^A3237779027^Areview
############Result Records example in format######################
31   3237764860   feedback
31   3237769106   dynamic
31   3237779027   review


  was:
There is real case using python stream script in Spark SQL query. We found that 
all result records from "select"  write in ONE line as input for python script 
and so it cause script will not identify each record.Other, filed separator in 
spark sql will be '^A' or '\001' which is inconsistent/incompatible the '\t' in 
Hive implementation.

#################Key  Query####################:
CREATE VIEW temp1 AS
SELECT *
FROM
(
  FROM
  (
    SELECT
      c.wcs_user_sk,
      w.wp_type,
      (wcs_click_date_sk * 24 * 60 * 60 + wcs_click_time_sk) AS tstamp_inSec
    FROM web_clickstreams c, web_page w
    WHERE c.wcs_web_page_sk = w.wp_web_page_sk
    AND   c.wcs_web_page_sk IS NOT NULL
    AND   c.wcs_user_sk     IS NOT NULL
    AND   c.wcs_sales_sk    IS NULL --abandoned implies: no sale
    DISTRIBUTE BY wcs_user_sk SORT BY wcs_user_sk, tstamp_inSec
  ) clicksAnWebPageType
  REDUCE
    wcs_user_sk,
    tstamp_inSec,
    wp_type
  USING 'python sessionize.py 3600'
  AS (
    wp_type STRING,
    tstamp BIGINT, 
    sessionid STRING)
) sessionized

#############Key Python Script#################
for line in sys.stdin:
     user_sk,  tstamp_str, value  = line.strip().split("\t")

############Result Records example from 'select' ##################
^V31^A3237764860^Afeedback^U31^A3237769106^Adynamic^T31^A3237779027^Areview
############Result Records example in format######################
31   3237764860   feedback
31   3237769106   dynamic
31   3237779027   review



> [Spark SQL] All result records will be popluated in ONE line due to missing 
> the correct line/filed delimeter
> ------------------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-10310
>                 URL: https://issues.apache.org/jira/browse/SPARK-10310
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>            Reporter: Yi Zhou
>            Priority: Blocker
>
> There is real case using python stream script in Spark SQL query. We found 
> that all result records were wroten in ONE line as input from "select" 
> pipeline for python script and so it cause script will not identify each 
> record.Other, filed separator in spark sql will be '^A' or '\001' which is 
> inconsistent/incompatible the '\t' in Hive implementation.
> #################Key  Query####################:
> CREATE VIEW temp1 AS
> SELECT *
> FROM
> (
>   FROM
>   (
>     SELECT
>       c.wcs_user_sk,
>       w.wp_type,
>       (wcs_click_date_sk * 24 * 60 * 60 + wcs_click_time_sk) AS tstamp_inSec
>     FROM web_clickstreams c, web_page w
>     WHERE c.wcs_web_page_sk = w.wp_web_page_sk
>     AND   c.wcs_web_page_sk IS NOT NULL
>     AND   c.wcs_user_sk     IS NOT NULL
>     AND   c.wcs_sales_sk    IS NULL --abandoned implies: no sale
>     DISTRIBUTE BY wcs_user_sk SORT BY wcs_user_sk, tstamp_inSec
>   ) clicksAnWebPageType
>   REDUCE
>     wcs_user_sk,
>     tstamp_inSec,
>     wp_type
>   USING 'python sessionize.py 3600'
>   AS (
>     wp_type STRING,
>     tstamp BIGINT, 
>     sessionid STRING)
> ) sessionized
> #############Key Python Script#################
> for line in sys.stdin:
>      user_sk,  tstamp_str, value  = line.strip().split("\t")
> ############Result Records example from 'select' ##################
> ^V31^A3237764860^Afeedback^U31^A3237769106^Adynamic^T31^A3237779027^Areview
> ############Result Records example in format######################
> 31   3237764860   feedback
> 31   3237769106   dynamic
> 31   3237779027   review



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-10310) [Spark SQL] All result records will be popluated in ONE line due to missing the correct line/filed delimeter

Reply via email to