[ 
https://issues.apache.org/jira/browse/SPARK-41790?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

mattshma updated SPARK-41790:
-----------------------------
    Description: 
We'll get wrong data when transform only specify reader or writer 's row format 
delimited, the reason is using the wrong format to feed/fetch data to/from 
running script now.  In theory, writer uses inFormat to feed to input data into 
the running script and reader uses outFormat to read the output from the 
running script, but inFormat and outFormat are set wrong value currently in the 
following code:
{code:java}
val (inFormat, inSerdeClass, inSerdeProps, reader) =
  format(
    inRowFormat, "hive.script.recordreader",
    "org.apache.hadoop.hive.ql.exec.TextRecordReader")

val (outFormat, outSerdeClass, outSerdeProps, writer) =
  format(
    outRowFormat, "hive.script.recordwriter",
    "org.apache.hadoop.hive.ql.exec.TextRecordWriter") {code}
 

Example SQL:
{code:java}
spark-sql> CREATE TABLE t1 (a string, b string); 

spark-sql> INSERT OVERWRITE t1 VALUES("1", "2"), ("3", "4");

spark-sql> SELECT TRANSFORM(a, b)
         >   ROW FORMAT DELIMITED
         >   FIELDS TERMINATED BY ','
         >   USING 'cat'
         >   AS (c)
         > FROM t1;
c

spark-sql> SELECT TRANSFORM(a, b)
         >   USING 'cat'
         >   AS (c)
         >   ROW FORMAT DELIMITED
         >   FIELDS TERMINATED BY ','
         > FROM t1;
c
1    23    4{code}
 

The same sql in hive:
{code:java}
hive> SELECT TRANSFORM(a, b)
    >   ROW FORMAT DELIMITED
    >   FIELDS TERMINATED BY ','
    >   USING 'cat'
    >   AS (c)
    > FROM t1;
c
1,2
3,4

hive> SELECT TRANSFORM(a, b)
    >   USING 'cat'
    >   AS (c)
    >   ROW FORMAT DELIMITED
    >   FIELDS TERMINATED BY ','
    > FROM t1;
c
1    2
3    4 {code}
 

  was:
We'll get wrong data when transform only specify reader or writer 's row format 
delimited, the reason is using the wrong format to feed/fetch data to/from 
running script now: writer uses inFormat to feed to input data into the running 
script and reader uses outFormat to read the output from the running script. 
But inFormat and outFormat are set wrong value currently because the following 
code:

 
{code:java}
val (inFormat, inSerdeClass, inSerdeProps, reader) =
  format(
    inRowFormat, "hive.script.recordreader",
    "org.apache.hadoop.hive.ql.exec.TextRecordReader")

val (outFormat, outSerdeClass, outSerdeProps, writer) =
  format(
    outRowFormat, "hive.script.recordwriter",
    "org.apache.hadoop.hive.ql.exec.TextRecordWriter") {code}
 

Example SQL:
{code:java}
spark-sql> CREATE TABLE t1 (a string, b string); 

spark-sql> INSERT OVERWRITE t1 VALUES("1", "2"), ("3", "4");

spark-sql> SELECT TRANSFORM(a, b)
         >   ROW FORMAT DELIMITED
         >   FIELDS TERMINATED BY ','
         >   USING 'cat'
         >   AS (c)
         > FROM t1;
c

spark-sql> SELECT TRANSFORM(a, b)
         >   USING 'cat'
         >   AS (c)
         >   ROW FORMAT DELIMITED
         >   FIELDS TERMINATED BY ','
         > FROM t1;
c
1    23    4{code}
 

The same sql in hive:
{code:java}
hive> SELECT TRANSFORM(a, b)
    >   ROW FORMAT DELIMITED
    >   FIELDS TERMINATED BY ','
    >   USING 'cat'
    >   AS (c)
    > FROM t1;
c
1,2
3,4

hive> SELECT TRANSFORM(a, b)
    >   USING 'cat'
    >   AS (c)
    >   ROW FORMAT DELIMITED
    >   FIELDS TERMINATED BY ','
    > FROM t1;
c
1    2
3    4 {code}
 


> Transform will get wrong date when only specify reader or writer 's row 
> format delimited
> ----------------------------------------------------------------------------------------
>
>                 Key: SPARK-41790
>                 URL: https://issues.apache.org/jira/browse/SPARK-41790
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 3.3.1
>            Reporter: mattshma
>            Priority: Major
>
> We'll get wrong data when transform only specify reader or writer 's row 
> format delimited, the reason is using the wrong format to feed/fetch data 
> to/from running script now.  In theory, writer uses inFormat to feed to input 
> data into the running script and reader uses outFormat to read the output 
> from the running script, but inFormat and outFormat are set wrong value 
> currently in the following code:
> {code:java}
> val (inFormat, inSerdeClass, inSerdeProps, reader) =
>   format(
>     inRowFormat, "hive.script.recordreader",
>     "org.apache.hadoop.hive.ql.exec.TextRecordReader")
> val (outFormat, outSerdeClass, outSerdeProps, writer) =
>   format(
>     outRowFormat, "hive.script.recordwriter",
>     "org.apache.hadoop.hive.ql.exec.TextRecordWriter") {code}
>  
> Example SQL:
> {code:java}
> spark-sql> CREATE TABLE t1 (a string, b string); 
> spark-sql> INSERT OVERWRITE t1 VALUES("1", "2"), ("3", "4");
> spark-sql> SELECT TRANSFORM(a, b)
>          >   ROW FORMAT DELIMITED
>          >   FIELDS TERMINATED BY ','
>          >   USING 'cat'
>          >   AS (c)
>          > FROM t1;
> c
> spark-sql> SELECT TRANSFORM(a, b)
>          >   USING 'cat'
>          >   AS (c)
>          >   ROW FORMAT DELIMITED
>          >   FIELDS TERMINATED BY ','
>          > FROM t1;
> c
> 1    23    4{code}
>  
> The same sql in hive:
> {code:java}
> hive> SELECT TRANSFORM(a, b)
>     >   ROW FORMAT DELIMITED
>     >   FIELDS TERMINATED BY ','
>     >   USING 'cat'
>     >   AS (c)
>     > FROM t1;
> c
> 1,2
> 3,4
> hive> SELECT TRANSFORM(a, b)
>     >   USING 'cat'
>     >   AS (c)
>     >   ROW FORMAT DELIMITED
>     >   FIELDS TERMINATED BY ','
>     > FROM t1;
> c
> 1    2
> 3    4 {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to