[jira] [Commented] (SPARK-16613) RDD.pipe returns values for empty partitions

Tejas Patil (JIRA) Tue, 19 Jul 2016 20:07:32 -0700

    [ 
https://issues.apache.org/jira/browse/SPARK-16613?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15385235#comment-15385235
 ]


Tejas Patil commented on SPARK-16613:
-------------------------------------

[~srowen] , [~rxin] : I feel that invoking the pipe command even for empty 
partitions is bad. Users should be exposed to Spark partitions. Leaving the 
existing behavior as it is might mean that if number of partitions change OR 
partitioning scheme changes, the results generated for the same input data can 
differ. This might be annoying. One would expect Spark's behavior to be same as 
running the pipe command in standalone way over terminal in which case there 
won't be any empty partitions as partition is a Spark internal concept.

Hive's ScriptOperator invokes the user binary only when it sees a row : 
https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/ScriptOperator.java#L339
 

Spark's ScriptTransformation as well follows the same convention:
https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/ScriptTransformation.scala#L244

> RDD.pipe returns values for empty partitions
> --------------------------------------------
>
>                 Key: SPARK-16613
>                 URL: https://issues.apache.org/jira/browse/SPARK-16613
>             Project: Spark
>          Issue Type: Bug
>            Reporter: Alex Krasnyansky
>
> Suppose we have such Spark code
> {code}
> object PipeExample {
>   def main(args: Array[String]) {
>     val fstRdd = sc.parallelize(List("hi", "hello", "how", "are", "you"))
>     val pipeRdd = 
> fstRdd.pipe("/Users/finkel/spark-pipe-example/src/main/resources/len.sh")
>     pipeRdd.collect.foreach(println)
>   }
> }
> {code}
> It uses a bash script to convert a string to its length.
> {code}
> #!/bin/sh
> read input
> len=${#input}
> echo $len
> {code}
> So far so good, but when I run the code, it prints incorrect output. For 
> example:
> {code}
> 0
> 2
> 0
> 5
> 3
> 0
> 3
> 3
> {code}
> I expect to see
> {code}
> 2
> 5
> 3
> 3
> 3
> {code}
> which is correct output for the app. I think it's a bug. It's expected to see 
> only positive integers and avoid zeros.
> Environment:
> 1. Spark version is 1.6.2
> 2. Scala version is 2.11.6



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16613) RDD.pipe returns values for empty partitions

Reply via email to