[ https://issues.apache.org/jira/browse/SPARK-16613?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15385235#comment-15385235 ]
Tejas Patil commented on SPARK-16613: ------------------------------------- [~srowen] , [~rxin] : I feel that invoking the pipe command even for empty partitions is bad. Users should be exposed to Spark partitions. Leaving the existing behavior as it is might mean that if number of partitions change OR partitioning scheme changes, the results generated for the same input data can differ. This might be annoying. One would expect Spark's behavior to be same as running the pipe command in standalone way over terminal in which case there won't be any empty partitions as partition is a Spark internal concept. Hive's ScriptOperator invokes the user binary only when it sees a row : https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/ScriptOperator.java#L339 Spark's ScriptTransformation as well follows the same convention: https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/ScriptTransformation.scala#L244 > RDD.pipe returns values for empty partitions > -------------------------------------------- > > Key: SPARK-16613 > URL: https://issues.apache.org/jira/browse/SPARK-16613 > Project: Spark > Issue Type: Bug > Reporter: Alex Krasnyansky > > Suppose we have such Spark code > {code} > object PipeExample { > def main(args: Array[String]) { > val fstRdd = sc.parallelize(List("hi", "hello", "how", "are", "you")) > val pipeRdd = > fstRdd.pipe("/Users/finkel/spark-pipe-example/src/main/resources/len.sh") > pipeRdd.collect.foreach(println) > } > } > {code} > It uses a bash script to convert a string to its length. > {code} > #!/bin/sh > read input > len=${#input} > echo $len > {code} > So far so good, but when I run the code, it prints incorrect output. For > example: > {code} > 0 > 2 > 0 > 5 > 3 > 0 > 3 > 3 > {code} > I expect to see > {code} > 2 > 5 > 3 > 3 > 3 > {code} > which is correct output for the app. I think it's a bug. It's expected to see > only positive integers and avoid zeros. > Environment: > 1. Spark version is 1.6.2 > 2. Scala version is 2.11.6 -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org