[ 
https://issues.apache.org/jira/browse/HIVE-8436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16336431#comment-16336431
 ] 

liyunzhang commented on HIVE-8436:
----------------------------------

[~csun]:  thanks for reply,

{quote}

without the copying function, the RDD cache will cache *references*

{quote}

I have not found this in spark 
[document|https://spark.apache.org/docs/2.2.0/rdd-programming-guide.html#rdd-persistence].
  Spark RDD cache  reference not value.  If have time, can you provide  a link 
which explain it?
{code:java}

private static class CopyFunction implements 
PairFunction<Tuple2<WritableComparable, Writable>,
WritableComparable, Writable> {

private transient Configuration conf;

@Override
public Tuple2<WritableComparable, Writable>
call(Tuple2<WritableComparable, Writable> tuple) throws Exception {
if (conf == null) {
conf = new Configuration();
}

return new Tuple2<WritableComparable, Writable>(tuple._1(),
WritableUtils.clone(tuple._2(), conf));
}

}{code}

{{WritableUtils.clone(tuple._2(), conf))}} used to clone tuple._2() to  a new 
variable.  This means tuple._2() is instance of Class which can be cloned.  For 
Text type, it is ok. For orc/parquet format, it is not ok because 
[HIVE-18289|https://issues.apache.org/jira/browse/HIVE-18289].  The reason is 
OrcStruct doesn't have an empty constructor for orc when 
ReflectionUtils.newInstance is called and a similar reason for parquet format. 
Is there anyway to solve it?

> Modify SparkWork to split works with multiple child works [Spark Branch]
> ------------------------------------------------------------------------
>
>                 Key: HIVE-8436
>                 URL: https://issues.apache.org/jira/browse/HIVE-8436
>             Project: Hive
>          Issue Type: Sub-task
>          Components: Spark
>            Reporter: Xuefu Zhang
>            Assignee: Chao Sun
>            Priority: Major
>             Fix For: 1.1.0
>
>         Attachments: HIVE-8436.1-spark.patch, HIVE-8436.10-spark.patch, 
> HIVE-8436.11-spark.patch, HIVE-8436.2-spark.patch, HIVE-8436.3-spark.patch, 
> HIVE-8436.4-spark.patch, HIVE-8436.5-spark.patch, HIVE-8436.6-spark.patch, 
> HIVE-8436.7-spark.patch, HIVE-8436.8-spark.patch, HIVE-8436.9-spark.patch
>
>
> Based on the design doc, we need to split the operator tree of a work in 
> SparkWork if the work is connected to multiple child works. The way splitting 
> the operator tree is performed by cloning the original work and removing 
> unwanted branches in the operator tree. Please refer to the design doc for 
> details.
> This process should be done right before we generate SparkPlan. We should 
> have a utility method that takes the orignal SparkWork and return a modified 
> SparkWork.
> This process should also keep the information about the original work and its 
> clones. Such information will be needed during SparkPlan generation 
> (HIVE-8437).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to