[jira] [Commented] (SPARK-4402) Output path validation of an action statement resulting in runtime exception

2014-11-17 Thread Vijay (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4402?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14214635#comment-14214635
 ] 

Vijay commented on SPARK-4402:
--

Thanks for the explanation.
It is clear now.

> Output path validation of an action statement resulting in runtime exception
> 
>
> Key: SPARK-4402
> URL: https://issues.apache.org/jira/browse/SPARK-4402
> Project: Spark
>  Issue Type: Wish
>Reporter: Vijay
>Priority: Minor
>
> Output path validation is happening at the time of statement execution as a 
> part of lazyevolution of action statement. But if the path already exists 
> then it throws a runtime exception. Hence all the processing completed till 
> that point is lost which results in resource wastage (processing time and CPU 
> usage).
> If this I/O related validation is done before the RDD action operations then 
> this runtime exception can be avoided.
> I believe similar validation/ feature is implemented in hadoop also.
> Example:
> SchemaRDD.saveAsTextFile() evaluated the path during runtime 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4402) Output path validation of an action statement resulting in runtime exception

2014-11-17 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4402?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14214486#comment-14214486
 ] 

Sean Owen commented on SPARK-4402:
--

Can the Spark code go back and check this before any of it is called, at the 
start of your program? no that isn't possible. It wouldn't even know which RDDs 
may be executed at the outset, and, wouldn't be sure that the output dir isn't 
cleared up by your code before output happens.

Here it seems to happen before the output operation starts, which is about as 
early as possible. I suggest this is the correct behavior and is the current 
behavior. It's even configurable whether it overwrites or fails when the output 
dir exists.

Of course you can and should check the output directory in your program. In 
fact your program is in a better position to know whether it should be an 
error, warning, or whether you should just overwrite the output.

> Output path validation of an action statement resulting in runtime exception
> 
>
> Key: SPARK-4402
> URL: https://issues.apache.org/jira/browse/SPARK-4402
> Project: Spark
>  Issue Type: Wish
>Reporter: Vijay
>Priority: Minor
>
> Output path validation is happening at the time of statement execution as a 
> part of lazyevolution of action statement. But if the path already exists 
> then it throws a runtime exception. Hence all the processing completed till 
> that point is lost which results in resource wastage (processing time and CPU 
> usage).
> If this I/O related validation is done before the RDD action operations then 
> this runtime exception can be avoided.
> I believe similar validation/ feature is implemented in hadoop also.
> Example:
> SchemaRDD.saveAsTextFile() evaluated the path during runtime 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4402) Output path validation of an action statement resulting in runtime exception

2014-11-16 Thread Vijay (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4402?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14214320#comment-14214320
 ] 

Vijay commented on SPARK-4402:
--

Yes, output path is being validated in PairRDDFunctions.saveAsHadoopDataset. 
Please find the below exception details.
So, the output path is validated only during the execution  
saveAsHadoopDataset. After completing all the preceding statements. 

My query is that is it possible to make this validation in the first place when 
the program executon starts.

Exception in thread "main" org.apache.hadoop.mapred.FileAlreadyExistsException: 
Output directory file:/home/HadoopUser/eclipse-scala/test/output1 already exists
at 
org.apache.hadoop.mapred.FileOutputFormat.checkOutputSpecs(FileOutputFormat.java:132)
at 
org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopDataset(PairRDDFunctions.scala:968)
at 
org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:878)
at 
org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:792)
at org.apache.spark.rdd.RDD.saveAsTextFile(RDD.scala:1159)
at test.OutputTest$.main(OutputTest.scala:19)
at test.OutputTest.main(OutputTest.scala)

> Output path validation of an action statement resulting in runtime exception
> 
>
> Key: SPARK-4402
> URL: https://issues.apache.org/jira/browse/SPARK-4402
> Project: Spark
>  Issue Type: Wish
>Reporter: Vijay
>Priority: Minor
>
> Output path validation is happening at the time of statement execution as a 
> part of lazyevolution of action statement. But if the path already exists 
> then it throws a runtime exception. Hence all the processing completed till 
> that point is lost which results in resource wastage (processing time and CPU 
> usage).
> If this I/O related validation is done before the RDD action operations then 
> this runtime exception can be avoided.
> I believe similar validation/ feature is implemented in hadoop also.
> Example:
> SchemaRDD.saveAsTextFile() evaluated the path during runtime 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4402) Output path validation of an action statement resulting in runtime exception

2014-11-15 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4402?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14213739#comment-14213739
 ] 

Sean Owen commented on SPARK-4402:
--

Look at the code in PairRDDFunctions.saveAsHadoopDataset, which is what 
ultimately gets called. You'll see it try to check the output configuration 
upfront:

{code}
if (self.conf.getBoolean("spark.hadoop.validateOutputSpecs", true)) {
  // FileOutputFormat ignores the filesystem parameter
  val ignoredFs = FileSystem.get(hadoopConf)
  hadoopConf.getOutputFormat.checkOutputSpecs(ignoredFs, hadoopConf)
}
{code}

It's enabled by default. I wonder if the code path is somehow using a 
nonstandard InputFormat that doesn't check?
But this should cause an exception if the output path exists, before it starts, 
and was committed in SPARK-1100 for 1.0.

> Output path validation of an action statement resulting in runtime exception
> 
>
> Key: SPARK-4402
> URL: https://issues.apache.org/jira/browse/SPARK-4402
> Project: Spark
>  Issue Type: Wish
>Reporter: Vijay
>Priority: Minor
>
> Output path validation is happening at the time of statement execution as a 
> part of lazyevolution of action statement. But if the path already exists 
> then it throws a runtime exception. Hence all the processing completed till 
> that point is lost which results in resource wastage (processing time and CPU 
> usage).
> If this I/O related validation is done before the RDD action operations then 
> this runtime exception can be avoided.
> I believe similar validation/ feature is implemented in hadoop also.
> Example:
> SchemaRDD.saveAsTextFile() evaluated the path during runtime 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4402) Output path validation of an action statement resulting in runtime exception

2014-11-15 Thread Vijay (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4402?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14213729#comment-14213729
 ] 

Vijay commented on SPARK-4402:
--

Thanks for the reply [~srowen]

This is different scenario from the issue SPARK-1100.

Issue SPARK-1100 says that output directory is over written if it exists.
I think that fix works fine.

But, my concern is that spark throws a runtime exception if the output 
directory exists. This is happening after executing all the previous action 
statements and resulting in abrupt termination of the program. Result of the 
previous action statements is lost.

Please confirm whether this abrupt program termination is expected?

> Output path validation of an action statement resulting in runtime exception
> 
>
> Key: SPARK-4402
> URL: https://issues.apache.org/jira/browse/SPARK-4402
> Project: Spark
>  Issue Type: Wish
>Reporter: Vijay
>Priority: Minor
>
> Output path validation is happening at the time of statement execution as a 
> part of lazyevolution of action statement. But if the path already exists 
> then it throws a runtime exception. Hence all the processing completed till 
> that point is lost which results in resource wastage (processing time and CPU 
> usage).
> If this I/O related validation is done before the RDD action operations then 
> this runtime exception can be avoided.
> I believe similar validation/ feature is implemented in hadoop also.
> Example:
> SchemaRDD.saveAsTextFile() evaluated the path during runtime 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4402) Output path validation of an action statement resulting in runtime exception

2014-11-14 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4402?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14212376#comment-14212376
 ] 

Sean Owen commented on SPARK-4402:
--

Is this not the same issue resolved by 
https://issues.apache.org/jira/browse/SPARK-1100 ? I think the behavior 
implemented there is the intended behavior here.

> Output path validation of an action statement resulting in runtime exception
> 
>
> Key: SPARK-4402
> URL: https://issues.apache.org/jira/browse/SPARK-4402
> Project: Spark
>  Issue Type: Wish
>Reporter: Vijay
>Priority: Minor
>
> Output path validation is happening at the time of statement execution as a 
> part of lazyevolution of action statement. But if the path already exists 
> then it throws a runtime exception. Hence all the processing completed till 
> that point is lost which results in resource wastage (processing time and CPU 
> usage).
> If this I/O related validation is done before the RDD action operations then 
> this runtime exception can be avoided.
> I believe similar validation/ feature is implemented in hadoop also.
> Example:
> SchemaRDD.saveAsTextFile() evaluated the path during runtime 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org