Re: spark 2.0.0 - when saving a model to S3 spark creates temporary files. Why?

2016-08-25 Thread Steve Loughran


With Hadoop 2.7 or later, set

spark.hadooop.mapreduce.fileoutputcommitter.algorithm.version  2
spark.hadoop.mapreduce.fileoutputcommitter.cleanup.skipped true

This switches to a no -rename version of the file output committer, is faster 
all round. You are still at risk of things going wrong on failure though, and 
with speculation enabled.


you are still at risk o

On 25 Aug 2016, at 13:16, Tal Grynbaum 
> wrote:

Is/was there an option similar to DirectParquetOutputCommitter to write json 
files to S3 ?

On Thu, Aug 25, 2016 at 2:56 PM, Takeshi Yamamuro 
> wrote:
Hi,

Seems this just prevents writers from leaving partial data in a destination dir 
when jobs fail.
In the previous versions of Spark, there was a way to directly write data in a 
destination though,
Spark v2.0+ has no way to do that because of the critial issue on S3 (See: 
SPARK-10063).

// maropu


On Thu, Aug 25, 2016 at 2:40 PM, Tal Grynbaum 
> wrote:

I read somewhere that its because s3 has to know the size of the file upfront
I dont really understand this,  as to why is it ok  not to know it for the temp 
files and not ok for the final files.
The delete permission is the minor disadvantage from my side,  the worst thing 
is that i have a cluster of 100 machines sitting idle for 15 minutes waiting 
for copy to end.

Any suggestions how to avoid that?

On Thu, Aug 25, 2016, 08:21 Aseem Bansal 
> wrote:
Hi

When Spark saves anything to S3 it creates temporary files. Why? Asking this as 
this requires the the access credentails to be given delete permissions along 
with write permissions.



--
---
Takeshi Yamamuro



--
Tal Grynbaum / CTO & co-founder

m# +972-54-7875797


mobile retention done right



Re: spark 2.0.0 - when saving a model to S3 spark creates temporary files. Why?

2016-08-25 Thread Takeshi Yamamuro
afaik no.

// maropu

On Thu, Aug 25, 2016 at 9:16 PM, Tal Grynbaum 
wrote:

> Is/was there an option similar to DirectParquetOutputCommitter to write
> json files to S3 ?
>
> On Thu, Aug 25, 2016 at 2:56 PM, Takeshi Yamamuro 
> wrote:
>
>> Hi,
>>
>> Seems this just prevents writers from leaving partial data in a
>> destination dir when jobs fail.
>> In the previous versions of Spark, there was a way to directly write data
>> in a destination though,
>> Spark v2.0+ has no way to do that because of the critial issue on S3
>> (See: SPARK-10063).
>>
>> // maropu
>>
>>
>> On Thu, Aug 25, 2016 at 2:40 PM, Tal Grynbaum 
>> wrote:
>>
>>> I read somewhere that its because s3 has to know the size of the file
>>> upfront
>>> I dont really understand this,  as to why is it ok  not to know it for
>>> the temp files and not ok for the final files.
>>> The delete permission is the minor disadvantage from my side,  the worst
>>> thing is that i have a cluster of 100 machines sitting idle for 15 minutes
>>> waiting for copy to end.
>>>
>>> Any suggestions how to avoid that?
>>>
>>> On Thu, Aug 25, 2016, 08:21 Aseem Bansal  wrote:
>>>
 Hi

 When Spark saves anything to S3 it creates temporary files. Why? Asking
 this as this requires the the access credentails to be given
 delete permissions along with write permissions.

>>>
>>
>>
>> --
>> ---
>> Takeshi Yamamuro
>>
>
>
>
> --
> *Tal Grynbaum* / *CTO & co-founder*
>
> m# +972-54-7875797
>
> mobile retention done right
>



-- 
---
Takeshi Yamamuro


Re: spark 2.0.0 - when saving a model to S3 spark creates temporary files. Why?

2016-08-25 Thread Tal Grynbaum
Is/was there an option similar to DirectParquetOutputCommitter to write
json files to S3 ?

On Thu, Aug 25, 2016 at 2:56 PM, Takeshi Yamamuro 
wrote:

> Hi,
>
> Seems this just prevents writers from leaving partial data in a
> destination dir when jobs fail.
> In the previous versions of Spark, there was a way to directly write data
> in a destination though,
> Spark v2.0+ has no way to do that because of the critial issue on S3 (See:
> SPARK-10063).
>
> // maropu
>
>
> On Thu, Aug 25, 2016 at 2:40 PM, Tal Grynbaum 
> wrote:
>
>> I read somewhere that its because s3 has to know the size of the file
>> upfront
>> I dont really understand this,  as to why is it ok  not to know it for
>> the temp files and not ok for the final files.
>> The delete permission is the minor disadvantage from my side,  the worst
>> thing is that i have a cluster of 100 machines sitting idle for 15 minutes
>> waiting for copy to end.
>>
>> Any suggestions how to avoid that?
>>
>> On Thu, Aug 25, 2016, 08:21 Aseem Bansal  wrote:
>>
>>> Hi
>>>
>>> When Spark saves anything to S3 it creates temporary files. Why? Asking
>>> this as this requires the the access credentails to be given
>>> delete permissions along with write permissions.
>>>
>>
>
>
> --
> ---
> Takeshi Yamamuro
>



-- 
*Tal Grynbaum* / *CTO & co-founder*

m# +972-54-7875797

mobile retention done right


Re: spark 2.0.0 - when saving a model to S3 spark creates temporary files. Why?

2016-08-25 Thread Takeshi Yamamuro
Hi,

Seems this just prevents writers from leaving partial data in a destination
dir when jobs fail.
In the previous versions of Spark, there was a way to directly write data
in a destination though,
Spark v2.0+ has no way to do that because of the critial issue on S3 (See:
SPARK-10063).

// maropu


On Thu, Aug 25, 2016 at 2:40 PM, Tal Grynbaum 
wrote:

> I read somewhere that its because s3 has to know the size of the file
> upfront
> I dont really understand this,  as to why is it ok  not to know it for the
> temp files and not ok for the final files.
> The delete permission is the minor disadvantage from my side,  the worst
> thing is that i have a cluster of 100 machines sitting idle for 15 minutes
> waiting for copy to end.
>
> Any suggestions how to avoid that?
>
> On Thu, Aug 25, 2016, 08:21 Aseem Bansal  wrote:
>
>> Hi
>>
>> When Spark saves anything to S3 it creates temporary files. Why? Asking
>> this as this requires the the access credentails to be given
>> delete permissions along with write permissions.
>>
>


-- 
---
Takeshi Yamamuro


Re: spark 2.0.0 - when saving a model to S3 spark creates temporary files. Why?

2016-08-24 Thread Tal Grynbaum
I read somewhere that its because s3 has to know the size of the file
upfront
I dont really understand this,  as to why is it ok  not to know it for the
temp files and not ok for the final files.
The delete permission is the minor disadvantage from my side,  the worst
thing is that i have a cluster of 100 machines sitting idle for 15 minutes
waiting for copy to end.

Any suggestions how to avoid that?

On Thu, Aug 25, 2016, 08:21 Aseem Bansal  wrote:

> Hi
>
> When Spark saves anything to S3 it creates temporary files. Why? Asking
> this as this requires the the access credentails to be given
> delete permissions along with write permissions.
>


spark 2.0.0 - when saving a model to S3 spark creates temporary files. Why?

2016-08-24 Thread Aseem Bansal
Hi

When Spark saves anything to S3 it creates temporary files. Why? Asking
this as this requires the the access credentails to be given
delete permissions along with write permissions.