[jira] [Commented] (HDFS-13117) Proposal to support writing replications to HDFS asynchronously

2018-02-10 Thread xuchuanyin (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-13117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16359764#comment-16359764
 ] 

xuchuanyin commented on HDFS-13117:
---

[~jojochuang] Actually I've tested it in a 3-node cluster. The time to copy 
file from local disk to HDFS with 3 replication is about 
{color:#FF}*300ms*{color} and the time to change the HDFS file from 
1-replica to 3-replica costs about {color:#FF}*10ms*{color} or less. (This 
does not contain the time to write local disk and the time to write first 
replica to HDFS).

 

Besides, skip writing to local disk will {color:#FF}save about 33% amount 
of disk write I/O{color}.

> Proposal to support writing replications to HDFS asynchronously
> ---
>
> Key: HDFS-13117
> URL: https://issues.apache.org/jira/browse/HDFS-13117
> Project: Hadoop HDFS
>  Issue Type: New Feature
>Reporter: xuchuanyin
>Priority: Major
>
> My initial question was as below:
> ```
> I've learned that When We write data to HDFS using the interface provided by 
> HDFS such as 'FileSystem.create', our client will block until all the blocks 
> and their replications are done. This will cause efficiency problem if we use 
> HDFS as our final data storage. And many of my colleagues write the data to 
> local disk in the main thread and copy it to HDFS in another thread. 
> Obviously, it increases the disk I/O.
>  
>    So, is there a way to optimize this usage? I don't want to increase the 
> disk I/O, neither do I want to be blocked during the writing of extra 
> replications.
>   How about writing to HDFS by specifying only one replication in the main 
> thread and set the actual number of replication in another thread? Or is 
> there any better way to do this?
> ```
>  
> So my proposal here is to support writing extra replications to HDFS 
> asynchronously. User can set a minimum replicator as acceptable number of 
> replications ( < default or expected replicator). When writing to HDFS, user 
> will only be blocked until the minimum replicator has been finished and HDFS 
> will continue to complete the extra replications in background.Since HDFS 
> will periodically check the integrity of all the replications, we can also 
> leave this work to HDFS itself.
>  
> There are ways to provide the interfaces:
> 1. Creating a series of interfaces by adding `acceptableReplication` 
> parameter to the current interfaces as below:
> ```
> Before:
> FSDataOutputStream create(Path f,
>   boolean overwrite,
>   int bufferSize,
>   short replication,
>   long blockSize
> ) throws IOException
>  
> After:
> FSDataOutputStream create(Path f,
>   boolean overwrite,
>   int bufferSize,
>   short replication,
>   short acceptableReplication, // minimum number of replication to finish 
> before return
>   long blockSize
> ) throws IOException
> ```
>  
> 2. Adding the `acceptableReplication` and `asynchronous` to the runtime (or 
> default) configuration, so user will not have to change any interface and 
> will benefit from this feature.
>  
> How do you think about this?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-13117) Proposal to support writing replications to HDFS asynchronously

2018-02-07 Thread xuchuanyin (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-13117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16356592#comment-16356592
 ] 

xuchuanyin commented on HDFS-13117:
---

[~jojochuang] [~kihwal] Thanks for your response.

I haven't noticed the consequences of directly write files to HDFS in our apps. 
Actually we haven't tested it yet as we already knew that when client write 
data to HDFS, it will return only if the last block of last replication is done 
– at least the time between the last block of the first replication and the 
last block of the last replication can be saved.

 

Our apps are high performance in-memory-calculating related processes (written 
in C language). The app's performance of reading local file is about 3~4X 
better than that of reading HDFS file, So we are worrying about the write 
performance would encounter the same problem.

 

Now we want to make a balance between efficiency and disk write:

The process write temporary local files and copy it to HDFS in another thread 
will certainly make the process high performance but will cause more disk 
write. So I propose the above proposal.

 

I’m not afraid if I have made my idea clear...

> Proposal to support writing replications to HDFS asynchronously
> ---
>
> Key: HDFS-13117
> URL: https://issues.apache.org/jira/browse/HDFS-13117
> Project: Hadoop HDFS
>  Issue Type: New Feature
>Reporter: xuchuanyin
>Priority: Major
>
> My initial question was as below:
> ```
> I've learned that When We write data to HDFS using the interface provided by 
> HDFS such as 'FileSystem.create', our client will block until all the blocks 
> and their replications are done. This will cause efficiency problem if we use 
> HDFS as our final data storage. And many of my colleagues write the data to 
> local disk in the main thread and copy it to HDFS in another thread. 
> Obviously, it increases the disk I/O.
>  
>    So, is there a way to optimize this usage? I don't want to increase the 
> disk I/O, neither do I want to be blocked during the writing of extra 
> replications.
>   How about writing to HDFS by specifying only one replication in the main 
> thread and set the actual number of replication in another thread? Or is 
> there any better way to do this?
> ```
>  
> So my proposal here is to support writing extra replications to HDFS 
> asynchronously. User can set a minimum replicator as acceptable number of 
> replications ( < default or expected replicator). When writing to HDFS, user 
> will only be blocked until the minimum replicator has been finished and HDFS 
> will continue to complete the extra replications in background.Since HDFS 
> will periodically check the integrity of all the replications, we can also 
> leave this work to HDFS itself.
>  
> There are ways to provide the interfaces:
> 1. Creating a series of interfaces by adding `acceptableReplication` 
> parameter to the current interfaces as below:
> ```
> Before:
> FSDataOutputStream create(Path f,
>   boolean overwrite,
>   int bufferSize,
>   short replication,
>   long blockSize
> ) throws IOException
>  
> After:
> FSDataOutputStream create(Path f,
>   boolean overwrite,
>   int bufferSize,
>   short replication,
>   short acceptableReplication, // minimum number of replication to finish 
> before return
>   long blockSize
> ) throws IOException
> ```
>  
> 2. Adding the `acceptableReplication` and `asynchronous` to the runtime (or 
> default) configuration, so user will not have to change any interface and 
> will benefit from this feature.
>  
> How do you think about this?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Created] (HDFS-13117) Proposal to support writing replications to HDFS asynchronously

2018-02-06 Thread xuchuanyin (JIRA)
xuchuanyin created HDFS-13117:
-

 Summary: Proposal to support writing replications to HDFS 
asynchronously
 Key: HDFS-13117
 URL: https://issues.apache.org/jira/browse/HDFS-13117
 Project: Hadoop HDFS
  Issue Type: New Feature
Reporter: xuchuanyin


My initial question was as below:

```

I've learned that When We write data to HDFS using the interface provided by 
HDFS such as 'FileSystem.create', our client will block until all the blocks 
and their replications are done. This will cause efficiency problem if we use 
HDFS as our final data storage. And many of my colleagues write the data to 
local disk in the main thread and copy it to HDFS in another thread. Obviously, 
it increases the disk I/O.
 
   So, is there a way to optimize this usage? I don't want to increase the disk 
I/O, neither do I want to be blocked during the writing of extra replications.

  How about writing to HDFS by specifying only one replication in the main 
thread and set the actual number of replication in another thread? Or is there 
any better way to do this?

```

 

So my proposal here is to support writing extra replications to HDFS 
asynchronously. User can set a minimum replicator as acceptable number of 
replications ( < default or expected replicator). When writing to HDFS, user 
will only be blocked until the minimum replicator has been finished and HDFS 
will continue to complete the extra replications in background.Since HDFS will 
periodically check the integrity of all the replications, we can also leave 
this work to HDFS itself.

 

There are ways to provide the interfaces:

1. Creating a series of interfaces by adding `acceptableReplication` parameter 
to the current interfaces as below:

```

Before:

FSDataOutputStream create(Path f,

  boolean overwrite,

  int bufferSize,

  short replication,

  long blockSize

) throws IOException

 

After:

FSDataOutputStream create(Path f,

  boolean overwrite,

  int bufferSize,

  short replication,

  short acceptableReplication, // minimum number of replication to finish 
before return

  long blockSize

) throws IOException

```

 

2. Adding the `acceptableReplication` and `asynchronous` to the runtime (or 
default) configuration, so user will not have to change any interface and will 
benefit from this feature.

 

How do you think about this?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org