[jira] [Updated] (SOLR-9961) RestoreCore needs the option to download files in parallel.

2019-07-16 Thread Mikhail Khludnev (JIRA)


 [ 
https://issues.apache.org/jira/browse/SOLR-9961?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mikhail Khludnev updated SOLR-9961:
---
Attachment: SOLR-9961.patch

> RestoreCore needs the option to download files in parallel.
> ---
>
> Key: SOLR-9961
> URL: https://issues.apache.org/jira/browse/SOLR-9961
> Project: Solr
>  Issue Type: Improvement
>  Components: Backup/Restore
>Affects Versions: 6.2.1
>Reporter: Timothy Potter
>Priority: Major
> Attachments: SOLR-9961.patch, SOLR-9961.patch, SOLR-9961.patch, 
> SOLR-9961.patch, SOLR-9961.patch
>
>
> My backup to cloud storage (Google cloud storage in this case, but I think 
> this is a general problem) takes 8 minutes ... the restore of the same core 
> takes hours. The restore loop in RestoreCore is serial and doesn't allow me 
> to parallelize the expensive part of this operation (the IO from the remote 
> cloud storage service). We need the option to parallelize the download (like 
> distcp). 
> Also, I tried downloading the same directory using gsutil and it was very 
> fast, like 2 minutes. So I know it's not the pipe that's limiting perf here.
> Here's a very rough patch that does the parallelization. We may also want to 
> consider a two-step approach: 1) download in parallel to a temp dir, 2) 
> perform all the of the checksum validation against the local temp dir. That 
> will save round trips to the remote cloud storage.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (SOLR-9961) RestoreCore needs the option to download files in parallel.

2019-06-29 Thread Mikhail Khludnev (JIRA)


 [ 
https://issues.apache.org/jira/browse/SOLR-9961?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mikhail Khludnev updated SOLR-9961:
---
Attachment: SOLR-9961.patch

> RestoreCore needs the option to download files in parallel.
> ---
>
> Key: SOLR-9961
> URL: https://issues.apache.org/jira/browse/SOLR-9961
> Project: Solr
>  Issue Type: Improvement
>  Components: Backup/Restore
>Affects Versions: 6.2.1
>Reporter: Timothy Potter
>Priority: Major
> Attachments: SOLR-9961.patch, SOLR-9961.patch, SOLR-9961.patch, 
> SOLR-9961.patch
>
>
> My backup to cloud storage (Google cloud storage in this case, but I think 
> this is a general problem) takes 8 minutes ... the restore of the same core 
> takes hours. The restore loop in RestoreCore is serial and doesn't allow me 
> to parallelize the expensive part of this operation (the IO from the remote 
> cloud storage service). We need the option to parallelize the download (like 
> distcp). 
> Also, I tried downloading the same directory using gsutil and it was very 
> fast, like 2 minutes. So I know it's not the pipe that's limiting perf here.
> Here's a very rough patch that does the parallelization. We may also want to 
> consider a two-step approach: 1) download in parallel to a temp dir, 2) 
> perform all the of the checksum validation against the local temp dir. That 
> will save round trips to the remote cloud storage.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (SOLR-9961) RestoreCore needs the option to download files in parallel.

2019-06-28 Thread Mikhail Khludnev (JIRA)


 [ 
https://issues.apache.org/jira/browse/SOLR-9961?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mikhail Khludnev updated SOLR-9961:
---
Status: Patch Available  (was: Open)

> RestoreCore needs the option to download files in parallel.
> ---
>
> Key: SOLR-9961
> URL: https://issues.apache.org/jira/browse/SOLR-9961
> Project: Solr
>  Issue Type: Improvement
>  Components: Backup/Restore
>Affects Versions: 6.2.1
>Reporter: Timothy Potter
>Priority: Major
> Attachments: SOLR-9961.patch, SOLR-9961.patch, SOLR-9961.patch
>
>
> My backup to cloud storage (Google cloud storage in this case, but I think 
> this is a general problem) takes 8 minutes ... the restore of the same core 
> takes hours. The restore loop in RestoreCore is serial and doesn't allow me 
> to parallelize the expensive part of this operation (the IO from the remote 
> cloud storage service). We need the option to parallelize the download (like 
> distcp). 
> Also, I tried downloading the same directory using gsutil and it was very 
> fast, like 2 minutes. So I know it's not the pipe that's limiting perf here.
> Here's a very rough patch that does the parallelization. We may also want to 
> consider a two-step approach: 1) download in parallel to a temp dir, 2) 
> perform all the of the checksum validation against the local temp dir. That 
> will save round trips to the remote cloud storage.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (SOLR-9961) RestoreCore needs the option to download files in parallel.

2019-06-27 Thread Mikhail Khludnev (JIRA)


 [ 
https://issues.apache.org/jira/browse/SOLR-9961?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mikhail Khludnev updated SOLR-9961:
---
Attachment: SOLR-9961.patch

> RestoreCore needs the option to download files in parallel.
> ---
>
> Key: SOLR-9961
> URL: https://issues.apache.org/jira/browse/SOLR-9961
> Project: Solr
>  Issue Type: Improvement
>  Components: Backup/Restore
>Affects Versions: 6.2.1
>Reporter: Timothy Potter
>Priority: Major
> Attachments: SOLR-9961.patch, SOLR-9961.patch, SOLR-9961.patch
>
>
> My backup to cloud storage (Google cloud storage in this case, but I think 
> this is a general problem) takes 8 minutes ... the restore of the same core 
> takes hours. The restore loop in RestoreCore is serial and doesn't allow me 
> to parallelize the expensive part of this operation (the IO from the remote 
> cloud storage service). We need the option to parallelize the download (like 
> distcp). 
> Also, I tried downloading the same directory using gsutil and it was very 
> fast, like 2 minutes. So I know it's not the pipe that's limiting perf here.
> Here's a very rough patch that does the parallelization. We may also want to 
> consider a two-step approach: 1) download in parallel to a temp dir, 2) 
> perform all the of the checksum validation against the local temp dir. That 
> will save round trips to the remote cloud storage.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (SOLR-9961) RestoreCore needs the option to download files in parallel.

2019-06-27 Thread Mikhail Khludnev (JIRA)


 [ 
https://issues.apache.org/jira/browse/SOLR-9961?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mikhail Khludnev updated SOLR-9961:
---
Attachment: (was: SOLR-9961.patch)

> RestoreCore needs the option to download files in parallel.
> ---
>
> Key: SOLR-9961
> URL: https://issues.apache.org/jira/browse/SOLR-9961
> Project: Solr
>  Issue Type: Improvement
>  Components: Backup/Restore
>Affects Versions: 6.2.1
>Reporter: Timothy Potter
>Priority: Major
> Attachments: SOLR-9961.patch, SOLR-9961.patch
>
>
> My backup to cloud storage (Google cloud storage in this case, but I think 
> this is a general problem) takes 8 minutes ... the restore of the same core 
> takes hours. The restore loop in RestoreCore is serial and doesn't allow me 
> to parallelize the expensive part of this operation (the IO from the remote 
> cloud storage service). We need the option to parallelize the download (like 
> distcp). 
> Also, I tried downloading the same directory using gsutil and it was very 
> fast, like 2 minutes. So I know it's not the pipe that's limiting perf here.
> Here's a very rough patch that does the parallelization. We may also want to 
> consider a two-step approach: 1) download in parallel to a temp dir, 2) 
> perform all the of the checksum validation against the local temp dir. That 
> will save round trips to the remote cloud storage.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (SOLR-9961) RestoreCore needs the option to download files in parallel.

2019-06-27 Thread Mikhail Khludnev (JIRA)


 [ 
https://issues.apache.org/jira/browse/SOLR-9961?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mikhail Khludnev updated SOLR-9961:
---
Attachment: SOLR-9961.patch

> RestoreCore needs the option to download files in parallel.
> ---
>
> Key: SOLR-9961
> URL: https://issues.apache.org/jira/browse/SOLR-9961
> Project: Solr
>  Issue Type: Improvement
>  Components: Backup/Restore
>Affects Versions: 6.2.1
>Reporter: Timothy Potter
>Priority: Major
> Attachments: SOLR-9961.patch, SOLR-9961.patch, SOLR-9961.patch
>
>
> My backup to cloud storage (Google cloud storage in this case, but I think 
> this is a general problem) takes 8 minutes ... the restore of the same core 
> takes hours. The restore loop in RestoreCore is serial and doesn't allow me 
> to parallelize the expensive part of this operation (the IO from the remote 
> cloud storage service). We need the option to parallelize the download (like 
> distcp). 
> Also, I tried downloading the same directory using gsutil and it was very 
> fast, like 2 minutes. So I know it's not the pipe that's limiting perf here.
> Here's a very rough patch that does the parallelization. We may also want to 
> consider a two-step approach: 1) download in parallel to a temp dir, 2) 
> perform all the of the checksum validation against the local temp dir. That 
> will save round trips to the remote cloud storage.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (SOLR-9961) RestoreCore needs the option to download files in parallel.

2017-01-13 Thread Timothy Potter (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-9961?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Timothy Potter updated SOLR-9961:
-
Attachment: SOLR-9961.patch

Here's an updated patch (created against 6.3.0 tag) -p0 style that adds an 
option for {{BackupRepository}} implementations to download in parallel using a 
thread pool. But as stated in the description, this now causes the various 
FileSystem already closed issue, so would need to be used with hdfs cache 
disabled.

I've tested this on a 10G index in Google cloud storage and it completed in ~30 
mins vs. hours or not at all.

> RestoreCore needs the option to download files in parallel.
> ---
>
> Key: SOLR-9961
> URL: https://issues.apache.org/jira/browse/SOLR-9961
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: Backup/Restore
>Affects Versions: 6.2.1
>Reporter: Timothy Potter
> Attachments: SOLR-9961.patch, SOLR-9961.patch
>
>
> My backup to cloud storage (Google cloud storage in this case, but I think 
> this is a general problem) takes 8 minutes ... the restore of the same core 
> takes hours. The restore loop in RestoreCore is serial and doesn't allow me 
> to parallelize the expensive part of this operation (the IO from the remote 
> cloud storage service). We need the option to parallelize the download (like 
> distcp). 
> Also, I tried downloading the same directory using gsutil and it was very 
> fast, like 2 minutes. So I know it's not the pipe that's limiting perf here.
> Here's a very rough patch that does the parallelization. We may also want to 
> consider a two-step approach: 1) download in parallel to a temp dir, 2) 
> perform all the of the checksum validation against the local temp dir. That 
> will save round trips to the remote cloud storage.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (SOLR-9961) RestoreCore needs the option to download files in parallel.

2017-01-13 Thread Timothy Potter (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-9961?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Timothy Potter updated SOLR-9961:
-
Attachment: SOLR-9961.patch

> RestoreCore needs the option to download files in parallel.
> ---
>
> Key: SOLR-9961
> URL: https://issues.apache.org/jira/browse/SOLR-9961
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: Backup/Restore
>Affects Versions: 6.2.1
>Reporter: Timothy Potter
> Attachments: SOLR-9961.patch
>
>
> My backup to cloud storage (Google cloud storage in this case, but I think 
> this is a general problem) takes 8 minutes ... the restore of the same core 
> takes hours. The restore loop in RestoreCore is serial and doesn't allow me 
> to parallelize the expensive part of this operation (the IO from the remote 
> cloud storage service). We need the option to parallelize the download (like 
> distcp). 
> Also, I tried downloading the same directory using gsutil and it was very 
> fast, like 2 minutes. So I know it's not the pipe that's limiting perf here.
> Here's a very rough patch that does the parallelization. We may also want to 
> consider a two-step approach: 1) download in parallel to a temp dir, 2) 
> perform all the of the checksum validation against the local temp dir. That 
> will save round trips to the remote cloud storage.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org