Re: Accumulo on Google Cloud Storage

Maxim Kolchin Fri, 22 Jun 2018 09:09:02 -0700

Just FYI: A separate discussion was started in the GCS connector issue
tracker to come up with a way to support Accumulo. See
https://github.com/GoogleCloudPlatform/bigdata-interop/issues/104


It'd be great to increase some attention to the issue, so please if
everyone interested press the thumb up button :)

Maxim

On Fri, Jun 22, 2018 at 4:09 PM Maxim Kolchin <kolchin...@gmail.com> wrote:

> > If somebody is interested in using Accumulo on GCS, I'd like to
> encourage them to submit any bugs they encounter, and any patches (if they
> are able) which resolve those bugs.
>
> I'd like to contribute a fix, but I don't know where to start. We tried to
> get any help from the Google Support about [1] over email, but they just
> say that the GCS doesn't support such write pattern. In the end, we can
> only guess how to adjust the Accumulo behaviour to minimise broken
> connections to the GCS.
>
> BTW although we observe this exception, the tablet server doesn't fail, so
> it means that after some retries it is able to write WALs to GCS.
>
> @Stephen,
>
> > as discussions with MS engineers have suggested, similar to the GCS
> thread, that small writes at high volume are, at best, suboptimal for ADLS.
>
> Did you try to adjust any Accumulo properties to do bigger writes less
> frequently or something like that?
>
> [1]: https://github.com/GoogleCloudPlatform/bigdata-interop/issues/103
>
> Maxim
>
> On Thu, Jun 21, 2018 at 7:17 AM Stephen Meyles <smey...@gmail.com> wrote:
>
>> I think we're seeing something similar but in our case we're trying to
>> run Accumulo atop ADLS. When we generate sufficient write load we start to
>> see stack traces like the following:
>>
>> [log.DfsLogger] ERROR: Failed to write log entries
>> java.io.IOException: attempting to write to a closed stream;
>> at
>> com.microsoft.azure.datalake.store.ADLFileOutputStream.write(ADLFileOutputStream.java:88)
>> at
>> com.microsoft.azure.datalake.store.ADLFileOutputStream.write(ADLFileOutputStream.java:77)
>> at
>> org.apache.hadoop.fs.adl.AdlFsOutputStream.write(AdlFsOutputStream.java:57)
>> at
>> org.apache.hadoop.fs.FSDataOutputStream$PositionCache.write(FSDataOutputStream.java:48)
>> at java.io.DataOutputStream.write(DataOutputStream.java:88)
>> at java.io.DataOutputStream.writeByte(DataOutputStream.java:153)
>> at org.apache.accumulo.tserver.logger.LogFileKey.write(LogFileKey.java:87)
>> at org.apache.accumulo.tserver.log.DfsLogger.write(DfsLogger.java:537)
>>
>> We have developed a rudimentary LogCloser implementation that allows us
>> to recover from this but overall performance is significantly impacted by
>> this.
>>
>> > As for the WAL closing issue on GCS, I recall a previous thread about
>> that
>>
>> I searched more for this but wasn't able to find anything, nor similar
>> re: ADL. I am also curious about the earlier question:
>>
>> >> Does Accumulo have a specific write pattern [to WALs], so that file
>> system may not support it?
>>
>> as discussions with MS engineers have suggested, similar to the GCS
>> thread, that small writes at high volume are, at best, suboptimal for ADLS.
>>
>> Regards
>>
>> Stephen
>>
>>
>> On Wed, Jun 20, 2018 at 11:20 AM, Christopher <ctubb...@apache.org>
>> wrote:
>>
>>> For what it's worth, this is an Apache project, not a Sqrrl project.
>>> Amazon is free to contribute to Accumulo to improve its support of their
>>> platform, just as anybody is free to do. Amazon may start contributing more
>>> as a result of their acquisition... or they may not. There is no reason to
>>> expect that their acquisition will have any impact whatsoever on the
>>> platforms Accumulo supports, because Accumulo is not, and has not ever
>>> been, a Sqrrl project (although some Sqrrl employees have contributed), and
>>> thus will not become an Amazon project. It has been, and will remain, a
>>> vendor-neutral Apache project. Regardless, we welcome contributions from
>>> anybody which would improve Accumulo's support of any additional platform
>>> alternatives to HDFS, whether it be GCS, S3, or something else.
>>>
>>> As for the WAL closing issue on GCS, I recall a previous thread about
>>> that... I think a simple patch might be possible to solve that issue, but
>>> to date, nobody has contributed a fix. If somebody is interested in using
>>> Accumulo on GCS, I'd like to encourage them to submit any bugs they
>>> encounter, and any patches (if they are able) which resolve those bugs. If
>>> they need help submitting a fix, please ask on the dev@ list.
>>>
>>>
>>>
>>> On Wed, Jun 20, 2018 at 8:21 AM Geoffry Roberts <threadedb...@gmail.com>
>>> wrote:
>>>
>>>> Maxim,
>>>>
>>>> Interesting that you were able to run A on GCS.  I never thought of
>>>> that--good to know.
>>>>
>>>> Since I am now an AWS guy (at least or the time being), in light of the
>>>> fact that Amazon purchased Sqrrl,  I am interested to see what develops.
>>>>
>>>>
>>>> On Wed, Jun 20, 2018 at 5:15 AM, Maxim Kolchin <kolchin...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi Geoffry,
>>>>>
>>>>> Thank you for the feedback!
>>>>>
>>>>> Thanks to [1, 2], I was able to run Accumulo cluster on Google VMs and
>>>>> with GCS instead of HDFS. And I used Google Dataproc to run Hadoop jobs on
>>>>> Accumulo. Almost everything was good until I've not faced some connection
>>>>> issues with GCS. Quite often, the connection to GCS breaks on writing or
>>>>> closing WALs.
>>>>>
>>>>> To all,
>>>>>
>>>>> Does Accumulo have a specific write pattern, so that file system may
>>>>> not support it? Are there Accumulo properties which I can play with to
>>>>> adjust the write pattern?
>>>>>
>>>>> [1]: https://github.com/cybermaggedon/accumulo-gs
>>>>> [2]: https://github.com/cybermaggedon/accumulo-docker
>>>>>
>>>>> Thank you!
>>>>> Maxim
>>>>>
>>>>> On Tue, Jun 19, 2018 at 10:31 PM Geoffry Roberts <
>>>>> threadedb...@gmail.com> wrote:
>>>>>
>>>>>> I tried running Accumulo on Google.  I first tried running it on
>>>>>> Google's pre-made Hadoop.  I found the various file paths one must 
>>>>>> contend
>>>>>> with are different on Google than on a straight download from Apache.  It
>>>>>> seems they moved things around.  To counter this, I installed my own 
>>>>>> Hadoop
>>>>>> along with Zookeeper and Accumulo on a Google node.  All went well until
>>>>>> one fine day when I could no longer log in.  It seems Google had pushed 
>>>>>> out
>>>>>> some changes over night that broke my client side Google Cloud
>>>>>> installation.  Google referred the affected to a lengthy,
>>>>>> easy-to-make-a-mistake procedure for resolving the issue.
>>>>>>
>>>>>> I decided life was too short for this kind of thing and switched to
>>>>>> Amazon.
>>>>>>
>>>>>> On Tue, Jun 19, 2018 at 7:34 AM, Maxim Kolchin <kolchin...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi all,
>>>>>>>
>>>>>>> Does anyone have experience running Accumulo on top of Google Cloud
>>>>>>> Storage instead of HDFS? In [1] you can see some details if you never 
>>>>>>> heard
>>>>>>> about this feature.
>>>>>>>
>>>>>>> I see some discussion (see [2], [3]) around this topic, but it looks
>>>>>>> to me that this isn't as popular as, I believe, should be.
>>>>>>>
>>>>>>> [1]:
>>>>>>> https://cloud.google.com/dataproc/docs/concepts/connectors/cloud-storage
>>>>>>> [2]: https://github.com/apache/accumulo/issues/428
>>>>>>> [3]:
>>>>>>> https://github.com/GoogleCloudPlatform/bigdata-interop/issues/103
>>>>>>>
>>>>>>> Best regards,
>>>>>>> Maxim
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> There are ways and there are ways,
>>>>>>
>>>>>> Geoffry Roberts
>>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> There are ways and there are ways,
>>>>
>>>> Geoffry Roberts
>>>>
>>>
>>

Re: Accumulo on Google Cloud Storage

Reply via email to