Hi,

Yes, you can use Alluxio with Spark to read/write to S3. Here is a blog
post on Spark + Alluxio + S3
<https://www.alluxio.com/blog/accelerating-on-demand-data-analytics-with-alluxio>,
and here is some documentation for configuring Alluxio + S3
<http://www.alluxio.org/docs/master/en/Configuring-Alluxio-with-S3.html>
and configuring Spark + Alluxio
<http://www.alluxio.org/docs/master/en/Running-Spark-on-Alluxio.html>.

You mentioned that it required a lot of effort to get working. May I ask
what you ran into, and how you got it to work?

Thanks,
Gene

On Thu, May 11, 2017 at 11:55 AM, Miguel Morales <therevolti...@gmail.com>
wrote:

> Might want to try to use gzip as opposed to parquet.  The only way i
> ever reliably got parquet to work on S3 is by using Alluxio as a
> buffer, but it's a decent amount of work.
>
> On Thu, May 11, 2017 at 11:50 AM, lucas.g...@gmail.com
> <lucas.g...@gmail.com> wrote:
> > Also, and this is unrelated to the actual question... Why don't these
> > messages show up in the archive?
> >
> > http://apache-spark-user-list.1001560.n3.nabble.com/
> >
> > Ideally I'd want to post a link to our internal wiki for these questions,
> > but can't find them in the archive.
> >
> > On 11 May 2017 at 07:16, lucas.g...@gmail.com <lucas.g...@gmail.com>
> wrote:
> >>
> >> Looks like this isn't viable in spark 2.0.0 (and greater I presume).
> I'm
> >> pretty sure I came across this blog and ignored it due to that.
> >>
> >> Any other thoughts?  The linked tickets in:
> >> https://issues.apache.org/jira/browse/SPARK-10063
> >> https://issues.apache.org/jira/browse/HADOOP-13786
> >> https://issues.apache.org/jira/browse/HADOOP-9565 look relevant too.
> >>
> >> On 10 May 2017 at 22:24, Miguel Morales <therevolti...@gmail.com>
> wrote:
> >>>
> >>> Try using the DirectParquetOutputCommiter:
> >>> http://dev.sortable.com/spark-directparquetoutputcommitter/
> >>>
> >>> On Wed, May 10, 2017 at 10:07 PM, lucas.g...@gmail.com
> >>> <lucas.g...@gmail.com> wrote:
> >>> > Hi users, we have a bunch of pyspark jobs that are using S3 for
> loading
> >>> > /
> >>> > intermediate steps and final output of parquet files.
> >>> >
> >>> > We're running into the following issues on a semi regular basis:
> >>> > * These are intermittent errors, IE we have about 300 jobs that run
> >>> > nightly... And a fairly random but small-ish percentage of them fail
> >>> > with
> >>> > the following classes of errors.
> >>> >
> >>> > S3 write errors
> >>> >
> >>> >> "ERROR Utils: Aborting task
> >>> >> com.amazonaws.services.s3.model.AmazonS3Exception: Status Code:
> 404,
> >>> >> AWS
> >>> >> Service: Amazon S3, AWS Request ID: 2D3RP, AWS Error Code: null, AWS
> >>> >> Error
> >>> >> Message: Not Found, S3 Extended Request ID: BlaBlahEtc="
> >>> >
> >>> >
> >>> >>
> >>> >> "Py4JJavaError: An error occurred while calling o43.parquet.
> >>> >> : com.amazonaws.services.s3.model.MultiObjectDeleteException:
> Status
> >>> >> Code:
> >>> >> 0, AWS Service: null, AWS Request ID: null, AWS Error Code: null,
> AWS
> >>> >> Error
> >>> >> Message: One or more objects could not be deleted, S3 Extended
> Request
> >>> >> ID:
> >>> >> null"
> >>> >
> >>> >
> >>> >
> >>> > S3 Read Errors:
> >>> >
> >>> >> [Stage 1:=================================================>
>  (27
> >>> >> + 4)
> >>> >> / 31]17/05/10 16:25:23 ERROR Executor: Exception in task 10.0 in
> stage
> >>> >> 1.0
> >>> >> (TID 11)
> >>> >> java.net.SocketException: Connection reset
> >>> >> at java.net.SocketInputStream.read(SocketInputStream.java:196)
> >>> >> at java.net.SocketInputStream.read(SocketInputStream.java:122)
> >>> >> at sun.security.ssl.InputRecord.readFully(InputRecord.java:442)
> >>> >> at sun.security.ssl.InputRecord.readV3Record(InputRecord.java:554)
> >>> >> at sun.security.ssl.InputRecord.read(InputRecord.java:509)
> >>> >> at sun.security.ssl.SSLSocketImpl.readRecord(
> SSLSocketImpl.java:927)
> >>> >> at
> >>> >> sun.security.ssl.SSLSocketImpl.readDataRecord(
> SSLSocketImpl.java:884)
> >>> >> at sun.security.ssl.AppInputStream.read(AppInputStream.java:102)
> >>> >> at
> >>> >>
> >>> >> org.apache.http.impl.io.AbstractSessionInputBuffer.read(
> AbstractSessionInputBuffer.java:198)
> >>> >> at
> >>> >>
> >>> >> org.apache.http.impl.io.ContentLengthInputStream.read(
> ContentLengthInputStream.java:178)
> >>> >> at
> >>> >>
> >>> >> org.apache.http.impl.io.ContentLengthInputStream.read(
> ContentLengthInputStream.java:200)
> >>> >> at
> >>> >>
> >>> >> org.apache.http.impl.io.ContentLengthInputStream.close(
> ContentLengthInputStream.java:103)
> >>> >> at
> >>> >>
> >>> >> org.apache.http.conn.BasicManagedEntity.streamClosed(
> BasicManagedEntity.java:168)
> >>> >> at
> >>> >>
> >>> >> org.apache.http.conn.EofSensorInputStream.checkClose(
> EofSensorInputStream.java:228)
> >>> >> at
> >>> >>
> >>> >> org.apache.http.conn.EofSensorInputStream.close(
> EofSensorInputStream.java:174)
> >>> >> at java.io.FilterInputStream.close(FilterInputStream.java:181)
> >>> >> at java.io.FilterInputStream.close(FilterInputStream.java:181)
> >>> >> at java.io.FilterInputStream.close(FilterInputStream.java:181)
> >>> >> at java.io.FilterInputStream.close(FilterInputStream.java:181)
> >>> >> at com.amazonaws.services.s3.model.S3Object.close(S3Object.
> java:203)
> >>> >> at
> >>> >> org.apache.hadoop.fs.s3a.S3AInputStream.close(
> S3AInputStream.java:187)
> >>> >
> >>> >
> >>> >
> >>> > We have literally tons of logs we can add but it would make the email
> >>> > unwieldy big.  If it would be helpful I'll drop them in a pastebin or
> >>> > something.
> >>> >
> >>> > Our config is along the lines of:
> >>> >
> >>> > spark-2.1.0-bin-hadoop2.7
> >>> > '--packages
> >>> > com.amazonaws:aws-java-sdk:1.10.34,org.apache.hadoop:
> hadoop-aws:2.6.0
> >>> > pyspark-shell'
> >>> >
> >>> > Given the stack overflow / googling I've been doing I know we're not
> >>> > the
> >>> > only org with these issues but I haven't found a good set of
> solutions
> >>> > in
> >>> > those spaces yet.
> >>> >
> >>> > Thanks!
> >>> >
> >>> > Gary Lucas
> >>
> >>
> >
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>

Reply via email to