Are you running EMR? On Sun, May 14, 2017 at 4:59 AM, Miguel Morales <therevolti...@gmail.com> wrote:
> Some things just didn't work as i had first expected it. For example, > when writing from a spark collection to an alluxio destination didn't > persist them to s3 automatically. > > I remember having to use the alluxio library directly to force the > files to persist to s3 after spark finished writing to alluxio. > > On Fri, May 12, 2017 at 6:52 AM, Gene Pang <gene.p...@gmail.com> wrote: > > Hi, > > > > Yes, you can use Alluxio with Spark to read/write to S3. Here is a blog > post > > on Spark + Alluxio + S3, and here is some documentation for configuring > > Alluxio + S3 and configuring Spark + Alluxio. > > > > You mentioned that it required a lot of effort to get working. May I ask > > what you ran into, and how you got it to work? > > > > Thanks, > > Gene > > > > On Thu, May 11, 2017 at 11:55 AM, Miguel Morales < > therevolti...@gmail.com> > > wrote: > >> > >> Might want to try to use gzip as opposed to parquet. The only way i > >> ever reliably got parquet to work on S3 is by using Alluxio as a > >> buffer, but it's a decent amount of work. > >> > >> On Thu, May 11, 2017 at 11:50 AM, lucas.g...@gmail.com > >> <lucas.g...@gmail.com> wrote: > >> > Also, and this is unrelated to the actual question... Why don't these > >> > messages show up in the archive? > >> > > >> > http://apache-spark-user-list.1001560.n3.nabble.com/ > >> > > >> > Ideally I'd want to post a link to our internal wiki for these > >> > questions, > >> > but can't find them in the archive. > >> > > >> > On 11 May 2017 at 07:16, lucas.g...@gmail.com <lucas.g...@gmail.com> > >> > wrote: > >> >> > >> >> Looks like this isn't viable in spark 2.0.0 (and greater I presume). > >> >> I'm > >> >> pretty sure I came across this blog and ignored it due to that. > >> >> > >> >> Any other thoughts? The linked tickets in: > >> >> https://issues.apache.org/jira/browse/SPARK-10063 > >> >> https://issues.apache.org/jira/browse/HADOOP-13786 > >> >> https://issues.apache.org/jira/browse/HADOOP-9565 look relevant too. > >> >> > >> >> On 10 May 2017 at 22:24, Miguel Morales <therevolti...@gmail.com> > >> >> wrote: > >> >>> > >> >>> Try using the DirectParquetOutputCommiter: > >> >>> http://dev.sortable.com/spark-directparquetoutputcommitter/ > >> >>> > >> >>> On Wed, May 10, 2017 at 10:07 PM, lucas.g...@gmail.com > >> >>> <lucas.g...@gmail.com> wrote: > >> >>> > Hi users, we have a bunch of pyspark jobs that are using S3 for > >> >>> > loading > >> >>> > / > >> >>> > intermediate steps and final output of parquet files. > >> >>> > > >> >>> > We're running into the following issues on a semi regular basis: > >> >>> > * These are intermittent errors, IE we have about 300 jobs that > run > >> >>> > nightly... And a fairly random but small-ish percentage of them > fail > >> >>> > with > >> >>> > the following classes of errors. > >> >>> > > >> >>> > S3 write errors > >> >>> > > >> >>> >> "ERROR Utils: Aborting task > >> >>> >> com.amazonaws.services.s3.model.AmazonS3Exception: Status Code: > >> >>> >> 404, > >> >>> >> AWS > >> >>> >> Service: Amazon S3, AWS Request ID: 2D3RP, AWS Error Code: null, > >> >>> >> AWS > >> >>> >> Error > >> >>> >> Message: Not Found, S3 Extended Request ID: BlaBlahEtc=" > >> >>> > > >> >>> > > >> >>> >> > >> >>> >> "Py4JJavaError: An error occurred while calling o43.parquet. > >> >>> >> : com.amazonaws.services.s3.model.MultiObjectDeleteException: > >> >>> >> Status > >> >>> >> Code: > >> >>> >> 0, AWS Service: null, AWS Request ID: null, AWS Error Code: null, > >> >>> >> AWS > >> >>> >> Error > >> >>> >> Message: One or more objects could not be deleted, S3 Extended > >> >>> >> Request > >> >>> >> ID: > >> >>> >> null" > >> >>> > > >> >>> > > >> >>> > > >> >>> > S3 Read Errors: > >> >>> > > >> >>> >> [Stage 1:=================================================> > >> >>> >> (27 > >> >>> >> + 4) > >> >>> >> / 31]17/05/10 16:25:23 ERROR Executor: Exception in task 10.0 in > >> >>> >> stage > >> >>> >> 1.0 > >> >>> >> (TID 11) > >> >>> >> java.net.SocketException: Connection reset > >> >>> >> at java.net.SocketInputStream.read(SocketInputStream.java:196) > >> >>> >> at java.net.SocketInputStream.read(SocketInputStream.java:122) > >> >>> >> at sun.security.ssl.InputRecord.readFully(InputRecord.java:442) > >> >>> >> at sun.security.ssl.InputRecord.readV3Record(InputRecord.java: > 554) > >> >>> >> at sun.security.ssl.InputRecord.read(InputRecord.java:509) > >> >>> >> at > >> >>> >> sun.security.ssl.SSLSocketImpl.readRecord( > SSLSocketImpl.java:927) > >> >>> >> at > >> >>> >> > >> >>> >> sun.security.ssl.SSLSocketImpl.readDataRecord( > SSLSocketImpl.java:884) > >> >>> >> at sun.security.ssl.AppInputStream.read(AppInputStream.java:102) > >> >>> >> at > >> >>> >> > >> >>> >> > >> >>> >> org.apache.http.impl.io.AbstractSessionInputBuffer.read( > AbstractSessionInputBuffer.java:198) > >> >>> >> at > >> >>> >> > >> >>> >> > >> >>> >> org.apache.http.impl.io.ContentLengthInputStream.read( > ContentLengthInputStream.java:178) > >> >>> >> at > >> >>> >> > >> >>> >> > >> >>> >> org.apache.http.impl.io.ContentLengthInputStream.read( > ContentLengthInputStream.java:200) > >> >>> >> at > >> >>> >> > >> >>> >> > >> >>> >> org.apache.http.impl.io.ContentLengthInputStream.close( > ContentLengthInputStream.java:103) > >> >>> >> at > >> >>> >> > >> >>> >> > >> >>> >> org.apache.http.conn.BasicManagedEntity.streamClosed( > BasicManagedEntity.java:168) > >> >>> >> at > >> >>> >> > >> >>> >> > >> >>> >> org.apache.http.conn.EofSensorInputStream.checkClose( > EofSensorInputStream.java:228) > >> >>> >> at > >> >>> >> > >> >>> >> > >> >>> >> org.apache.http.conn.EofSensorInputStream.close( > EofSensorInputStream.java:174) > >> >>> >> at java.io.FilterInputStream.close(FilterInputStream.java:181) > >> >>> >> at java.io.FilterInputStream.close(FilterInputStream.java:181) > >> >>> >> at java.io.FilterInputStream.close(FilterInputStream.java:181) > >> >>> >> at java.io.FilterInputStream.close(FilterInputStream.java:181) > >> >>> >> at > >> >>> >> com.amazonaws.services.s3.model.S3Object.close(S3Object. > java:203) > >> >>> >> at > >> >>> >> > >> >>> >> org.apache.hadoop.fs.s3a.S3AInputStream.close( > S3AInputStream.java:187) > >> >>> > > >> >>> > > >> >>> > > >> >>> > We have literally tons of logs we can add but it would make the > >> >>> > email > >> >>> > unwieldy big. If it would be helpful I'll drop them in a pastebin > >> >>> > or > >> >>> > something. > >> >>> > > >> >>> > Our config is along the lines of: > >> >>> > > >> >>> > spark-2.1.0-bin-hadoop2.7 > >> >>> > '--packages > >> >>> > > >> >>> > com.amazonaws:aws-java-sdk:1.10.34,org.apache.hadoop: > hadoop-aws:2.6.0 > >> >>> > pyspark-shell' > >> >>> > > >> >>> > Given the stack overflow / googling I've been doing I know we're > not > >> >>> > the > >> >>> > only org with these issues but I haven't found a good set of > >> >>> > solutions > >> >>> > in > >> >>> > those spaces yet. > >> >>> > > >> >>> > Thanks! > >> >>> > > >> >>> > Gary Lucas > >> >> > >> >> > >> > > >> > >> --------------------------------------------------------------------- > >> To unsubscribe e-mail: user-unsubscr...@spark.apache.org > >> > > > > --------------------------------------------------------------------- > To unsubscribe e-mail: user-unsubscr...@spark.apache.org > >