Fwd: Missing output partition file in S3

2016-09-19 Thread Richard Catlin


> Begin forwarded message:
> 
> From: "Chen, Kevin" 
> Subject: Re: Missing output partition file in S3
> Date: September 19, 2016 at 10:54:44 AM PDT
> To: Steve Loughran 
> Cc: "user@spark.apache.org" 
> 
> Hi Steve,
> 
> Our S3 is on US east. But this issue also occurred when we using a S3 bucket 
> on US west. We are using S3n. We use Spark standalone deployment. We run the 
> job in EC2. The datasets are about 25GB. We did not have speculative 
> execution turned on. We did not use DirectCommiter.
> 
> Thanks,
> Kevin
> 
> From: Steve Loughran mailto:ste...@hortonworks.com>>
> Date: Friday, September 16, 2016 at 3:46 AM
> To: Chen Kevin mailto:kevin.c...@neustar.biz>>
> Cc: "user@spark.apache.org " 
> mailto:user@spark.apache.org>>
> Subject: Re: Missing output partition file in S3
> 
> 
>> On 15 Sep 2016, at 19:37, Chen, Kevin > > wrote:
>> 
>> Hi,
>> 
>> Has any one encountered an issue of missing output partition file in S3 ? My 
>> spark job writes output to a S3 location. Occasionally, I noticed one 
>> partition file is missing. As a result, one chunk of data was lost. If I 
>> rerun the same job, the problem usually goes away. This has been happening 
>> pretty random. I observed once or twice a week on a daily run job. I am 
>> using Spark 1.2.1.
>> 
>> Very much appreciated on any input, suggestion of fix/workaround.
>> 
>> 
>> 
> 
> This doesn't sound good
> 
> Without making any promises about being able to fix this,  I would like to 
> understand the setup to see if there is something that could be done to 
> address this
> Which S3 installation? US East or elsewhere
> Which s3 client: s3n or s3a. If on hadoop 2.7+, can you switch to S3a if you 
> haven't already (exception, if you are using AWS EMR you have to stick with 
> their s3:// client)
> Are you running in-EC2 or remotely?
> How big are the datasets being generated?
> Do you have speculative execution turned on
> which committer? is the external "DirectCommitter", or the classic Hadoop 
> FileOutputCommitter? If so &you are using Hadoop 2.7.x, can you try the v2 
> algorithm (hadoop.mapreduce.fileoutputcommitter.algorithm.version 2)
> 
> I should warn that the stance of myself and colleagues is "dont commit direct 
> to S3", write to HDFS and do a distcp when you finally copy out the data. S3 
> itself doesn't have enough consistency for committing output to work in the 
> presence of all race conditions and failure modes. At least here you've 
> noticed the problem; the thing people fear is not noticing that a problem has 
> arisen
> 
> -Steve



Re: off heap to alluxio/tachyon in Spark 2

2016-09-19 Thread Richard Catlin
Here is my understanding.

Spark used Tachyon as an off-heap solution for RDDs.  In certain situations, it 
would alleviate Garbage Collection or the RDDs.

Tungsten, Spark 2’s off-heap (columnar format) is much more efficient and used 
as the default.  Alluvio no longer makes sense for this use.


You can still use Tachyon/Alluxio to bring your files into Memory, which is 
quicker for Spark to access than your DFS(HDFS or S3).

Alluxio actually supports a “Tiered Filesystem”, and automatically brings the 
“hotter” files into the fastest storage (Memory, SSD).  You can configure it 
with Memory, SSD, and/or HDDs with the DFS as the persistent store, called 
under-filesystem.

Hope this helps.

Richard Catlin

> On Sep 19, 2016, at 7:56 AM, aka.fe2s  wrote:
> 
> Hi folks,
> 
> What has happened with Tachyon / Alluxio in Spark 2? Doc doesn't mention it 
> no longer.
> 
> --
> Oleksiy Dyagilev



Re: difference between dataframe and dataframwrite

2016-06-16 Thread Richard Catlin
I believe it depends on your Spark application.

To write to Hive, use 
dataframe.saveAsTable

To write to S3, use
dataframe.write.parquet(“s3://”)

Hope this helps.
Richard

> On Jun 16, 2016, at 9:54 AM, Natu Lauchande  wrote:
> 
> Does



RE: Nested DataFrames

2015-06-25 Thread Richard Catlin
I am looking to do something similar to this Postgres query in HiveQL.  If
I have a DataFrame student and a DataFrame grade, is this possible?

I read in Learning Spark: Lightning-Fast Big Data Analysis that it should
be possible.  It says in Chapter 9
"SchemaRDDs can store several basic types, as well as structures and arrays
of these types.  They use the HiveQL syntax for type definitions.
Table-9-1 shown"
and goes on to say
"The last type, structures, is simply represented as other Rows in Spark
SQL.  All of these types can also be nested within each other; for example,
you can have arrays of structs, or maps that contain structs"

Here is the url of the page that describes the Postgres:
http://stackoverflow.com/questions/10928210/postgresql-aggregate-array

SELECT s.name,  array_agg(g.Mark) as marks

FROM student s

LEFT JOIN Grade g ON g.Student_id = s.Id

GROUP BY s.Id

Thank you.

Richard Catlin


Nesting DataFrames and saving to Parquet

2015-06-24 Thread Richard Catlin
I have two Dataframes.  A "users" DF, and an "investments" DF.  The
"investments" DF has a column that matches the "users" id.  I would like to
nest the collection of investments for each user and save to a parquet file.

Is there a straightforward way to do this?

Thanks.
Richard Catlin


Re: Nested DataFrame(SchemaRDD)

2015-06-24 Thread Richard Catlin
Michael,

I have two Dataframes.  A "users" DF, and an "investments" DF.  The
"investments" DF has a column that matches the "users" id.  I would like to
nest the collection of investments for each user and save to a parquet file.

Is there a straightforward way to do this?

Thanks.
Richard Catlin

On Tue, Jun 23, 2015 at 4:57 PM, Michael Armbrust 
wrote:

> You can also do this using a sequence of case classes (in the example
> stored in a tuple, though the outer container could also be a case class):
>
> case class MyRecord(name: String, location: String)
> val df = Seq((1, Seq(MyRecord("Michael", "Berkeley"), MyRecord("Andy",
> "Oakland".toDF("id", "people")
>
> df.printSchema
>
> root
> |-- id: integer (nullable = false)
> |-- people: array (nullable = true)
> | |-- element: struct (containsNull = true)
> | | |-- name: string (nullable = true)
> | | |-- location: string (nullable = true)
>
> If this dataframe is saved to parquet the nesting will be preserved.
>
> On Tue, Jun 23, 2015 at 4:35 PM, Roberto Congiu 
> wrote:
>
>> I wrote a brief howto on building nested records in spark and storing
>> them in parquet here:
>> http://www.congiu.com/creating-nested-data-parquet-in-spark-sql/
>>
>> 2015-06-23 16:12 GMT-07:00 Richard Catlin :
>>
>>> How do I create a DataFrame(SchemaRDD) with a nested array of Rows in a
>>> column?  Is there an example?  Will this store as a nested parquet file?
>>>
>>> Thanks.
>>>
>>> Richard Catlin
>>>
>>
>>
>


RE: Nested DataFrame(SchemaRDD)

2015-06-23 Thread Richard Catlin
How do I create a DataFrame(SchemaRDD) with a nested array of Rows in a
column?  Is there an example?  Will this store as a nested parquet file?

Thanks.

Richard Catlin


Can a Spark App run with spark-submit write pdf files to HDFS

2015-06-09 Thread Richard Catlin
I would like to write pdf files using pdfbox to HDFS from my Spark
application.  Can this be done?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Can-a-Spark-App-run-with-spark-submit-write-pdf-files-to-HDFS-tp23233.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org