Re: [spark-sql] JsonRDD

2015-02-03 Thread Daniil Osipov
Thanks Reynold,

Case sensitivity issues are definitely orthogonal. I'll submit a bug or PR.

Is there a way to rename the object to eliminate the confusion? Not sure
how locked down the API is at this time, but it seems like a potential
confusion point for developers.

On Mon, Feb 2, 2015 at 4:30 PM, Reynold Xin  wrote:

> It's bad naming - JsonRDD is actually not an RDD. It is just a set of util
> methods.
>
> The case sensitivity issues seem orthogonal, and would be great to be able
> to control that with a flag.
>
>
> On Mon, Feb 2, 2015 at 4:16 PM, Daniil Osipov 
> wrote:
>
>> Hey Spark developers,
>>
>> Is there a good reason for JsonRDD being a Scala object as opposed to
>> class? Seems most other RDDs are classes, and can be extended.
>>
>> The reason I'm asking is that there is a problem with Hive
>> interoperability
>> with JSON DataFrames where jsonFile generates case sensitive schema, while
>> Hive expects case insensitive and fails with an exception during
>> saveAsTable if there are two columns with the same name in different case.
>>
>> I'm trying to resolve the problem, but that requires me to extend JsonRDD,
>> which I can't do. Other RDDs are subclass friendly, why is JsonRDD
>> different?
>>
>> Dan
>>
>
>


[spark-sql] JsonRDD

2015-02-02 Thread Daniil Osipov
Hey Spark developers,

Is there a good reason for JsonRDD being a Scala object as opposed to
class? Seems most other RDDs are classes, and can be extended.

The reason I'm asking is that there is a problem with Hive interoperability
with JSON DataFrames where jsonFile generates case sensitive schema, while
Hive expects case insensitive and fails with an exception during
saveAsTable if there are two columns with the same name in different case.

I'm trying to resolve the problem, but that requires me to extend JsonRDD,
which I can't do. Other RDDs are subclass friendly, why is JsonRDD
different?

Dan


Re: EC2 clusters ready in launch time + 30 seconds

2014-10-06 Thread Daniil Osipov
I've also been looking at this. Basically, the Spark EC2 script is
excellent for small development clusters of several nodes, but isn't
suitable for production. It handles instance setup in a single threaded
manner, while it can easily be parallelized. It also doesn't handle failure
well, ex when an instance fails to start or is taking too long to respond.

Our desire was to have an equivalent to Amazon EMR[1] API that would
trigger Spark jobs, including specified cluster setup. I've done some work
towards that end, and it would benefit from an updated AMI greatly.

Dan

[1]
http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-cli-commands.html

On Sat, Oct 4, 2014 at 7:28 AM, Nicholas Chammas  wrote:

> Thanks for posting that script, Patrick. It looks like a good place to
> start.
>
> Regarding Docker vs. Packer, as I understand it you can use Packer to
> create Docker containers at the same time as AMIs and other image types.
>
> Nick
>
>
> On Sat, Oct 4, 2014 at 2:49 AM, Patrick Wendell 
> wrote:
>
> > Hey All,
> >
> > Just a couple notes. I recently posted a shell script for creating the
> > AMI's from a clean Amazon Linux AMI.
> >
> > https://github.com/mesos/spark-ec2/blob/v3/create_image.sh
> >
> > I think I will update the AMI's soon to get the most recent security
> > updates. For spark-ec2's purpose this is probably sufficient (we'll
> > only need to re-create them every few months).
> >
> > However, it would be cool if someone wanted to tackle providing a more
> > general mechanism for defining Spark-friendly "images" that can be
> > used more generally. I had thought that docker might be a good way to
> > go for something like this - but maybe this packer thing is good too.
> >
> > For one thing, if we had a standard image we could use it to create
> > containers for running Spark's unit test, which would be really cool.
> > This would help a lot with random issues around port and filesystem
> > contention we have for unit tests.
> >
> > I'm not sure if the long term place for this would be inside the spark
> > codebase or a community library or what. But it would definitely be
> > very valuable to have if someone wanted to take it on.
> >
> > - Patrick
> >
> > On Fri, Oct 3, 2014 at 5:20 PM, Nicholas Chammas
> >  wrote:
> > > FYI: There is an existing issue -- SPARK-3314
> > >  -- about scripting
> > the
> > > creation of Spark AMIs.
> > >
> > > With Packer, it looks like we may be able to script the creation of
> > > multiple image types (VMWare, GCE, AMI, Docker, etc...) at once from a
> > > single Packer template. That's very cool.
> > >
> > > I'll be looking into this.
> > >
> > > Nick
> > >
> > >
> > > On Thu, Oct 2, 2014 at 8:23 PM, Nicholas Chammas <
> > nicholas.cham...@gmail.com
> > >> wrote:
> > >
> > >> Thanks for the update, Nate. I'm looking forward to seeing how these
> > >> projects turn out.
> > >>
> > >> David, Packer looks very, very interesting. I'm gonna look into it
> more
> > >> next week.
> > >>
> > >> Nick
> > >>
> > >>
> > >> On Thu, Oct 2, 2014 at 8:00 PM, Nate D'Amico 
> wrote:
> > >>
> > >>> Bit of progress on our end, bit of lagging as well.  Our guy leading
> > >>> effort got little bogged down on client project to update hive/sql
> > testbed
> > >>> to latest spark/sparkSQL, also launching public service so we have
> > been bit
> > >>> scattered recently.
> > >>>
> > >>> Will have some more updates probably after next week.  We are
> planning
> > on
> > >>> taking our client work around hive/spark, plus taking over the bigtop
> > >>> automation work to modernize and get that fit for human consumption
> > outside
> > >>> or org.  All our work and puppet modules will be open sourced,
> > documented,
> > >>> hopefully start to rally some other folks around effort that find it
> > useful
> > >>>
> > >>> Side note, another effort we are looking into is gradle
> tests/support.
> > >>> We have been leveraging serverspec for some basic infrastructure
> > tests, but
> > >>> with bigtop switching over to gradle builds/testing setup in 0.8 we
> > want to
> > >>> include support for that in our own efforts, probably some stuff that
> > can
> > >>> be learned and leveraged in spark world for repeatable/tested
> > infrastructure
> > >>>
> > >>> If anyone has any specific automation questions to your environment
> you
> > >>> can drop me a line directly.., will try to help out best I can.  Else
> > will
> > >>> post update to dev list once we get on top of our own product release
> > and
> > >>> the bigtop work
> > >>>
> > >>> Nate
> > >>>
> > >>>
> > >>> -Original Message-
> > >>> From: David Rowe [mailto:davidr...@gmail.com]
> > >>> Sent: Thursday, October 02, 2014 4:44 PM
> > >>> To: Nicholas Chammas
> > >>> Cc: dev; Shivaram Venkataraman
> > >>> Subject: Re: EC2 clusters ready in launch time + 30 seconds
> > >>>
> > >>> I think this is exactly what packer is for. See e.g.
> > >>> http://www.packer.i