Re: HiveContext: cache table not supported for partitioned table?

2014-10-02 Thread Cheng Lian

Cache table works with partitioned table.

I guess you’re experimenting with a default local metastore and the 
metastore_db directory doesn’t exist at the first place. In this case, 
all metastore tables/views don’t exist at first and will throw the error 
message you saw when the |PARTITIONS| metastore table is accessed for 
the first time by Hive client. However, you should also see this line 
before this error:


   14/10/03 10:51:30 ERROR ObjectStore: Direct SQL failed, falling back
   to ORM

And then the table is created on the fly. The cache operation is also 
performed normally. You can verify this by selecting it and check the 
Spark UI for cached RDDs. If you try to uncache the table and cache it 
again, you won’t see this error any more.


Normally, in production environment you won’t see this error because 
metastore database is usually setup ahead of time.


On 10/3/14 3:39 AM, Du Li wrote:


Hi,

In Spark 1.1 HiveContext, I ran a create partitioned table command 
followed by a cache table command and got 
a java.sql.SQLSyntaxErrorException: Table/View 'PARTITIONS' does not 
exist. But cache table worked fine if the table is not a partitioned 
table.


Can anybody confirm that cache of partitioned table is not supported 
yet in current version?


Thanks,
Du


​


Re: What is the best way to build my developing Spark for testing on EC2?

2014-10-02 Thread Evan Sparks
I recommend using the data generators provided with MLlib to generate synthetic 
data for your scalability tests - provided they're well suited for your 
algorithms. They let you control things like number of examples and 
dimensionality of your dataset, as well as number of partitions. 

As far as cluster set up goes, I usually launch spot instances with the 
spark-ec2 scripts, and then check out a repo which contains a simple driver 
application for my code. Then I have something crude like bash scripts running 
my program and collecting output. 

You could have a look at the spark-perf repo if you want something a little 
better principled/automatic. 

- Evan

> On Oct 2, 2014, at 5:37 PM, Yu Ishikawa  wrote:
> 
> Hi all, 
> 
> I am trying to contribute some machine learning algorithms to MLlib. 
> I must evaluate their performance on a cluster, changing input data 
> size, the number of CPU cores and any their parameters.
> 
> I would like to build my develoipng Spark on EC2 automatically. 
> Is there already a building script for a developing version like spark-ec2
> script?
> Or if you have any good idea to evaluate the performance of a developing 
> MLlib algorithm on a spark cluster like EC2, could you tell me?
> 
> Best,
> 
> 
> 
> -
> -- Yu Ishikawa
> --
> View this message in context: 
> http://apache-spark-developers-list.1001551.n3.nabble.com/What-is-the-best-way-to-build-my-developing-Spark-for-testing-on-EC2-tp8638.html
> Sent from the Apache Spark Developers List mailing list archive at Nabble.com.
> 
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
> 

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



What is the best way to build my developing Spark for testing on EC2?

2014-10-02 Thread Yu Ishikawa
Hi all, 

I am trying to contribute some machine learning algorithms to MLlib. 
I must evaluate their performance on a cluster, changing input data 
size, the number of CPU cores and any their parameters.

I would like to build my develoipng Spark on EC2 automatically. 
Is there already a building script for a developing version like spark-ec2
script?
Or if you have any good idea to evaluate the performance of a developing 
MLlib algorithm on a spark cluster like EC2, could you tell me?

Best,



-
-- Yu Ishikawa
--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/What-is-the-best-way-to-build-my-developing-Spark-for-testing-on-EC2-tp8638.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: EC2 clusters ready in launch time + 30 seconds

2014-10-02 Thread Nicholas Chammas
Thanks for the update, Nate. I'm looking forward to seeing how these
projects turn out.

David, Packer looks very, very interesting. I'm gonna look into it more
next week.

Nick


On Thu, Oct 2, 2014 at 8:00 PM, Nate D'Amico  wrote:

> Bit of progress on our end, bit of lagging as well.  Our guy leading
> effort got little bogged down on client project to update hive/sql testbed
> to latest spark/sparkSQL, also launching public service so we have been bit
> scattered recently.
>
> Will have some more updates probably after next week.  We are planning on
> taking our client work around hive/spark, plus taking over the bigtop
> automation work to modernize and get that fit for human consumption outside
> or org.  All our work and puppet modules will be open sourced, documented,
> hopefully start to rally some other folks around effort that find it useful
>
> Side note, another effort we are looking into is gradle tests/support.  We
> have been leveraging serverspec for some basic infrastructure tests, but
> with bigtop switching over to gradle builds/testing setup in 0.8 we want to
> include support for that in our own efforts, probably some stuff that can
> be learned and leveraged in spark world for repeatable/tested infrastructure
>
> If anyone has any specific automation questions to your environment you
> can drop me a line directly.., will try to help out best I can.  Else will
> post update to dev list once we get on top of our own product release and
> the bigtop work
>
> Nate
>
>
> -Original Message-
> From: David Rowe [mailto:davidr...@gmail.com]
> Sent: Thursday, October 02, 2014 4:44 PM
> To: Nicholas Chammas
> Cc: dev; Shivaram Venkataraman
> Subject: Re: EC2 clusters ready in launch time + 30 seconds
>
> I think this is exactly what packer is for. See e.g.
> http://www.packer.io/intro/getting-started/build-image.html
>
> On a related note, the current AMI for hvm systems (e.g. m3.*, r3.*) has a
> bad package for httpd, whcih causes ganglia not to start. For some reason I
> can't get access to the raw AMI to fix it.
>
> On Fri, Oct 3, 2014 at 9:30 AM, Nicholas Chammas <
> nicholas.cham...@gmail.com
> > wrote:
>
> > Is there perhaps a way to define an AMI programmatically? Like, a
> > collection of base AMI id + list of required stuff to be installed +
> > list of required configuration changes. I’m guessing that’s what
> > people use things like Puppet, Ansible, or maybe also AWS CloudFormation
> for, right?
> >
> > If we could do something like that, then with every new release of
> > Spark we could quickly and easily create new AMIs that have everything
> we need.
> > spark-ec2 would only have to bring up the instances and do a minimal
> > amount of configuration, and the only thing we’d need to track in the
> > Spark repo is the code that defines what goes on the AMI, as well as a
> > list of the AMI ids specific to each release.
> >
> > I’m just thinking out loud here. Does this make sense?
> >
> > Nate,
> >
> > Any progress on your end with this work?
> >
> > Nick
> > ​
> >
> > On Sun, Jul 13, 2014 at 8:53 PM, Shivaram Venkataraman <
> > shiva...@eecs.berkeley.edu> wrote:
> >
> > > It should be possible to improve cluster launch time if we are
> > > careful about what commands we run during setup. One way to do this
> > > would be to walk down the list of things we do for cluster
> > > initialization and see if there is anything we can do make things
> > > faster. Unfortunately this might
> > be
> > > pretty time consuming, but I don't know of a better strategy. The
> > > place
> > to
> > > start would be the setup.sh file at
> > > https://github.com/mesos/spark-ec2/blob/v3/setup.sh
> > >
> > > Here are some things that take a lot of time and could be improved:
> > > 1. Creating swap partitions on all machines. We could check if there
> > > is a way to get EC2 to always mount a swap partition 2. Copying /
> > > syncing things across slaves. The copy-dir script is called too many
> > > times right now and each time it pauses for a few milliseconds
> > > between slaves [1]. This could be improved by removing unnecessary
> > > copies 3. We could make less frequently used modules like Tachyon,
> > > persistent
> > hdfs
> > > not a part of the default setup.
> > >
> > > [1] https://github.com/mesos/spark-ec2/blob/v3/copy-dir.sh#L42
> > >
> > > Thanks
> > > Shivaram
> > >
> > >
> > >
> > >
> > > On Sat, Jul 12, 2014 at 7:02 PM, Nicholas Chammas <
> > > nicholas.cham...@gmail.com> wrote:
> > >
> > > > On Thu, Jul 10, 2014 at 8:10 PM, Nate D'Amico 
> > wrote:
> > > >
> > > > > Starting to work through some automation/config stuff for spark
> > > > > stack
> > > on
> > > > > EC2 with a project, will be focusing the work through the apache
> > bigtop
> > > > > effort to start, can then share with spark community directly as
> > things
> > > > > progress if people are interested
> > > >
> > > >
> > > > Let us know how that goes. I'm definitely interested in hearing more.
> > > >
> > > > Nick
>

RE: EC2 clusters ready in launch time + 30 seconds

2014-10-02 Thread Nate D'Amico
Bit of progress on our end, bit of lagging as well.  Our guy leading effort got 
little bogged down on client project to update hive/sql testbed to latest 
spark/sparkSQL, also launching public service so we have been bit scattered 
recently.

Will have some more updates probably after next week.  We are planning on 
taking our client work around hive/spark, plus taking over the bigtop 
automation work to modernize and get that fit for human consumption outside or 
org.  All our work and puppet modules will be open sourced, documented, 
hopefully start to rally some other folks around effort that find it useful

Side note, another effort we are looking into is gradle tests/support.  We have 
been leveraging serverspec for some basic infrastructure tests, but with bigtop 
switching over to gradle builds/testing setup in 0.8 we want to include support 
for that in our own efforts, probably some stuff that can be learned and 
leveraged in spark world for repeatable/tested infrastructure 

If anyone has any specific automation questions to your environment you can 
drop me a line directly.., will try to help out best I can.  Else will post 
update to dev list once we get on top of our own product release and the bigtop 
work

Nate


-Original Message-
From: David Rowe [mailto:davidr...@gmail.com] 
Sent: Thursday, October 02, 2014 4:44 PM
To: Nicholas Chammas
Cc: dev; Shivaram Venkataraman
Subject: Re: EC2 clusters ready in launch time + 30 seconds

I think this is exactly what packer is for. See e.g.
http://www.packer.io/intro/getting-started/build-image.html

On a related note, the current AMI for hvm systems (e.g. m3.*, r3.*) has a bad 
package for httpd, whcih causes ganglia not to start. For some reason I can't 
get access to the raw AMI to fix it.

On Fri, Oct 3, 2014 at 9:30 AM, Nicholas Chammas  wrote:

> Is there perhaps a way to define an AMI programmatically? Like, a 
> collection of base AMI id + list of required stuff to be installed + 
> list of required configuration changes. I’m guessing that’s what 
> people use things like Puppet, Ansible, or maybe also AWS CloudFormation for, 
> right?
>
> If we could do something like that, then with every new release of 
> Spark we could quickly and easily create new AMIs that have everything we 
> need.
> spark-ec2 would only have to bring up the instances and do a minimal 
> amount of configuration, and the only thing we’d need to track in the 
> Spark repo is the code that defines what goes on the AMI, as well as a 
> list of the AMI ids specific to each release.
>
> I’m just thinking out loud here. Does this make sense?
>
> Nate,
>
> Any progress on your end with this work?
>
> Nick
> ​
>
> On Sun, Jul 13, 2014 at 8:53 PM, Shivaram Venkataraman < 
> shiva...@eecs.berkeley.edu> wrote:
>
> > It should be possible to improve cluster launch time if we are 
> > careful about what commands we run during setup. One way to do this 
> > would be to walk down the list of things we do for cluster 
> > initialization and see if there is anything we can do make things 
> > faster. Unfortunately this might
> be
> > pretty time consuming, but I don't know of a better strategy. The 
> > place
> to
> > start would be the setup.sh file at
> > https://github.com/mesos/spark-ec2/blob/v3/setup.sh
> >
> > Here are some things that take a lot of time and could be improved:
> > 1. Creating swap partitions on all machines. We could check if there 
> > is a way to get EC2 to always mount a swap partition 2. Copying / 
> > syncing things across slaves. The copy-dir script is called too many 
> > times right now and each time it pauses for a few milliseconds 
> > between slaves [1]. This could be improved by removing unnecessary 
> > copies 3. We could make less frequently used modules like Tachyon, 
> > persistent
> hdfs
> > not a part of the default setup.
> >
> > [1] https://github.com/mesos/spark-ec2/blob/v3/copy-dir.sh#L42
> >
> > Thanks
> > Shivaram
> >
> >
> >
> >
> > On Sat, Jul 12, 2014 at 7:02 PM, Nicholas Chammas < 
> > nicholas.cham...@gmail.com> wrote:
> >
> > > On Thu, Jul 10, 2014 at 8:10 PM, Nate D'Amico 
> wrote:
> > >
> > > > Starting to work through some automation/config stuff for spark 
> > > > stack
> > on
> > > > EC2 with a project, will be focusing the work through the apache
> bigtop
> > > > effort to start, can then share with spark community directly as
> things
> > > > progress if people are interested
> > >
> > >
> > > Let us know how that goes. I'm definitely interested in hearing more.
> > >
> > > Nick
> > >
> >
>


-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: EC2 clusters ready in launch time + 30 seconds

2014-10-02 Thread David Rowe
I think this is exactly what packer is for. See e.g.
http://www.packer.io/intro/getting-started/build-image.html

On a related note, the current AMI for hvm systems (e.g. m3.*, r3.*) has a
bad package for httpd, whcih causes ganglia not to start. For some reason I
can't get access to the raw AMI to fix it.

On Fri, Oct 3, 2014 at 9:30 AM, Nicholas Chammas  wrote:

> Is there perhaps a way to define an AMI programmatically? Like, a
> collection of base AMI id + list of required stuff to be installed + list
> of required configuration changes. I’m guessing that’s what people use
> things like Puppet, Ansible, or maybe also AWS CloudFormation for, right?
>
> If we could do something like that, then with every new release of Spark we
> could quickly and easily create new AMIs that have everything we need.
> spark-ec2 would only have to bring up the instances and do a minimal amount
> of configuration, and the only thing we’d need to track in the Spark repo
> is the code that defines what goes on the AMI, as well as a list of the AMI
> ids specific to each release.
>
> I’m just thinking out loud here. Does this make sense?
>
> Nate,
>
> Any progress on your end with this work?
>
> Nick
> ​
>
> On Sun, Jul 13, 2014 at 8:53 PM, Shivaram Venkataraman <
> shiva...@eecs.berkeley.edu> wrote:
>
> > It should be possible to improve cluster launch time if we are careful
> > about what commands we run during setup. One way to do this would be to
> > walk down the list of things we do for cluster initialization and see if
> > there is anything we can do make things faster. Unfortunately this might
> be
> > pretty time consuming, but I don't know of a better strategy. The place
> to
> > start would be the setup.sh file at
> > https://github.com/mesos/spark-ec2/blob/v3/setup.sh
> >
> > Here are some things that take a lot of time and could be improved:
> > 1. Creating swap partitions on all machines. We could check if there is a
> > way to get EC2 to always mount a swap partition
> > 2. Copying / syncing things across slaves. The copy-dir script is called
> > too many times right now and each time it pauses for a few milliseconds
> > between slaves [1]. This could be improved by removing unnecessary copies
> > 3. We could make less frequently used modules like Tachyon, persistent
> hdfs
> > not a part of the default setup.
> >
> > [1] https://github.com/mesos/spark-ec2/blob/v3/copy-dir.sh#L42
> >
> > Thanks
> > Shivaram
> >
> >
> >
> >
> > On Sat, Jul 12, 2014 at 7:02 PM, Nicholas Chammas <
> > nicholas.cham...@gmail.com> wrote:
> >
> > > On Thu, Jul 10, 2014 at 8:10 PM, Nate D'Amico 
> wrote:
> > >
> > > > Starting to work through some automation/config stuff for spark stack
> > on
> > > > EC2 with a project, will be focusing the work through the apache
> bigtop
> > > > effort to start, can then share with spark community directly as
> things
> > > > progress if people are interested
> > >
> > >
> > > Let us know how that goes. I'm definitely interested in hearing more.
> > >
> > > Nick
> > >
> >
>


Re: EC2 clusters ready in launch time + 30 seconds

2014-10-02 Thread Nicholas Chammas
Is there perhaps a way to define an AMI programmatically? Like, a
collection of base AMI id + list of required stuff to be installed + list
of required configuration changes. I’m guessing that’s what people use
things like Puppet, Ansible, or maybe also AWS CloudFormation for, right?

If we could do something like that, then with every new release of Spark we
could quickly and easily create new AMIs that have everything we need.
spark-ec2 would only have to bring up the instances and do a minimal amount
of configuration, and the only thing we’d need to track in the Spark repo
is the code that defines what goes on the AMI, as well as a list of the AMI
ids specific to each release.

I’m just thinking out loud here. Does this make sense?

Nate,

Any progress on your end with this work?

Nick
​

On Sun, Jul 13, 2014 at 8:53 PM, Shivaram Venkataraman <
shiva...@eecs.berkeley.edu> wrote:

> It should be possible to improve cluster launch time if we are careful
> about what commands we run during setup. One way to do this would be to
> walk down the list of things we do for cluster initialization and see if
> there is anything we can do make things faster. Unfortunately this might be
> pretty time consuming, but I don't know of a better strategy. The place to
> start would be the setup.sh file at
> https://github.com/mesos/spark-ec2/blob/v3/setup.sh
>
> Here are some things that take a lot of time and could be improved:
> 1. Creating swap partitions on all machines. We could check if there is a
> way to get EC2 to always mount a swap partition
> 2. Copying / syncing things across slaves. The copy-dir script is called
> too many times right now and each time it pauses for a few milliseconds
> between slaves [1]. This could be improved by removing unnecessary copies
> 3. We could make less frequently used modules like Tachyon, persistent hdfs
> not a part of the default setup.
>
> [1] https://github.com/mesos/spark-ec2/blob/v3/copy-dir.sh#L42
>
> Thanks
> Shivaram
>
>
>
>
> On Sat, Jul 12, 2014 at 7:02 PM, Nicholas Chammas <
> nicholas.cham...@gmail.com> wrote:
>
> > On Thu, Jul 10, 2014 at 8:10 PM, Nate D'Amico  wrote:
> >
> > > Starting to work through some automation/config stuff for spark stack
> on
> > > EC2 with a project, will be focusing the work through the apache bigtop
> > > effort to start, can then share with spark community directly as things
> > > progress if people are interested
> >
> >
> > Let us know how that goes. I'm definitely interested in hearing more.
> >
> > Nick
> >
>


HiveContext: cache table not supported for partitioned table?

2014-10-02 Thread Du Li
Hi,

In Spark 1.1 HiveContext, I ran a create partitioned table command followed by 
a cache table command and got a java.sql.SQLSyntaxErrorException: Table/View 
'PARTITIONS' does not exist. But cache table worked fine if the table is not a 
partitioned table.

Can anybody confirm that cache of partitioned table is not supported yet in 
current version?

Thanks,
Du