We use both databricks and emr. We use databricks for our exploratory / adhoc
use cases because their notebook is pretty badass and better than Zeppelin IMHO.
We use EMR for our production machine learning and ETL tasks. The nice thing
about EMR is you can use applications other than spark.
If you are running on Amazon, then it's always a crapshoot as well.
M
> On Dec 21, 2015, at 4:41 PM, Josh Rosen wrote:
>
> @Eran, are Server 1 and Server 2 both part of the same cluster / do they have
> similar positions in the network topology w.r.t the Spark
If you are running on AWS I would recommend using s3 instead of hdfs as a
general practice if you are maintaining state or data there. This way you can
treat your spark clusters as ephemeral compute resources that you can swap out
easily -- eg if something breaks just spin up a fresh cluster
http://spark.apache.org/docs/latest/programming-guide.html#rdd-operations
M
> On Dec 4, 2015, at 8:21 AM, Sateesh Karuturi
> wrote:
>
> Hello Spark experts...
> Iam new to Apache Spark..Can anyone send me the proper Documentation to learn
> RDD functions.
>
We were looking into this as well --- the answer looks like "no"
Here's the ticket:
https://issues.apache.org/jira/browse/HADOOP-9680
m
On Fri, Dec 4, 2015 at 1:41 PM, Lin, Hao wrote:
> Hi,
>
>
>
> Does anyone knows if Spark run in AWS is supported by temporary access
>
You should consider presto for this use case. If you want fast "first query"
times it is a better fit.
I think sparksql will catch up at some point but if you are not doing multiple
queries against data cached in RDDs and need low latency it may not be a good
fit.
M
> On Dec 1, 2015, at 7:23
fig options replace SPARK_CLASSPATH. Further info in the
> docs. I've had good luck with these options, and for ease of use I just set
> them in the spark defaults config.
>
> https://spark.apache.org/docs/latest/configuration.html
>
>> On Tue, 17 Nov 2015 at 21:06 Michal Klos
t set
> them in the spark defaults config.
>
> https://spark.apache.org/docs/latest/configuration.html
>
>> On Tue, 17 Nov 2015 at 21:06 Michal Klos <michal.klo...@gmail.com> wrote:
>> Hi,
>>
>> We are running a Spark Standalone cluster on EMR
Hi,
We are running a Spark Standalone cluster on EMR (note: not using YARN) and
are trying to use S3 w/ EmrFS as our event logging directory.
We are having difficulties with a ClassNotFoundException on EmrFileSystem
when we navigate to the event log screen. This is to be expected as the
EmrFs
You must add the partitions to the Hive table with something like "alter table
your_table add if not exists partition (country='us');".
If you have dynamic partitioning turned on, you can do 'msck repair table
your_table' to recover the partitions.
I would recommend reviewing the Hive
Hi,
We would like to be added to the Powered by Spark list:
organization name: Localytics
URL: http://eng.localytics.com/
a list of which Spark components you are using: Spark, Spark Streaming,
MLLib
a short description of your use case: Batch, real-time, and predictive
analytics driving our
-effects are idempotent, or
use some transactional locking. Spark's own output operations, such as
saving to Hadoop, use such mechanisms. For instance, in the case of
Hadoop it uses the OutputCommitter classes.
- Patrick
On Fri, Mar 27, 2015 at 12:36 PM, Michal Klos michal.klo
Hi Spark group,
We haven't been able to find clear descriptions of how Spark handles the
resiliency of RDDs in relationship to executing actions with side-effects.
If you do an `rdd.foreach(someSideEffect)`, then you are doing a
side-effect for each element in the RDD. If a partition goes down --
Hi,
We have a Scala application and we want it to programmatically submit Spark
jobs to a Spark-YARN cluster in yarn-client mode.
We're running into a lot of classpath issues, e.g. once submitted it looks
for jars in our parent Scala application's local directory, jars that it
shouldn't need.
PM, Michal Klos michal.klo...@gmail.com
wrote:
Hi Spark Community,
We would like to define exception handling behavior on RDD instantiation
/
build. Since the RDD is lazily evaluated, it seems like we are forced to
put
all exception handling in the first action call
an RDD
action. So you can refactor that way too.
On Wed, Mar 11, 2015 at 2:39 PM, Michal Klos michal.klo...@gmail.com
wrote:
Is there a way to have the exception handling go lazily along with the
definition?
e.g... we define it on the RDD but then our exception handling code gets
triggered
Hi Spark Community,
We would like to define exception handling behavior on RDD instantiation /
build. Since the RDD is lazily evaluated, it seems like we are forced to
put all exception handling in the first action call?
This is an example of something that would be nice:
def myRDD = {
Try {
, Jörn Franke jornfra...@gmail.com wrote:
What database are you using?
Le 28 févr. 2015 18:15, Michal Klos michal.klo...@gmail.com a écrit :
Hi Spark community,
We have a use case where we need to pull huge amounts of data from a SQL
query against a database into Spark. We need to execute
Hi Spark community,
We have a use case where we need to pull huge amounts of data from a SQL
query against a database into Spark. We need to execute the query against
our huge database and not a substitute (SparkSQL, Hive, etc) because of a
couple of factors including custom functions used in the
19 matches
Mail list logo