Re: Databricks Cloud vs AWS EMR

2016-01-28 Thread Michal Klos
We use both databricks and emr. We use databricks for our exploratory / adhoc use cases because their notebook is pretty badass and better than Zeppelin IMHO. We use EMR for our production machine learning and ETL tasks. The nice thing about EMR is you can use applications other than spark.

Re: fishing for help!

2015-12-21 Thread Michal Klos
If you are running on Amazon, then it's always a crapshoot as well. M > On Dec 21, 2015, at 4:41 PM, Josh Rosen wrote: > > @Eran, are Server 1 and Server 2 both part of the same cluster / do they have > similar positions in the network topology w.r.t the Spark

Re: newbie best practices: is spark-ec2 intended to be used to manage long-lasting infrastructure ?

2015-12-04 Thread Michal Klos
If you are running on AWS I would recommend using s3 instead of hdfs as a general practice if you are maintaining state or data there. This way you can treat your spark clusters as ephemeral compute resources that you can swap out easily -- eg if something breaks just spin up a fresh cluster

Re: RDD functions

2015-12-04 Thread Michal Klos
http://spark.apache.org/docs/latest/programming-guide.html#rdd-operations M > On Dec 4, 2015, at 8:21 AM, Sateesh Karuturi > wrote: > > Hello Spark experts... > Iam new to Apache Spark..Can anyone send me the proper Documentation to learn > RDD functions. >

Re: Is Temporary Access Credential (AccessKeyId, SecretAccessKey + SecurityToken) support by Spark?

2015-12-04 Thread Michal Klos
We were looking into this as well --- the answer looks like "no" Here's the ticket: https://issues.apache.org/jira/browse/HADOOP-9680 m On Fri, Dec 4, 2015 at 1:41 PM, Lin, Hao wrote: > Hi, > > > > Does anyone knows if Spark run in AWS is supported by temporary access >

Re: Low Latency SQL query

2015-12-01 Thread Michal Klos
You should consider presto for this use case. If you want fast "first query" times it is a better fit. I think sparksql will catch up at some point but if you are not doing multiple queries against data cached in RDDs and need low latency it may not be a good fit. M > On Dec 1, 2015, at 7:23

Re: Additional Master daemon classpath

2015-11-18 Thread Michal Klos
fig options replace SPARK_CLASSPATH. Further info in the > docs. I've had good luck with these options, and for ease of use I just set > them in the spark defaults config. > > https://spark.apache.org/docs/latest/configuration.html > >> On Tue, 17 Nov 2015 at 21:06 Michal Klos

Re: Additional Master daemon classpath

2015-11-18 Thread Michal Klos
t set > them in the spark defaults config. > > https://spark.apache.org/docs/latest/configuration.html > >> On Tue, 17 Nov 2015 at 21:06 Michal Klos <michal.klo...@gmail.com> wrote: >> Hi, >> >> We are running a Spark Standalone cluster on EMR

Additional Master daemon classpath

2015-11-17 Thread Michal Klos
Hi, We are running a Spark Standalone cluster on EMR (note: not using YARN) and are trying to use S3 w/ EmrFS as our event logging directory. We are having difficulties with a ClassNotFoundException on EmrFileSystem when we navigate to the event log screen. This is to be expected as the EmrFs

Re: Partitioned Parquet based external table

2015-11-12 Thread Michal Klos
You must add the partitions to the Hive table with something like "alter table your_table add if not exists partition (country='us');". If you have dynamic partitioning turned on, you can do 'msck repair table your_table' to recover the partitions. I would recommend reviewing the Hive

Add to Powered by Spark page

2015-05-19 Thread Michal Klos
Hi, We would like to be added to the Powered by Spark list: organization name: Localytics URL: http://eng.localytics.com/ a list of which Spark components you are using: Spark, Spark Streaming, MLLib a short description of your use case: Batch, real-time, and predictive analytics driving our

Re: RDD resiliency -- does it keep state?

2015-03-28 Thread Michal Klos
-effects are idempotent, or use some transactional locking. Spark's own output operations, such as saving to Hadoop, use such mechanisms. For instance, in the case of Hadoop it uses the OutputCommitter classes. - Patrick On Fri, Mar 27, 2015 at 12:36 PM, Michal Klos michal.klo

RDD resiliency -- does it keep state?

2015-03-27 Thread Michal Klos
Hi Spark group, We haven't been able to find clear descriptions of how Spark handles the resiliency of RDDs in relationship to executing actions with side-effects. If you do an `rdd.foreach(someSideEffect)`, then you are doing a side-effect for each element in the RDD. If a partition goes down --

Spark yarn-client submission example?

2015-03-17 Thread Michal Klos
Hi, We have a Scala application and we want it to programmatically submit Spark jobs to a Spark-YARN cluster in yarn-client mode. We're running into a lot of classpath issues, e.g. once submitted it looks for jars in our parent Scala application's local directory, jars that it shouldn't need.

Re: Define exception handling on lazy elements?

2015-03-11 Thread Michal Klos
PM, Michal Klos michal.klo...@gmail.com wrote: Hi Spark Community, We would like to define exception handling behavior on RDD instantiation / build. Since the RDD is lazily evaluated, it seems like we are forced to put all exception handling in the first action call

Re: Define exception handling on lazy elements?

2015-03-11 Thread Michal Klos
an RDD action. So you can refactor that way too. On Wed, Mar 11, 2015 at 2:39 PM, Michal Klos michal.klo...@gmail.com wrote: Is there a way to have the exception handling go lazily along with the definition? e.g... we define it on the RDD but then our exception handling code gets triggered

Define exception handling on lazy elements?

2015-03-11 Thread Michal Klos
Hi Spark Community, We would like to define exception handling behavior on RDD instantiation / build. Since the RDD is lazily evaluated, it seems like we are forced to put all exception handling in the first action call? This is an example of something that would be nice: def myRDD = { Try {

Re: Scalable JDBCRDD

2015-03-02 Thread Michal Klos
, Jörn Franke jornfra...@gmail.com wrote: What database are you using? Le 28 févr. 2015 18:15, Michal Klos michal.klo...@gmail.com a écrit : Hi Spark community, We have a use case where we need to pull huge amounts of data from a SQL query against a database into Spark. We need to execute

Scalable JDBCRDD

2015-02-28 Thread Michal Klos
Hi Spark community, We have a use case where we need to pull huge amounts of data from a SQL query against a database into Spark. We need to execute the query against our huge database and not a substitute (SparkSQL, Hive, etc) because of a couple of factors including custom functions used in the