spark persistence doubt

2016-09-28 Thread Shushant Arora
Hi I have a flow like below 1.rdd1=some source.transform(); 2.tranformedrdd1 = rdd1.transform(..); 3.transformrdd2 = rdd1.transform(..); 4.tranformrdd1.action(); Does I need to persist rdd1 to optimise step 2 and 3 ? or since there is no lineage breakage so it will work without persist ?

Re: MLib Documentation Update Needed

2016-09-28 Thread Tobi Bosede
OK, I've opened a jira. https://issues.apache.org/jira/browse/SPARK-17718 And ok, I forgot the loss is summed in the objective function provided. My mistake. On a tangentially related topic, why is there a half in front of the squared loss? Similarly, the L2 regularizer has a half. It's just a

Re: Problems with new experimental Kafka Consumer for 0.10

2016-09-28 Thread Cody Koeninger
Well, I'd start at the first thing suggested by the error, namely that the group has rebalanced. Is that stream using a unique group id? On Wed, Sep 28, 2016 at 5:17 AM, Matthias Niehoff wrote: > Hi, > > the stacktrace: > >

Re: spark / mesos and GPU resources

2016-09-28 Thread Timothy Chen
Hi Jackie, That doesn't work because GPU is a first class reource for Mesos starting with 1.0, and the patch to enable it from me is still in PR. I've done a demo last Spark Summit SF about Spark/Mesos/GPU and you can look at the video to see how it works. Feel free to try out the PR

spark / mesos and GPU resources

2016-09-28 Thread Jackie Tung
Hi, Does Spark support GPU resources reservation (much like it does for CPU and memory) on a Mesos cluster manager? Mesos added GPU recently as a first class resource type. I tried out the spark.mesos.constraints variable optimistically “gpu:1” which did not work for me. This is really one

Submit and Monitor standalone cluster application

2016-09-28 Thread Mariano Semelman
​Hello everybody, I'm developing an application to submit batch and streaming apps in a fault tolerant fashion. For that I need a programatically way to submit and monitor my apps and relaunch them in case of failure. Right now I'm using spark standalone (1.6.x) and submitting in cluster mode.

Re: Issue with rogue data in csv file used in Spark application

2016-09-28 Thread Mich Talebzadeh
Thanks guys. This seemed to be working after declaring all columns as Strings to start and using filters below to avoid rogue characters. The second filter ensures that there was trade volumes on that date. val *rs = df2.filter($"Open" !== "-").filter($"Volume".cast("Integer") >

Re: Dataset doesn't have partitioner after a repartition on one of the columns

2016-09-28 Thread Igor Berman
Michael, can you explain please why bucketBy is supported when using writeAsTable() to parquet by not with parquet() Is it only difference between table api and dataframe/dataset api? or there are some other? org.apache.spark.sql.AnalysisException: 'save' does not support bucketing right now; at

Re: Treadting NaN fields in Spark

2016-09-28 Thread Marco Mistroni
Hi Dr Mich, how bout reading all csv as string and then applying an UDF sort of like this? import scala.util.control.Exception.allCatch def getDouble(doubleStr:String):Double = allCatch opt doubleStr.toDouble match { case Some(doubleNum) => doubleNum case _ => Double.NaN }

Re: Spark ML Decision Trees Algorithm

2016-09-28 Thread janardhan shetty
Is there a reference to the research paper which is implemented in spark 2.0 ? On Wed, Sep 28, 2016 at 9:52 AM, janardhan shetty wrote: > Which algorithm is used under the covers while doing decision trees FOR > SPARK ? > for example: scikit-learn (python) uses an

Re: Dataset doesn't have partitioner after a repartition on one of the columns

2016-09-28 Thread Michael Armbrust
Hi Darin, In SQL we have finer grained information about partitioning, so we don't use the RDD Partitioner. Here's a notebook that walks

Re: Spark Executor Lost issue

2016-09-28 Thread Sushrut Ikhar
Can you add more details like are you using rdds/datasets/sql ..; are you doing group by/ joins ; is your input splittable? btw, you can pass the config the same way you are passing memryOverhead: e.g. --conf spark.default.parallelism=1000 or through spark-context in code Regards, Sushrut Ikhar

Re: Using Spark as a Maven dependency but with Hadoop 2.6

2016-09-28 Thread Sean Owen
I guess I'm claiming the artifacts wouldn't even be different in the first place, because the Hadoop APIs that are used are all the same across these versions. That would be the thing that makes you need multiple versions of the artifact under multiple classifiers. On Wed, Sep 28, 2016 at 1:16

Re: Using Spark as a Maven dependency but with Hadoop 2.6

2016-09-28 Thread Olivier Girardot
ok, don't you think it could be published with just different classifiers hadoop-2.6hadoop-2.4 hadoop-2.2 being the current default. So for now, I should just override spark 2.0.0's dependencies with the ones defined in the pom profile On Thu, Sep 22, 2016 11:17 AM, Sean Owen

Re: New to spark.

2016-09-28 Thread Bryan Cutler
Hi Anirudh, All types of contributions are welcome, from code to documentation. Please check out the page at https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark for some info, specifically keep a watch out for starter JIRAs here

Spark Summit CfP Closes Sunday

2016-09-28 Thread Jules Damji
Fellow Sparkers, The Spark Summit East 2017 CfP closes Sunday. If you have an abstract, don’t miss the deadline https://spark-summit.org/east-2017/ Thank you & see you in Boston! cheers Jules -- Simplicity precludes neither profundity nor power. Jules

Spark ML Decision Trees Algorithm

2016-09-28 Thread janardhan shetty
Which algorithm is used under the covers while doing decision trees FOR SPARK ? for example: scikit-learn (python) uses an optimised version of the CART algorithm.

New to spark.

2016-09-28 Thread Anirudh Muhnot
Hello everyone, I'm Anirudh. I'm fairly new to spark as I've done an online specialisation from UC Berkeley. I know how to code in Python but have little to no idea about Scala. I want to contribute to Spark, Where do I start and how? I'm reading the pull requests at Git Hub but I'm barley able

Re: Broadcast big dataset

2016-09-28 Thread Takeshi Yamamuro
Hi, # I dropped dev and added user because this is more suitable in user-mailinglist. I think you need to describe more about your environments, e.g. spark version, executor memory, and so on. // maropu On Wed, Sep 28, 2016 at 11:03 PM, WangJianfei < wangjianfe...@otcaix.iscas.ac.cn> wrote:

Re: Treadting NaN fields in Spark

2016-09-28 Thread Peter Figliozzi
In Scala, x.isNaN returns true for Double.NaN, but false for any character. I guess the `isnan` function you are using works by ultimately looking at x.isNan. On Wed, Sep 28, 2016 at 5:56 AM, Mich Talebzadeh wrote: > > This is an issue in most databases. Specifically

Re: Access S3 buckets in multiple accounts

2016-09-28 Thread Steve Loughran
On 27 Sep 2016, at 15:53, Daniel Siegmann > wrote: I am running Spark on Amazon EMR and writing data to an S3 bucket. However, the data is read from an S3 bucket in a separate AWS account. Setting the fs.s3a.access.key and

Re: Spark Executor Lost issue

2016-09-28 Thread Aditya
Hi All, Any updates on this? On Wednesday 28 September 2016 12:22 PM, Sushrut Ikhar wrote: Try with increasing the parallelism by repartitioning and also you may increase - spark.default.parallelism You can also try with decreasing num-executor cores. Basically, this happens when the executor

Re: Trying to fetch S3 data

2016-09-28 Thread Steve Loughran
On 28 Sep 2016, at 06:28, Hitesh Goyal > wrote: Hi team, I want to fetch data from Amazon S3 bucket. For this, I am trying to access it using scala. I have tried the basic wordcount application in scala. Now I want to retrieve s3

Re: Issue with rogue data in csv file used in Spark application

2016-09-28 Thread Bedrytski Aliaksandr
Hi Mich, if I understood you well, you may cast the value to float, it will yield null if the value is not a correct float: val df = Seq(("-", 5), ("1", 6), (",", 7), ("8.6", 7)).toDF("value", "id").createOrReplaceTempView("lines") spark.sql("SELECT cast(value as FLOAT) from lines").show()

Treadting NaN fields in Spark

2016-09-28 Thread Mich Talebzadeh
This is an issue in most databases. Specifically if a field is NaN.. --> ( *NaN*, standing for not a number, is a numeric data type value representing an undefined or unrepresentable value, especially in floating-point calculations) There is a method called isnan() in Spark that is supposed to

Re: Problems with new experimental Kafka Consumer for 0.10

2016-09-28 Thread Matthias Niehoff
Hi, the stacktrace: org.apache.kafka.clients.consumer.CommitFailedException: Commit cannot be completed since the group has already rebalanced and assigned the partitions to another member. This means that the time between subsequent calls to poll() was longer than the configured

Re: Access S3 buckets in multiple accounts

2016-09-28 Thread Eike von Seggern
Hi Teng, 2016-09-28 10:42 GMT+02:00 Teng Qiu : > hmm, i do not believe security group can control s3 bucket access... is > this something new? or you mean IAM role? > You're right, it's not security groups but you can configure a VPC endpoint for the EMR-Cluster and grant

Re: Access S3 buckets in multiple accounts

2016-09-28 Thread Teng Qiu
hmm, i do not believe security group can control s3 bucket access... is this something new? or you mean IAM role? @Daniel, using spark on EMR, you should be able to use IAM role to access AWS resources, you do not need to specify fs.s3a.access.key or fs.s3a.secret.key at all. S3A is able to use

Re: Access S3 buckets in multiple accounts

2016-09-28 Thread Eike von Seggern
Hi Daniel, you can start your EMR Cluster in a dedicated security group and configure the foreign bucket's policy to allow read-write access from that SG. Best Eike 2016-09-27 16:53 GMT+02:00 Daniel Siegmann : > I am running Spark on Amazon EMR and writing data

Re: Issue with rogue data in csv file used in Spark application

2016-09-28 Thread Mich Talebzadeh
Thanks all. This is the csv schema all columns mapped to String scala> df2.printSchema root |-- Stock: string (nullable = true) |-- Ticker: string (nullable = true) |-- TradeDate: string (nullable = true) |-- Open: string (nullable = true) |-- High: string (nullable = true) |-- Low: string

Re: Spark Executor Lost issue

2016-09-28 Thread Aditya
: Thanks Sushrut for the reply. Currently I have not defined spark.default.parallelism property. Can you let me know how much should I set it to? Regards, Aditya Calangutkar On Wednesday 28 September 2016 12:22 PM, Sushrut Ikhar wrote: Try with increasing the parallelism by repartitioning

Spark Executor Lost issue

2016-09-28 Thread Aditya
I have a spark job which runs fine for small data. But when data increases it gives executor lost error.My executor and driver memory are set at its highest point. I have also tried increasing--conf spark.yarn.executor.memoryOverhead=600but still not able to fix the problem. Is there any other