weightCol doesn't seem to be handled properly in PySpark

2016-09-07 Thread evanzamir
When I am trying to use LinearRegression, it seems that unless there is a column specified with weights, it will raise a py4j error. Seems odd because supposedly the default is weightCol=None, but when I specifically pass in weightCol=None to LinearRegression, I get this error. -- View this

Re: How to write data into CouchBase using Spark & Scala?

2016-09-07 Thread Devi P.V
Thanks.Now it is working. On Thu, Sep 8, 2016 at 12:57 AM, aka.fe2s wrote: > Most likely you are missing an import statement that enables some Scala > implicits. I haven't used this connector, but looks like you need "import > com.couchbase.spark._" > > -- > Oleksiy Dyagilev

Forecasting algorithms in spark ML

2016-09-07 Thread Madabhattula Rajesh Kumar
Hi, Please let me know supported Forecasting algorithms in spark ML Regards, Rajesh

Re: MLib : Non Linear Optimization

2016-09-07 Thread nsareen
Any answer to this question group ? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/MLib-Non-Linear-Optimization-tp27645p27676.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: distribute work (files)

2016-09-07 Thread Peter Figliozzi
It works! Hmm, smells like some kind of linux permissions issue. Checking this, the owner & group are the same all around, and there is global read permission as well. So I have no clue why it would not work with an sshfs mounted volume. Back to OPs question... use Spark's CSV data source

Re: distribute work (files)

2016-09-07 Thread ayan guha
So, can you try to simulate the same without sshfs? ie, create a folder on /tmp/datashare and copy your files on all the machines and point sc.textFiles to that folder? On Thu, Sep 8, 2016 at 11:26 AM, Peter Figliozzi wrote: > All (three) of them. It's kind of cool--

Re: No SparkR on Mesos?

2016-09-07 Thread Rodrick Brown
We've been using SparkR on Mesos for quite sometime with no issues. [fedora@prod-rstudio-1 ~]$ /opt/spark-1.6.1/bin/sparkR R version 3.3.0 (2016-05-03) -- "Supposedly Educational" Copyright (C) 2016 The R Foundation for Statistical Computing Platform: x86_64-redhat-linux-gnu (64-bit) R is free

Re: distribute work (files)

2016-09-07 Thread Peter Figliozzi
All (three) of them. It's kind of cool-- when I re-run collect() a different executor will show up as first to encounter the error. On Wed, Sep 7, 2016 at 8:20 PM, ayan guha wrote: > Hi > > Is it happening on all executors or one? > > On Thu, Sep 8, 2016 at 10:46 AM, Peter

Re: distribute work (files)

2016-09-07 Thread ayan guha
Hi Is it happening on all executors or one? On Thu, Sep 8, 2016 at 10:46 AM, Peter Figliozzi wrote: > > Yes indeed (see below). Just to reiterate, I am not running Hadoop. The > "curly" node name mentioned in the stacktrace is the name of one of the > worker nodes.

Fwd: distribute work (files)

2016-09-07 Thread Peter Figliozzi
Yes indeed (see below). Just to reiterate, I am not running Hadoop. The "curly" node name mentioned in the stacktrace is the name of one of the worker nodes. I've mounted the same directory "datashare" with two text files to all worker nodes with sshfs. The Spark documentation suggests that

year out of range

2016-09-07 Thread Daniel Lopes
Hi, I'm* importing a few CSV*s with spark-csv package, Always when I give a select at each one looks ok But when i join then with sqlContext.sql give me this error all tables has fields timestamp joins are not with this dates *Py4JJavaError: An error occurred while calling o643.showString.* :

collect_set without nulls (1.6 vs 2.0)

2016-09-07 Thread Lee Becker
Hello everyone, Consider this toy example: case class Foo(x: String, y: String) val df = sparkSession.createDataFrame(Array(Foo(null), Foo("a"), Foo("b")) df.select(collect_set($"x")).show In Spark 2.0.0 I get the following results: +--+ |collect_set(x)| +--+ | [null,

Re: dstream.foreachRDD iteration

2016-09-07 Thread Mich Talebzadeh
I am sure few Spark gurus can explain this much better than me. So here we go. A DStream is an abstraction that breaks a continuous stream of data into small chunks. This is called "micro-batching" and Spark streaming is all about micro-batching You have batch interval, windows length and

Re: No SparkR on Mesos?

2016-09-07 Thread Timothy Chen
Python should be supported as I tested it, patches should already be merged 1.6.2. Tim > On Sep 8, 2016, at 1:20 AM, Michael Gummelt wrote: > > Quite possibly. I've never used it. I know Python was "unsupported" for a > while, which turned out to mean there was a

Re: Spark 2.0 with Kafka 0.10 exception

2016-09-07 Thread Cody Koeninger
It's a really noticeable overhead, without the cache you're basically pulling every message twice due to prefetching. On Wed, Sep 7, 2016 at 3:23 PM, Srikanth wrote: > Yea, disabling cache was not going to be my permanent solution either. > I was going to ask how big an

Re: Spark 2.0 with Kafka 0.10 exception

2016-09-07 Thread Srikanth
Yea, disabling cache was not going to be my permanent solution either. I was going to ask how big an overhead is that? It happens intermittently and each time it happens retry is successful. Srikanth On Wed, Sep 7, 2016 at 3:55 PM, Cody Koeninger wrote: > That's not what I

Re: Spark 2.0 with Kafka 0.10 exception

2016-09-07 Thread Cody Koeninger
That's not what I would have expected to happen with a lower cache setting, but in general disabling the cache isn't something you want to do with the new kafka consumer. As far as the original issue, are you seeing those polling errors intermittently, or consistently? From your description, it

Re: Spark 2.0 with Kafka 0.10 exception

2016-09-07 Thread Srikanth
Setting those two results in below exception. No.of executors < no.of partitions. Could that be triggering this? 16/09/07 15:33:13 ERROR Executor: Exception in task 2.0 in stage 2.0 (TID 9) java.util.ConcurrentModificationException: KafkaConsumer is not safe for multi-threaded access at

Re: How to write data into CouchBase using Spark & Scala?

2016-09-07 Thread aka.fe2s
Most likely you are missing an import statement that enables some Scala implicits. I haven't used this connector, but looks like you need "import com.couchbase.spark._" -- Oleksiy Dyagilev On Wed, Sep 7, 2016 at 9:42 AM, Devi P.V wrote: > I am newbie in CouchBase.I am

Re: Spark 2.0 with Kafka 0.10 exception

2016-09-07 Thread Cody Koeninger
you could try setting spark.streaming.kafka.consumer.cache.initialCapacity spark.streaming.kafka.consumer.cache.maxCapacity to 1 On Wed, Sep 7, 2016 at 2:02 PM, Srikanth wrote: > I had a look at the executor logs and noticed that this exception happens > only when using

Re: LabeledPoint creation

2016-09-07 Thread aka.fe2s
It has 4 categories a = 1 0 0 b = 0 0 0 c = 0 1 0 d = 0 0 1 -- Oleksiy Dyagilev On Wed, Sep 7, 2016 at 10:42 AM, Madabhattula Rajesh Kumar < mrajaf...@gmail.com> wrote: > Hi, > > Any help on above mail use case ? > > Regards, > Rajesh > > On Tue, Sep 6, 2016 at 5:40 PM, Madabhattula Rajesh

Re: Spark 2.0 with Kafka 0.10 exception

2016-09-07 Thread Srikanth
I had a look at the executor logs and noticed that this exception happens only when using the cached consumer. Every retry is successful. This is consistent. One possibility is that the cached consumer is causing the failure as retry clears it. Is there a way to disable cache and test this? Again,

Re: Split RDD by key and save to different files

2016-09-07 Thread Dhaval Patel
In order to do that, first of all you need to Key RDD by Key. and then use saveAsHadoopFile in this way: We can use saveAsHadoopFile(location,classOf[KeyClass], classOf[ValueClass], classOf[PartitionOutputFormat]) When PartitionOutputFormat is extended from MultipleTextOutputFormat. Sample for

Error while storing datetime read from MySQL back to MySQL

2016-09-07 Thread Dhaval Patel
I am facing an error while trying to save Dataframe containing datetime field into MySQL table. What I am doing is: 1. Reading data from MySQL table which has fields of type datetime in MySQL. 2. Process Dataframe. 3. Store/Save Dataframe back into another MySQL table. While creating table, spark

Re: No SparkR on Mesos?

2016-09-07 Thread Felix Cheung
This is correct - SparkR is not quite working completely on Mesos. JIRAs and contributions welcome! On Wed, Sep 7, 2016 at 10:21 AM -0700, "Michael Gummelt" > wrote: Quite possibly. I've never used it. I know Python was "unsupported"

Split RDD by key and save to different files

2016-09-07 Thread Vikash Kumar
I need to spilt RDD [keys, Iterable[Value]] to save each key into different file. e.g I have records like: customerId, name, age, sex 111,abc,34,M 122, xyz,32,F 111,def,31,F 122.trp,30,F 133,jkl,35,M I need to write 3 different files based on customerId file1: 111,abc,34,M 111,def,31,F file2:

Re: Spark Java Heap Error

2016-09-07 Thread neil90
If your in local mode just allocate all your memory you want to use to your Driver(that acts as the executor in local mode) don't even bother changing the executor memory. So your new settings should look like this... spark.driver.memory 16g spark.driver.maxResultSize 2g

Re: No SparkR on Mesos?

2016-09-07 Thread Michael Gummelt
Quite possibly. I've never used it. I know Python was "unsupported" for a while, which turned out to mean there was a silly conditional that would fail the submission, even though all the support was there. Could be the same for R. Can you submit a JIRA? On Wed, Sep 7, 2016 at 5:02 AM, Peter

Re: Spark Metrics: custom source/sink configurations not getting recognized

2016-09-07 Thread map reduced
Thanks for the reply, I wish it did. We have an internal metrics system where we need to submit to. I am sure that the ways I've tried work with yarn deployment, but not with standalone. Thanks, KP On Tue, Sep 6, 2016 at 11:36 PM, Benjamin Kim wrote: > We use

Re: Mesos coarse-grained problem with spark.shuffle.service.enabled

2016-09-07 Thread Michael Gummelt
The shuffle service is run out of band from any specific Spark job, and you only run one on any given node. You need to get the Spark distribution on each node somehow, then run the shuffle service out of that distribution. The most common way I see people doing this is via Marathon (using the

Re: dstream.foreachRDD iteration

2016-09-07 Thread Ashok Kumar
I have checked that doc sir. My understand every batch interval of data always generates one RDD, So why do we need to use foreachRDD when there is only one. Sorry for this question but bit confusing me. Thanks On Wednesday, 7 September 2016, 18:05, Mich Talebzadeh

Re: dstream.foreachRDD iteration

2016-09-07 Thread Mich Talebzadeh
Hi, What is so confusing about RDD. Have you checked this doc? http://spark.apache.org/docs/latest/streaming-programming-guide.html#design-patterns-for-using-foreachrdd HTH Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

Re: Reset auto.offset.reset in Kafka 0.10 integ

2016-09-07 Thread Cody Koeninger
Kafka as yet doesn't have a good way to distinguish between "it's ok to reset at the beginning of a job" and "it's ok to reset any time offsets are out of range" https://issues.apache.org/jira/browse/KAFKA-3370 Because of that, it's better IMHO to err on the side of obvious failures, as opposed

Re: Reset auto.offset.reset in Kafka 0.10 integ

2016-09-07 Thread Srikanth
Yes that's right. I understand this is a data loss. The restart doesn't have to be all that silent. It requires us to set a flag. I thought auto.offset.reset is that flag. But there isn't much I can do at this point given that retention has cleaned things up. The app has to start. Let admins

Re: spark 1.6.0 web console shows running application in a "waiting" status, but it's acutally running

2016-09-07 Thread arlindo santos
Assuming you configured spark to use zookeeper for ha, when the master fails over to another node, the workers will automatically attach themselves to the newly elected master and this works fine. My issue is that once I go over to the new master web GUI ( I see all the workers attached just

Re: Reset auto.offset.reset in Kafka 0.10 integ

2016-09-07 Thread Cody Koeninger
Just so I'm clear on what's happening... - you're running a job that auto-commits offsets to kafka. - you stop that job for longer than your retention - you start that job back up, and it errors because the last committed offset is no longer available - you think that instead of erroring, the job

Re: Reset auto.offset.reset in Kafka 0.10 integ

2016-09-07 Thread Srikanth
My retention is 1d which isn't terribly low. The problem is every time I restart after retention expiry, I get this exception instead of honoring auto.offset.reset. It isn't a corner case where retention expired after driver created a batch. Its easily reproducible and consistent. On Tue, Sep 6,

Re: spark 1.6.0 web console shows running application in a "waiting" status, but it's acutally running

2016-09-07 Thread Mich Talebzadeh
I just tested it. If you start master on the original host, the workers on that host they won't respond. They will stay stale. So there is no heartbeat between workers and master except the initial handshake The only way is to stop workers (if they are still running), restart the master and

Re: spark 1.6.0 web console shows running application in a "waiting" status, but it's acutally running

2016-09-07 Thread arlindo santos
Port 7077 is for "client" mode connections to the master. In "cluster" mode it's 6066 and this means the "driver" runs on the spark cluster on a node spark chooses. The command I use to deploy my spark app (including the driver) is below: spark-submit --deploy-mode cluster --master

Re: distribute work (files)

2016-09-07 Thread Yong Zhang
What error do you get? FileNotFoundException? Please paste the stacktrace here. Yong From: Peter Figliozzi Sent: Wednesday, September 7, 2016 10:18 AM To: ayan guha Cc: Lydia Ickler; user.spark Subject: Re: distribute work (files)

Managing Dataset API Partitions - Spark 2.0

2016-09-07 Thread ANDREA SPINA
Hi everyone, I'd test some algorithms with the Dataset API offered by Spark 2.0.0. So I was wondering, *which is the best way for managing Dataset partitions?* E.g. in the data reading phase, what I use to do is the following *// RDD* *// if I want to set a custom minimum number of partitions*

Re: spark 1.6.0 web console shows running application in a "waiting" status, but it's acutally running

2016-09-07 Thread Mich Talebzadeh
This is my take. When you issue spark-submit on any node it start GUI on port 4040 by default. Otherwise you can specify port yourself with --conf "spark.ui.port=" As I understand in standalone mode executors run on workers. $SPARK_HOME/sbin/start-slave.sh spark://::7077 That port 7077 is the

Re: spark 1.6.0 web console shows running application in a "waiting" status, but it's acutally running

2016-09-07 Thread arlindo santos
Yes refreshed a few times. Running in cluster mode. Fyi.. I can duplicate this easily now. Our setup consists of 3 nodes running standalone spark, master and worker on each, zookeeper doing master leader election. If I kill a master on any node, the master shifts to another node and that is

Re: distribute work (files)

2016-09-07 Thread Peter Figliozzi
That's failing for me. Can someone please try this-- is this even supposed to work: - create a directory somewhere and add two text files to it - mount that directory on the Spark worker machines with sshfs - read the textfiles into one datas structure using a file URL with a

Re: Dataframe, Java: How to convert String to Vector ?

2016-09-07 Thread Peter Figliozzi
Here's a decent GitHub book: Mastering Apache Spark . I'm new at Scala too. I found it very helpful to study the Scala language without Spark. The documentation found here is

Failed to open native connection to Cassandra at

2016-09-07 Thread muhammet pakyürek
how to solve this problem below py4j.protocol.Py4JJavaError: An error occurred while calling o33.load. : java.io.IOException: Failed to open native connection to Cassandra at {127.0.1.1}:9042

Re: spark 1.6.0 web console shows running application in a "waiting" status, but it's acutally running

2016-09-07 Thread Mich Talebzadeh
Have you refreshed the Spark UI page? What Mode are you running your Spark app? HTH Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw *

Re: I noticed LinearRegression sometimes produces negative R^2 values

2016-09-07 Thread Evan Zamir
Yes, it's on a hold out segment from the data set being fitted. On Wed, Sep 7, 2016 at 1:02 AM Sean Owen wrote: > Yes, should be. > It's also not necessarily nonnegative if you evaluate R^2 on a > different data set than you fit it to. Is that the case? > > On Tue, Sep 6,

How to find the partitioner for a Dataset

2016-09-07 Thread Darin McBeath
I have a Dataset (om) which I created and repartitioned (and cached) using one of the fields (docId). Reading the Spark documentation, I would assume the om Dataset should be hash partitioned. But, how can I verify this? When I do om.rdd.partitioner I get Option[org.apache.spark.Partitioner]

No SparkR on Mesos?

2016-09-07 Thread Peter Griessl
Hello, does SparkR really not work (yet?) on Mesos (Spark 2.0 on Mesos 1.0)? $ /opt/spark/bin/sparkR R version 3.3.1 (2016-06-21) -- "Bug in Your Hair" Copyright (C) 2016 The R Foundation for Statistical Computing Platform: x86_64-pc-linux-gnu (64-bit) Launching java with spark-submit command

dstream.foreachRDD iteration

2016-09-07 Thread Ashok Kumar
Hi, A bit confusing to me How many layers involved in DStream.foreachRDD. Do I need to loop over it more than once? I mean  DStream.foreachRDD{ rdd = > } I am trying to get individual lines in RDD. Thanks

Re: call() function being called 3 times

2016-09-07 Thread Kevin Tran
It turns out that call() function runs in different stages ... 2016-09-07 20:37:21,086 [Executor task launch worker-0] INFO org.apache.spark.executor.Executor - Running task 0.0 in stage 11.0 (TID 11) 2016-09-07 20:37:21,087 [Executor task launch worker-0] DEBUG

Re[10]: Spark 2.0: SQL runs 5x times slower when adding 29th field to aggregation.

2016-09-07 Thread Сергей Романов
Thank you, Yong, it looks great. I had added following lines to spark-defaults.conf and now my original SQL query runs much faster. spark.executor.extraJavaOptions -XX:-DontCompileHugeMethods spark.driver.extraJavaOptions -XX:-DontCompileHugeMethods Can you recommend these configuration

Mesos coarse-grained problem with spark.shuffle.service.enabled

2016-09-07 Thread Tamas Szuromi
Hello, For a while, we're using Spark on Mesos with fine-grained mode in production. Since Spark 2.0 the fine-grained mode is deprecated so we'd shift to dynamic allocation. When I tried to setup the dynamic allocation I run into the following problem: So I set spark.shuffle.service.enabled =

Re: I noticed LinearRegression sometimes produces negative R^2 values

2016-09-07 Thread Sean Owen
Yes, should be. It's also not necessarily nonnegative if you evaluate R^2 on a different data set than you fit it to. Is that the case? On Tue, Sep 6, 2016 at 11:15 PM, Evan Zamir wrote: > I am using the default setting for setting fitIntercept, which *should* be > TRUE

Re: LabeledPoint creation

2016-09-07 Thread Madabhattula Rajesh Kumar
Hi, Any help on above mail use case ? Regards, Rajesh On Tue, Sep 6, 2016 at 5:40 PM, Madabhattula Rajesh Kumar < mrajaf...@gmail.com> wrote: > Hi, > > I am new to Spark ML, trying to create a LabeledPoint from categorical > dataset(example code from spark). For this, I am using One-hot

Re: Getting figures from spark streaming

2016-09-07 Thread Ashok Kumar
Any help on this warmly appreciated. On Tuesday, 6 September 2016, 21:31, Ashok Kumar wrote: Hello Gurus, I am creating some figures and feed them into Kafka and then spark streaming. It works OK but I have the following issue. For now as a test I sent 5

call() function being called 3 times

2016-09-07 Thread Kevin Tran
Hi Everyone, Does anyone know why call() function being called *3 times* for each message arrive JavaDStream message = messagesDStream.map(new >> Function, String>() { > > @Override > > public String call(Tuple2 tuple2) { > > return tuple2._2(); > > } > >

SparkStreaming is not working with SparkLauncher

2016-09-07 Thread aditya barve
Hello Team, I am new to spark. I tried to create a sample application and submitted to spark. D:\spark-1.6.2-bin-hadoop2.6\bin\spark-submit --class "main.java.SimpleApp" --master local[4] target/simple-project-1.0-jar-with-dependencies.jar server1 1234 It worked fine. Here server1 is my netcat

How to write data into CouchBase using Spark & Scala?

2016-09-07 Thread Devi P.V
I am newbie in CouchBase.I am trying to write data into CouchBase.My sample code is following, val cfg = new SparkConf() .setAppName("couchbaseQuickstart") .setMaster("local[*]") .set("com.couchbase.bucket.MyBucket","pwd") val sc = new SparkContext(cfg) val doc1 =

Re: Dataframe, Java: How to convert String to Vector ?

2016-09-07 Thread Yan Facai
Hi Peter, I'm familiar with Pandas / Numpy in python, while spark / scala is totally new for me. Pandas provides a detailed document, like how to slice data, parse file, use apply and filter function. Do spark have some more detailed document? On Tue, Sep 6, 2016 at 9:58 PM, Peter Figliozzi

Re: Spark Metrics: custom source/sink configurations not getting recognized

2016-09-07 Thread Benjamin Kim
We use Graphite/Grafana for custom metrics. We found Spark’s metrics not to be customizable. So, we write directly using Graphite’s API, which was very easy to do using Java’s socket library in Scala. It works great for us, and we are going one step further using Sensu to alert us if there is