Newbie Q: Issue related to connecting Spark Master Standalone through Scala app

2016-09-26 Thread Reth RM
Hi, I have issue connecting spark master, receiving a RuntimeException: java.io.InvalidClassException: org.apache.spark.rpc.netty.RequestMessage. Followed the steps mentioned below. Can you please point me to where am I doing wrong? 1. Downloaded spark (version spark-2.0.0-bin-hadoop2.7) 2.

Large-scale matrix inverse in Spark

2016-09-26 Thread Cooper
How is the problem of large-scale matrix inversion approached in Apache Spark ? This linear algebra operation is obviously the very base of a lot of other algorithms (regression, classification, etc). However, I have not been able to find a Spark API on parallel implementation of matrix

Re: Access Amazon s3 data

2016-09-26 Thread Jagadeesan A.S.
Hi Hitesh, Below couple of links will help you to start spark application with amazon s3. https://www.cloudera.com/documentation/enterprise/5-5-x/topics/spark_s3.html https://www.supergloo.com/fieldnotes/apache-spark-amazon-s3-examples-of-text-files/ Cheers Jagadeesan A S On Tue, Sep 27,

Access Amazon s3 data

2016-09-26 Thread Hitesh Goyal
Hi team, I have data in the amazon s3. I want to access the data using apache spark application. I am new to it. Please tell me how can i make application in java so that i can be able to apply spark sql queries on the s3. -Hitesh Goyal

Re: median of groups

2016-09-26 Thread ayan guha
I have used percentile_approx (with 0.5) function from hive,using sqlContext sql commands. On Tue, Sep 27, 2016 at 10:52 AM, Peter Figliozzi wrote: > I'm trying to figure out a nice way to get the median of a DataFrame > column *once it is grouped. * > > It's easy

median of groups

2016-09-26 Thread Peter Figliozzi
I'm trying to figure out a nice way to get the median of a DataFrame column *once it is grouped. * It's easy enough now to get the min, max, mean, and other things that are part of spark.sql.functions: df.groupBy("foo", "bar").agg(mean($"column1")) And it's easy enough to get the median of a

Re: Tutorial error - zeppelin 0.6.2 built with spark 2.0 and mapr

2016-09-26 Thread Nirav Patel
FYI, it works when I use MapR configured Spark 2.0. ie export SPARK_HOME=/opt/mapr/spark/spark-2.0.0-bin-without-hadoop Thanks Nirav On Mon, Sep 26, 2016 at 3:45 PM, Nirav Patel wrote: > Hi, > > I built zeppeling 0.6 branch using spark 2.0 using following mvn : > >

Tutorial error - zeppelin 0.6.2 built with spark 2.0 and mapr

2016-09-26 Thread Nirav Patel
Hi, I built zeppeling 0.6 branch using spark 2.0 using following mvn : mvn clean package -Pbuild-distr -Pmapr41 -Pyarn -Pspark-2.0 -Pscala-2.11 -DskipTests Built went successful. I only have following set in zeppelin-conf.sh export HADOOP_HOME=/opt/mapr/hadoop/hadoop-2.5.1/ export

Re: Slow Shuffle Operation on Empty Batch

2016-09-26 Thread Cody Koeninger
Do you have a minimal example of how to reproduce the problem, that doesn't depend on Cassandra? On Mon, Sep 26, 2016 at 4:10 PM, Erwan ALLAIN wrote: > Hi > > I'm working with > - Kafka 0.8.2 > - Spark Streaming (2.0) direct input stream. > - cassandra 3.0 > > My batch

Re: spark-submit failing but job running from scala ide

2016-09-26 Thread Marco Mistroni
Hi Vr your code works fine for me, running on Windows 10 vs Spark 1.6.1 i m guessing your Spark installation could be busted? That would explain why it works on your IDE, as you are just importing jars in your project. The java.io.IOException: Failed to connect to error is misleading, i have

Slow Shuffle Operation on Empty Batch

2016-09-26 Thread Erwan ALLAIN
Hi I'm working with - Kafka 0.8.2 - Spark Streaming (2.0) direct input stream. - cassandra 3.0 My batch interval is 1s. When I use some map, filter even saveToCassandra functions, the processing time is around 50ms on empty batches => This is fine. As soon as I use some reduceByKey, the

Re: using SparkILoop.run

2016-09-26 Thread Vadim Semenov
Add "-Dspark.master=local[*]" to the VM properties of your test run. On Mon, Sep 26, 2016 at 2:25 PM, Mohit Jaggi wrote: > I want to use the following API SparkILoop.run(...). I am writing a test > case as that passes some scala code to spark interpreter and receives >

Re: Is Spark 2.0 master node compatible with Spark 1.5 work node?

2016-09-26 Thread Koert Kuipers
oh i forgot in step1 you will have to modify spark's pom.xml to include cloudera repo so it can find the cloudera artifacts anyhow we found this process to be pretty easy and we stopped using the spark versions bundles with the distros On Mon, Sep 26, 2016 at 3:57 PM, Koert Kuipers

Re: Is Spark 2.0 master node compatible with Spark 1.5 work node?

2016-09-26 Thread Koert Kuipers
it is also easy to launch many different spark versions on yarn by simply having them installed side-by-side. 1) build spark for your cdh version. for example for cdh 5 i do: $ git checkout v2.0.0 $ dev/make-distribution.sh --name cdh5.4-hive --tgz -Phadoop-2.6 -Dhadoop.version=2.6.0-cdh5.4.4

Re: udf forces usage of Row for complex types?

2016-09-26 Thread Koert Kuipers
https://issues.apache.org/jira/browse/SPARK-17668 On Mon, Sep 26, 2016 at 3:40 PM, Koert Kuipers wrote: > ok will create jira > > On Mon, Sep 26, 2016 at 3:27 PM, Michael Armbrust > wrote: > >> I agree this should work. We just haven't finished

Re: udf forces usage of Row for complex types?

2016-09-26 Thread Koert Kuipers
ok will create jira On Mon, Sep 26, 2016 at 3:27 PM, Michael Armbrust wrote: > I agree this should work. We just haven't finished killing the old > reflection based conversion logic now that we have more powerful/efficient > encoders. Please open a JIRA. > > On Sun,

Re: Spark 2.0 Structured Streaming: sc.parallelize in foreach sink cause Task not serializable error

2016-09-26 Thread Michael Armbrust
The code in ForeachWriter runs on the executors, which means that you are not allowed to use the SparkContext. This is probably why you are seeing that exception. On Sun, Sep 25, 2016 at 3:20 PM, Jianshi wrote: > Dear all: > > I am trying out the new released feature of

Re: udf forces usage of Row for complex types?

2016-09-26 Thread Michael Armbrust
I agree this should work. We just haven't finished killing the old reflection based conversion logic now that we have more powerful/efficient encoders. Please open a JIRA. On Sun, Sep 25, 2016 at 2:41 PM, Koert Kuipers wrote: > after having gotten used to have case classes

Native libraries using only one core in standalone spark cluster

2016-09-26 Thread guangweiyu
Hi, I'm trying to run a spark job that uses multiple cpu cores per spark executor in a spark job. Specifically, it runs the gemm matrix multiply routine from each partition on a large matrix that cannot be distributed. For test purpose, I have a machine with 8 cores running standalone spark. I

using SparkILoop.run

2016-09-26 Thread Mohit Jaggi
I want to use the following API SparkILoop.run(...). I am writing a test case as that passes some scala code to spark interpreter and receives result as string. I couldn't figure out how to pass the right settings into the run() method. I get an error about "master' not being set. object

Re: Is Spark 2.0 master node compatible with Spark 1.5 work node?

2016-09-26 Thread Piotr Smoliński
In YARN you submit the whole application. This way unless the distribution provider does strange classpath "optimisations" you may just submit Spark 2 application aside of Spark 1.5 or 1.6. It is YARN responsibility to deliver the application files and spark assembly to the workers. What's more,

Non-linear regression of exponential form in Spark

2016-09-26 Thread Cooper
Is this possible to perform exponential regression in Apache Spark ? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Non-linear-regression-of-exponential-form-in-Spark-tp27794.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Pyspark ML - Unable to finish cross validation

2016-09-26 Thread Simone
Hello, I am using pyspark to train a Logistic Regression model using cross validation with ML. My dataset is - for testing purposes very small - like no more than 50 records for train. On the other hand, my "feature" column has a very large size - i.e., 1500+ columns. I am running on yarn

Re: spark-submit failing but job running from scala ide

2016-09-26 Thread vr spark
Hi Jacek/All, I restarted my terminal and then i try spark-submit and again getting those errors. How do i see how many "runtimes" are running and how to have only one? some how my spark 1.6 and spark 2.0 are conflicting. how to fix it? i installed spark 1.6 earlier using this steps

Re: Is Spark 2.0 master node compatible with Spark 1.5 work node?

2016-09-26 Thread Rex X
Yes, I have a cloudera cluster with Yarn. Any more details on how to work out with uber jar? Thank you. On Sun, Sep 18, 2016 at 2:13 PM, Felix Cheung wrote: > Well, uber jar works in YARN, but not with standalone ;) > > > > > > On Sun, Sep 18, 2016 at 12:44 PM

Re: Running jobs against remote cluster from scala eclipse ide

2016-09-26 Thread Jacek Laskowski
Hi, Remove .setMaster("spark://spark-437-1-5963003:7077"). set("spark.driver.host","11.104.29.106") and start over. Can you also run the following command to check out Spark Standalone: run-example --master spark://spark-437-1-5963003:7077 SparkPi Pozdrawiam, Jacek Laskowski

Running jobs against remote cluster from scala eclipse ide

2016-09-26 Thread vr spark
Hi, I use scala IDE for eclipse. I usually run job against my local spark installed on my mac and then export the jars and copy it to spark cluster of my company and run spark submit on it. This works fine. But i want to run the jobs from scala ide directly using the spark cluster of my company.

unsubscribe

2016-09-26 Thread Karthikeyan Vasuki Balasubramaniam
unsubscribe - To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: how to find NaN values of each row of spark dataframe to decide whether the rows is dropeed or not

2016-09-26 Thread Peyman Mohajerian
Also take a look at this API: https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.DataFrameNaFunctions On Mon, Sep 26, 2016 at 1:09 AM, Bedrytski Aliaksandr wrote: > Hi Muhammet, > > python also supports sql queries

Re: SparkLauncher not receiving events

2016-09-26 Thread Mariano Semelman
Solved, tl;dr I was using port: 6066 instead of 7077, I got confused because of this message in the log when I submit to the legacy port: [info] - org.apache.spark.launcher.app.ActivitiesSortingAggregateJob - 16/09/26 11:43:27 WARN RestSubmissionClient: Unable to connect to server

SparkLauncher not receiving events

2016-09-26 Thread Mariano Semelman
Hello, I'm having problems to receive events from the submited app. The app succesfuly submits, but the listener I'm passing to SparkLauncher is not receiving events. Spark Version: 1.6.1 (both client app and master) here are the relevant snippets I'm using in my code:

Please unsubscribe me from this mailing list

2016-09-26 Thread Hogancamp, Aaron
Please unsubscribe aaron.t.hoganc...@leidos.com from this mailing list. Thanks, Aaron Hogancamp Data Scientist (615) 431-3229 (desk) (615) 617-7160 (mobile)

Re: how to decide which part of process use spark dataframe and pandas dataframe?

2016-09-26 Thread Peyman Mohajerian
A simple way to do that is to collect data in the driver when you need to use Python panda. On Monday, September 26, 2016, muhammet pakyürek wrote: > > > is there a clear guide to decide the above? >

Re: Writing Dataframe to CSV yields blank file called "_SUCCESS"

2016-09-26 Thread Peter Figliozzi
Thanks again Piotr. It's good to know there are a number of options. Once again I'm glad I put all my workers on the same ethernet switch, as unanticipated shuffling isn't so bad. Sincerely, Pete On Mon, Sep 26, 2016 at 8:35 AM, Piotr Smoliński < piotr.smolinski...@gmail.com> wrote: > Best,

Re: udf forces usage of Row for complex types?

2016-09-26 Thread Koert Kuipers
Case classes are serializable by default (they extend java Serializable trait) I am not using RDD or Dataset because I need to transform one column out of 200 or so. Dataset has the mechanisms to convert rows to case classes as needed (and make sure it's consistent with the schema). Why would

Re: Writing Dataframe to CSV yields blank file called "_SUCCESS"

2016-09-26 Thread Piotr Smoliński
Best, you should write to HDFS or when you test the product with no HDFS available just create a shared filesystem (windows shares, nfs, etc.) where the data will be written. You'll still end up with many files, but this time there will be only one directory tree. You may reduce the number of

Re: Writing Dataframe to CSV yields blank file called "_SUCCESS"

2016-09-26 Thread Peter Figliozzi
Thank you Piotr, that's what happened. In fact, there are about 100 files on each worker node in a directory corresponding to the write. Any way to tone that down a bit (maybe 1 file per worker)? Or, write a single file somewhere? On Mon, Sep 26, 2016 at 12:44 AM, Piotr Smoliński <

Re: Can Spark Streaming 2.0 work with Kafka 0.10?

2016-09-26 Thread Cody Koeninger
Either artifact should work with 0.10 brokers. The 0.10 integration has more features but is still marked experimental. On Sep 26, 2016 3:41 AM, "Haopu Wang" wrote: > Hi, in the official integration guide, it says "Spark Streaming 2.0.0 is > compatible with Kafka 0.8.2.1."

RE: udf forces usage of Row for complex types?

2016-09-26 Thread ming.he
It should be UserDefinedType. You can refer to https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/UserDefinedTypeSuite.scala From: Koert Kuipers [mailto:ko...@tresata.com] Sent: Monday, September 26, 2016 5:42 AM To: user@spark.apache.org Subject: udf

Re: Subscribe

2016-09-26 Thread Amit Sela
Please Subscribe via the mailing list as described here: http://beam.incubator.apache.org/use/mailing-lists/ On Mon, Sep 26, 2016, 12:11 Lakshmi Rajagopalan wrote: > >

Subscribe

2016-09-26 Thread Lakshmi Rajagopalan

Can Spark Streaming 2.0 work with Kafka 0.10?

2016-09-26 Thread Haopu Wang
Hi, in the official integration guide, it says "Spark Streaming 2.0.0 is compatible with Kafka 0.8.2.1." However, in maven repository, I can get "spark-streaming-kafka-0-10_2.11" which depends on Kafka 0.10.0.0 Is this artifact stable enough? Thank you!

Re: MLib Documentation Update Needed

2016-09-26 Thread Sean Owen
Yes I think that footnote could be a lot more prominent, or pulled up right under the table. I also think it would be fine to present the {0,1} formulation. It's actually more recognizable, I think, for log-loss in that form. It's probably less recognizable for hinge loss, but, consistency is

Re: how to find NaN values of each row of spark dataframe to decide whether the rows is dropeed or not

2016-09-26 Thread Bedrytski Aliaksandr
Hi Muhammet, python also supports sql queries http://spark.apache.org/docs/latest/sql-programming-guide.html#running-sql-queries-programmatically Regards, -- Bedrytski Aliaksandr sp...@bedryt.ski On Mon, Sep 26, 2016, at 10:01, muhammet pakyürek wrote: > > > > but my requst is related to

how to decide which part of process use spark dataframe and pandas dataframe?

2016-09-26 Thread muhammet pakyürek
is there a clear guide to decide the above?

Re: Extract timestamp from Kafka message

2016-09-26 Thread Alonso Isidoro Roman
hum, i think you have to embed the timestamp within the message... Alonso Isidoro Roman [image: https://]about.me/alonso.isidoro.roman 2016-09-26 0:59 GMT+02:00 Kevin Tran

Re: how to find NaN values of each row of spark dataframe to decide whether the rows is dropeed or not

2016-09-26 Thread Bedrytski Aliaksandr
Hi Muhammet, have you tried to use sql queries? > spark.sql(""" > SELECT > field1, > field2, > field3 >FROM table1 >WHERE > field1 != 'Nan', > field2 != 'Nan', > field3 != 'Nan' > """) This query filters rows containing Nan for a table

how to find NaN values of each row of spark dataframe to decide whether the rows is dropeed or not

2016-09-26 Thread muhammet pakyürek
is there any way to do this directly. if its not, is there any todo this indirectly using another datastrcutures of spark