Newbie Q: Issue related to connecting Spark Master Standalone through Scala app

2016-09-26 Thread Reth RM
Hi,

 I have issue connecting spark master, receiving a RuntimeException:
java.io.InvalidClassException: org.apache.spark.rpc.netty.RequestMessage.

Followed the steps mentioned below. Can you please point me to where am I
doing wrong?

1. Downloaded spark (version spark-2.0.0-bin-hadoop2.7)
2. Have scala installed (version  2.11.8)
3. navigated to /spark-2.0.0-bin-hadoop2.7/sbin
4../start-master.sh
5../start-slave.sh spark://http://host:7077/
6. Intellij has simple 2 lines code for scala as it is here


Error
https://jpst.it/NOUE


Large-scale matrix inverse in Spark

2016-09-26 Thread Cooper
How is the problem of large-scale matrix inversion approached in Apache Spark
?

This linear algebra operation is obviously the very base of a lot of other
algorithms (regression, classification, etc). However, I have not been able
to find a Spark API on parallel implementation of matrix inversion. Can you
please clarify approaching this operation on the Spark internals ?

Here    is a paper on
the parallelized matrix inversion in Spark, however I am trying to use an
existing code instead of implementing one from scratch, if available.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Large-scale-matrix-inverse-in-Spark-tp27796.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Access Amazon s3 data

2016-09-26 Thread Jagadeesan A.S.
Hi Hitesh,

Below couple of links will help you to start spark application with amazon
s3.

https://www.cloudera.com/documentation/enterprise/5-5-x/topics/spark_s3.html

https://www.supergloo.com/fieldnotes/apache-spark-amazon-s3-examples-of-text-files/

Cheers
Jagadeesan A S


On Tue, Sep 27, 2016 at 7:20 AM, Hitesh Goyal 
wrote:

>
>
> Hi team,
>
> I have data in the amazon s3. I want to access the data using apache spark
> application. I am new to it.
> Please tell me how can i make application in java so that i can be able to
> apply spark sql queries on the s3.
>
>
>
> -Hitesh Goyal
>


Access Amazon s3 data

2016-09-26 Thread Hitesh Goyal


Hi team,

I have data in the amazon s3. I want to access the data using apache spark 
application. I am new to it.
Please tell me how can i make application in java so that i can be able to 
apply spark sql queries on the s3.



-Hitesh Goyal


Re: median of groups

2016-09-26 Thread ayan guha
I have used percentile_approx (with 0.5) function from hive,using
sqlContext sql commands.

On Tue, Sep 27, 2016 at 10:52 AM, Peter Figliozzi 
wrote:

> I'm trying to figure out a nice way to get the median of a DataFrame
> column *once it is grouped.  *
>
> It's easy enough now to get the min, max, mean, and other things that are
> part of spark.sql.functions:
>
> df.groupBy("foo", "bar").agg(mean($"column1"))
>
> And it's easy enough to get the median of a column before grouping, using
> approxQuantile.
>
> However approxQuantile is part of DataFrame.stat i.e. a
> DataFrameStatFunctions.
>
> Is there a way to use it inside the .agg?
>
> Or do we need a user defined aggregation function?
>
> Or some other way?
> Stack Overflow version of the question here
> 
> .
>
> Thanks,
>
> Pete
>
>


-- 
Best Regards,
Ayan Guha


median of groups

2016-09-26 Thread Peter Figliozzi
I'm trying to figure out a nice way to get the median of a DataFrame
column *once
it is grouped.  *

It's easy enough now to get the min, max, mean, and other things that are
part of spark.sql.functions:

df.groupBy("foo", "bar").agg(mean($"column1"))

And it's easy enough to get the median of a column before grouping, using
approxQuantile.

However approxQuantile is part of DataFrame.stat i.e. a
DataFrameStatFunctions.

Is there a way to use it inside the .agg?

Or do we need a user defined aggregation function?

Or some other way?
Stack Overflow version of the question here

.

Thanks,

Pete


Re: Tutorial error - zeppelin 0.6.2 built with spark 2.0 and mapr

2016-09-26 Thread Nirav Patel
FYI, it works when I use MapR configured Spark 2.0. ie

export SPARK_HOME=/opt/mapr/spark/spark-2.0.0-bin-without-hadoop


Thanks

Nirav

On Mon, Sep 26, 2016 at 3:45 PM, Nirav Patel  wrote:

> Hi,
>
> I built zeppeling 0.6 branch using spark 2.0 using following mvn :
>
> mvn clean package -Pbuild-distr -Pmapr41 -Pyarn -Pspark-2.0 -Pscala-2.11
> -DskipTests
>
> Built went successful.
> I only have following set in zeppelin-conf.sh
>
> export HADOOP_HOME=/opt/mapr/hadoop/hadoop-2.5.1/
>
> export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
>
>
> i.e. I am using embedded spark 2.0 created during build process.
>
> After starting zeppelin for first time I tried to run 'Zeppelin Tutorial'
> notebook and got following exception in 'Load data into table' paragraph
>
>
> import org.apache.commons.io.IOUtils
> import java.net.URL
> import java.nio.charset.Charset
> bankText: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[0] at
> parallelize at :30
> defined class Bank
> java.lang.UnsatisfiedLinkError: Native Library /tmp/mapr-xactly-
> libMapRClient.0.6.2-SNAPSHOT.so already loaded in another classloader
> at java.lang.ClassLoader.loadLibrary0(ClassLoader.java:1907)
> at java.lang.ClassLoader.loadLibrary(ClassLoader.java:1824)
> at java.lang.Runtime.load0(Runtime.java:809)
> at java.lang.System.load(System.java:1086)
> at com.mapr.fs.shim.LibraryLoader.load(LibraryLoader.java:29)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at sun.reflect.NativeMethodAccessorImpl.invoke(
> NativeMethodAccessorImpl.java:62)
> at sun.reflect.DelegatingMethodAccessorImpl.invoke(
> DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at com.mapr.fs.ShimLoader.loadNativeLibrary(ShimLoader.java:335)
> at com.mapr.fs.ShimLoader.load(ShimLoader.java:210)
> at com.mapr.fs.MapRFileSystem.(MapRFileSystem.java:80)
> at java.lang.Class.forName0(Native Method)
> at java.lang.Class.forName(Class.java:348)
> at org.apache.hadoop.conf.Configuration.getClassByNameOrNull(
> Configuration.java:1857)
> at org.apache.hadoop.conf.Configuration.getClassByName(
> Configuration.java:1822)
> at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:1916)
> at org.apache.hadoop.fs.FileSystem.getFileSystemClass(
> FileSystem.java:2609)
> at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2622)
> at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:91)
> at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2661)
> at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2643)
> at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:404)
> at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:169)
> at org.apache.hadoop.hive.ql.session.SessionState.start(
> SessionState.java:505)
> at org.apache.spark.sql.hive.client.HiveClientImpl.(
> HiveClientImpl.scala:171)
> at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
> at sun.reflect.NativeConstructorAccessorImpl.newInstance(
> NativeConstructorAccessorImpl.java:62)
> at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(
> DelegatingConstructorAccessorImpl.java:45)
> at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
> at org.apache.spark.sql.hive.client.IsolatedClientLoader.createClient(
> IsolatedClientLoader.scala:258)
> at org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(
> HiveUtils.scala:359)
> at org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(
> HiveUtils.scala:263)
> at org.apache.spark.sql.hive.HiveSharedState.metadataHive$
> lzycompute(HiveSharedState.scala:39)
> at org.apache.spark.sql.hive.HiveSharedState.metadataHive(
> HiveSharedState.scala:38)
> at org.apache.spark.sql.hive.HiveSharedState.externalCatalog$lzycompute(
> HiveSharedState.scala:46)
> at org.apache.spark.sql.hive.HiveSharedState.externalCatalog(
> HiveSharedState.scala:45)
> at org.apache.spark.sql.hive.HiveSessionState.catalog$
> lzycompute(HiveSessionState.scala:50)
> at org.apache.spark.sql.hive.HiveSessionState.catalog(
> HiveSessionState.scala:48)
> at org.apache.spark.sql.hive.HiveSessionState$$anon$1.<
> init>(HiveSessionState.scala:63)
> at org.apache.spark.sql.hive.HiveSessionState.analyzer$
> lzycompute(HiveSessionState.scala:63)
> at org.apache.spark.sql.hive.HiveSessionState.analyzer(
> HiveSessionState.scala:62)
> at org.apache.spark.sql.execution.QueryExecution.
> assertAnalyzed(QueryExecution.scala:49)
> at org.apache.spark.sql.Dataset.(Dataset.scala:161)
> at org.apache.spark.sql.Dataset.(Dataset.scala:167)
> at org.apache.spark.sql.Dataset$.apply(Dataset.scala:59)
> at org.apache.spark.sql.SparkSession.createDataset(SparkSession.scala:441)
> at org.apache.spark.sql.SQLContext.createDataset(SQLContext.scala:395)
> at org.apache.spark.sql.SQLImplicits.rddToDatasetHolder(
> SQLImplicits.scala:163)
> at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:36)
> at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:48)
> at 

Tutorial error - zeppelin 0.6.2 built with spark 2.0 and mapr

2016-09-26 Thread Nirav Patel
Hi,

I built zeppeling 0.6 branch using spark 2.0 using following mvn :

mvn clean package -Pbuild-distr -Pmapr41 -Pyarn -Pspark-2.0 -Pscala-2.11
-DskipTests

Built went successful.
I only have following set in zeppelin-conf.sh

export HADOOP_HOME=/opt/mapr/hadoop/hadoop-2.5.1/

export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop


i.e. I am using embedded spark 2.0 created during build process.

After starting zeppelin for first time I tried to run 'Zeppelin Tutorial'
notebook and got following exception in 'Load data into table' paragraph


import org.apache.commons.io.IOUtils
import java.net.URL
import java.nio.charset.Charset
bankText: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[0] at
parallelize at :30
defined class Bank
java.lang.UnsatisfiedLinkError: Native Library /tmp/
mapr-xactly-libMapRClient.0.6.2-SNAPSHOT.so already loaded in another
classloader
at java.lang.ClassLoader.loadLibrary0(ClassLoader.java:1907)
at java.lang.ClassLoader.loadLibrary(ClassLoader.java:1824)
at java.lang.Runtime.load0(Runtime.java:809)
at java.lang.System.load(System.java:1086)
at com.mapr.fs.shim.LibraryLoader.load(LibraryLoader.java:29)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at com.mapr.fs.ShimLoader.loadNativeLibrary(ShimLoader.java:335)
at com.mapr.fs.ShimLoader.load(ShimLoader.java:210)
at com.mapr.fs.MapRFileSystem.(MapRFileSystem.java:80)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:348)
at
org.apache.hadoop.conf.Configuration.getClassByNameOrNull(Configuration.java:1857)
at
org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:1822)
at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:1916)
at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2609)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2622)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:91)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2661)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2643)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:404)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:169)
at
org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:505)
at
org.apache.spark.sql.hive.client.HiveClientImpl.(HiveClientImpl.scala:171)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at
org.apache.spark.sql.hive.client.IsolatedClientLoader.createClient(IsolatedClientLoader.scala:258)
at
org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:359)
at
org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:263)
at
org.apache.spark.sql.hive.HiveSharedState.metadataHive$lzycompute(HiveSharedState.scala:39)
at
org.apache.spark.sql.hive.HiveSharedState.metadataHive(HiveSharedState.scala:38)
at
org.apache.spark.sql.hive.HiveSharedState.externalCatalog$lzycompute(HiveSharedState.scala:46)
at
org.apache.spark.sql.hive.HiveSharedState.externalCatalog(HiveSharedState.scala:45)
at
org.apache.spark.sql.hive.HiveSessionState.catalog$lzycompute(HiveSessionState.scala:50)
at
org.apache.spark.sql.hive.HiveSessionState.catalog(HiveSessionState.scala:48)
at
org.apache.spark.sql.hive.HiveSessionState$$anon$1.(HiveSessionState.scala:63)
at
org.apache.spark.sql.hive.HiveSessionState.analyzer$lzycompute(HiveSessionState.scala:63)
at
org.apache.spark.sql.hive.HiveSessionState.analyzer(HiveSessionState.scala:62)
at
org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:49)
at org.apache.spark.sql.Dataset.(Dataset.scala:161)
at org.apache.spark.sql.Dataset.(Dataset.scala:167)
at org.apache.spark.sql.Dataset$.apply(Dataset.scala:59)
at org.apache.spark.sql.SparkSession.createDataset(SparkSession.scala:441)
at org.apache.spark.sql.SQLContext.createDataset(SQLContext.scala:395)
at
org.apache.spark.sql.SQLImplicits.rddToDatasetHolder(SQLImplicits.scala:163)
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:36)
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:48)
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:50)
at $iwC$$iwC$$iwC$$iwC$$iwC.(:52)
at $iwC$$iwC$$iwC$$iwC.(:54)
at $iwC$$iwC$$iwC.(:56)
at $iwC$$iwC.(:58)
at $iwC.(:60)
at (:62)
at .(:66)
at .()
at .(:7)
at .()
at $print()
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at 

Re: Slow Shuffle Operation on Empty Batch

2016-09-26 Thread Cody Koeninger
Do you have a minimal example of how to reproduce the problem, that
doesn't depend on Cassandra?

On Mon, Sep 26, 2016 at 4:10 PM, Erwan ALLAIN  wrote:
> Hi
>
> I'm working with
> - Kafka 0.8.2
> - Spark Streaming (2.0) direct input stream.
> - cassandra 3.0
>
> My batch interval is 1s.
>
> When I use some map, filter even saveToCassandra functions, the processing
> time is around 50ms on empty batches
>  => This is fine.
>
> As soon as I use some reduceByKey, the processing time is increasing rapidly
> between 3 and 4s for 3 calls of reduceByKey on empty batches.
> => Not Good
>
> I've found a workaround by using a foreachRDD on DStream and check if rdd is
> empty before executing the reduceByKey but I find this quite ugly.
>
> Do I need to check if RDD is empty on all shuffle operation ?
>
> Thanks for your lights

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: spark-submit failing but job running from scala ide

2016-09-26 Thread Marco Mistroni
Hi Vr
 your code works fine for me, running on Windows 10 vs Spark 1.6.1
i m guessing your Spark installation could be busted?
That would explain why it works on your IDE, as you are just importing jars
in your project.

The java.io.IOException: Failed to connect to error  is misleading,  i have
seen similar error for  two or three completely different usecases

I'd suggest your either try to
- move down to Spark 1.4.0 or 1.5.2 (there are subtle differences between
these old version and spark 1.6.1
- or reinstall Spark 1.6.1 and start from running spark-examples via
spark-submit
- run spark-shell and enter your SimpleApp line by line, to see if you can
get better debugging infos

hth
 marco.



On Mon, Sep 26, 2016 at 5:22 PM, vr spark  wrote:

> Hi Jacek/All,
>
>  I restarted my terminal and then i try spark-submit and again getting
> those errors. How do i see how many "runtimes" are running and how to have
> only one? some how my spark 1.6 and spark 2.0 are conflicting. how to fix
> it?
>
> i installed spark 1.6 earlier using this steps http://genomegeek.
> blogspot.com/2014/11/how-to-install-apache-spark-on-mac-os-x.html
> i installed spark 2.0 using these steps http://blog.weetech.co/
> 2015/08/light-learning-apache-spark.html
>
> Here is the for run-example
>
> m-C02KL0B1FFT4:bin vr$ ./run-example SparkPi
> Using Spark's default log4j profile: org/apache/spark/log4j-
> defaults.properties
> 16/09/26 09:11:00 INFO SparkContext: Running Spark version 2.0.0
> 16/09/26 09:11:00 WARN NativeCodeLoader: Unable to load native-hadoop
> library for your platform... using builtin-java classes where applicable
> 16/09/26 09:11:00 INFO SecurityManager: Changing view acls to: vr
> 16/09/26 09:11:00 INFO SecurityManager: Changing modify acls to: vr
> 16/09/26 09:11:00 INFO SecurityManager: Changing view acls groups to:
> 16/09/26 09:11:00 INFO SecurityManager: Changing modify acls groups to:
> 16/09/26 09:11:00 INFO SecurityManager: SecurityManager: authentication
> disabled; ui acls disabled; users  with view permissions: Set(vr); groups
> with view permissions: Set(); users  with modify permissions: Set(vr);
> groups with modify permissions: Set()
> 16/09/26 09:11:01 INFO Utils: Successfully started service 'sparkDriver'
> on port 59323.
> 16/09/26 09:11:01 INFO SparkEnv: Registering MapOutputTracker
> 16/09/26 09:11:01 INFO SparkEnv: Registering BlockManagerMaster
> 16/09/26 09:11:01 INFO DiskBlockManager: Created local directory at
> /private/var/folders/23/ycbtxh8s551gzlsgj8q647d88gsjgb
> /T/blockmgr-d0d6dfea-2c97-4337-8e7d-0bbcb141f4c9
> 16/09/26 09:11:01 INFO MemoryStore: MemoryStore started with capacity
> 366.3 MB
> 16/09/26 09:11:01 INFO SparkEnv: Registering OutputCommitCoordinator
> 16/09/26 09:11:01 WARN Utils: Service 'SparkUI' could not bind on port
> 4040. Attempting port 4041.
> 16/09/26 09:11:01 INFO Utils: Successfully started service 'SparkUI' on
> port 4041.
> 16/09/26 09:11:01 INFO SparkUI: Bound SparkUI to 0.0.0.0, and started at
> http://192.168.1.3:4041
> 16/09/26 09:11:01 INFO SparkContext: Added JAR file:/Users/vr/Downloads/
> spark-2.0.0/examples/target/scala-2.11/jars/scopt_2.11-3.3.0.jar at
> spark://192.168.1.3:59323/jars/scopt_2.11-3.3.0.jar with timestamp
> 1474906261472
> 16/09/26 09:11:01 INFO SparkContext: Added JAR file:/Users/vr/Downloads/
> spark-2.0.0/examples/target/scala-2.11/jars/spark-examples_2.11-2.0.0.jar
> at spark://192.168.1.3:59323/jars/spark-examples_2.11-2.0.0.jar with
> timestamp 1474906261473
> 16/09/26 09:11:01 INFO Executor: Starting executor ID driver on host
> localhost
> 16/09/26 09:11:01 INFO Utils: Successfully started service
> 'org.apache.spark.network.netty.NettyBlockTransferService' on port 59324.
> 16/09/26 09:11:01 INFO NettyBlockTransferService: Server created on
> 192.168.1.3:59324
> 16/09/26 09:11:01 INFO BlockManagerMaster: Registering BlockManager
> BlockManagerId(driver, 192.168.1.3, 59324)
> 16/09/26 09:11:01 INFO BlockManagerMasterEndpoint: Registering block
> manager 192.168.1.3:59324 with 366.3 MB RAM, BlockManagerId(driver,
> 192.168.1.3, 59324)
> 16/09/26 09:11:01 INFO BlockManagerMaster: Registered BlockManager
> BlockManagerId(driver, 192.168.1.3, 59324)
> 16/09/26 09:11:01 WARN SparkContext: Use an existing SparkContext, some
> configuration may not take effect.
> 16/09/26 09:11:01 INFO SharedState: Warehouse path is
> 'file:/Users/vr/Downloads/spark-2.0.0/bin/spark-warehouse'.
> 16/09/26 09:11:01 INFO SparkContext: Starting job: reduce at
> SparkPi.scala:38
> 16/09/26 09:11:02 INFO DAGScheduler: Got job 0 (reduce at
> SparkPi.scala:38) with 2 output partitions
> 16/09/26 09:11:02 INFO DAGScheduler: Final stage: ResultStage 0 (reduce at
> SparkPi.scala:38)
> 16/09/26 09:11:02 INFO DAGScheduler: Parents of final stage: List()
> 16/09/26 09:11:02 INFO DAGScheduler: Missing parents: List()
> 16/09/26 09:11:02 INFO DAGScheduler: Submitting ResultStage 0
> (MapPartitionsRDD[1] at map at SparkPi.scala:34), 

Slow Shuffle Operation on Empty Batch

2016-09-26 Thread Erwan ALLAIN
Hi

I'm working with
- Kafka 0.8.2
- Spark Streaming (2.0) direct input stream.
- cassandra 3.0

My batch interval is 1s.

When I use some map, filter even saveToCassandra functions, the processing
time is around 50ms on empty batches
 => This is fine.

As soon as I use some reduceByKey, the processing time is increasing rapidly
between 3 and 4s for 3 calls of reduceByKey on empty batches.
=> Not Good

I've found a workaround by using a foreachRDD on DStream and check if rdd
is empty before executing the reduceByKey but I find this quite ugly.

Do I need to check if RDD is empty on all shuffle operation ?

Thanks for your lights


Re: using SparkILoop.run

2016-09-26 Thread Vadim Semenov
Add "-Dspark.master=local[*]" to the VM properties of your test run.

On Mon, Sep 26, 2016 at 2:25 PM, Mohit Jaggi  wrote:

> I want to use the following API  SparkILoop.run(...). I am writing a test
> case as that passes some scala code to spark interpreter and receives
> result as string.
>
> I couldn't figure out how to pass the right settings into the run()
> method. I get an error about "master' not being set.
>
> object SparkILoop {
>
>   /**
>* Creates an interpreter loop with default settings and feeds
>* the given code to it as input.
>*/
>   def run(code: String, sets: Settings = new Settings): String = {
> import java.io.{ BufferedReader, StringReader, OutputStreamWriter }
>
> stringFromStream { ostream =>
>   Console.withOut(ostream) {
> val input = new BufferedReader(new StringReader(code))
> val output = new JPrintWriter(new OutputStreamWriter(ostream), true)
> val repl = new SparkILoop(input, output)
>
> if (sets.classpath.isDefault) {
>   sets.classpath.value = sys.props("java.class.path")
> }
> repl process sets
>   }
> }
>   }
>   def run(lines: List[String]): String = run(lines.map(_ + "\n").mkString)
> }
>
>


Re: Is Spark 2.0 master node compatible with Spark 1.5 work node?

2016-09-26 Thread Koert Kuipers
oh i forgot in step1 you will have to modify spark's pom.xml to include
cloudera repo so it can find the cloudera artifacts

anyhow we found this process to be pretty easy and we stopped using the
spark versions bundles with the distros

On Mon, Sep 26, 2016 at 3:57 PM, Koert Kuipers  wrote:

> it is also easy to launch many different spark versions on yarn by simply
> having them installed side-by-side.
>
> 1) build spark for your cdh version. for example for cdh 5 i do:
> $ git checkout v2.0.0
> $ dev/make-distribution.sh --name cdh5.4-hive --tgz -Phadoop-2.6
> -Dhadoop.version=2.6.0-cdh5.4.4 -Pyarn -Phive -Psparkr
>
> 2) scp to and untar the spark tar.gz on the server you want to launch from
>
> 3) modify the new spark's conf/spark-env.sh so it has this:
> export HADOOP_CONF_DIR=/etc/hadoop/conf
>
> 4) modify the new spark's conf/spark-defaults.conf so it has this:
> spark.master yarn
>
> 5) now launch your application with the bin/spark-submit script from the
> new spark distro
>
>
> On Mon, Sep 26, 2016 at 11:48 AM, Rex X  wrote:
>
>> Yes, I have a cloudera cluster with Yarn. Any more details on how to work
>> out with uber jar?
>>
>> Thank you.
>>
>>
>> On Sun, Sep 18, 2016 at 2:13 PM, Felix Cheung 
>> wrote:
>>
>>> Well, uber jar works in YARN, but not with standalone ;)
>>>
>>>
>>>
>>>
>>>
>>> On Sun, Sep 18, 2016 at 12:44 PM -0700, "Chris Fregly" >> > wrote:
>>>
>>> you'll see errors like this...
>>>
>>> "java.lang.RuntimeException: java.io.InvalidClassException:
>>> org.apache.spark.rpc.netty.RequestMessage; local class incompatible:
>>> stream classdesc serialVersionUID = -2221986757032131007, local class
>>> serialVersionUID = -5447855329526097695"
>>>
>>> ...when mixing versions of spark.
>>>
>>> i'm actually seeing this right now while testing across Spark 1.6.1 and
>>> Spark 2.0.1 for my all-in-one, hybrid cloud/on-premise Spark + Zeppelin +
>>> Kafka + Kubernetes + Docker + One-Click Spark ML Model Production
>>> Deployments initiative documented here:
>>>
>>> https://github.com/fluxcapacitor/pipeline/wiki/Kubernetes-Do
>>> cker-Spark-ML
>>>
>>> and check out my upcoming meetup on this effort either in-person or
>>> online:
>>>
>>> http://www.meetup.com/Advanced-Spark-and-TensorFlow-Meetup/e
>>> vents/233978839/
>>>
>>> we're throwing in some GPU/CUDA just to sweeten the offering!  :)
>>>
>>> On Sat, Sep 10, 2016 at 2:57 PM, Holden Karau 
>>> wrote:
>>>
 I don't think a 2.0 uber jar will play nicely on a 1.5 standalone
 cluster.


 On Saturday, September 10, 2016, Felix Cheung <
 felixcheun...@hotmail.com> wrote:

> You should be able to get it to work with 2.0 as uber jar.
>
> What type cluster you are running on? YARN? And what distribution?
>
>
>
>
>
> On Sun, Sep 4, 2016 at 8:48 PM -0700, "Holden Karau" <
> hol...@pigscanfly.ca> wrote:
>
> You really shouldn't mix different versions of Spark between the
> master and worker nodes, if your going to upgrade - upgrade all of them.
> Otherwise you may get very confusing failures.
>
> On Monday, September 5, 2016, Rex X  wrote:
>
>> Wish to use the Pivot Table feature of data frame which is available
>> since Spark 1.6. But the spark of current cluster is version 1.5. Can we
>> install Spark 2.0 on the master node to work around this?
>>
>> Thanks!
>>
>
>
> --
> Cell : 425-233-8271
> Twitter: https://twitter.com/holdenkarau
>
>

 --
 Cell : 425-233-8271
 Twitter: https://twitter.com/holdenkarau


>>>
>>>
>>> --
>>> *Chris Fregly*
>>> Research Scientist @ *PipelineIO* 
>>> *Advanced Spark and TensorFlow Meetup*
>>> 
>>> *San Francisco* | *Chicago* | *Washington DC*
>>>
>>>
>>>
>>>
>>>
>>
>


Re: Is Spark 2.0 master node compatible with Spark 1.5 work node?

2016-09-26 Thread Koert Kuipers
it is also easy to launch many different spark versions on yarn by simply
having them installed side-by-side.

1) build spark for your cdh version. for example for cdh 5 i do:
$ git checkout v2.0.0
$ dev/make-distribution.sh --name cdh5.4-hive --tgz -Phadoop-2.6
-Dhadoop.version=2.6.0-cdh5.4.4 -Pyarn -Phive -Psparkr

2) scp to and untar the spark tar.gz on the server you want to launch from

3) modify the new spark's conf/spark-env.sh so it has this:
export HADOOP_CONF_DIR=/etc/hadoop/conf

4) modify the new spark's conf/spark-defaults.conf so it has this:
spark.master yarn

5) now launch your application with the bin/spark-submit script from the
new spark distro


On Mon, Sep 26, 2016 at 11:48 AM, Rex X  wrote:

> Yes, I have a cloudera cluster with Yarn. Any more details on how to work
> out with uber jar?
>
> Thank you.
>
>
> On Sun, Sep 18, 2016 at 2:13 PM, Felix Cheung 
> wrote:
>
>> Well, uber jar works in YARN, but not with standalone ;)
>>
>>
>>
>>
>>
>> On Sun, Sep 18, 2016 at 12:44 PM -0700, "Chris Fregly" 
>> wrote:
>>
>> you'll see errors like this...
>>
>> "java.lang.RuntimeException: java.io.InvalidClassException:
>> org.apache.spark.rpc.netty.RequestMessage; local class incompatible:
>> stream classdesc serialVersionUID = -2221986757032131007, local class
>> serialVersionUID = -5447855329526097695"
>>
>> ...when mixing versions of spark.
>>
>> i'm actually seeing this right now while testing across Spark 1.6.1 and
>> Spark 2.0.1 for my all-in-one, hybrid cloud/on-premise Spark + Zeppelin +
>> Kafka + Kubernetes + Docker + One-Click Spark ML Model Production
>> Deployments initiative documented here:
>>
>> https://github.com/fluxcapacitor/pipeline/wiki/Kubernetes-Docker-Spark-ML
>>
>> and check out my upcoming meetup on this effort either in-person or
>> online:
>>
>> http://www.meetup.com/Advanced-Spark-and-TensorFlow-Meetup/
>> events/233978839/
>>
>> we're throwing in some GPU/CUDA just to sweeten the offering!  :)
>>
>> On Sat, Sep 10, 2016 at 2:57 PM, Holden Karau 
>> wrote:
>>
>>> I don't think a 2.0 uber jar will play nicely on a 1.5 standalone
>>> cluster.
>>>
>>>
>>> On Saturday, September 10, 2016, Felix Cheung 
>>> wrote:
>>>
 You should be able to get it to work with 2.0 as uber jar.

 What type cluster you are running on? YARN? And what distribution?





 On Sun, Sep 4, 2016 at 8:48 PM -0700, "Holden Karau" <
 hol...@pigscanfly.ca> wrote:

 You really shouldn't mix different versions of Spark between the master
 and worker nodes, if your going to upgrade - upgrade all of them. Otherwise
 you may get very confusing failures.

 On Monday, September 5, 2016, Rex X  wrote:

> Wish to use the Pivot Table feature of data frame which is available
> since Spark 1.6. But the spark of current cluster is version 1.5. Can we
> install Spark 2.0 on the master node to work around this?
>
> Thanks!
>


 --
 Cell : 425-233-8271
 Twitter: https://twitter.com/holdenkarau


>>>
>>> --
>>> Cell : 425-233-8271
>>> Twitter: https://twitter.com/holdenkarau
>>>
>>>
>>
>>
>> --
>> *Chris Fregly*
>> Research Scientist @ *PipelineIO* 
>> *Advanced Spark and TensorFlow Meetup*
>> 
>> *San Francisco* | *Chicago* | *Washington DC*
>>
>>
>>
>>
>>
>


Re: udf forces usage of Row for complex types?

2016-09-26 Thread Koert Kuipers
https://issues.apache.org/jira/browse/SPARK-17668

On Mon, Sep 26, 2016 at 3:40 PM, Koert Kuipers  wrote:

> ok will create jira
>
> On Mon, Sep 26, 2016 at 3:27 PM, Michael Armbrust 
> wrote:
>
>> I agree this should work.  We just haven't finished killing the old
>> reflection based conversion logic now that we have more powerful/efficient
>> encoders.  Please open a JIRA.
>>
>> On Sun, Sep 25, 2016 at 2:41 PM, Koert Kuipers  wrote:
>>
>>> after having gotten used to have case classes represent complex
>>> structures in Datasets, i am surprised to find out that when i work in
>>> DataFrames with udfs no such magic exists, and i have to fall back to
>>> manipulating Row objects, which is error prone and somewhat ugly.
>>>
>>> for example:
>>> case class Person(name: String, age: Int)
>>>
>>> val df = Seq((Person("john", 33), 5), (Person("mike", 30),
>>> 6)).toDF("person", "id")
>>> val df1 = df.withColumn("person", udf({ (p: Person) => p.copy(age =
>>> p.age + 1) }).apply(col("person")))
>>> df1.printSchema
>>> df1.show
>>>
>>> leads to:
>>> java.lang.ClassCastException: org.apache.spark.sql.catalyst.
>>> expressions.GenericRowWithSchema cannot be cast to Person
>>>
>>
>>
>


Re: udf forces usage of Row for complex types?

2016-09-26 Thread Koert Kuipers
ok will create jira

On Mon, Sep 26, 2016 at 3:27 PM, Michael Armbrust 
wrote:

> I agree this should work.  We just haven't finished killing the old
> reflection based conversion logic now that we have more powerful/efficient
> encoders.  Please open a JIRA.
>
> On Sun, Sep 25, 2016 at 2:41 PM, Koert Kuipers  wrote:
>
>> after having gotten used to have case classes represent complex
>> structures in Datasets, i am surprised to find out that when i work in
>> DataFrames with udfs no such magic exists, and i have to fall back to
>> manipulating Row objects, which is error prone and somewhat ugly.
>>
>> for example:
>> case class Person(name: String, age: Int)
>>
>> val df = Seq((Person("john", 33), 5), (Person("mike", 30),
>> 6)).toDF("person", "id")
>> val df1 = df.withColumn("person", udf({ (p: Person) => p.copy(age = p.age
>> + 1) }).apply(col("person")))
>> df1.printSchema
>> df1.show
>>
>> leads to:
>> java.lang.ClassCastException: org.apache.spark.sql.catalyst.
>> expressions.GenericRowWithSchema cannot be cast to Person
>>
>
>


Re: Spark 2.0 Structured Streaming: sc.parallelize in foreach sink cause Task not serializable error

2016-09-26 Thread Michael Armbrust
The code in ForeachWriter runs on the executors, which means that you are
not allowed to use the SparkContext.  This is probably why you are seeing
that exception.

On Sun, Sep 25, 2016 at 3:20 PM, Jianshi  wrote:

> Dear all:
>
> I am trying out the new released feature of structured streaming in Spark
> 2.0. I use the Structured Streaming to perform windowing by event time. I
> can print out the result in the console.  I would like to write the result
> to  Cassandra database through the foreach sink option. I am trying to use
> the spark-cassandra-connector to save the result. The connector saves rdd
> to
> Cassandra by calling rdd.saveToCassandra(), and this works fine if I
> execute
> the commands in spark-shell. For example:
> import com.datastax.spark.connector._
> val col = sc.parallelize(Seq(("of", 1200), ("the", "863")))
> col.saveToCassandra(keyspace, table)
>
> However, when I use the sc.parallelize inside foreach sink, it raise an
> error. The input file is Json messages with each row like the following:
> {"id": text, "time":timestamp,"hr": int}
>
> Here is my code:
>
> object StructStream {
>   def main(args: Array[String]) {
> val conf = new SparkConf(true).set("spark.cassandra.connection.host",
> "172.31.0.174")
> val sc = new SparkContext(conf)
> val sqlContext = new SQLContext(sc)
> val spark =
> SparkSession.builder.appName("StructuredAverage").getOrCreate()
> import spark.implicits._
>
> val userSchema = new StructType().add("id", "string").add("hr",
> "integer").add("time","timestamp")
> val jsonDF =
> spark.readStream.schema(userSchema).json("hdfs://ec2-
> 52-45-70-95.compute-1.amazonaws.com:9000/test3/")
> val line_count = jsonDF.groupBy(window($"time","2 minutes","1
> minutes"),
> $"id").count().orderBy("window")
>
> import org.apache.spark.sql.ForeachWriter
>
> val writer = new ForeachWriter[org.apache.spark.sql.Row] {
>   override def open(partitionId: Long, version: Long) = true
>   override def process(value: org.apache.spark.sql.Row) = {
> val toRemove = "[]".toSet
> val v_str = value.toString().filterNot(toRemove).split(",")
> val v_df =
> sc.parallelize(Seq(Stick(v_str(2),v_str(3).toInt,v_str(1),v_str(0
> v_df.saveToCassandra("playground","sstest")
> println(v_str(0),v_str(1),v_str(2),v_str(3))}
>   override def close(errorOrNull: Throwable) = ()
> }
>
> val query =
> line_count.writeStream.outputMode("complete").foreach(writer).start()
>
> query.awaitTermination()
>
>   }
>
> }
>
> case class Stick(aid: String, bct:Int, cend: String, dst: String)
>
> *
> The error message looks like this:*
>
> Error:
> org.apache.spark.SparkException: Task not serializable
> at
> org.apache.spark.util.ClosureCleaner$.ensureSerializable(
> ClosureCleaner.scala:298)
> at
> org.apache.spark.util.ClosureCleaner$.org$apache$
> spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:288)
> at org.apache.spark.util.ClosureCleaner$.clean(
> ClosureCleaner.scala:108)
> at org.apache.spark.SparkContext.clean(SparkContext.scala:2037)
> at
> org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1.apply(RDD.scala:882)
> at
> org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1.apply(RDD.scala:881)
> at
> org.apache.spark.rdd.RDDOperationScope$.withScope(
> RDDOperationScope.scala:151)
> at
> org.apache.spark.rdd.RDDOperationScope$.withScope(
> RDDOperationScope.scala:112)
> at org.apache.spark.rdd.RDD.withScope(RDD.scala:358)
> at org.apache.spark.rdd.RDD.foreachPartition(RDD.scala:881)
> at
> org.apache.spark.sql.Dataset$$anonfun$foreachPartition$1.
> apply$mcV$sp(Dataset.scala:2117)
> at
> org.apache.spark.sql.Dataset$$anonfun$foreachPartition$1.
> apply(Dataset.scala:2117)
> at
> org.apache.spark.sql.Dataset$$anonfun$foreachPartition$1.
> apply(Dataset.scala:2117)
> at
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(
> SQLExecution.scala:57)
> at org.apache.spark.sql.Dataset.withNewExecutionId(Dataset.
> scala:2532)
> at org.apache.spark.sql.Dataset.foreachPartition(Dataset.
> scala:2116)
> at
> org.apache.spark.sql.execution.streaming.ForeachSink.addBatch(
> ForeachSink.scala:69)
> at
> org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$
> spark$sql$execution$streaming$StreamExecution$$runBatch(
> StreamExecution.scala:375)
> at
> org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$org$
> apache$spark$sql$execution$streaming$StreamExecution$$
> runBatches$1.apply$mcZ$sp(StreamExecution.scala:194)
> at
> org.apache.spark.sql.execution.streaming.ProcessingTimeExecutor.
> execute(TriggerExecutor.scala:43)
> at
> org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$
> spark$sql$execution$streaming$StreamExecution$$runBatches(
> 

Re: udf forces usage of Row for complex types?

2016-09-26 Thread Michael Armbrust
I agree this should work.  We just haven't finished killing the old
reflection based conversion logic now that we have more powerful/efficient
encoders.  Please open a JIRA.

On Sun, Sep 25, 2016 at 2:41 PM, Koert Kuipers  wrote:

> after having gotten used to have case classes represent complex structures
> in Datasets, i am surprised to find out that when i work in DataFrames with
> udfs no such magic exists, and i have to fall back to manipulating Row
> objects, which is error prone and somewhat ugly.
>
> for example:
> case class Person(name: String, age: Int)
>
> val df = Seq((Person("john", 33), 5), (Person("mike", 30),
> 6)).toDF("person", "id")
> val df1 = df.withColumn("person", udf({ (p: Person) => p.copy(age = p.age
> + 1) }).apply(col("person")))
> df1.printSchema
> df1.show
>
> leads to:
> java.lang.ClassCastException: 
> org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema
> cannot be cast to Person
>


Native libraries using only one core in standalone spark cluster

2016-09-26 Thread guangweiyu
Hi,

I'm trying to run a spark job that uses multiple cpu cores per spark
executor in a spark job. Specifically, it runs the gemm matrix multiply
routine from each partition on a large matrix that cannot be distributed.

For test purpose, I have a machine with 8 cores running standalone spark. I
started a spark context, setting "spark.task.cpus" to "8"; then I generated
an RDD with 1 partition only so there will be one executor using all cores.

The job is coded in Java, with JNI wrapper provided by fommil (netlib-java)
and underlying BLAS implementation from OpenBLAS, and the machine I'm
running is one local desktop with Intel(R) Core(TM) i7-4770K CPU @ 3.50GHz
(8 cores)
When I run the test using a local spark as "local[8]", I can see the routine
completes in about 200ms, and CPU utilization is near 100% for all cores.
This is nearly identical performance to running the same code without spark
straight from Java.

When I run the test attaching to the standalone spark by setting master as
"spark://:7077, the same code takes about 12 seconds, and monitoring the
cpu shows that only one thread is used at a time. This is also very close to
the performance I get if I ran the routine in Java with only one core.

I do not see any warning about failure to load native library, and if I
collect a map of  System.getenv(), I see that all the environment variables
seems to be correct (OPENBLAS_NUM_THREADS=8, LD_LIBRARY_PATH includes the
wrapper, etc..)

I also tried to replace OpenBLAS with MKL, with MKL_NUM_THREADS=8 and
MKL_DYNAMIC=false, but I got exactly same behaviour: local spark seems to
use all cores, but standalone spark would not use all cores.

I tried a lot of different settings on the native library's side but it
seems weird that local spark was okay but not the standalone spark.

Any help is greatly appreciated!

Guang



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Native-libraries-using-only-one-core-in-standalone-spark-cluster-tp27795.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



using SparkILoop.run

2016-09-26 Thread Mohit Jaggi
I want to use the following API  SparkILoop.run(...). I am writing a test
case as that passes some scala code to spark interpreter and receives
result as string.

I couldn't figure out how to pass the right settings into the run() method.
I get an error about "master' not being set.

object SparkILoop {

  /**
   * Creates an interpreter loop with default settings and feeds
   * the given code to it as input.
   */
  def run(code: String, sets: Settings = new Settings): String = {
import java.io.{ BufferedReader, StringReader, OutputStreamWriter }

stringFromStream { ostream =>
  Console.withOut(ostream) {
val input = new BufferedReader(new StringReader(code))
val output = new JPrintWriter(new OutputStreamWriter(ostream), true)
val repl = new SparkILoop(input, output)

if (sets.classpath.isDefault) {
  sets.classpath.value = sys.props("java.class.path")
}
repl process sets
  }
}
  }
  def run(lines: List[String]): String = run(lines.map(_ + "\n").mkString)
}


Re: Is Spark 2.0 master node compatible with Spark 1.5 work node?

2016-09-26 Thread Piotr Smoliński
In YARN you submit the whole application. This way unless the distribution
provider does strange classpath
"optimisations" you may just submit Spark 2 application aside of Spark 1.5
or 1.6.

It is YARN responsibility to deliver the application files and spark
assembly to the workers. What's more,
you have to install Spark only on the node from which you are going to
start it.

Procedure tested with Hortonworks.

HTH,
Piotr


On Mon, Sep 26, 2016 at 5:48 PM, Rex X  wrote:

> Yes, I have a cloudera cluster with Yarn. Any more details on how to work
> out with uber jar?
>
> Thank you.
>
>
> On Sun, Sep 18, 2016 at 2:13 PM, Felix Cheung 
> wrote:
>
>> Well, uber jar works in YARN, but not with standalone ;)
>>
>>
>>
>>
>>
>> On Sun, Sep 18, 2016 at 12:44 PM -0700, "Chris Fregly" 
>> wrote:
>>
>> you'll see errors like this...
>>
>> "java.lang.RuntimeException: java.io.InvalidClassException:
>> org.apache.spark.rpc.netty.RequestMessage; local class incompatible:
>> stream classdesc serialVersionUID = -2221986757032131007, local class
>> serialVersionUID = -5447855329526097695"
>>
>> ...when mixing versions of spark.
>>
>> i'm actually seeing this right now while testing across Spark 1.6.1 and
>> Spark 2.0.1 for my all-in-one, hybrid cloud/on-premise Spark + Zeppelin +
>> Kafka + Kubernetes + Docker + One-Click Spark ML Model Production
>> Deployments initiative documented here:
>>
>> https://github.com/fluxcapacitor/pipeline/wiki/Kubernetes-Docker-Spark-ML
>>
>> and check out my upcoming meetup on this effort either in-person or
>> online:
>>
>> http://www.meetup.com/Advanced-Spark-and-TensorFlow-Meetup/
>> events/233978839/
>>
>> we're throwing in some GPU/CUDA just to sweeten the offering!  :)
>>
>> On Sat, Sep 10, 2016 at 2:57 PM, Holden Karau 
>> wrote:
>>
>>> I don't think a 2.0 uber jar will play nicely on a 1.5 standalone
>>> cluster.
>>>
>>>
>>> On Saturday, September 10, 2016, Felix Cheung 
>>> wrote:
>>>
 You should be able to get it to work with 2.0 as uber jar.

 What type cluster you are running on? YARN? And what distribution?





 On Sun, Sep 4, 2016 at 8:48 PM -0700, "Holden Karau" <
 hol...@pigscanfly.ca> wrote:

 You really shouldn't mix different versions of Spark between the master
 and worker nodes, if your going to upgrade - upgrade all of them. Otherwise
 you may get very confusing failures.

 On Monday, September 5, 2016, Rex X  wrote:

> Wish to use the Pivot Table feature of data frame which is available
> since Spark 1.6. But the spark of current cluster is version 1.5. Can we
> install Spark 2.0 on the master node to work around this?
>
> Thanks!
>


 --
 Cell : 425-233-8271
 Twitter: https://twitter.com/holdenkarau


>>>
>>> --
>>> Cell : 425-233-8271
>>> Twitter: https://twitter.com/holdenkarau
>>>
>>>
>>
>>
>> --
>> *Chris Fregly*
>> Research Scientist @ *PipelineIO* 
>> *Advanced Spark and TensorFlow Meetup*
>> 
>> *San Francisco* | *Chicago* | *Washington DC*
>>
>>
>>
>>
>>
>


Non-linear regression of exponential form in Spark

2016-09-26 Thread Cooper
Is this possible to perform exponential regression in Apache Spark ?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Non-linear-regression-of-exponential-form-in-Spark-tp27794.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Pyspark ML - Unable to finish cross validation

2016-09-26 Thread Simone
Hello,

I am using pyspark to train a Logistic Regression model using cross validation 
with ML. My dataset is - for testing purposes very small - like no more than 50 
records for train.
On the other hand, my "feature" column has a very large size - i.e., 1500+ 
columns.

I am running on yarn using 3 executors, with 4gb and 4 cores each. I am using 
cache to store dataframes.

Unfortunately, my process does not finish and hangs in doing cross validation. 

Any clues? 

Thanks guys

Simone

Re: spark-submit failing but job running from scala ide

2016-09-26 Thread vr spark
Hi Jacek/All,

 I restarted my terminal and then i try spark-submit and again getting
those errors. How do i see how many "runtimes" are running and how to have
only one? some how my spark 1.6 and spark 2.0 are conflicting. how to fix
it?

i installed spark 1.6 earlier using this steps
http://genomegeek.blogspot.com/2014/11/how-to-install-apache-spark-on-mac-os-x.html
i installed spark 2.0 using these steps
http://blog.weetech.co/2015/08/light-learning-apache-spark.html

Here is the for run-example

m-C02KL0B1FFT4:bin vr$ ./run-example SparkPi
Using Spark's default log4j profile:
org/apache/spark/log4j-defaults.properties
16/09/26 09:11:00 INFO SparkContext: Running Spark version 2.0.0
16/09/26 09:11:00 WARN NativeCodeLoader: Unable to load native-hadoop
library for your platform... using builtin-java classes where applicable
16/09/26 09:11:00 INFO SecurityManager: Changing view acls to: vr
16/09/26 09:11:00 INFO SecurityManager: Changing modify acls to: vr
16/09/26 09:11:00 INFO SecurityManager: Changing view acls groups to:
16/09/26 09:11:00 INFO SecurityManager: Changing modify acls groups to:
16/09/26 09:11:00 INFO SecurityManager: SecurityManager: authentication
disabled; ui acls disabled; users  with view permissions: Set(vr); groups
with view permissions: Set(); users  with modify permissions: Set(vr);
groups with modify permissions: Set()
16/09/26 09:11:01 INFO Utils: Successfully started service 'sparkDriver' on
port 59323.
16/09/26 09:11:01 INFO SparkEnv: Registering MapOutputTracker
16/09/26 09:11:01 INFO SparkEnv: Registering BlockManagerMaster
16/09/26 09:11:01 INFO DiskBlockManager: Created local directory at
/private/var/folders/23/ycbtxh8s551gzlsgj8q647d88gsjgb/T/blockmgr-d0d6dfea-2c97-4337-8e7d-0bbcb141f4c9
16/09/26 09:11:01 INFO MemoryStore: MemoryStore started with capacity 366.3
MB
16/09/26 09:11:01 INFO SparkEnv: Registering OutputCommitCoordinator
16/09/26 09:11:01 WARN Utils: Service 'SparkUI' could not bind on port
4040. Attempting port 4041.
16/09/26 09:11:01 INFO Utils: Successfully started service 'SparkUI' on
port 4041.
16/09/26 09:11:01 INFO SparkUI: Bound SparkUI to 0.0.0.0, and started at
http://192.168.1.3:4041
16/09/26 09:11:01 INFO SparkContext: Added JAR
file:/Users/vr/Downloads/spark-2.0.0/examples/target/scala-2.11/jars/scopt_2.11-3.3.0.jar
at spark://192.168.1.3:59323/jars/scopt_2.11-3.3.0.jar with timestamp
1474906261472
16/09/26 09:11:01 INFO SparkContext: Added JAR
file:/Users/vr/Downloads/spark-2.0.0/examples/target/scala-2.11/jars/spark-examples_2.11-2.0.0.jar
at spark://192.168.1.3:59323/jars/spark-examples_2.11-2.0.0.jar with
timestamp 1474906261473
16/09/26 09:11:01 INFO Executor: Starting executor ID driver on host
localhost
16/09/26 09:11:01 INFO Utils: Successfully started service
'org.apache.spark.network.netty.NettyBlockTransferService' on port 59324.
16/09/26 09:11:01 INFO NettyBlockTransferService: Server created on
192.168.1.3:59324
16/09/26 09:11:01 INFO BlockManagerMaster: Registering BlockManager
BlockManagerId(driver, 192.168.1.3, 59324)
16/09/26 09:11:01 INFO BlockManagerMasterEndpoint: Registering block
manager 192.168.1.3:59324 with 366.3 MB RAM, BlockManagerId(driver,
192.168.1.3, 59324)
16/09/26 09:11:01 INFO BlockManagerMaster: Registered BlockManager
BlockManagerId(driver, 192.168.1.3, 59324)
16/09/26 09:11:01 WARN SparkContext: Use an existing SparkContext, some
configuration may not take effect.
16/09/26 09:11:01 INFO SharedState: Warehouse path is
'file:/Users/vr/Downloads/spark-2.0.0/bin/spark-warehouse'.
16/09/26 09:11:01 INFO SparkContext: Starting job: reduce at
SparkPi.scala:38
16/09/26 09:11:02 INFO DAGScheduler: Got job 0 (reduce at SparkPi.scala:38)
with 2 output partitions
16/09/26 09:11:02 INFO DAGScheduler: Final stage: ResultStage 0 (reduce at
SparkPi.scala:38)
16/09/26 09:11:02 INFO DAGScheduler: Parents of final stage: List()
16/09/26 09:11:02 INFO DAGScheduler: Missing parents: List()
16/09/26 09:11:02 INFO DAGScheduler: Submitting ResultStage 0
(MapPartitionsRDD[1] at map at SparkPi.scala:34), which has no missing
parents
16/09/26 09:11:02 INFO MemoryStore: Block broadcast_0 stored as values in
memory (estimated size 1832.0 B, free 366.3 MB)
16/09/26 09:11:02 INFO MemoryStore: Block broadcast_0_piece0 stored as
bytes in memory (estimated size 1169.0 B, free 366.3 MB)
16/09/26 09:11:02 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory
on 192.168.1.3:59324 (size: 1169.0 B, free: 366.3 MB)
16/09/26 09:11:02 INFO SparkContext: Created broadcast 0 from broadcast at
DAGScheduler.scala:1012
16/09/26 09:11:02 INFO DAGScheduler: Submitting 2 missing tasks from
ResultStage 0 (MapPartitionsRDD[1] at map at SparkPi.scala:34)
16/09/26 09:11:02 INFO TaskSchedulerImpl: Adding task set 0.0 with 2 tasks
16/09/26 09:11:02 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID
0, localhost, partition 0, PROCESS_LOCAL, 5474 bytes)
16/09/26 09:11:02 INFO TaskSetManager: Starting task 1.0 in stage 0.0 (TID
1, localhost, partition 1, 

Re: Is Spark 2.0 master node compatible with Spark 1.5 work node?

2016-09-26 Thread Rex X
Yes, I have a cloudera cluster with Yarn. Any more details on how to work
out with uber jar?

Thank you.


On Sun, Sep 18, 2016 at 2:13 PM, Felix Cheung 
wrote:

> Well, uber jar works in YARN, but not with standalone ;)
>
>
>
>
>
> On Sun, Sep 18, 2016 at 12:44 PM -0700, "Chris Fregly" 
> wrote:
>
> you'll see errors like this...
>
> "java.lang.RuntimeException: java.io.InvalidClassException:
> org.apache.spark.rpc.netty.RequestMessage; local class incompatible:
> stream classdesc serialVersionUID = -2221986757032131007, local class
> serialVersionUID = -5447855329526097695"
>
> ...when mixing versions of spark.
>
> i'm actually seeing this right now while testing across Spark 1.6.1 and
> Spark 2.0.1 for my all-in-one, hybrid cloud/on-premise Spark + Zeppelin +
> Kafka + Kubernetes + Docker + One-Click Spark ML Model Production
> Deployments initiative documented here:
>
> https://github.com/fluxcapacitor/pipeline/wiki/Kubernetes-Docker-Spark-ML
>
> and check out my upcoming meetup on this effort either in-person or
> online:
>
> http://www.meetup.com/Advanced-Spark-and-TensorFlow-
> Meetup/events/233978839/
>
> we're throwing in some GPU/CUDA just to sweeten the offering!  :)
>
> On Sat, Sep 10, 2016 at 2:57 PM, Holden Karau 
> wrote:
>
>> I don't think a 2.0 uber jar will play nicely on a 1.5 standalone
>> cluster.
>>
>>
>> On Saturday, September 10, 2016, Felix Cheung 
>> wrote:
>>
>>> You should be able to get it to work with 2.0 as uber jar.
>>>
>>> What type cluster you are running on? YARN? And what distribution?
>>>
>>>
>>>
>>>
>>>
>>> On Sun, Sep 4, 2016 at 8:48 PM -0700, "Holden Karau" <
>>> hol...@pigscanfly.ca> wrote:
>>>
>>> You really shouldn't mix different versions of Spark between the master
>>> and worker nodes, if your going to upgrade - upgrade all of them. Otherwise
>>> you may get very confusing failures.
>>>
>>> On Monday, September 5, 2016, Rex X  wrote:
>>>
 Wish to use the Pivot Table feature of data frame which is available
 since Spark 1.6. But the spark of current cluster is version 1.5. Can we
 install Spark 2.0 on the master node to work around this?

 Thanks!

>>>
>>>
>>> --
>>> Cell : 425-233-8271
>>> Twitter: https://twitter.com/holdenkarau
>>>
>>>
>>
>> --
>> Cell : 425-233-8271
>> Twitter: https://twitter.com/holdenkarau
>>
>>
>
>
> --
> *Chris Fregly*
> Research Scientist @ *PipelineIO* 
> *Advanced Spark and TensorFlow Meetup*
> 
> *San Francisco* | *Chicago* | *Washington DC*
>
>
>
>
>


Re: Running jobs against remote cluster from scala eclipse ide

2016-09-26 Thread Jacek Laskowski
Hi,

Remove

.setMaster("spark://spark-437-1-5963003:7077").
set("spark.driver.host","11.104.29.106")

and start over.

Can you also run the following command to check out Spark Standalone:

run-example --master spark://spark-437-1-5963003:7077 SparkPi

Pozdrawiam,
Jacek Laskowski

https://medium.com/@jaceklaskowski/
Mastering Apache Spark 2.0 http://bit.ly/mastering-apache-spark
Follow me at https://twitter.com/jaceklaskowski


On Mon, Sep 26, 2016 at 5:34 PM, vr spark  wrote:
> Hi,
> I use scala IDE for eclipse. I usually run job against my local spark
> installed on my mac and then export the jars and copy it to spark cluster of
> my company and run spark submit on it.
> This works fine.
>
> But i want to run the jobs from scala ide directly using the spark cluster
> of my company.
> the spark master url of my company cluster is
> spark://spark-437-1-5963003:7077.
> one of the worker nodes of that cluster is 11.104.29.106
>
> I tried this option,  but getting error
>
>   val conf = new SparkConf().setAppName("Simple
> Application").setMaster("spark://spark-437-1-5963003:7077").
> set("spark.driver.host","11.104.29.106")
>
> please let me know.
>
> 16/09/25 08:51:51 INFO SparkContext: Running Spark version 2.0.0
>
> 16/09/25 08:51:51 WARN NativeCodeLoader: Unable to load native-hadoop
> library for your platform... using builtin-java classes where applicable
>
> 16/09/25 08:51:52 INFO SecurityManager: Changing view acls to: vr
>
> 16/09/25 08:51:52 INFO SecurityManager: Changing modify acls to: vr
>
> 16/09/25 08:51:52 INFO SecurityManager: Changing view acls groups to:
>
> 16/09/25 08:51:52 INFO SecurityManager: Changing modify acls groups to:
>
> 16/09/25 08:51:52 INFO SecurityManager: SecurityManager: authentication
> disabled; ui acls disabled; users  with view permissions: Set(vr); groups
> with view permissions: Set(); users  with modify permissions: Set(vr);
> groups with modify permissions: Set()
>
> 16/09/25 08:51:52 WARN Utils: Service 'sparkDriver' could not bind on port
> 0. Attempting port 1.
>
> 16/09/25 08:51:52 WARN Utils: Service 'sparkDriver' could not bind on port
> 0. Attempting port 1.
>
> 16/09/25 08:51:52 WARN Utils: Service 'sparkDriver' could not bind on port
> 0. Attempting port 1.
>
> 16/09/25 08:51:52 WARN Utils: Service 'sparkDriver' could not bind on port
> 0. Attempting port 1.
>
> 16/09/25 08:51:52 WARN Utils: Service 'sparkDriver' could not bind on port
> 0. Attempting port 1.
>
> 16/09/25 08:51:52 WARN Utils: Service 'sparkDriver' could not bind on port
> 0. Attempting port 1.
>
> 16/09/25 08:51:52 WARN Utils: Service 'sparkDriver' could not bind on port
> 0. Attempting port 1.
>
> 16/09/25 08:51:52 WARN Utils: Service 'sparkDriver' could not bind on port
> 0. Attempting port 1.
>
> 16/09/25 08:51:52 WARN Utils: Service 'sparkDriver' could not bind on port
> 0. Attempting port 1.
>
> 16/09/25 08:51:52 WARN Utils: Service 'sparkDriver' could not bind on port
> 0. Attempting port 1.
>
> 16/09/25 08:51:52 WARN Utils: Service 'sparkDriver' could not bind on port
> 0. Attempting port 1.
>
> 16/09/25 08:51:52 WARN Utils: Service 'sparkDriver' could not bind on port
> 0. Attempting port 1.
>
> 16/09/25 08:51:52 WARN Utils: Service 'sparkDriver' could not bind on port
> 0. Attempting port 1.
>
> 16/09/25 08:51:52 WARN Utils: Service 'sparkDriver' could not bind on port
> 0. Attempting port 1.
>
> 16/09/25 08:51:52 WARN Utils: Service 'sparkDriver' could not bind on port
> 0. Attempting port 1.
>
> 16/09/25 08:51:52 WARN Utils: Service 'sparkDriver' could not bind on port
> 0. Attempting port 1.
>
> 16/09/25 08:51:52 ERROR SparkContext: Error initializing SparkContext.
>
> java.net.BindException: Can't assign requested address: Service
> 'sparkDriver' failed after 16 retries! Consider explicitly setting the
> appropriate port for the service 'sparkDriver' (for example spark.ui.port
> for SparkUI) to an available port or increasing spark.port.maxRetries.
>
> at sun.nio.ch.Net.bind0(Native Method)
>
> at sun.nio.ch.Net.bind(Net.java:433)
>
>
>
>
> full class code
>
> object RatingsCounter {
>
>   /** Our main function where the action happens */
>
>   def main(args: Array[String]) {
>
> Logger.getLogger("org").setLevel(Level.INFO)
>
> val conf = new SparkConf().setAppName("Simple
> Application").setMaster("spark://spark-437-1-5963003:7077").
> set("spark.driver.host","11.104.29.106")
>
>
>
>val sc = new SparkContext(conf)
>
> val lines = sc.textFile("u.data")
>
> val ratings = lines.map(x => x.toString().split("\t")(2))
>
> val results = ratings.countByValue()
>
> val sortedResults = results.toSeq.sortBy(_._1)
>
> sortedResults.foreach(println)
>
>   }
>
> }
>
>

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Running jobs against remote cluster from scala eclipse ide

2016-09-26 Thread vr spark
Hi,
I use scala IDE for eclipse. I usually run job against my local spark
installed on my mac and then export the jars and copy it to spark cluster
of my company and run spark submit on it.
This works fine.

But i want to run the jobs from scala ide directly using the spark cluster
of my company.
the spark master url of my company cluster is
spark://spark-437-1-5963003:7077.
one of the worker nodes of that cluster is 11.104.29.106

I tried this option,  but getting error

  val conf = new SparkConf().setAppName("Simple Application").setMaster(
"spark://spark-437-1-5963003:7077"). set("spark.driver.host","11.104.29.106"
)

please let me know.

16/09/25 08:51:51 INFO SparkContext: Running Spark version 2.0.0

16/09/25 08:51:51 WARN NativeCodeLoader: Unable to load native-hadoop
library for your platform... using builtin-java classes where applicable

16/09/25 08:51:52 INFO SecurityManager: Changing view acls to: vr

16/09/25 08:51:52 INFO SecurityManager: Changing modify acls to: vr

16/09/25 08:51:52 INFO SecurityManager: Changing view acls groups to:

16/09/25 08:51:52 INFO SecurityManager: Changing modify acls groups to:

16/09/25 08:51:52 INFO SecurityManager: SecurityManager: authentication
disabled; ui acls disabled; users  with view permissions: Set(vr); groups
with view permissions: Set(); users  with modify permissions: Set(vr);
groups with modify permissions: Set()

16/09/25 08:51:52 WARN Utils: Service 'sparkDriver' could not bind on port
0. Attempting port 1.

16/09/25 08:51:52 WARN Utils: Service 'sparkDriver' could not bind on port
0. Attempting port 1.

16/09/25 08:51:52 WARN Utils: Service 'sparkDriver' could not bind on port
0. Attempting port 1.

16/09/25 08:51:52 WARN Utils: Service 'sparkDriver' could not bind on port
0. Attempting port 1.

16/09/25 08:51:52 WARN Utils: Service 'sparkDriver' could not bind on port
0. Attempting port 1.

16/09/25 08:51:52 WARN Utils: Service 'sparkDriver' could not bind on port
0. Attempting port 1.

16/09/25 08:51:52 WARN Utils: Service 'sparkDriver' could not bind on port
0. Attempting port 1.

16/09/25 08:51:52 WARN Utils: Service 'sparkDriver' could not bind on port
0. Attempting port 1.

16/09/25 08:51:52 WARN Utils: Service 'sparkDriver' could not bind on port
0. Attempting port 1.

16/09/25 08:51:52 WARN Utils: Service 'sparkDriver' could not bind on port
0. Attempting port 1.

16/09/25 08:51:52 WARN Utils: Service 'sparkDriver' could not bind on port
0. Attempting port 1.

16/09/25 08:51:52 WARN Utils: Service 'sparkDriver' could not bind on port
0. Attempting port 1.

16/09/25 08:51:52 WARN Utils: Service 'sparkDriver' could not bind on port
0. Attempting port 1.

16/09/25 08:51:52 WARN Utils: Service 'sparkDriver' could not bind on port
0. Attempting port 1.

16/09/25 08:51:52 WARN Utils: Service 'sparkDriver' could not bind on port
0. Attempting port 1.

16/09/25 08:51:52 WARN Utils: Service 'sparkDriver' could not bind on port
0. Attempting port 1.

16/09/25 08:51:52 ERROR SparkContext: Error initializing SparkContext.

java.net.BindException: Can't assign requested address: Service
'sparkDriver' failed after 16 retries! Consider explicitly setting the
appropriate port for the service 'sparkDriver' (for example spark.ui.port
for SparkUI) to an available port or increasing spark.port.maxRetries.

at sun.nio.ch.Net.bind0(Native Method)

at sun.nio.ch.Net.bind(Net.java:433)




*full class code*

object RatingsCounter {

  /** Our main function where the action happens */

  def main(args: Array[String]) {

Logger.getLogger("org").setLevel(Level.INFO)

val conf = new SparkConf().setAppName("Simple Application").
setMaster("spark://spark-437-1-5963003:7077"). set("spark.driver.host",
"11.104.29.106")



   val sc = new SparkContext(conf)

val lines = sc.textFile("u.data")

val ratings = lines.map(x => x.toString().split("\t")(2))

val results = ratings.countByValue()

val sortedResults = results.toSeq.sortBy(_._1)

sortedResults.foreach(println)

  }

}


unsubscribe

2016-09-26 Thread Karthikeyan Vasuki Balasubramaniam
unsubscribe

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: how to find NaN values of each row of spark dataframe to decide whether the rows is dropeed or not

2016-09-26 Thread Peyman Mohajerian
Also take a look at this API:

https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.DataFrameNaFunctions

On Mon, Sep 26, 2016 at 1:09 AM, Bedrytski Aliaksandr 
wrote:

> Hi Muhammet,
>
> python also supports sql queries http://spark.apache.org/docs/latest/sql-
> programming-guide.html#running-sql-queries-programmatically
>
> Regards,
> --
>   Bedrytski Aliaksandr
>   sp...@bedryt.ski
>
>
>
> On Mon, Sep 26, 2016, at 10:01, muhammet pakyürek wrote:
>
>
>
>
> but my requst is related to python because i have designed preprocess
>  for data which looks for rows including NaN values. if the number of Nan
> is high above the threshodl. it s deleted otherwise fill it with a
> predictive value. therefore i need python version for this process
>
>
> --
>
> *From:* Bedrytski Aliaksandr 
> *Sent:* Monday, September 26, 2016 7:53 AM
> *To:* muhammet pakyürek
> *Cc:* user@spark.apache.org
> *Subject:* Re: how to find NaN values of each row of spark dataframe to
> decide whether the rows is dropeed or not
>
> Hi Muhammet,
>
> have you tried to use sql queries?
>
> spark.sql("""
> SELECT
> field1,
> field2,
> field3
>FROM table1
>WHERE
> field1 != 'Nan',
> field2 != 'Nan',
> field3 != 'Nan'
> """)
>
>
> This query filters rows containing Nan for a table with 3 columns.
>
> Regards,
> --
>   Bedrytski Aliaksandr
>   sp...@bedryt.ski
>
>
>
> On Mon, Sep 26, 2016, at 09:30, muhammet pakyürek wrote:
>
>
> is there any way to do this directly.  if its not, is there any todo this
> indirectly using another datastrcutures of spark
>
>
>
>


Re: SparkLauncher not receiving events

2016-09-26 Thread Mariano Semelman
Solved,
tl;dr
I was using port: 6066 instead of 7077,


I got confused because of this message in the log when I submit to the
legacy port:

[info] - org.apache.spark.launcher.app.ActivitiesSortingAggregateJob -
16/09/26 11:43:27 WARN RestSubmissionClient: Unable to connect to server
spark://ponyo:7077.
[info] - org.apache.spark.launcher.app.ActivitiesSortingAggregateJob -
Warning: Master endpoint spark://ponyo:7077 was not a REST server. Falling
back to legacy submission gateway instead.

That's why I started using the 6066 (rest), that was a some time ago.

But it seems that for some unknown reason to me sparkLauncher listeners
don't work with the rest port, absolutely all the logs (ie: client, master,
worker, driver) don't show anything at all.

Well, I hope I didn't make anyone spend too much time on this and saved
some other soul a few days with this.
​​
​Mariano


On 26 September 2016 at 11:37, Mariano Semelman <
mariano.semel...@despegar.com> wrote:

> Hello,
>
> I'm having problems to receive events from the submited app. The app
> succesfuly submits, but the listener I'm passing to SparkLauncher is not
> receiving events.
>
> Spark Version: 1.6.1 (both client app and master)
>
> here are the relevant snippets I'm using in my code:
> https://gist.github.com/msemelman/d9d2b54ce0c01af8952bc298058dc0d5
>
> BTW the logger is correctly configured if you ask for that. I even
> debugged the spark code and put some breakpoints
> in org.apache.spark.launcher.LauncherServer.ServerConnection#handle to
> check if any message were coming but nothing.
>
> Here it is the verbose output of the submission:
> https://gist.github.com/msemelman/2367fe4899886e2f2e38c87edfe4e9a9
> And what the master shows:
> https://www.dropbox.com/s/afmhnn3ra9wndca/Screenshot%20Master.png?dl=1
>
>
> Any help would be much appreciated, even if you told me where to debug,
> I'm a bit lost here.
>
> Thanks in advance
>


SparkLauncher not receiving events

2016-09-26 Thread Mariano Semelman
Hello,

I'm having problems to receive events from the submited app. The app
succesfuly submits, but the listener I'm passing to SparkLauncher is not
receiving events.

Spark Version: 1.6.1 (both client app and master)

here are the relevant snippets I'm using in my code:
https://gist.github.com/msemelman/d9d2b54ce0c01af8952bc298058dc0d5

BTW the logger is correctly configured if you ask for that. I even debugged
the spark code and put some breakpoints
in org.apache.spark.launcher.LauncherServer.ServerConnection#handle to
check if any message were coming but nothing.

Here it is the verbose output of the submission:
https://gist.github.com/msemelman/2367fe4899886e2f2e38c87edfe4e9a9
And what the master shows:
https://www.dropbox.com/s/afmhnn3ra9wndca/Screenshot%20Master.png?dl=1


Any help would be much appreciated, even if you told me where to debug, I'm
a bit lost here.

Thanks in advance


Please unsubscribe me from this mailing list

2016-09-26 Thread Hogancamp, Aaron
Please unsubscribe 
aaron.t.hoganc...@leidos.com from this 
mailing list.

Thanks,

Aaron Hogancamp
Data Scientist
(615) 431-3229 (desk)
(615) 617-7160 (mobile)



Re: how to decide which part of process use spark dataframe and pandas dataframe?

2016-09-26 Thread Peyman Mohajerian
A simple way to do that is to collect data in the driver when you need to
use Python panda.

On Monday, September 26, 2016, muhammet pakyürek  wrote:

>
>
> is there a clear guide to decide the above?
>


Re: Writing Dataframe to CSV yields blank file called "_SUCCESS"

2016-09-26 Thread Peter Figliozzi
Thanks again Piotr.  It's good to know there are a number of options.  Once
again I'm glad I put all my workers on the same ethernet switch, as
unanticipated shuffling isn't so bad.
Sincerely,
Pete

On Mon, Sep 26, 2016 at 8:35 AM, Piotr Smoliński <
piotr.smolinski...@gmail.com> wrote:

> Best, you should write to HDFS or when you test the product with no HDFS
> available just create a shared
> filesystem (windows shares, nfs, etc.) where the data will be written.
>
> You'll still end up with many files, but this time there will be only one
> directory tree.
>
> You may reduce the number of files by:
> * combining partitions on the same executor with coalesce call
> * repartitioning the RDD (DataFrame or DataSet depending on the API you
> use)
>
> The latter one is useful when you write the data to a partitioned
> structure. Note that repartitioning
> is explicit shuffle.
>
> If you want to have only single file you need to repartition the whole RDD
> to single partition.
> Depending on the result data size it may be something that you want or do
> not want to do ;-)
>
> Regards,
> Piotr
>
>
>
> On Mon, Sep 26, 2016 at 2:30 PM, Peter Figliozzi  > wrote:
>
>> Thank you Piotr, that's what happened.  In fact, there are about 100
>> files on each worker node in a directory corresponding to the write.
>>
>> Any way to tone that down a bit (maybe 1 file per worker)?  Or, write a
>> single file somewhere?
>>
>>
>> On Mon, Sep 26, 2016 at 12:44 AM, Piotr Smoliński <
>> piotr.smolinski...@gmail.com> wrote:
>>
>>> Hi Peter,
>>>
>>> The blank file _SUCCESS indicates properly finished output operation.
>>>
>>> What is the topology of your application?
>>> I presume, you write to local filesystem and have more than one worker
>>> machine.
>>> In such case Spark will write the result files for each partition (in
>>> the worker which
>>> holds it) and complete operation writing the _SUCCESS in the driver node.
>>>
>>> Cheers,
>>> Piotr
>>>
>>>
>>> On Mon, Sep 26, 2016 at 4:56 AM, Peter Figliozzi <
>>> pete.figlio...@gmail.com> wrote:
>>>
 Both

 df.write.csv("/path/to/foo")

 and

 df.write.format("com.databricks.spark.csv").save("/path/to/foo")

 results in a *blank* file called "_SUCCESS" under /path/to/foo.

 My df has stuff in it.. tried this with both my real df, and a quick df
 constructed from literals.

 Why isn't it writing anything?

 Thanks,

 Pete

>>>
>>>
>>
>


Re: udf forces usage of Row for complex types?

2016-09-26 Thread Koert Kuipers
Case classes are serializable by default (they extend java Serializable
trait)

I am not using RDD or Dataset because I need to transform one column out of
200 or so.

Dataset has the mechanisms to convert rows to case classes as needed (and
make sure it's consistent with the schema). Why would this code not be used
to make udfs a lot nicer?

On Sep 26, 2016 1:16 AM, "Bedrytski Aliaksandr"  wrote:

> Hi Koert,
>
> these case classes you are talking about, should be serializeable to be
> efficient (like kryo or just plain java serialization).
>
> DataFrame is not simply a collection of Rows (which are serializeable by
> default), it also contains a schema with different type for each column.
> This way any columnar data may be represented without creating custom case
> classes each time.
>
> If you want to manipulate a collection of case classes, why not use good
> old RDDs? (Or DataSets if you are using Spark 2.0)
> If you want to use sql against that collection, you will need to explain
> to your application how to read it as a table (by transforming it to a
> DataFrame)
>
> Regards
> --
>   Bedrytski Aliaksandr
>   sp...@bedryt.ski
>
>
>
> On Sun, Sep 25, 2016, at 23:41, Koert Kuipers wrote:
>
> after having gotten used to have case classes represent complex structures
> in Datasets, i am surprised to find out that when i work in DataFrames with
> udfs no such magic exists, and i have to fall back to manipulating Row
> objects, which is error prone and somewhat ugly.
> for example:
> case class Person(name: String, age: Int)
>
> val df = Seq((Person("john", 33), 5), (Person("mike", 30),
> 6)).toDF("person", "id")
> val df1 = df.withColumn("person", udf({ (p: Person) => p.copy(age = p.age
> + 1) }).apply(col("person")))
> df1.printSchema
> df1.show
> leads to:
> java.lang.ClassCastException: 
> org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema
> cannot be cast to Person
>
>
>


Re: Writing Dataframe to CSV yields blank file called "_SUCCESS"

2016-09-26 Thread Piotr Smoliński
Best, you should write to HDFS or when you test the product with no HDFS
available just create a shared
filesystem (windows shares, nfs, etc.) where the data will be written.

You'll still end up with many files, but this time there will be only one
directory tree.

You may reduce the number of files by:
* combining partitions on the same executor with coalesce call
* repartitioning the RDD (DataFrame or DataSet depending on the API you use)

The latter one is useful when you write the data to a partitioned
structure. Note that repartitioning
is explicit shuffle.

If you want to have only single file you need to repartition the whole RDD
to single partition.
Depending on the result data size it may be something that you want or do
not want to do ;-)

Regards,
Piotr



On Mon, Sep 26, 2016 at 2:30 PM, Peter Figliozzi 
wrote:

> Thank you Piotr, that's what happened.  In fact, there are about 100 files
> on each worker node in a directory corresponding to the write.
>
> Any way to tone that down a bit (maybe 1 file per worker)?  Or, write a
> single file somewhere?
>
>
> On Mon, Sep 26, 2016 at 12:44 AM, Piotr Smoliński <
> piotr.smolinski...@gmail.com> wrote:
>
>> Hi Peter,
>>
>> The blank file _SUCCESS indicates properly finished output operation.
>>
>> What is the topology of your application?
>> I presume, you write to local filesystem and have more than one worker
>> machine.
>> In such case Spark will write the result files for each partition (in the
>> worker which
>> holds it) and complete operation writing the _SUCCESS in the driver node.
>>
>> Cheers,
>> Piotr
>>
>>
>> On Mon, Sep 26, 2016 at 4:56 AM, Peter Figliozzi <
>> pete.figlio...@gmail.com> wrote:
>>
>>> Both
>>>
>>> df.write.csv("/path/to/foo")
>>>
>>> and
>>>
>>> df.write.format("com.databricks.spark.csv").save("/path/to/foo")
>>>
>>> results in a *blank* file called "_SUCCESS" under /path/to/foo.
>>>
>>> My df has stuff in it.. tried this with both my real df, and a quick df
>>> constructed from literals.
>>>
>>> Why isn't it writing anything?
>>>
>>> Thanks,
>>>
>>> Pete
>>>
>>
>>
>


Re: Writing Dataframe to CSV yields blank file called "_SUCCESS"

2016-09-26 Thread Peter Figliozzi
Thank you Piotr, that's what happened.  In fact, there are about 100 files
on each worker node in a directory corresponding to the write.

Any way to tone that down a bit (maybe 1 file per worker)?  Or, write a
single file somewhere?


On Mon, Sep 26, 2016 at 12:44 AM, Piotr Smoliński <
piotr.smolinski...@gmail.com> wrote:

> Hi Peter,
>
> The blank file _SUCCESS indicates properly finished output operation.
>
> What is the topology of your application?
> I presume, you write to local filesystem and have more than one worker
> machine.
> In such case Spark will write the result files for each partition (in the
> worker which
> holds it) and complete operation writing the _SUCCESS in the driver node.
>
> Cheers,
> Piotr
>
>
> On Mon, Sep 26, 2016 at 4:56 AM, Peter Figliozzi  > wrote:
>
>> Both
>>
>> df.write.csv("/path/to/foo")
>>
>> and
>>
>> df.write.format("com.databricks.spark.csv").save("/path/to/foo")
>>
>> results in a *blank* file called "_SUCCESS" under /path/to/foo.
>>
>> My df has stuff in it.. tried this with both my real df, and a quick df
>> constructed from literals.
>>
>> Why isn't it writing anything?
>>
>> Thanks,
>>
>> Pete
>>
>
>


Re: Can Spark Streaming 2.0 work with Kafka 0.10?

2016-09-26 Thread Cody Koeninger
Either artifact should work with 0.10 brokers.  The 0.10 integration has
more features but is still marked experimental.

On Sep 26, 2016 3:41 AM, "Haopu Wang"  wrote:

> Hi, in the official integration guide, it says "Spark Streaming 2.0.0 is
> compatible with Kafka 0.8.2.1."
>
>
>
> However, in maven repository, I can get "spark-streaming-kafka-0-10_2.11"
> which depends on Kafka 0.10.0.0
>
> Is this artifact stable enough? Thank you!
>
>
>
>
>


RE: udf forces usage of Row for complex types?

2016-09-26 Thread ming.he
It should be UserDefinedType.

You can refer to 
https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/UserDefinedTypeSuite.scala

From: Koert Kuipers [mailto:ko...@tresata.com]
Sent: Monday, September 26, 2016 5:42 AM
To: user@spark.apache.org
Subject: udf forces usage of Row for complex types?

after having gotten used to have case classes represent complex structures in 
Datasets, i am surprised to find out that when i work in DataFrames with udfs 
no such magic exists, and i have to fall back to manipulating Row objects, 
which is error prone and somewhat ugly.
for example:
case class Person(name: String, age: Int)

val df = Seq((Person("john", 33), 5), (Person("mike", 30), 6)).toDF("person", 
"id")
val df1 = df.withColumn("person", udf({ (p: Person) => p.copy(age = p.age + 1) 
}).apply(col("person")))
df1.printSchema
df1.show
leads to:
java.lang.ClassCastException: 
org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema cannot be cast 
to Person


Re: Subscribe

2016-09-26 Thread Amit Sela
Please Subscribe via the mailing list as described here:
http://beam.incubator.apache.org/use/mailing-lists/

On Mon, Sep 26, 2016, 12:11 Lakshmi Rajagopalan  wrote:

>
>


Subscribe

2016-09-26 Thread Lakshmi Rajagopalan



Can Spark Streaming 2.0 work with Kafka 0.10?

2016-09-26 Thread Haopu Wang
Hi, in the official integration guide, it says "Spark Streaming 2.0.0 is
compatible with Kafka 0.8.2.1."

 

However, in maven repository, I can get
"spark-streaming-kafka-0-10_2.11" which depends on Kafka 0.10.0.0

Is this artifact stable enough? Thank you!

 

 



Re: MLib Documentation Update Needed

2016-09-26 Thread Sean Owen
Yes I think that footnote could be a lot more prominent, or pulled up
right under the table.

I also think it would be fine to present the {0,1} formulation. It's
actually more recognizable, I think, for log-loss in that form. It's
probably less recognizable for hinge loss, but, consistency is more
important. There's just an extra (2y-1) term, at worst.

The loss here is per instance, and implicitly summed over all
instances. I think that is probably not confusing for the reader; if
they're reading this at all to double-check just what formulation is
being used, I think they'd know that. But, it's worth a note.

The loss is summed in the case of log-loss, not multiplied (if that's
what you're saying).

Those are decent improvements, feel free to open a pull request / JIRA.


On Mon, Sep 26, 2016 at 6:22 AM, Tobi Bosede  wrote:
> The loss function here for logistic regression is confusing. It seems to
> imply that spark uses only -1 and 1 class labels. However it uses 0,1 as the
> very inconspicuous note quoted below (under Classification) says. We need to
> make this point more visible to avoid confusion.
>
> Better yet, we should replace the loss function listed with that for 0, 1 no
> matter how mathematically inconvenient, since that is what is actually
> implemented in Spark.
>
> More problematic, the loss function (even in this "convenient" form) is
> actually incorrect. This is because it is missing either a summation (sigma)
> in the log or product (pi) outside the log, as the loss for logistic is the
> log likelihood. So there are multiple problems with the documentation.
> Please advise on steps to fix for all version documentation or if there are
> already some in place.
>
> "Note that, in the mathematical formulation in this guide, a binary label
> y is denoted as either +1 (positive) or −1 (negative), which is convenient
> for the formulation. However, the negative label is represented by 0 in
> spark.mllib instead of −1, to be consistent with multiclass labeling."

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: how to find NaN values of each row of spark dataframe to decide whether the rows is dropeed or not

2016-09-26 Thread Bedrytski Aliaksandr
Hi Muhammet,

python also supports sql queries
http://spark.apache.org/docs/latest/sql-programming-guide.html#running-sql-queries-programmatically

Regards,
--
  Bedrytski Aliaksandr
  sp...@bedryt.ski



On Mon, Sep 26, 2016, at 10:01, muhammet pakyürek wrote:
>
>
>
> but my requst is related to python because i have designed preprocess
> for data which looks for rows including NaN values. if the number of
> Nan is high above the threshodl. it s deleted otherwise fill it with a
> predictive value. therefore i need python version for this process
>
>
>
> *From:* Bedrytski Aliaksandr  *Sent:* Monday,
> September 26, 2016 7:53 AM *To:* muhammet pakyürek *Cc:*
> user@spark.apache.org *Subject:* Re: how to find NaN values of each
> row of spark dataframe to decide whether the rows is dropeed or not
>
> Hi Muhammet,
>
> have you tried to use sql queries?
>
>> spark.sql("""
>> SELECT
>> field1,
>> field2,
>> field3
>>FROM table1
>>WHERE
>> field1 != 'Nan',
>> field2 != 'Nan',
>> field3 != 'Nan'
>> """)
>
> This query filters rows containing Nan for a table with 3 columns.
>
> Regards,
> --
>   Bedrytski Aliaksandr
>   sp...@bedryt.ski
>
>
>
> On Mon, Sep 26, 2016, at 09:30, muhammet pakyürek wrote:
>>
>> is there any way to do this directly.  if its not, is there any todo
>> this indirectly using another datastrcutures of spark
>>
>


how to decide which part of process use spark dataframe and pandas dataframe?

2016-09-26 Thread muhammet pakyürek


is there a clear guide to decide the above?


Re: Extract timestamp from Kafka message

2016-09-26 Thread Alonso Isidoro Roman
hum, i think you have to embed the timestamp within the message...

Alonso Isidoro Roman
[image: https://]about.me/alonso.isidoro.roman


2016-09-26 0:59 GMT+02:00 Kevin Tran :

> Hi Everyone,
> Does anyone know how could we extract timestamp from Kafka message in
> Spark streaming ?
>
> JavaPairInputDStream messagesDStream =
> KafkaUtils.createDirectStream(
>ssc,
>String.class,
>String.class,
>StringDecoder.class,
>StringDecoder.class,
>kafkaParams,
>topics
>);
>
>
> Thanks,
> Kevin.
>
>
>
>
>
>
>


Re: how to find NaN values of each row of spark dataframe to decide whether the rows is dropeed or not

2016-09-26 Thread Bedrytski Aliaksandr
Hi Muhammet,

have you tried to use sql queries?

> spark.sql("""
> SELECT
> field1,
> field2,
> field3
>FROM table1
>WHERE
> field1 != 'Nan',
> field2 != 'Nan',
> field3 != 'Nan'
> """)

This query filters rows containing Nan for a table with 3 columns.

Regards,
--
  Bedrytski Aliaksandr
  sp...@bedryt.ski



On Mon, Sep 26, 2016, at 09:30, muhammet pakyürek wrote:
>
> is there any way to do this directly.  if its not, is there any todo
> this indirectly using another datastrcutures of spark
>


how to find NaN values of each row of spark dataframe to decide whether the rows is dropeed or not

2016-09-26 Thread muhammet pakyürek

is there any way to do this directly.  if its not, is there any todo this 
indirectly using another datastrcutures of spark