date:20160412

RE: Unable to Access files in Hadoop HA enabled from using Spark

2016-04-12 Thread Amit Singh Hora

This property already exists.

-Original Message-
From: "ashesh_28 [via Apache Spark User List]" 

Sent: ‎4/‎13/‎2016 11:02 AM
To: "Amit Singh Hora" 
Subject: Re: Unable to Access files in Hadoop HA enabled from using Spark

Try adding the following property into hdfs-site.xml 


   dfs.client.failover.proxy.provider.
   
org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider
 




If you reply to this email, your message will be added to the discussion below:
http://apache-spark-user-list.1001560.n3.nabble.com/Unable-to-Access-files-in-Hadoop-HA-enabled-from-using-Spark-tp26768p26769.html
 
To unsubscribe from Unable to Access files in Hadoop HA enabled from using 
Spark, click here.
NAML 



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Unable-to-Access-files-in-Hadoop-HA-enabled-from-using-Spark-tp26768p26770.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Unable to Access files in Hadoop HA enabled from using Spark

2016-04-12 Thread ashesh_28

Try adding the following property into hdfs-site.xml 


   dfs.client.failover.proxy.provider.
  
org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Unable-to-Access-files-in-Hadoop-HA-enabled-from-using-Spark-tp26768p26769.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Unable to Access files in Hadoop HA enabled from using Spark

2016-04-12 Thread Amit Singh Hora

I am trying to access directory in Hadoop from my Spark code on local
machine.Hadoop is HA enabled .

 val conf = new SparkConf().setAppName("LDA Sample").setMaster("local[2]")
val sc=new SparkContext(conf)
val distFile = sc.textFile("hdfs://hdpha/mini_newsgroups/")
println(distFile.count())
but getting error

java.net.UnknownHostException: hdpha
As hdpha not resolves to a particular machine it is the name I have chosen
for my HA Hadoop.I have already copied all hadoop configuration on my local
machine and have set the env. variable HADOOP_CONF_DIR But still no success.

Any suggestion will be of a great help

Note:- Hadoop HA is working properly as i have tried uploading file to
hadoop and it works



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Unable-to-Access-files-in-Hadoop-HA-enabled-from-using-Spark-tp26768.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: [ASK]:Dataframe number of column limit in Saprk 1.5.2

2016-04-12 Thread mdkhajaasmath

I am also looking for same information . In my case I need to create 190 
columns.. 

Sent from my iPhone

> On Apr 12, 2016, at 9:49 PM, Divya Gehlot  wrote:
> 
> Hi,
> I would like to know does Spark Dataframe API has limit  on creation of 
> number of columns?
> 
> Thanks,
> Divya 

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

[ASK]:Dataframe number of column limit in Saprk 1.5.2

2016-04-12 Thread Divya Gehlot

Hi,
I would like to know does Spark Dataframe API has limit  on creation of
number of columns?

Thanks,
Divya

Re: Can i have a hive context and sql context in the same app ?

2016-04-12 Thread Michael Armbrust

You don't need multiple contexts to do this:
http://spark.apache.org/docs/latest/sql-programming-guide.html#jdbc-to-other-databases

On Tue, Apr 12, 2016 at 4:05 PM, Michael Segel 
wrote:

> Reading from multiple sources within the same application?
>
> How would you connect to Hive for some data and then reach out to lets say
> Oracle or DB2 for some other data that you may want but isn’t available on
> your cluster?
>
>
> On Apr 12, 2016, at 10:52 AM, Michael Armbrust 
> wrote:
>
> You can, but I'm not sure why you would want to.  If you want to isolate
> different users just use hiveContext.newSession().
>
> On Tue, Apr 12, 2016 at 1:48 AM, Natu Lauchande 
> wrote:
>
>> Hi,
>>
>> Is it possible to have both a sqlContext and a hiveContext in the same
>> application ?
>>
>> If yes would there be any performance pernalties of doing so.
>>
>> Regards,
>> Natu
>>
>
>
>

RE: How does spark-submit handle Python scripts (and how to repeat it)?

2016-04-12 Thread Sun, Rui

Spark configurations specified at the command line for spark-submit should be 
passed to the JVM inside Julia process. You can refer to 
https://github.com/apache/spark/blob/master/launcher/src/main/java/org/apache/spark/launcher/SparkSubmitCommandBuilder.java#L267
 and 
https://github.com/apache/spark/blob/master/launcher/src/main/java/org/apache/spark/launcher/SparkSubmitCommandBuilder.java#L295
Generally,
spark-submit JVM -> JuliaRunner -> Env var like 
“JULIA_SUBMIT_ARGS” -> julia process -> new JVM with SparkContext
  Julia can pick the env var, and set the system properties or directly fill 
the configurations into a SparkConf, and then create a SparkContext

Yes, you are right, `spark-submit` creates new Python/R process that connects 
back to that same JVM and creates SparkContext in it.
Refer to 
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/PythonRunner.scala#L47
 and
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/RRunner.scala#L65


From: Andrei [mailto:faithlessfri...@gmail.com]
Sent: Wednesday, April 13, 2016 4:32 AM
To: Sun, Rui 
Cc: user 
Subject: Re: How does spark-submit handle Python scripts (and how to repeat it)?

One part is passing the command line options, like “--master”, from the JVM 
launched by spark-submit to the JVM where SparkContext resides

Since I have full control over both - JVM and Julia parts - I can pass whatever 
options to both. But what exactly should be passed? Currently pipeline looks 
like this:

spark-submit JVM -> JuliaRunner -> julia process -> new JVM with SparkContext

 I want to make the last JVM's SparkContext to understand that it should run on 
YARN. Obviously, I can't pass `--master yarn` option to JVM itself. Instead, I 
can pass system property "spark.master" = "yarn-client", but this results in an 
error:

Retrying connect to server: 0.0.0.0/0.0.0.0:8032


So it's definitely not enough. I tried to set manually all system properties 
that `spark-submit` adds to the JVM (including "spark-submit=true", 
"spark.submit.deployMode=client", etc.), but it didn't help too. Source code is 
always good, but for a stranger like me it's a little bit hard to grasp control 
flow in SparkSubmit class.


For pySpark & SparkR, when running scripts in client deployment modes 
(standalone client and yarn client), the JVM is the same (py4j/RBackend running 
as a thread in the JVM launched by spark-submit)

Can you elaborate on this? Does it mean that `spark-submit` creates new 
Python/R process that connects back to that same JVM and creates SparkContext 
in it?


On Tue, Apr 12, 2016 at 2:04 PM, Sun, Rui 
> wrote:
There is much deployment preparation work handling different deployment modes 
for pyspark and SparkR in SparkSubmit. It is difficult to summarize it briefly, 
you had better refer to the source code.

Supporting running Julia scripts in SparkSubmit is more than implementing a 
‘JuliaRunner’. One part is passing the command line options, like “--master”, 
from the JVM launched by spark-submit to the JVM where SparkContext resides, in 
the case that the two JVMs are not the same. For pySpark & SparkR, when running 
scripts in client deployment modes (standalone client and yarn client), the JVM 
is the same (py4j/RBackend running as a thread in the JVM launched by 
spark-submit) , so no need to pass the command line options around. However, in 
your case, Julia interpreter launches an in-process JVM for SparkContext, which 
is a separate JVM from the one launched by spark-submit. So you need a way, 
typically an environment environment variable, like “SPARKR_SUBMIT_ARGS” for 
SparkR or “PYSPARK_SUBMIT_ARGS” for pyspark, to pass command line args to the 
in-process JVM in the Julia interpreter so that SparkConf can pick the options.

From: Andrei 
[mailto:faithlessfri...@gmail.com]
Sent: Tuesday, April 12, 2016 3:48 AM
To: user >
Subject: How does spark-submit handle Python scripts (and how to repeat it)?

I'm working on a wrapper [1] around Spark for the Julia programming language 
[2] similar to PySpark. I've got it working with Spark Standalone server by 
creating local JVM and setting master programmatically. However, this approach 
doesn't work with YARN (and probably Mesos), which require running via 
`spark-submit`.

In `SparkSubmit` class I see that for Python a special class `PythonRunner` is 
launched, so I tried to do similar `JuliaRunner`, which essentially does the 
following:

val pb = new ProcessBuilder(Seq("julia", juliaScript))
val process = pb.start()
process.waitFor()

where `juliaScript` itself creates new JVM and `SparkContext` inside it WITHOUT 
setting master URL. I then tried to launch this class using

spark-submit --master yarn \

Re: How to estimate the size of dataframe using pyspark?

2016-04-12 Thread Buntu Dev

Thanks Davies, I've shared the code snippet and the dataset. Please let me
know if you need any other information.

On Mon, Apr 11, 2016 at 10:44 AM, Davies Liu  wrote:

> That's weird, DataFrame.count() should not require lots of memory on
> driver, could you provide a way to reproduce it (could generate fake
> dataset)?
>
> On Sat, Apr 9, 2016 at 4:33 PM, Buntu Dev  wrote:
> > I've allocated about 4g for the driver. For the count stage, I notice the
> > Shuffle Write to be 13.9 GB.
> >
> > On Sat, Apr 9, 2016 at 11:43 AM, Ndjido Ardo BAR 
> wrote:
> >>
> >> What's the size of your driver?
> >> On Sat, 9 Apr 2016 at 20:33, Buntu Dev  wrote:
> >>>
> >>> Actually, df.show() works displaying 20 rows but df.count() is the one
> >>> which is causing the driver to run out of memory. There are just 3 INT
> >>> columns.
> >>>
> >>> Any idea what could be the reason?
> >>>
> >>> On Sat, Apr 9, 2016 at 10:47 AM,  wrote:
> 
>  You seem to have a lot of column :-) !
>  df.count() displays the size of your data frame.
>  df.columns.size() the number of columns.
> 
>  Finally, I suggest you check the size of your drive and customize it
>  accordingly.
> 
>  Cheers,
> 
>  Ardo
> 
>  Sent from my iPhone
> 
>  > On 09 Apr 2016, at 19:37, bdev  wrote:
>  >
>  > I keep running out of memory on the driver when I attempt to do
>  > df.show().
>  > Can anyone let me know how to estimate the size of the dataframe?
>  >
>  > Thanks!
>  >
>  >
>  >
>  > --
>  > View this message in context:
>  >
> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-estimate-the-size-of-dataframe-using-pyspark-tp26729.html
>  > Sent from the Apache Spark User List mailing list archive at
>  > Nabble.com.
>  >
>  >
> -
>  > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>  > For additional commands, e-mail: user-h...@spark.apache.org
>  >
> >>>
> >>>
> >
>

RE: ML Random Forest Classifier

2016-04-12 Thread Ashic Mahtab

It looks like the issue is around impurity stats. After converting an rf model 
to old, and back to new (without disk storage or anything), and specifying the 
same num of features, same categorical features map, etc., 
DecisionTreeClassifier::predictRaw throws a null pointer exception here:
 override protected def predictRaw(features: Vector): Vector = {
Vectors.dense(rootNode.predictImpl(features).impurityStats.stats.clone())  }
It appears impurityStats is always null (even though impurity does have a 
value). Any known workarounds? It's looking like I'll have to revert to using 
mllib instead :(
-Ashic.
From: as...@live.com
To: ja...@gluru.co
CC: user@spark.apache.org
Subject: RE: ML Random Forest Classifier
Date: Wed, 13 Apr 2016 02:20:53 +0100




I managed to get to the map using MetadataUtils (it's a private ml package). 
There are still some issues, around feature names, etc. Trying to pin them down.

From: as...@live.com
To: ja...@gluru.co
CC: user@spark.apache.org
Subject: RE: ML Random Forest Classifier
Date: Wed, 13 Apr 2016 00:50:31 +0100




Hi James,Following on from the previous email, is there a way to get the 
categoricalFeatures of a Spark ML Random Forest? Essentially something I can 
pass to
RandomForestClassificationModel.fromOld(oldModel, parent, categoricalFeatures, 
numClasses, numFeatures)
I could construct it by hand, but I was hoping for a more automated way of 
getting the map. Since the trained model already knows about the value, perhaps 
it's possible to grab it for storage?
Thanks,Ashic.

From: as...@live.com
To: ja...@gluru.co
CC: user@spark.apache.org
Subject: RE: ML Random Forest Classifier
Date: Mon, 11 Apr 2016 23:21:53 +0100




Thanks, James. That looks promising. 

Date: Mon, 11 Apr 2016 10:41:07 +0100
Subject: Re: ML Random Forest Classifier
From: ja...@gluru.co
To: as...@live.com
CC: user@spark.apache.org

To add a bit more detail perhaps something like this might work:








package org.apache.spark.ml




import org.apache.spark.ml.classification.RandomForestClassificationModel

import org.apache.spark.ml.classification.DecisionTreeClassificationModel

import org.apache.spark.ml.classification.LogisticRegressionModel

import org.apache.spark.mllib.tree.model.{ RandomForestModel => 
OldRandomForestModel }

import org.apache.spark.ml.classification.RandomForestClassifier




object RandomForestModelConverter {




  def fromOld(oldModel: OldRandomForestModel, parent: RandomForestClassifier = 
null,

categoricalFeatures: Map[Int, Int], numClasses: Int, numFeatures: Int = 
-1): RandomForestClassificationModel = {

RandomForestClassificationModel.fromOld(oldModel, parent, 
categoricalFeatures, numClasses, numFeatures)

  }




  def toOld(newModel: RandomForestClassificationModel): OldRandomForestModel = {

newModel.toOld

  }

}

Regards,
James 
On 11 April 2016 at 10:36, James Hammerton  wrote:
There are methods for converting the dataframe based random forest models to 
the old RDD based models and vice versa. Perhaps using these will help given 
that the old models can be saved and loaded?
In order to use them however you will need to write code in the 
org.apache.spark.ml package.
I've not actually tried doing this myself but it looks as if it might work.
Regards,
James








On 11 April 2016 at 10:29, Ashic Mahtab  wrote:



Hello,I'm trying to save a pipeline with a random forest classifier. If I try 
to save the pipeline, it complains that the classifier is not Writable, and 
indeed the classifier itself doesn't have a write function. There's a pull 
request that's been merged that enables this for Spark 2.0 (any dates around 
when that'll release?). I am, however, using the Spark Cassandra Connector 
which doesn't seem to be able to create a CqlContext with spark 2.0 snapshot 
builds. Seeing that ML Lib's random forest classifier supports storing and 
loading models, is there a way to create a Spark ML pipeline in Spark 1.6 with 
a random forest classifier that'll allow me to store and load the model? The 
model takes significant amount of time to train, and I really don't want to 
have to train it every time my application launches.
Thanks,Ashic.

RE: ML Random Forest Classifier

2016-04-12 Thread Ashic Mahtab

I managed to get to the map using MetadataUtils (it's a private ml package). 
There are still some issues, around feature names, etc. Trying to pin them down.

From: as...@live.com
To: ja...@gluru.co
CC: user@spark.apache.org
Subject: RE: ML Random Forest Classifier
Date: Wed, 13 Apr 2016 00:50:31 +0100

Hi James,Following on from the previous email, is there a way to get the 
categoricalFeatures of a Spark ML Random Forest? Essentially something I can 
pass to
RandomForestClassificationModel.fromOld(oldModel, parent, categoricalFeatures, 
numClasses, numFeatures)
I could construct it by hand, but I was hoping for a more automated way of 
getting the map. Since the trained model already knows about the value, perhaps 
it's possible to grab it for storage?
Thanks,Ashic.

From: as...@live.com
To: ja...@gluru.co
CC: user@spark.apache.org
Subject: RE: ML Random Forest Classifier
Date: Mon, 11 Apr 2016 23:21:53 +0100

Thanks, James. That looks promising. 

Date: Mon, 11 Apr 2016 10:41:07 +0100
Subject: Re: ML Random Forest Classifier
From: ja...@gluru.co
To: as...@live.com
CC: user@spark.apache.org

To add a bit more detail perhaps something like this might work:

package org.apache.spark.ml

import org.apache.spark.ml.classification.RandomForestClassificationModel

import org.apache.spark.ml.classification.DecisionTreeClassificationModel

import org.apache.spark.ml.classification.LogisticRegressionModel

import org.apache.spark.mllib.tree.model.{ RandomForestModel => 
OldRandomForestModel }

import org.apache.spark.ml.classification.RandomForestClassifier

object RandomForestModelConverter {

  def fromOld(oldModel: OldRandomForestModel, parent: RandomForestClassifier = 
null,

categoricalFeatures: Map[Int, Int], numClasses: Int, numFeatures: Int = 
-1): RandomForestClassificationModel = {

RandomForestClassificationModel.fromOld(oldModel, parent, 
categoricalFeatures, numClasses, numFeatures)

  }

  def toOld(newModel: RandomForestClassificationModel): OldRandomForestModel = {

newModel.toOld

  }

}

Regards,
James 
On 11 April 2016 at 10:36, James Hammerton  wrote:
There are methods for converting the dataframe based random forest models to 
the old RDD based models and vice versa. Perhaps using these will help given 
that the old models can be saved and loaded?
In order to use them however you will need to write code in the 
org.apache.spark.ml package.
I've not actually tried doing this myself but it looks as if it might work.
Regards,
James

On 11 April 2016 at 10:29, Ashic Mahtab  wrote:

Hello,I'm trying to save a pipeline with a random forest classifier. If I try 
to save the pipeline, it complains that the classifier is not Writable, and 
indeed the classifier itself doesn't have a write function. There's a pull 
request that's been merged that enables this for Spark 2.0 (any dates around 
when that'll release?). I am, however, using the Spark Cassandra Connector 
which doesn't seem to be able to create a CqlContext with spark 2.0 snapshot 
builds. Seeing that ML Lib's random forest classifier supports storing and 
loading models, is there a way to create a Spark ML pipeline in Spark 1.6 with 
a random forest classifier that'll allow me to store and load the model? The 
model takes significant amount of time to train, and I really don't want to 
have to train it every time my application launches.
Thanks,Ashic.

Re: Spark 1.6.1 packages on S3 corrupt?

2016-04-12 Thread Nicholas Chammas

Yes, this is a known issue. The core devs are already aware of it. [CC dev]

FWIW, I believe the Spark 1.6.1 / Hadoop 2.6 package on S3 is not corrupt.
It may be the only 1.6.1 package that is not corrupt, though. :/

Nick


On Tue, Apr 12, 2016 at 9:00 PM Augustus Hong 
wrote:

> Hi all,
>
> I'm trying to launch a cluster with the spark-ec2 script but seeing the
> error below.  Are the packages on S3 corrupted / not in the correct format?
>
> Initializing spark
>
> --2016-04-13 00:25:39--
> http://s3.amazonaws.com/spark-related-packages/spark-1.6.1-bin-hadoop1.tgz
>
> Resolving s3.amazonaws.com (s3.amazonaws.com)... 54.231.11.67
>
> Connecting to s3.amazonaws.com (s3.amazonaws.com)|54.231.11.67|:80...
> connected.
>
> HTTP request sent, awaiting response... 200 OK
>
> Length: 277258240 (264M) [application/x-compressed]
>
> Saving to: ‘spark-1.6.1-bin-hadoop1.tgz’
>
> 100%[==>]
> 277,258,240 37.6MB/s   in 9.2s
>
> 2016-04-13 00:25:49 (28.8 MB/s) - ‘spark-1.6.1-bin-hadoop1.tgz’ saved
> [277258240/277258240]
>
> Unpacking Spark
>
>
> gzip: stdin: not in gzip format
>
> tar: Child returned status 1
>
> tar: Error is not recoverable: exiting now
>
> mv: missing destination file operand after `spark'
>
> Try `mv --help' for more information.
>
>
>
>
>
>
> --
> [image: Branch] 
> Augustus Hong
> Software Engineer
>
>

Spark 1.6.1 packages on S3 corrupt?

2016-04-12 Thread Augustus Hong

Hi all,

I'm trying to launch a cluster with the spark-ec2 script but seeing the
error below.  Are the packages on S3 corrupted / not in the correct format?

Initializing spark

--2016-04-13 00:25:39--
http://s3.amazonaws.com/spark-related-packages/spark-1.6.1-bin-hadoop1.tgz

Resolving s3.amazonaws.com (s3.amazonaws.com)... 54.231.11.67

Connecting to s3.amazonaws.com (s3.amazonaws.com)|54.231.11.67|:80...
connected.

HTTP request sent, awaiting response... 200 OK

Length: 277258240 (264M) [application/x-compressed]

Saving to: ‘spark-1.6.1-bin-hadoop1.tgz’

100%[==>]
277,258,240 37.6MB/s   in 9.2s

2016-04-13 00:25:49 (28.8 MB/s) - ‘spark-1.6.1-bin-hadoop1.tgz’ saved
[277258240/277258240]

Unpacking Spark


gzip: stdin: not in gzip format

tar: Child returned status 1

tar: Error is not recoverable: exiting now

mv: missing destination file operand after `spark'

Try `mv --help' for more information.






-- 
[image: Branch] 
Augustus Hong
Software Engineer

RE: ML Random Forest Classifier

2016-04-12 Thread Ashic Mahtab

Hi James,Following on from the previous email, is there a way to get the 
categoricalFeatures of a Spark ML Random Forest? Essentially something I can 
pass to
RandomForestClassificationModel.fromOld(oldModel, parent, categoricalFeatures, 
numClasses, numFeatures)
I could construct it by hand, but I was hoping for a more automated way of 
getting the map. Since the trained model already knows about the value, perhaps 
it's possible to grab it for storage?
Thanks,Ashic.

From: as...@live.com
To: ja...@gluru.co
CC: user@spark.apache.org
Subject: RE: ML Random Forest Classifier
Date: Mon, 11 Apr 2016 23:21:53 +0100




Thanks, James. That looks promising. 

Date: Mon, 11 Apr 2016 10:41:07 +0100
Subject: Re: ML Random Forest Classifier
From: ja...@gluru.co
To: as...@live.com
CC: user@spark.apache.org

To add a bit more detail perhaps something like this might work:








package org.apache.spark.ml




import org.apache.spark.ml.classification.RandomForestClassificationModel

import org.apache.spark.ml.classification.DecisionTreeClassificationModel

import org.apache.spark.ml.classification.LogisticRegressionModel

import org.apache.spark.mllib.tree.model.{ RandomForestModel => 
OldRandomForestModel }

import org.apache.spark.ml.classification.RandomForestClassifier




object RandomForestModelConverter {




  def fromOld(oldModel: OldRandomForestModel, parent: RandomForestClassifier = 
null,

categoricalFeatures: Map[Int, Int], numClasses: Int, numFeatures: Int = 
-1): RandomForestClassificationModel = {

RandomForestClassificationModel.fromOld(oldModel, parent, 
categoricalFeatures, numClasses, numFeatures)

  }




  def toOld(newModel: RandomForestClassificationModel): OldRandomForestModel = {

newModel.toOld

  }

}

Regards,
James 
On 11 April 2016 at 10:36, James Hammerton  wrote:
There are methods for converting the dataframe based random forest models to 
the old RDD based models and vice versa. Perhaps using these will help given 
that the old models can be saved and loaded?
In order to use them however you will need to write code in the 
org.apache.spark.ml package.
I've not actually tried doing this myself but it looks as if it might work.
Regards,
James








On 11 April 2016 at 10:29, Ashic Mahtab  wrote:



Hello,I'm trying to save a pipeline with a random forest classifier. If I try 
to save the pipeline, it complains that the classifier is not Writable, and 
indeed the classifier itself doesn't have a write function. There's a pull 
request that's been merged that enables this for Spark 2.0 (any dates around 
when that'll release?). I am, however, using the Spark Cassandra Connector 
which doesn't seem to be able to create a CqlContext with spark 2.0 snapshot 
builds. Seeing that ML Lib's random forest classifier supports storing and 
loading models, is there a way to create a Spark ML pipeline in Spark 1.6 with 
a random forest classifier that'll allow me to store and load the model? The 
model takes significant amount of time to train, and I really don't want to 
have to train it every time my application launches.
Thanks,Ashic.

Spark acessing secured HDFS

2016-04-12 Thread vijikarthi

Hello,

I am trying to understand Spark support to access secure HDFS cluster.

My plan is to deploy "Spark on Mesos" which will access a secure HDFS
cluster running elsewhere in the network. I am trying to understand how much
of support do exist as of now? 

My understanding is Spark as of now supports accessing secured Hadoop
cluster only through "Spark on YARN" deployment option where the principal
and keytab can be passed through spark-submit options.

However, spark "local" mode also accepts keytab and principal to support
apps like spark sql.

I have installed Spark (1.6.1) on a machine where Hadoop client is not
installed but copied core-site and hdfs-site to /etc/hadoop/conf and
configured HADOOP_CONF_DIR="/etc/hadoop/conf/" property in spark-env. I have
tested and confirmed "org.apache.spark.examples.HdfsTest" class can access
insecure Hadoop cluster. When I test the same code/configuration with a
secure Hadoop cluster, I am getting *"Can't get Master Kerberos principal
for use as renewer" error message. *

I have pasted complete debug log output below. Please let me know if I am
missing any configurations that is causing this issue?

Regards
Vijay

/bin/spark-submit --deploy-mode client --master local --driver-memory=512M
--driver-cores=0.5 --executor-memory 512M --total-executor-cores=1
--principal hdf...@foo.com --keytab
/downloads/spark-1.6.1-bin-hadoop2.6/hdfs.keytab -v --class
org.apache.spark.examples.HdfsTest  lib/spark-examples-1.6.1-hadoop2.6.0.jar 
/sink/names

Using properties file:
/downloads/spark-1.6.1-bin-hadoop2.6/conf/spark-defaults.conf
Adding default property:
spark.executor.uri=http://shinyfeather.com/spark/spark-1.6.1/spark-1.6.1-bin-hadoop2.6.tgz
Parsed arguments:
  master  local
  deployMode  client
  executorMemory  512M
  executorCores   null
  totalExecutorCores  1
  propertiesFile 
/downloads/spark-1.6.1-bin-hadoop2.6/conf/spark-defaults.conf
  driverMemory512M
  driverCores 0.5
  driverExtraClassPathnull
  driverExtraLibraryPath  null
  driverExtraJavaOptions  null
  supervise   false
  queue   null
  numExecutorsnull
  files   null
  pyFiles null
  archivesnull
  mainClass   org.apache.spark.examples.HdfsTest
  primaryResource
file:/downloads/spark-1.6.1-bin-hadoop2.6/lib/spark-examples-1.6.1-hadoop2.6.0.jar
  nameorg.apache.spark.examples.HdfsTest
  childArgs   [/sink/names]
  jarsnull
  packagesnull
  packagesExclusions  null
  repositoriesnull
  verbose true

Spark properties used, including those specified through
 --conf and those from the properties file
/downloads/spark-1.6.1-bin-hadoop2.6/conf/spark-defaults.conf:
  spark.driver.memory -> 512M
  spark.executor.uri ->
http://shinyfeather.com/spark/spark-1.6.1/spark-1.6.1-bin-hadoop2.6.tgz


16/04/12 22:54:06 DEBUG MutableMetricsFactory: field
org.apache.hadoop.metrics2.lib.MutableRate
org.apache.hadoop.security.UserGroupInformation$UgiMetrics.loginSuccess with
annotation @org.apache.hadoop.metrics2.annotation.Metric(value=[Rate of
successful kerberos logins and latency (milliseconds)], about=,
valueName=Time, type=DEFAULT, always=false, sampleName=Ops)
16/04/12 22:54:07 DEBUG MutableMetricsFactory: field
org.apache.hadoop.metrics2.lib.MutableRate
org.apache.hadoop.security.UserGroupInformation$UgiMetrics.loginFailure with
annotation @org.apache.hadoop.metrics2.annotation.Metric(value=[Rate of
failed kerberos logins and latency (milliseconds)], about=, valueName=Time,
type=DEFAULT, always=false, sampleName=Ops)
16/04/12 22:54:07 DEBUG MutableMetricsFactory: field
org.apache.hadoop.metrics2.lib.MutableRate
org.apache.hadoop.security.UserGroupInformation$UgiMetrics.getGroups with
annotation @org.apache.hadoop.metrics2.annotation.Metric(value=[GetGroups],
about=, valueName=Time, type=DEFAULT, always=false, sampleName=Ops)
16/04/12 22:54:07 DEBUG MetricsSystemImpl: UgiMetrics, User and group
related metrics
16/04/12 22:54:07 DEBUG Groups:  Creating new Groups object
16/04/12 22:54:07 DEBUG NativeCodeLoader: Trying to load the custom-built
native-hadoop library...
16/04/12 22:54:07 DEBUG NativeCodeLoader: Failed to load native-hadoop with
error: java.lang.UnsatisfiedLinkError: no hadoop in java.library.path
16/04/12 22:54:07 DEBUG NativeCodeLoader:
java.library.path=/usr/java/packages/lib/amd64:/usr/lib/x86_64-linux-gnu/jni:/lib/x86_64-linux-gnu:/usr/lib/x86_64-linux-gnu:/usr/lib/jni:/lib:/usr/lib
16/04/12 22:54:07 WARN NativeCodeLoader: Unable to load native-hadoop
library for your platform... using builtin-java classes where applicable
16/04/12 22:54:07 DEBUG PerformanceAdvisory: Falling back to shell based
16/04/12 22:54:07 DEBUG JniBasedUnixGroupsMappingWithFallback: Group mapping

how to write pyspark interface to scala code?

2016-04-12 Thread AlexG

I have Scala Spark code for computing a matrix factorization. I'd like to
make it possible to use this code from PySpark, so users can pass in a
python RDD and receive back one without knowing or caring that Scala code is
being called.

Please point me to an example of code (e.g. somewhere in the Spark codebase,
if it's clean enough) from which I can learn how to do this.





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/how-to-write-pyspark-interface-to-scala-code-tp26765.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Spark + Secure HDFS Cluster

2016-04-12 Thread Vijay Srinivasaraghavan

Hello,
I am trying to understand Spark support to access secure HDFS cluster.
My plan is to deploy Spark on Mesos which will access a secure HDFS cluster 
running elsewhere in the network. I am trying to understand how much of support 
do exist as of now? 
My understanding is Spark as of now supports accessing secured Hadoop cluster 
only through "Spark on YARN" deployment option where the principal and keytab 
can be passed through spark-submit options.
However, spark "local" mode also accepts keytab and principal to support apps 
like spark sql.
I have installed Spark (1.6.1) on a machine where Hadoop client is not 
installed but copied core-site and hdfs-site to /etc/hadoop/conf and configured 
HADOOP_CONF_DIR="/etc/hadoop/conf/" property in spark-env. I have tested and 
confirmed "org.apache.spark.examples.HdfsTest" class can access insecure Hadoop 
cluster. When I test the same code/configuration with a secure Hadoop cluster, 
I am getting "Can't get Master Kerberos principal for use as renewer" error 
message. 
I have pasted complete debug log output below. Please let me know if I am 
missing any configurations that is causing this issue?
RegardsVijay
/bin/spark-submit --deploy-mode client --master local --driver-memory=512M 
--driver-cores=0.5 --executor-memory 512M --total-executor-cores=1 --principal 
hdf...@foo.com --keytab /downloads/spark-1.6.1-bin-hadoop2.6/hdfs.keytab -v 
--class org.apache.spark.examples.HdfsTest  
lib/spark-examples-1.6.1-hadoop2.6.0.jar  /sink/names
Using properties file: 
/downloads/spark-1.6.1-bin-hadoop2.6/conf/spark-defaults.confAdding default 
property: 
spark.executor.uri=http://shinyfeather.com/spark/spark-1.6.1/spark-1.6.1-bin-hadoop2.6.tgzParsed
 arguments:  master                  local  deployMode              client  
executorMemory          512M  executorCores           null  totalExecutorCores  
    1  propertiesFile          
/downloads/spark-1.6.1-bin-hadoop2.6/conf/spark-defaults.conf  driverMemory     
       512M  driverCores             0.5  driverExtraClassPath    null  
driverExtraLibraryPath  null  driverExtraJavaOptions  null  supervise           
    false  queue                   null  numExecutors            null  files    
               null  pyFiles                 null  archives                null 
 mainClass               org.apache.spark.examples.HdfsTest  primaryResource    
     
file:/downloads/spark-1.6.1-bin-hadoop2.6/lib/spark-examples-1.6.1-hadoop2.6.0.jar
  name                    org.apache.spark.examples.HdfsTest  childArgs         
      [/sink/names]  jars                    null  packages                null 
 packagesExclusions      null  repositories            null  verbose            
     true
Spark properties used, including those specified through --conf and those from 
the properties file 
/downloads/spark-1.6.1-bin-hadoop2.6/conf/spark-defaults.conf:  
spark.driver.memory -> 512M  spark.executor.uri -> 
http://shinyfeather.com/spark/spark-1.6.1/spark-1.6.1-bin-hadoop2.6.tgz

16/04/12 22:54:06 DEBUG MutableMetricsFactory: field 
org.apache.hadoop.metrics2.lib.MutableRate 
org.apache.hadoop.security.UserGroupInformation$UgiMetrics.loginSuccess with 
annotation @org.apache.hadoop.metrics2.annotation.Metric(value=[Rate of 
successful kerberos logins and latency (milliseconds)], about=, valueName=Time, 
type=DEFAULT, always=false, sampleName=Ops)16/04/12 22:54:07 DEBUG 
MutableMetricsFactory: field org.apache.hadoop.metrics2.lib.MutableRate 
org.apache.hadoop.security.UserGroupInformation$UgiMetrics.loginFailure with 
annotation @org.apache.hadoop.metrics2.annotation.Metric(value=[Rate of failed 
kerberos logins and latency (milliseconds)], about=, valueName=Time, 
type=DEFAULT, always=false, sampleName=Ops)16/04/12 22:54:07 DEBUG 
MutableMetricsFactory: field org.apache.hadoop.metrics2.lib.MutableRate 
org.apache.hadoop.security.UserGroupInformation$UgiMetrics.getGroups with 
annotation @org.apache.hadoop.metrics2.annotation.Metric(value=[GetGroups], 
about=, valueName=Time, type=DEFAULT, always=false, sampleName=Ops)16/04/12 
22:54:07 DEBUG MetricsSystemImpl: UgiMetrics, User and group related 
metrics16/04/12 22:54:07 DEBUG Groups:  Creating new Groups object16/04/12 
22:54:07 DEBUG NativeCodeLoader: Trying to load the custom-built native-hadoop 
library...16/04/12 22:54:07 DEBUG NativeCodeLoader: Failed to load 
native-hadoop with error: java.lang.UnsatisfiedLinkError: no hadoop in 
java.library.path16/04/12 22:54:07 DEBUG NativeCodeLoader: 
java.library.path=/usr/java/packages/lib/amd64:/usr/lib/x86_64-linux-gnu/jni:/lib/x86_64-linux-gnu:/usr/lib/x86_64-linux-gnu:/usr/lib/jni:/lib:/usr/lib16/04/12
 22:54:07 WARN NativeCodeLoader: Unable to load native-hadoop library for your 
platform... using builtin-java classes where applicable16/04/12 22:54:07 DEBUG 
PerformanceAdvisory: Falling back to shell based16/04/12 22:54:07 DEBUG 
JniBasedUnixGroupsMappingWithFallback: Group mapping

Fwd: Can i have a hive context and sql context in the same app ?

2016-04-12 Thread Michael Segel


Sorry for duplicate(s), I forgot to switch my email address. 

> Begin forwarded message:
> 
> From: Michael Segel 
> Subject: Re: Can i have a hive context and sql context in the same app ?
> Date: April 12, 2016 at 4:05:26 PM MST
> To: Michael Armbrust 
> Cc: Natu Lauchande , "user@spark.apache.org" 
> 
> 
> Reading from multiple sources within the same application? 
> 
> How would you connect to Hive for some data and then reach out to lets say 
> Oracle or DB2 for some other data that you may want but isn’t available on 
> your cluster? 
> 
> 
>> On Apr 12, 2016, at 10:52 AM, Michael Armbrust > > wrote:
>> 
>> You can, but I'm not sure why you would want to.  If you want to isolate 
>> different users just use hiveContext.newSession().
>> 
>> On Tue, Apr 12, 2016 at 1:48 AM, Natu Lauchande > > wrote:
>> Hi,
>> 
>> Is it possible to have both a sqlContext and a hiveContext in the same 
>> application ?
>> 
>> If yes would there be any performance pernalties of doing so.
>> 
>> Regards,
>> Natu
>> 
>

Silly question...

2016-04-12 Thread Michael Segel

Hi, 
This is probably a silly question on my part… 

I’m looking at the latest (spark 1.6.1 release) and would like to do a build w 
Hive and JDBC support. 

From the documentation, I see two things that make me scratch my head.

1) Scala 2.11 
"Spark does not yet support its JDBC component for Scala 2.11.”

So if we want to use JDBC, don’t use Scala 2.11.x (in this case its 2.11.8)

2) Hive Support
"To enable Hive integration for Spark SQL along with its JDBC server and CLI, 
add the -Phive and Phive-thriftserver profiles to your existing build options. 
By default Spark will build with Hive 0.13.1 bindings.”

So if we’re looking at a later release of Hive… lets say 1.1.x … still use the 
-Phive and Phive-thriftserver . Is there anything else we should consider? 

Just asking because I’ve noticed that this part of the documentation hasn’t 
changed much over the past releases. 

Thanks in Advance, 

-Mike

Re: Old hostname pops up while running Spark app

2016-04-12 Thread Bibudh Lahiri

Hi Ted,

  Thanks for your prompt reply.

  I am afraid clearing the DNS cache did not help. I did the following

  sudo /etc/init.d/dnsmasq restart

 on the two nodes I am using, as I did not have nscd, but still getting the
same error. I am launching the master from 172.26.49.156, whose old name
was IMPETUS-1466, launching one worker from each of 172.26.49.156 and
172.26.49.55,
and launching the app through ./bin/pyspark from 172.26.49.55. I am sending
the detailed stack trace.

Exception in user code:
Traceback (most recent call last):
  File
"/home/impadmin/bibudh/healthcare/code/cloudera_challenge/analyze_anomaly_with_spark.py",
line 121, in anom_with_lr
pat_proc = pycsv.csvToDataFrame(sqlContext, plaintext_rdd, sep = ",")
  File
"/tmp/spark-0fe22b7c-da8a-4971-8fcf-20b43829504b/userFiles-d9a3c3ae-20d4-4476-8026-a225dd746dc4/pyspark_csv.py",
line 53, in csvToDataFrame
column_types = evaluateType(rdd_sql, parseDate)
  File
"/tmp/spark-0fe22b7c-da8a-4971-8fcf-20b43829504b/userFiles-d9a3c3ae-20d4-4476-8026-a225dd746dc4/pyspark_csv.py",
line 179, in evaluateType
return rdd_sql.map(getRowType).reduce(reduceTypes)
  File
"/home/impadmin/spark-1.6.0-bin-hadoop2.6/python/lib/pyspark.zip/pyspark/rdd.py",
line 797, in reduce
vals = self.mapPartitions(func).collect()
  File
"/home/impadmin/spark-1.6.0-bin-hadoop2.6/python/lib/pyspark.zip/pyspark/rdd.py",
line 771, in collect
port = self.ctx._jvm.PythonRDD.collectAndServe(self._jrdd.rdd())
  File
"/home/impadmin/spark-1.6.0-bin-hadoop2.6/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py",
line 813, in __call__
answer, self.gateway_client, self.target_id, self.name)
  File
"/home/impadmin/spark-1.6.0-bin-hadoop2.6/python/lib/pyspark.zip/pyspark/sql/utils.py",
line 45, in deco
return f(*a, **kw)
  File
"/home/impadmin/spark-1.6.0-bin-hadoop2.6/python/lib/py4j-0.9-src.zip/py4j/protocol.py",
line 308, in get_return_value
format(target_id, ".", name), value)
Py4JJavaError: An error occurred while calling
z:org.apache.spark.api.python.PythonRDD.collectAndServe.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0
in stage 2.0 failed 4 times, most recent failure: Lost task 0.3 in stage
2.0 (TID 11, IMPETUS-1466): java.lang.IllegalArgumentException:
java.net.UnknownHostException: IMPETUS-1466
at
org.apache.hadoop.security.SecurityUtil.buildTokenService(SecurityUtil.java:418)
at
org.apache.hadoop.hdfs.NameNodeProxies.createNonHAProxy(NameNodeProxies.java:231)
at
org.apache.hadoop.hdfs.NameNodeProxies.createProxy(NameNodeProxies.java:139)
at org.apache.hadoop.hdfs.DFSClient.(DFSClient.java:510)
at org.apache.hadoop.hdfs.DFSClient.(DFSClient.java:453)
at
org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:136)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2433)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:88)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2467)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2449)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:367)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:166)
at org.apache.hadoop.mapred.JobConf.getWorkingDirectory(JobConf.java:653)
at
org.apache.hadoop.mapred.FileInputFormat.setInputPaths(FileInputFormat.java:389)
at
org.apache.hadoop.mapred.FileInputFormat.setInputPaths(FileInputFormat.java:362)
at
org.apache.spark.SparkContext$$anonfun$hadoopFile$1$$anonfun$33.apply(SparkContext.scala:1015)
at
org.apache.spark.SparkContext$$anonfun$hadoopFile$1$$anonfun$33.apply(SparkContext.scala:1015)
at
org.apache.spark.rdd.HadoopRDD$$anonfun$getJobConf$6.apply(HadoopRDD.scala:176)
at
org.apache.spark.rdd.HadoopRDD$$anonfun$getJobConf$6.apply(HadoopRDD.scala:176)
at scala.Option.map(Option.scala:145)
at org.apache.spark.rdd.HadoopRDD.getJobConf(HadoopRDD.scala:176)
at org.apache.spark.rdd.HadoopRDD$$anon$1.(HadoopRDD.scala:212)
at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:208)
at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:101)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:70)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Caused by:

RE: DStream how many RDD's are created by batch

2016-04-12 Thread David Newberger

Hi Natu,

I believe you are correct one RDD would be created for each file.

Cheers,

David

From: Natu Lauchande [mailto:nlaucha...@gmail.com]
Sent: Tuesday, April 12, 2016 1:48 PM
To: David Newberger
Cc: user@spark.apache.org
Subject: Re: DStream how many RDD's are created by batch

Hi David,
Thanks for you answer.
I have a follow up question :
I am using textFileStream , and listening in an S3 bucket for new files to 
process.  Files are created every 5 minutes and my batch interval is 2 minutes .

Does it mean that each file will be for one RDD ?

Thanks,
Natu

On Tue, Apr 12, 2016 at 7:46 PM, David Newberger 
> wrote:
Hi,

Time is usually the criteria if I’m understanding your question. An RDD is 
created for each batch interval. If your interval is 500ms then an RDD would be 
created every 500ms. If it’s 2 seconds then an RDD is created every 2 seconds.

Cheers,

David

From: Natu Lauchande [mailto:nlaucha...@gmail.com]
Sent: Tuesday, April 12, 2016 7:09 AM
To: user@spark.apache.org
Subject: DStream how many RDD's are created by batch

Hi,
What's the criteria for the number of RDD's created for each micro bath 
iteration  ?

Thanks,
Natu

Re: Old hostname pops up while running Spark app

2016-04-12 Thread Ted Yu

FYI

https://documentation.cpanel.net/display/CKB/How+To+Clear+Your+DNS+Cache#HowToClearYourDNSCache-MacOS
®10.10
https://www.whatsmydns.net/flush-dns.html#linux

On Tue, Apr 12, 2016 at 2:44 PM, Bibudh Lahiri 
wrote:

> Hi,
>
> I am trying to run a piece of code with logistic regression on
> PySpark. I’ve run it successfully on my laptop, and I have run it
> previously on a standalone cluster mode, but the name of the server on
> which I am running it was changed in between (the old name was
> "IMPETUS-1466") by the admin. Now, when I am trying to run, it is
> throwing the following error:
>
> File
> "/home/impadmin/Nikunj/spark-1.6.0/python/lib/pyspark.zip/pyspark/sql/utils.py",
> line 53, in deco
>
> raise IllegalArgumentException(s.split(': ', 1)[1], stackTrace)
>
> pyspark.sql.utils.IllegalArgumentException:
> u'java.net.UnknownHostException: IMPETUS-1466.
>
>I have changed a few configuration files, and /etc/hosts, and
> regenerated the SSH keys, updated the files .ssh/known_hosts and 
> .ssh/authorized_keys,
> but still this is not getting resolved. Can someone please point out where
> this name is being picked up from?
>
> --
> Bibudh Lahiri
> Data Scientist, Impetus Technolgoies
> 5300 Stevens Creek Blvd
> San Jose, CA 95129
> http://knowthynumbers.blogspot.com/
>
>

Old hostname pops up while running Spark app

2016-04-12 Thread Bibudh Lahiri

Hi,

I am trying to run a piece of code with logistic regression on
PySpark. I’ve run it successfully on my laptop, and I have run it
previously on a standalone cluster mode, but the name of the server on
which I am running it was changed in between (the old name was
"IMPETUS-1466") by the admin. Now, when I am trying to run, it is throwing
the following error:

File
"/home/impadmin/Nikunj/spark-1.6.0/python/lib/pyspark.zip/pyspark/sql/utils.py",
line 53, in deco

raise IllegalArgumentException(s.split(': ', 1)[1], stackTrace)

pyspark.sql.utils.IllegalArgumentException:
u'java.net.UnknownHostException: IMPETUS-1466.

   I have changed a few configuration files, and /etc/hosts, and
regenerated the SSH keys, updated the files .ssh/known_hosts and
.ssh/authorized_keys,
but still this is not getting resolved. Can someone please point out where
this name is being picked up from?

-- 
Bibudh Lahiri
Data Scientist, Impetus Technolgoies
5300 Stevens Creek Blvd
San Jose, CA 95129
http://knowthynumbers.blogspot.com/

S3n performance (@AaronDavidson)

2016-04-12 Thread Martin Eden

Hi everyone,

Running on EMR 4.3 with Spark 1.6.0 and the provided S3N native driver I
manage to process approx 1TB of strings inside gzipped parquet in about 50
mins on a 20 node cluster (8 cores, 60Gb ram). That's about 17MBytes/sec
per node.

This seems sub optimal.

The processing is very basic, simple fields extraction from the strings and
a groupBy.

Watching Aaron's talk from the Spark EU Summit:
https://youtu.be/GzG9RTRTFck?t=863

it seems I am hitting the same issues with suboptimal S3 throughput he
mentions there.

I tried different numbers of files for the input data set (more smaller
files vs less larger files) combined with various settings
for fs.s3n.block.size thinking that might help if each mapper streams
larger chunks. It didn't! It actually seems that many small files gives
better performance than less larger ones (of course with oversubscribed
number of tasks/threads).

Similarly to what Aaron is mentioning with oversubscribed tasks/threads we
also become CPU bound (reach 100% cpu utilisation).


Has anyone seen a similar behaviour? How can we optimise this?

Are the improvements mentioned in Aaron's talk now part of S3n or S3a
driver or are they just available under DataBricksCloud? How can we benefit
from those improvements?

Thanks,
Martin

P.S. Have not tried S3a.

Re: JavaRDD with custom class?

2016-04-12 Thread Ted Yu

You can find various examples involving Serializable Java POJO
e.g.
./examples/src/main/java/org/apache/spark/examples/ml/JavaALSExample.java

Please pastebin some details on 'Task not serializable error'

Thanks

On Tue, Apr 12, 2016 at 12:44 PM, Daniel Valdivia 
wrote:

> Hi,
>
> I'm moving some code from Scala to Java and I just hit a wall where I'm
> trying to move an RDD with a custom data structure to java, but I'm not
> being able to do so:
>
> Scala Code:
>
> case class IncodentDoc(system_id: String, category: String, terms:
> Seq[String])
> var incTup = inc_filtered.map(record => {
>  //some logic
>   TermDoc(sys_id, category, termsSeq)
> })
>
> On Java I'm trying:
>
> class TermDoc implements Serializable  {
> public String system_id;
> public String category;
> public String[] terms;
>
> public TermDoc(String system_id, String category, String[] terms) {
> this.system_id = system_id;
> this.category = category;
> this.terms = terms;
> }
> }
>
> JavaRDD incTup = inc_filtered.map(record -> {
> //some code
> return new TermDoc(sys_id, category, termsArr);
> });
>
>
> When I run my code, I get hit with a Task not serializable error, what am
> I missing so I can use custom classes inside the RDD just like in scala?
>
> Cheers
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

Re: How does spark-submit handle Python scripts (and how to repeat it)?

2016-04-12 Thread Andrei

>
> One part is passing the command line options, like “--master”, from the
> JVM launched by spark-submit to the JVM where SparkContext resides


Since I have full control over both - JVM and Julia parts - I can pass
whatever options to both. But what exactly should be passed? Currently
pipeline looks like this:

spark-submit JVM -> JuliaRunner -> julia process -> new JVM with
SparkContext


 I want to make the last JVM's SparkContext to understand that it should
run on YARN. Obviously, I can't pass `--master yarn` option to JVM itself.
Instead, I can pass system property "spark.master" = "yarn-client", but
this results in an error:

Retrying connect to server: 0.0.0.0/0.0.0.0:8032



So it's definitely not enough. I tried to set manually all system
properties that `spark-submit` adds to the JVM (including
"spark-submit=true", "spark.submit.deployMode=client", etc.), but it didn't
help too. Source code is always good, but for a stranger like me it's a
little bit hard to grasp control flow in SparkSubmit class.


For pySpark & SparkR, when running scripts in client deployment modes
> (standalone client and yarn client), the JVM is the same (py4j/RBackend
> running as a thread in the JVM launched by spark-submit)


Can you elaborate on this? Does it mean that `spark-submit` creates new
Python/R process that connects back to that same JVM and creates
SparkContext in it?


On Tue, Apr 12, 2016 at 2:04 PM, Sun, Rui  wrote:

> There is much deployment preparation work handling different deployment
> modes for pyspark and SparkR in SparkSubmit. It is difficult to summarize
> it briefly, you had better refer to the source code.
>
>
>
> Supporting running Julia scripts in SparkSubmit is more than implementing
> a ‘JuliaRunner’. One part is passing the command line options, like
> “--master”, from the JVM launched by spark-submit to the JVM where
> SparkContext resides, in the case that the two JVMs are not the same. For
> pySpark & SparkR, when running scripts in client deployment modes
> (standalone client and yarn client), the JVM is the same (py4j/RBackend
> running as a thread in the JVM launched by spark-submit) , so no need to
> pass the command line options around. However, in your case, Julia
> interpreter launches an in-process JVM for SparkContext, which is a
> separate JVM from the one launched by spark-submit. So you need a way,
> typically an environment environment variable, like “SPARKR_SUBMIT_ARGS”
> for SparkR or “PYSPARK_SUBMIT_ARGS” for pyspark, to pass command line args
> to the in-process JVM in the Julia interpreter so that SparkConf can pick
> the options.
>
>
>
> *From:* Andrei [mailto:faithlessfri...@gmail.com]
> *Sent:* Tuesday, April 12, 2016 3:48 AM
> *To:* user 
> *Subject:* How does spark-submit handle Python scripts (and how to repeat
> it)?
>
>
>
> I'm working on a wrapper [1] around Spark for the Julia programming
> language [2] similar to PySpark. I've got it working with Spark Standalone
> server by creating local JVM and setting master programmatically. However,
> this approach doesn't work with YARN (and probably Mesos), which require
> running via `spark-submit`.
>
>
>
> In `SparkSubmit` class I see that for Python a special class
> `PythonRunner` is launched, so I tried to do similar `JuliaRunner`, which
> essentially does the following:
>
>
>
> val pb = new ProcessBuilder(Seq("julia", juliaScript))
>
> val process = pb.start()
>
> process.waitFor()
>
>
>
> where `juliaScript` itself creates new JVM and `SparkContext` inside it
> WITHOUT setting master URL. I then tried to launch this class using
>
>
>
> spark-submit --master yarn \
>
>   --class o.a.s.a.j.JuliaRunner \
>
>   project.jar my_script.jl
>
>
>
> I expected that `spark-submit` would set environment variables or
> something that SparkContext would then read and connect to appropriate
> master. This didn't happen, however, and process failed while trying to
> instantiate `SparkContext`, saying that master is not specified.
>
>
>
> So what am I missing? How can use `spark-submit` to run driver in a
> non-JVM language?
>
>
>
>
>
> [1]: https://github.com/dfdx/Sparta.jl
>
> [2]: http://julialang.org/
>

An error occurred while calling z:org.apache.spark.sql.execution.EvaluatePython.takeAndServe

2016-04-12 Thread AlexModestov

I get an error while I form a dataframe from the parquet file:

Py4JJavaError: An error occurred while calling
z:org.apache.spark.sql.execution.EvaluatePython.takeAndServe.
: org.apache.spark.SparkException: Job aborted due to stage failure:
Exception while getting task result:
org.apache.spark.storage.BlockFetchException: Failed to fetch block from 1
locations. Most recent failure cause:



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/An-error-occurred-while-calling-z-org-apache-spark-sql-execution-EvaluatePython-takeAndServe-tp26764.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: build spark 1.6 against cdh5.7 with hadoop 2.6.0 hbase 1.2: Failure

2016-04-12 Thread freedafeng

agh.. typo. supposed to use cdh5.7.0. I rerun the command with the fix, but
still get the same error. 

build/mvn -Pyarn -Phadoop-2.6 -Dhadoop.version=2.6.0-cdh5.7.0 -DskipTests
clean package




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/build-spark-1-6-against-cdh5-7-with-hadoop-2-6-0-hbase-1-2-Failure-tp26762p26763.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

build spark 1.6 against cdh5.7 with hadoop 2.6.0 hbase 1.2: Failure

2016-04-12 Thread freedafeng

jdk: 1.8.0_77
scala: 2.10.4
mvn: 3.3.9.

Slightly changed the pom.xml:
$ diff pom.xml pom.original 
130c130
< 2.6.0-cdh5.7.0-SNAPSHOT
---
> 2.2.0
133c133
< 1.2.0-cdh5.7.0-SNAPSHOT
---
> 0.98.7-hadoop2


command: build/mvn -Pyarn -Phadoop-2.6 -Dhadoop.version=2.6.0-cdh5.6.0
-DskipTests clean package

error: 
[INFO] 
[INFO] --- maven-remote-resources-plugin:1.5:process (default) @
spark-core_2.10 ---
[INFO] 
[INFO] --- maven-resources-plugin:2.6:resources (default-resources) @
spark-core_2.10 ---
[INFO] Using 'UTF-8' encoding to copy filtered resources.
[INFO] Copying 21 resources
[INFO] Copying 3 resources
[INFO] 
[INFO] --- scala-maven-plugin:3.2.2:compile (scala-compile-first) @
spark-core_2.10 ---
[INFO] Using zinc server for incremental compilation
[info] Compiling 486 Scala sources and 76 Java sources to
/home/jfeng/workspace/spark-1.6.0/core/target/scala-2.10/classes...
[error]
/home/jfeng/workspace/spark-1.6.0/core/src/main/scala/org/apache/spark/TestUtils.scala:22:
object StandardCharsets is not a member of package java.nio.charset
[error] import java.nio.charset.StandardCharsets
[error]^
[error]
/home/jfeng/workspace/spark-1.6.0/core/src/main/scala/org/apache/spark/TestUtils.scala:23:
object file is not a member of package java.nio
[error] import java.nio.file.Paths
[error] ^
[error]
/home/jfeng/workspace/spark-1.6.0/core/src/main/scala/org/apache/spark/TestUtils.scala:80:
not found: value StandardCharsets
[error]   ByteStreams.copy(new
ByteArrayInputStream(v.getBytes(StandardCharsets.UTF_8)), jarStream)
[error]^
[error]
/home/jfeng/workspace/spark-1.6.0/core/src/main/scala/org/apache/spark/TestUtils.scala:95:
not found: value Paths
[error]   val jarEntry = new
JarEntry(Paths.get(directoryPrefix.getOrElse(""), file.getName).toString)
[error]   ^
[error]
/home/jfeng/workspace/spark-1.6.0/core/src/main/scala/org/apache/spark/launcher/LauncherBackend.scala:43:
value getLoopbackAddress is not a member of object java.net.InetAddress
[error]   val s = new Socket(InetAddress.getLoopbackAddress(), port.get)
[error] 

Thanks in advance.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/build-spark-1-6-against-cdh5-7-with-hadoop-2-6-0-hbase-1-2-Failure-tp26762.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

JavaRDD with custom class?

2016-04-12 Thread Daniel Valdivia

Hi,

I'm moving some code from Scala to Java and I just hit a wall where I'm trying 
to move an RDD with a custom data structure to java, but I'm not being able to 
do so:

Scala Code:

case class IncodentDoc(system_id: String, category: String, terms: Seq[String])
var incTup = inc_filtered.map(record => {
 //some logic
  TermDoc(sys_id, category, termsSeq)
})

On Java I'm trying:

class TermDoc implements Serializable  {
public String system_id;
public String category;
public String[] terms;

public TermDoc(String system_id, String category, String[] terms) {
this.system_id = system_id;
this.category = category;
this.terms = terms;
}
}

JavaRDD incTup = inc_filtered.map(record -> {
//some code
return new TermDoc(sys_id, category, termsArr);
});


When I run my code, I get hit with a Task not serializable error, what am I 
missing so I can use custom classes inside the RDD just like in scala?

Cheers
-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Aggregator support in DataFrame

2016-04-12 Thread Koert Kuipers

still not sure how to use this with a DataFrame, assuming i cannot convert
it to a specific Dataset with .as (because i got lots of columns, or
because at compile time these types are simply not known).

i cannot specify the columns these operate on. i can resort to Row
transformations, like this:

scala> val x = List(("a", 1), ("a", 2), ("b", 5)).toDF("k", "v")
scala> x.groupBy("k").agg(typed.sum{ row: Row => row.int(1) }).show

note that it currently fails because of a known bug (SPARK-13363
), but ignoring that its
somewhat ugly that i have to resort to Row transformations. instead we
should allow something like for any Aggregator (so add an apply method that
takes in Column* to indicate what colums to operate on):

scala> x.groupBy("k").agg(typed.sum(col("v")))

any hints on what i need to do to make this happen? i have been going
through Aggregator, AggregateFunction, AggregateExpression,
TypedAggregateExpression and friends trying to get a sense but haven't had
much luck so far.

On Tue, Apr 12, 2016 at 1:50 PM, Michael Armbrust 
wrote:

> Did you see these?
>
>
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/expressions/scala/typed.scala#L70
>
> On Tue, Apr 12, 2016 at 9:46 AM, Koert Kuipers  wrote:
>
>> i dont really see how Aggregator can be useful for DataFrame unless you
>> can specify what columns it works on. Having to code Aggregators to always
>> use Row and then extract the values yourself breaks the abstraction and
>> makes it not much better than UserDefinedAggregateFunction (well... maybe
>> still better because i have encoders so i can use kryo).
>>
>> On Mon, Apr 11, 2016 at 10:53 PM, Koert Kuipers 
>> wrote:
>>
>>> saw that, dont think it solves it. i basically want to add some children
>>> to the expression i guess, to indicate what i am operating on? not sure if
>>> even makes sense
>>>
>>> On Mon, Apr 11, 2016 at 8:04 PM, Michael Armbrust <
>>> mich...@databricks.com> wrote:
>>>
 I'll note this interface has changed recently:
 https://github.com/apache/spark/commit/520dde48d0d52de1710a3275fdd5355dd69d

 I'm not sure that solves your problem though...

 On Mon, Apr 11, 2016 at 4:45 PM, Koert Kuipers 
 wrote:

> i like the Aggregator a lot
> (org.apache.spark.sql.expressions.Aggregator), but i find the way to use 
> it
> somewhat confusing. I am supposed to simply call aggregator.toColumn, but
> that doesn't allow me to specify which fields it operates on in a 
> DataFrame.
>
> i would basically like to do something like
> dataFrame
>   .groupBy("k")
>   .agg(
> myAggregator.on("v1", "v2").toColumn,
> myOtherAggregator.on("v3", "v4").toColumn
>   )
>

>>>
>>
>

Re: DStream how many RDD's are created by batch

2016-04-12 Thread Natu Lauchande

Hi David,

Thanks for you answer.

I have a follow up question :

I am using textFileStream , and listening in an S3 bucket for new files to
process.  Files are created every 5 minutes and my batch interval is 2
minutes .

Does it mean that each file will be for one RDD ?

Thanks,
Natu

On Tue, Apr 12, 2016 at 7:46 PM, David Newberger <
david.newber...@wandcorp.com> wrote:

> Hi,
>
>
>
> Time is usually the criteria if I’m understanding your question. An RDD is
> created for each batch interval. If your interval is 500ms then an RDD
> would be created every 500ms. If it’s 2 seconds then an RDD is created
> every 2 seconds.
>
>
>
> Cheers,
>
>
>
> *David*
>
>
>
> *From:* Natu Lauchande [mailto:nlaucha...@gmail.com]
> *Sent:* Tuesday, April 12, 2016 7:09 AM
> *To:* user@spark.apache.org
> *Subject:* DStream how many RDD's are created by batch
>
>
>
> Hi,
>
> What's the criteria for the number of RDD's created for each micro bath
> iteration  ?
>
>
>
> Thanks,
>
> Natu
>

Re: [ML] Training with bias

2016-04-12 Thread Daniel Siegmann

Yes, that's what I was looking for. Thanks.

On Tue, Apr 12, 2016 at 9:28 AM, Nick Pentreath 
wrote:

> Are you referring to fitting the intercept term? You can use
> lr.setFitIntercept (though it is true by default):
>
> scala> lr.explainParam(lr.fitIntercept)
> res27: String = fitIntercept: whether to fit an intercept term (default:
> true)
>
> On Mon, 11 Apr 2016 at 21:59 Daniel Siegmann 
> wrote:
>
>> I'm trying to understand how I can add a bias when training in Spark. I
>> have only a vague familiarity with this subject, so I hope this question
>> will be clear enough.
>>
>> Using liblinear a bias can be set - if it's >= 0, there will be an
>> additional weight appended in the model, and predicting with that model
>> will automatically append a feature for the bias.
>>
>> Is there anything similar in Spark, such as for logistic regression? The
>> closest thing I can find is MLUtils.appendBias, but this seems to
>> require manual work on both the training and scoring side. I was hoping for
>> something that would just be part of the model.
>>
>>
>> ~Daniel Siegmann
>>
>

Re: Aggregator support in DataFrame

2016-04-12 Thread Michael Armbrust

Did you see these?

https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/expressions/scala/typed.scala#L70

On Tue, Apr 12, 2016 at 9:46 AM, Koert Kuipers  wrote:

> i dont really see how Aggregator can be useful for DataFrame unless you
> can specify what columns it works on. Having to code Aggregators to always
> use Row and then extract the values yourself breaks the abstraction and
> makes it not much better than UserDefinedAggregateFunction (well... maybe
> still better because i have encoders so i can use kryo).
>
> On Mon, Apr 11, 2016 at 10:53 PM, Koert Kuipers  wrote:
>
>> saw that, dont think it solves it. i basically want to add some children
>> to the expression i guess, to indicate what i am operating on? not sure if
>> even makes sense
>>
>> On Mon, Apr 11, 2016 at 8:04 PM, Michael Armbrust > > wrote:
>>
>>> I'll note this interface has changed recently:
>>> https://github.com/apache/spark/commit/520dde48d0d52de1710a3275fdd5355dd69d
>>>
>>> I'm not sure that solves your problem though...
>>>
>>> On Mon, Apr 11, 2016 at 4:45 PM, Koert Kuipers 
>>> wrote:
>>>
 i like the Aggregator a lot
 (org.apache.spark.sql.expressions.Aggregator), but i find the way to use it
 somewhat confusing. I am supposed to simply call aggregator.toColumn, but
 that doesn't allow me to specify which fields it operates on in a 
 DataFrame.

 i would basically like to do something like
 dataFrame
   .groupBy("k")
   .agg(
 myAggregator.on("v1", "v2").toColumn,
 myOtherAggregator.on("v3", "v4").toColumn
   )

>>>
>>>
>>
>

Re: ordering over structs

2016-04-12 Thread Michael Armbrust

Does the data actually fit in memory?  Check the web ui.  If it doesn't
caching is not going to help you.

On Tue, Apr 12, 2016 at 9:00 AM, Imran Akbar  wrote:

> thanks Michael,
>
> That worked.
> But what's puzzling is if I take the exact same code and run it off a temp
> table created from parquet, vs. a cached table - it runs much slower.  5-10
> seconds uncached vs. 47-60 seconds cached.
>
> Any ideas why?
>
> Here's my code snippet:
> df = data.select("customer_id", struct('dt', 'product').alias("vs"))\
>   .groupBy("customer_id")\
>   .agg(min("vs").alias("final"))\
>   .select("customer_id", "final.dt", "final.product")
> df.head()
>
> My log from the non-cached run:
> http://pastebin.com/F88sSv1B
>
> Log from the cached run:
> http://pastebin.com/Pmmfea3d
>
> thanks,
> imran
>
> On Fri, Apr 8, 2016 at 12:33 PM, Michael Armbrust 
> wrote:
>
>> You need to use the struct function
>> 
>> (which creates an actual struct), you are trying to use the struct datatype
>> (which just represents the schema of a struct).
>>
>> On Thu, Apr 7, 2016 at 3:48 PM, Imran Akbar  wrote:
>>
>>> thanks Michael,
>>>
>>>
>>> I'm trying to implement the code in pyspark like so (where my dataframe
>>> has 3 columns - customer_id, dt, and product):
>>>
>>> st = StructType().add("dt", DateType(), True).add("product",
>>> StringType(), True)
>>>
>>> top = data.select("customer_id", st.alias('vs'))
>>>   .groupBy("customer_id")
>>>   .agg(max("dt").alias("vs"))
>>>   .select("customer_id", "vs.dt", "vs.product")
>>>
>>> But I get an error saying:
>>>
>>> AttributeError: 'StructType' object has no attribute 'alias'
>>>
>>> Can I do this without aliasing the struct?  Or am I doing something
>>> incorrectly?
>>>
>>>
>>> regards,
>>>
>>> imran
>>>
>>> On Wed, Apr 6, 2016 at 4:16 PM, Michael Armbrust >> > wrote:
>>>
 Ordering for a struct goes in order of the fields.  So the max struct
> is the one with the highest TotalValue (and then the highest category
> if there are multiple entries with the same hour and total value).
>
> Is this due to "InterpretedOrdering" in StructType?
>

 That is one implementation, but the code generated ordering also
 follows the same contract.



>  4)  Is it faster doing it this way than doing a join or window
> function in Spark SQL?
>
> Way faster.  This is a very efficient way to calculate argmax.
>
> Can you explain how this is way faster than window function? I can
> understand join doesn't make sense in this case. But to calculate the
> grouping max, you just have to shuffle the data by grouping keys. You 
> maybe
> can do a combiner on the mapper side before shuffling, but that is it. Do
> you mean windowing function in Spark SQL won't do any map side combiner,
> even it is for max?
>

 Windowing can't do partial aggregation and will have to collect all the
 data for a group so that it can be sorted before applying the function.  In
 contrast a max aggregation will do partial aggregation (map side combining)
 and can be calculated in a streaming fashion.

 Also, aggregation is more common and thus has seen more optimization
 beyond the theoretical limits described above.


>>>
>>
>

RE: DStream how many RDD's are created by batch

2016-04-12 Thread David Newberger

Hi,

Time is usually the criteria if I’m understanding your question. An RDD is 
created for each batch interval. If your interval is 500ms then an RDD would be 
created every 500ms. If it’s 2 seconds then an RDD is created every 2 seconds.

Cheers,

David

From: Natu Lauchande [mailto:nlaucha...@gmail.com]
Sent: Tuesday, April 12, 2016 7:09 AM
To: user@spark.apache.org
Subject: DStream how many RDD's are created by batch

Hi,
What's the criteria for the number of RDD's created for each micro bath 
iteration  ?

Thanks,
Natu

Re: Can i have a hive context and sql context in the same app ?

2016-04-12 Thread Michael Armbrust

You can, but I'm not sure why you would want to.  If you want to isolate
different users just use hiveContext.newSession().

On Tue, Apr 12, 2016 at 1:48 AM, Natu Lauchande 
wrote:

> Hi,
>
> Is it possible to have both a sqlContext and a hiveContext in the same
> application ?
>
> If yes would there be any performance pernalties of doing so.
>
> Regards,
> Natu
>

Creating a New Cassandra Table From a DataFrame Schema

2016-04-12 Thread Prateek .

Hi,

I am trying to create new Cassandra table by inferring schema from JSON:

https://github.com/datastax/spark-cassandra-connector/blob/master/doc/14_data_frames.md

I am not able to get createCassandraTable function on Dataframe:

import com.datastax.spark.connector._


df.createCassandraTable(
"test",
"renamed",
partitionKeyColumns = Some(Seq("user")),
clusteringKeyColumns = Some(Seq("newcolumnname")))

The doc says:
// Add spark connector specific methods to DataFrame

How can I achieve this.?

Thanks
Prateek
"DISCLAIMER: This message is proprietary to Aricent and is intended solely for 
the use of the individual to whom it is addressed. It may contain privileged or 
confidential information and should not be circulated or used for any purpose 
other than for what it is intended. If you have received this message in error, 
please notify the originator immediately. If you are not the intended 
recipient, you are notified that you are strictly prohibited from using, 
copying, altering, or disclosing the contents of this message. Aricent accepts 
no responsibility for loss or damage arising from the use of the information 
transmitted by this email including damage from virus."

Re: Monitoring S3 Bucket with Spark Streaming

2016-04-12 Thread Benjamin Kim

All,

I have more of a general Scala JSON question.

I have setup a notification on the S3 source bucket that triggers a Lambda 
function to unzip the new file placed there. Then, it saves the unzipped CSV 
file into another destination bucket where a notification is sent to a SQS 
topic. The contents of the message body is in JSON having the top level be the 
“Records” collection where within are 1 or more “Record” objects. I would like 
to know how to iterate through the “Records” retrieving each “Record” to 
extract the bucket value and the key value. I would then use this information 
to download the file into a DataFrame via spark-csv. Does anyone have any 
experience doing this?

I wrote some quick stab at it, but I know it’s not right.

def functionToCreateContext(): StreamingContext = {
val ssc = new StreamingContext(sc, Seconds(30))   // new context
val records = ssc.receiverStream(new SQSReceiver("amg-events")
.credentials(accessKey, secretKey)
.at(Regions.US_EAST_1)
.withTimeout(2))

records.foreach(record => {
val bucket = record['s3']['bucket']['name']
val key = record['s3']['object']['key']
val df = sqlContext.read
.format("com.databricks.spark.csv")
.option("header", "true") // Use first line of all files as header
.option("inferSchema", "true") // Automatically infer data types
.load("s3://" + bucket + "/" + key)
//save to hbase
})

ssc.checkpoint(checkpointDirectory)   // set checkpoint directory
ssc
}

Thanks,
Ben

> On Apr 9, 2016, at 6:12 PM, Benjamin Kim  wrote:
> 
> Ah, I spoke too soon.
> 
> I thought the SQS part was going to be a spark package. It looks like it has 
> be compiled into a jar for use. Am I right? Can someone help with this? I 
> tried to compile it using SBT, but I’m stuck with a SonatypeKeys not found 
> error.
> 
> If there’s an easier alternative, please let me know.
> 
> Thanks,
> Ben
> 
> 
>> On Apr 9, 2016, at 2:49 PM, Benjamin Kim > > wrote:
>> 
>> This was easy!
>> 
>> I just created a notification on a source S3 bucket to kick off a Lambda 
>> function that would decompress the dropped file and save it to another S3 
>> bucket. In return, this S3 bucket has a notification to send a SNS message 
>> to me via email. I can just as easily setup SQS to be the endpoint of this 
>> notification. This would then convey to a listening Spark Streaming job the 
>> file information to download. I like this!
>> 
>> Cheers,
>> Ben 
>> 
>>> On Apr 9, 2016, at 9:54 AM, Benjamin Kim >> > wrote:
>>> 
>>> This is awesome! I have someplace to start from.
>>> 
>>> Thanks,
>>> Ben
>>> 
>>> 
 On Apr 9, 2016, at 9:45 AM, programminggee...@gmail.com 
  wrote:
 
 Someone please correct me if I am wrong as I am still rather green to 
 spark, however it appears that through the S3 notification mechanism 
 described below, you can publish events to SQS and use SQS as a streaming 
 source into spark. The project at 
 https://github.com/imapi/spark-sqs-receiver 
  appears to provide libraries 
 for doing this.
 
 Hope this helps.
 
 Sent from my iPhone
 
 On Apr 9, 2016, at 9:55 AM, Benjamin Kim > wrote:
 
> Nezih,
> 
> This looks like a good alternative to having the Spark Streaming job 
> check for new files on its own. Do you know if there is a way to have the 
> Spark Streaming job get notified with the new file information and act 
> upon it? This can reduce the overhead and cost of polling S3. Plus, I can 
> use this to notify and kick off Lambda to process new data files and make 
> them ready for Spark Streaming to consume. This will also use 
> notifications to trigger. I just need to have all incoming folders 
> configured for notifications for Lambda and all outgoing folders for 
> Spark Streaming. This sounds like a better setup than we have now.
> 
> Thanks,
> Ben
> 
>> On Apr 9, 2016, at 12:25 AM, Nezih Yigitbasi > > wrote:
>> 
>> While it is doable in Spark, S3 also supports notifications: 
>> http://docs.aws.amazon.com/AmazonS3/latest/dev/NotificationHowTo.html 
>> 
>> 
>> 
>> On Fri, Apr 8, 2016 at 9:15 PM Natu Lauchande > > wrote:
>> Hi Benjamin,
>> 
>> I have done it . The critical configuration items are the ones below :
>> 
>>   ssc.sparkContext.hadoopConfiguration.set("fs.s3n.impl", 
>>

Re: Aggregator support in DataFrame

2016-04-12 Thread Koert Kuipers

i dont really see how Aggregator can be useful for DataFrame unless you can
specify what columns it works on. Having to code Aggregators to always use
Row and then extract the values yourself breaks the abstraction and makes
it not much better than UserDefinedAggregateFunction (well... maybe still
better because i have encoders so i can use kryo).

On Mon, Apr 11, 2016 at 10:53 PM, Koert Kuipers  wrote:

> saw that, dont think it solves it. i basically want to add some children
> to the expression i guess, to indicate what i am operating on? not sure if
> even makes sense
>
> On Mon, Apr 11, 2016 at 8:04 PM, Michael Armbrust 
> wrote:
>
>> I'll note this interface has changed recently:
>> https://github.com/apache/spark/commit/520dde48d0d52de1710a3275fdd5355dd69d
>>
>> I'm not sure that solves your problem though...
>>
>> On Mon, Apr 11, 2016 at 4:45 PM, Koert Kuipers  wrote:
>>
>>> i like the Aggregator a lot
>>> (org.apache.spark.sql.expressions.Aggregator), but i find the way to use it
>>> somewhat confusing. I am supposed to simply call aggregator.toColumn, but
>>> that doesn't allow me to specify which fields it operates on in a DataFrame.
>>>
>>> i would basically like to do something like
>>> dataFrame
>>>   .groupBy("k")
>>>   .agg(
>>> myAggregator.on("v1", "v2").toColumn,
>>> myOtherAggregator.on("v3", "v4").toColumn
>>>   )
>>>
>>
>>
>

How do i get a spark instance to use my log4j properties

2016-04-12 Thread Steve Lewis

Ok I am stymied. I have tried everything I can think of to get spark to use
my own version of

log4j.properties

In the launcher code - I launch a local instance from a Java application

I say -Dlog4j.configuration=conf/log4j.properties

where conf/log4j.properties is user.dir - no luck

Spark always starts saying

Using Spark's default log4j profile:
org/apache/spark/log4j-defaults.properties

I have a directory conf with my log4j.properties there but it seems to be
ignored

I use maven and an VERY RELUCTANT to edit the spark jars

I know this point has been discussed here before but I do not see a clean
answer

Re: Run a self-contained Spark app on a Spark standalone cluster

2016-04-12 Thread kevllino

Update: 
- I managed to login to the cluster 
- I want to use copy-dir to deploy my Python files to all nodes, I read I
need to copy them to /ephemeral/hdfs. But I don't know how to move them from
local to the cluster in HDFS?

Thanks in advance,
Kevin 



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Run-a-self-contained-Spark-app-on-a-Spark-standalone-cluster-tp26753p26761.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: History Server Refresh?

2016-04-12 Thread Miles Crawford

It is completed apps that are not showing up. I'm fine with incomplete apps
not appearing.

On Tue, Apr 12, 2016 at 6:43 AM, Steve Loughran 
wrote:

>
> On 12 Apr 2016, at 00:21, Miles Crawford  wrote:
>
> Hey there. I have my spark applications set up to write their event logs
> into S3 - this is super useful for ephemeral clusters, I can have
> persistent history even though my hosts go away.
>
> A history server is set up to view this s3 location, and that works fine
> too - at least on startup.
>
> The problem is that the history server doesn't seem to notice new logs
> arriving into the S3 bucket.  Any idea how I can get it to scan the folder
> for new files?
>
> Thanks,
> -miles
>
>
> s3 isn't a real filesystem, and apps writing to it don't have any data
> written until one of
>  -the output stream is close()'d. This happens at the end of the app
>  -the file is set up to be partitioned and a partition size is crossed
>
> Until either of those conditions are met, the history server isn't going
> to see anything.
>
> If you are going to use s3 as the dest, and you want to see incomplete
> apps, then you'll need to configure the spark job to have smaller partition
> size (64? 128? MB).
>
> If it's completed apps that aren't being seen by the HS, then that's a
> bug, though if its against s3 only, likely to be something related to
> directory listings
>

Re: [spark] build/sbt gen-idea error

2016-04-12 Thread Sean Owen

We just removed the gen-idea plugin.
Just import the Maven project into IDEA or Eclipse.

On Tue, Apr 12, 2016 at 4:52 PM, ImMr.K <875061...@qq.com> wrote:
> But how to import spark repo into idea or eclipse?
>
>
>
> -- 原始邮件 --
> 发件人: Ted Yu 
> 发送时间: 2016年4月12日 23:38
> 收件人: ImMr.K <875061...@qq.com>
> 抄送: user 
> 主题: Re: build/sbt gen-idea error
>
> gen-idea doesn't seem to be a valid command:
>
> [warn] Ignoring load failure: no project loaded.
> [error] Not a valid command: gen-idea
> [error] gen-idea
>

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: ordering over structs

2016-04-12 Thread Imran Akbar

thanks Michael,

That worked.
But what's puzzling is if I take the exact same code and run it off a temp
table created from parquet, vs. a cached table - it runs much slower.  5-10
seconds uncached vs. 47-60 seconds cached.

Any ideas why?

Here's my code snippet:
df = data.select("customer_id", struct('dt', 'product').alias("vs"))\
  .groupBy("customer_id")\
  .agg(min("vs").alias("final"))\
  .select("customer_id", "final.dt", "final.product")
df.head()

My log from the non-cached run:
http://pastebin.com/F88sSv1B

Log from the cached run:
http://pastebin.com/Pmmfea3d

thanks,
imran

On Fri, Apr 8, 2016 at 12:33 PM, Michael Armbrust 
wrote:

> You need to use the struct function
> 
> (which creates an actual struct), you are trying to use the struct datatype
> (which just represents the schema of a struct).
>
> On Thu, Apr 7, 2016 at 3:48 PM, Imran Akbar  wrote:
>
>> thanks Michael,
>>
>>
>> I'm trying to implement the code in pyspark like so (where my dataframe
>> has 3 columns - customer_id, dt, and product):
>>
>> st = StructType().add("dt", DateType(), True).add("product",
>> StringType(), True)
>>
>> top = data.select("customer_id", st.alias('vs'))
>>   .groupBy("customer_id")
>>   .agg(max("dt").alias("vs"))
>>   .select("customer_id", "vs.dt", "vs.product")
>>
>> But I get an error saying:
>>
>> AttributeError: 'StructType' object has no attribute 'alias'
>>
>> Can I do this without aliasing the struct?  Or am I doing something
>> incorrectly?
>>
>>
>> regards,
>>
>> imran
>>
>> On Wed, Apr 6, 2016 at 4:16 PM, Michael Armbrust 
>> wrote:
>>
>>> Ordering for a struct goes in order of the fields.  So the max struct is
 the one with the highest TotalValue (and then the highest category
   if there are multiple entries with the same hour and total value).

 Is this due to "InterpretedOrdering" in StructType?

>>>
>>> That is one implementation, but the code generated ordering also follows
>>> the same contract.
>>>
>>>
>>>
  4)  Is it faster doing it this way than doing a join or window
 function in Spark SQL?

 Way faster.  This is a very efficient way to calculate argmax.

 Can you explain how this is way faster than window function? I can
 understand join doesn't make sense in this case. But to calculate the
 grouping max, you just have to shuffle the data by grouping keys. You maybe
 can do a combiner on the mapper side before shuffling, but that is it. Do
 you mean windowing function in Spark SQL won't do any map side combiner,
 even it is for max?

>>>
>>> Windowing can't do partial aggregation and will have to collect all the
>>> data for a group so that it can be sorted before applying the function.  In
>>> contrast a max aggregation will do partial aggregation (map side combining)
>>> and can be calculated in a streaming fashion.
>>>
>>> Also, aggregation is more common and thus has seen more optimization
>>> beyond the theoretical limits described above.
>>>
>>>
>>
>

Re??[spark] build/sbt gen-idea error

2016-04-12 Thread ImMr.K

But how to import spark repo into idea or eclipse?




--  --
??: Ted Yu 
: 2016??4??12?? 23:38
??: ImMr.K <875061...@qq.com>
: user 
: Re: build/sbt gen-idea error



gen-idea doesn't seem to be a valid command:
[warn] Ignoring load failure: no project loaded.
[error] Not a valid command: gen-idea
[error] gen-idea




On Tue, Apr 12, 2016 at 8:28 AM, ImMr.K <875061...@qq.com> wrote:
Hi,
I have cloned spark and ,
cd spark
build/sbt gen-idea


got the following output:




Using /usr/java/jre1.7.0_09 as default JAVA_HOME.
Note, this will be overridden by -java-home if it is set.
[info] Loading project definition from /home/king/github/spark/project/project
[info] Loading project definition from 
/home/king/.sbt/0.13/staging/ad8e8574a5bcb2d22d23/sbt-pom-reader/project
[warn] Multiple resolvers having different access mechanism configured with 
same name 'sbt-plugin-releases'. To avoid conflict, Remove duplicate project 
resolvers (`resolvers`) or rename publishing resolver (`publishTo`).
[info] Loading project definition from /home/king/github/spark/project
org.apache.maven.model.building.ModelBuildingException: 1 problem was 
encountered while building the effective model for 
org.apache.spark:spark-parent_2.11:2.0.0-SNAPSHOT
[FATAL] Non-resolvable parent POM: Could not transfer artifact 
org.apache:apache:pom:14 from/to central ( 
http://repo.maven.apache.org/maven2): Error transferring file: Connection timed 
out from  
http://repo.maven.apache.org/maven2/org/apache/apache/14/apache-14.pom and 
'parent.relativePath' points at wrong local POM @ line 22, column 11


at 
org.apache.maven.model.building.DefaultModelProblemCollector.newModelBuildingException(DefaultModelProblemCollector.java:195)
at 
org.apache.maven.model.building.DefaultModelBuilder.readParentExternally(DefaultModelBuilder.java:841)
at 
org.apache.maven.model.building.DefaultModelBuilder.readParent(DefaultModelBuilder.java:664)
at 
org.apache.maven.model.building.DefaultModelBuilder.build(DefaultModelBuilder.java:310)
at 
org.apache.maven.model.building.DefaultModelBuilder.build(DefaultModelBuilder.java:232)
at 
com.typesafe.sbt.pom.MvnPomResolver.loadEffectivePom(MavenPomResolver.scala:61)
at com.typesafe.sbt.pom.package$.loadEffectivePom(package.scala:41)
at 
com.typesafe.sbt.pom.MavenProjectHelper$.makeProjectTree(MavenProjectHelper.scala:128)
at 
com.typesafe.sbt.pom.MavenProjectHelper$.makeReactorProject(MavenProjectHelper.scala:49)
at com.typesafe.sbt.pom.PomBuild$class.projectDefinitions(PomBuild.scala:28)
at SparkBuild$.projectDefinitions(SparkBuild.scala:347)
at sbt.Load$.sbt$Load$$projectsFromBuild(Load.scala:506)
at sbt.Load$$anonfun$27.apply(Load.scala:446)
at sbt.Load$$anonfun$27.apply(Load.scala:446)
at scala.collection.immutable.Stream.flatMap(Stream.scala:442)
at sbt.Load$.loadUnit(Load.scala:446)
at sbt.Load$$anonfun$18$$anonfun$apply$11.apply(Load.scala:291)
at sbt.Load$$anonfun$18$$anonfun$apply$11.apply(Load.scala:291)
at 
sbt.BuildLoader$$anonfun$componentLoader$1$$anonfun$apply$4$$anonfun$apply$5$$anonfun$apply$6.apply(BuildLoader.scala:91)
at 
sbt.BuildLoader$$anonfun$componentLoader$1$$anonfun$apply$4$$anonfun$apply$5$$anonfun$apply$6.apply(BuildLoader.scala:90)
at sbt.BuildLoader.apply(BuildLoader.scala:140)
at sbt.Load$.loadAll(Load.scala:344)
at sbt.Load$.loadURI(Load.scala:299)
at sbt.Load$.load(Load.scala:295)
at sbt.Load$.load(Load.scala:286)
at sbt.Load$.apply(Load.scala:140)
at sbt.Load$.defaultLoad(Load.scala:36)
at sbt.BuiltinCommands$.liftedTree1$1(Main.scala:492)
at sbt.BuiltinCommands$.doLoadProject(Main.scala:492)
at sbt.BuiltinCommands$$anonfun$loadProjectImpl$2.apply(Main.scala:484)
at sbt.BuiltinCommands$$anonfun$loadProjectImpl$2.apply(Main.scala:484)
at sbt.Command$$anonfun$applyEffect$1$$anonfun$apply$2.apply(Command.scala:59)
at sbt.Command$$anonfun$applyEffect$1$$anonfun$apply$2.apply(Command.scala:59)
at sbt.Command$$anonfun$applyEffect$2$$anonfun$apply$3.apply(Command.scala:61)
at sbt.Command$$anonfun$applyEffect$2$$anonfun$apply$3.apply(Command.scala:61)
at sbt.Command$.process(Command.scala:93)
at sbt.MainLoop$$anonfun$1$$anonfun$apply$1.apply(MainLoop.scala:96)
at sbt.MainLoop$$anonfun$1$$anonfun$apply$1.apply(MainLoop.scala:96)
at sbt.State$$anon$1.process(State.scala:184)
at sbt.MainLoop$$anonfun$1.apply(MainLoop.scala:96)
at sbt.MainLoop$$anonfun$1.apply(MainLoop.scala:96)
at sbt.ErrorHandling$.wideConvert(ErrorHandling.scala:17)
at sbt.MainLoop$.next(MainLoop.scala:96)
at sbt.MainLoop$.run(MainLoop.scala:89)
at sbt.MainLoop$$anonfun$runWithNewLog$1.apply(MainLoop.scala:68)
at sbt.MainLoop$$anonfun$runWithNewLog$1.apply(MainLoop.scala:63)
at sbt.Using.apply(Using.scala:24)
at sbt.MainLoop$.runWithNewLog(MainLoop.scala:63)
at sbt.MainLoop$.runAndClearLast(MainLoop.scala:46)
at sbt.MainLoop$.runLoggedLoop(MainLoop.scala:30)
at sbt.MainLoop$.runLogged(MainLoop.scala:22)
at

Re: Re：[spark] build/sbt gen-idea error

2016-04-12 Thread Marco Mistroni

Have you tried SBT eclipse plugin? Then u can run SBT eclipse and have ur
spark project directly in eclipse
Pls Google it and u shud b able to find ur way.
If not ping me and I send u the plugin (I m replying from my phone)
Hth
On 12 Apr 2016 4:53 pm, "ImMr.K" <875061...@qq.com> wrote:

But how to import spark repo into idea or eclipse?



-- 原始邮件 --
*发件人:* Ted Yu 
*发送时间:* 2016年4月12日 23:38
*收件人:* ImMr.K <875061...@qq.com>
*抄送:* user 
*主题:* Re: build/sbt gen-idea error

gen-idea doesn't seem to be a valid command:

[warn] Ignoring load failure: no project loaded.
[error] Not a valid command: gen-idea
[error] gen-idea

On Tue, Apr 12, 2016 at 8:28 AM, ImMr.K <875061...@qq.com> wrote:

> Hi,
> I have cloned spark and ,
> cd spark
> build/sbt gen-idea
>
> got the following output:
>
>
> Using /usr/java/jre1.7.0_09 as default JAVA_HOME.
> Note, this will be overridden by -java-home if it is set.
> [info] Loading project definition from
> /home/king/github/spark/project/project
> [info] Loading project definition from
> /home/king/.sbt/0.13/staging/ad8e8574a5bcb2d22d23/sbt-pom-reader/project
> [warn] Multiple resolvers having different access mechanism configured
> with same name 'sbt-plugin-releases'. To avoid conflict, Remove duplicate
> project resolvers (`resolvers`) or rename publishing resolver (`publishTo`).
> [info] Loading project definition from /home/king/github/spark/project
> org.apache.maven.model.building.ModelBuildingException: 1 problem was
> encountered while building the effective model for
> org.apache.spark:spark-parent_2.11:2.0.0-SNAPSHOT
> [FATAL] Non-resolvable parent POM: Could not transfer artifact
> org.apache:apache:pom:14 from/to central (
> http://repo.maven.apache.org/maven2): Error transferring file: Connection
> timed out from
> http://repo.maven.apache.org/maven2/org/apache/apache/14/apache-14.pom
> and 'parent.relativePath' points at wrong local POM @ line 22, column 11
>
> at
> org.apache.maven.model.building.DefaultModelProblemCollector.newModelBuildingException(DefaultModelProblemCollector.java:195)
> at
> org.apache.maven.model.building.DefaultModelBuilder.readParentExternally(DefaultModelBuilder.java:841)
> at
> org.apache.maven.model.building.DefaultModelBuilder.readParent(DefaultModelBuilder.java:664)
> at
> org.apache.maven.model.building.DefaultModelBuilder.build(DefaultModelBuilder.java:310)
> at
> org.apache.maven.model.building.DefaultModelBuilder.build(DefaultModelBuilder.java:232)
> at
> com.typesafe.sbt.pom.MvnPomResolver.loadEffectivePom(MavenPomResolver.scala:61)
> at com.typesafe.sbt.pom.package$.loadEffectivePom(package.scala:41)
> at
> com.typesafe.sbt.pom.MavenProjectHelper$.makeProjectTree(MavenProjectHelper.scala:128)
> at
> com.typesafe.sbt.pom.MavenProjectHelper$.makeReactorProject(MavenProjectHelper.scala:49)
> at
> com.typesafe.sbt.pom.PomBuild$class.projectDefinitions(PomBuild.scala:28)
> at SparkBuild$.projectDefinitions(SparkBuild.scala:347)
> at sbt.Load$.sbt$Load$$projectsFromBuild(Load.scala:506)
> at sbt.Load$$anonfun$27.apply(Load.scala:446)
> at sbt.Load$$anonfun$27.apply(Load.scala:446)
> at scala.collection.immutable.Stream.flatMap(Stream.scala:442)
> at sbt.Load$.loadUnit(Load.scala:446)
> at sbt.Load$$anonfun$18$$anonfun$apply$11.apply(Load.scala:291)
> at sbt.Load$$anonfun$18$$anonfun$apply$11.apply(Load.scala:291)
> at
> sbt.BuildLoader$$anonfun$componentLoader$1$$anonfun$apply$4$$anonfun$apply$5$$anonfun$apply$6.apply(BuildLoader.scala:91)
> at
> sbt.BuildLoader$$anonfun$componentLoader$1$$anonfun$apply$4$$anonfun$apply$5$$anonfun$apply$6.apply(BuildLoader.scala:90)
> at sbt.BuildLoader.apply(BuildLoader.scala:140)
> at sbt.Load$.loadAll(Load.scala:344)
> at sbt.Load$.loadURI(Load.scala:299)
> at sbt.Load$.load(Load.scala:295)
> at sbt.Load$.load(Load.scala:286)
> at sbt.Load$.apply(Load.scala:140)
> at sbt.Load$.defaultLoad(Load.scala:36)
> at sbt.BuiltinCommands$.liftedTree1$1(Main.scala:492)
> at sbt.BuiltinCommands$.doLoadProject(Main.scala:492)
> at sbt.BuiltinCommands$$anonfun$loadProjectImpl$2.apply(Main.scala:484)
> at sbt.BuiltinCommands$$anonfun$loadProjectImpl$2.apply(Main.scala:484)
> at
> sbt.Command$$anonfun$applyEffect$1$$anonfun$apply$2.apply(Command.scala:59)
> at
> sbt.Command$$anonfun$applyEffect$1$$anonfun$apply$2.apply(Command.scala:59)
> at
> sbt.Command$$anonfun$applyEffect$2$$anonfun$apply$3.apply(Command.scala:61)
> at
> sbt.Command$$anonfun$applyEffect$2$$anonfun$apply$3.apply(Command.scala:61)
> at sbt.Command$.process(Command.scala:93)
> at sbt.MainLoop$$anonfun$1$$anonfun$apply$1.apply(MainLoop.scala:96)
> at sbt.MainLoop$$anonfun$1$$anonfun$apply$1.apply(MainLoop.scala:96)
> at sbt.State$$anon$1.process(State.scala:184)
> at sbt.MainLoop$$anonfun$1.apply(MainLoop.scala:96)
> at sbt.MainLoop$$anonfun$1.apply(MainLoop.scala:96)
> at sbt.ErrorHandling$.wideConvert(ErrorHandling.scala:17)
> at sbt.MainLoop$.next(MainLoop.scala:96)
>

Re: [spark] build/sbt gen-idea error

2016-04-12 Thread Ted Yu

See
https://cwiki.apache.org/confluence/display/SPARK/Useful+Developer+Tools#UsefulDeveloperTools-IDESetup

On Tue, Apr 12, 2016 at 8:52 AM, ImMr.K <875061...@qq.com> wrote:

> But how to import spark repo into idea or eclipse?
>
>
>
> -- 原始邮件 --
> *发件人:* Ted Yu 
> *发送时间:* 2016年4月12日 23:38
> *收件人:* ImMr.K <875061...@qq.com>
> *抄送:* user 
> *主题:* Re: build/sbt gen-idea error
>
> gen-idea doesn't seem to be a valid command:
>
> [warn] Ignoring load failure: no project loaded.
> [error] Not a valid command: gen-idea
> [error] gen-idea
>
> On Tue, Apr 12, 2016 at 8:28 AM, ImMr.K <875061...@qq.com> wrote:
>
>> Hi,
>> I have cloned spark and ,
>> cd spark
>> build/sbt gen-idea
>>
>> got the following output:
>>
>>
>> Using /usr/java/jre1.7.0_09 as default JAVA_HOME.
>> Note, this will be overridden by -java-home if it is set.
>> [info] Loading project definition from
>> /home/king/github/spark/project/project
>> [info] Loading project definition from
>> /home/king/.sbt/0.13/staging/ad8e8574a5bcb2d22d23/sbt-pom-reader/project
>> [warn] Multiple resolvers having different access mechanism configured
>> with same name 'sbt-plugin-releases'. To avoid conflict, Remove duplicate
>> project resolvers (`resolvers`) or rename publishing resolver (`publishTo`).
>> [info] Loading project definition from /home/king/github/spark/project
>> org.apache.maven.model.building.ModelBuildingException: 1 problem was
>> encountered while building the effective model for
>> org.apache.spark:spark-parent_2.11:2.0.0-SNAPSHOT
>> [FATAL] Non-resolvable parent POM: Could not transfer artifact
>> org.apache:apache:pom:14 from/to central (
>> http://repo.maven.apache.org/maven2): Error transferring file:
>> Connection timed out from
>> http://repo.maven.apache.org/maven2/org/apache/apache/14/apache-14.pom
>> and 'parent.relativePath' points at wrong local POM @ line 22, column 11
>>
>> at
>> org.apache.maven.model.building.DefaultModelProblemCollector.newModelBuildingException(DefaultModelProblemCollector.java:195)
>> at
>> org.apache.maven.model.building.DefaultModelBuilder.readParentExternally(DefaultModelBuilder.java:841)
>> at
>> org.apache.maven.model.building.DefaultModelBuilder.readParent(DefaultModelBuilder.java:664)
>> at
>> org.apache.maven.model.building.DefaultModelBuilder.build(DefaultModelBuilder.java:310)
>> at
>> org.apache.maven.model.building.DefaultModelBuilder.build(DefaultModelBuilder.java:232)
>> at
>> com.typesafe.sbt.pom.MvnPomResolver.loadEffectivePom(MavenPomResolver.scala:61)
>> at com.typesafe.sbt.pom.package$.loadEffectivePom(package.scala:41)
>> at
>> com.typesafe.sbt.pom.MavenProjectHelper$.makeProjectTree(MavenProjectHelper.scala:128)
>> at
>> com.typesafe.sbt.pom.MavenProjectHelper$.makeReactorProject(MavenProjectHelper.scala:49)
>> at
>> com.typesafe.sbt.pom.PomBuild$class.projectDefinitions(PomBuild.scala:28)
>> at SparkBuild$.projectDefinitions(SparkBuild.scala:347)
>> at sbt.Load$.sbt$Load$$projectsFromBuild(Load.scala:506)
>> at sbt.Load$$anonfun$27.apply(Load.scala:446)
>> at sbt.Load$$anonfun$27.apply(Load.scala:446)
>> at scala.collection.immutable.Stream.flatMap(Stream.scala:442)
>> at sbt.Load$.loadUnit(Load.scala:446)
>> at sbt.Load$$anonfun$18$$anonfun$apply$11.apply(Load.scala:291)
>> at sbt.Load$$anonfun$18$$anonfun$apply$11.apply(Load.scala:291)
>> at
>> sbt.BuildLoader$$anonfun$componentLoader$1$$anonfun$apply$4$$anonfun$apply$5$$anonfun$apply$6.apply(BuildLoader.scala:91)
>> at
>> sbt.BuildLoader$$anonfun$componentLoader$1$$anonfun$apply$4$$anonfun$apply$5$$anonfun$apply$6.apply(BuildLoader.scala:90)
>> at sbt.BuildLoader.apply(BuildLoader.scala:140)
>> at sbt.Load$.loadAll(Load.scala:344)
>> at sbt.Load$.loadURI(Load.scala:299)
>> at sbt.Load$.load(Load.scala:295)
>> at sbt.Load$.load(Load.scala:286)
>> at sbt.Load$.apply(Load.scala:140)
>> at sbt.Load$.defaultLoad(Load.scala:36)
>> at sbt.BuiltinCommands$.liftedTree1$1(Main.scala:492)
>> at sbt.BuiltinCommands$.doLoadProject(Main.scala:492)
>> at sbt.BuiltinCommands$$anonfun$loadProjectImpl$2.apply(Main.scala:484)
>> at sbt.BuiltinCommands$$anonfun$loadProjectImpl$2.apply(Main.scala:484)
>> at
>> sbt.Command$$anonfun$applyEffect$1$$anonfun$apply$2.apply(Command.scala:59)
>> at
>> sbt.Command$$anonfun$applyEffect$1$$anonfun$apply$2.apply(Command.scala:59)
>> at
>> sbt.Command$$anonfun$applyEffect$2$$anonfun$apply$3.apply(Command.scala:61)
>> at
>> sbt.Command$$anonfun$applyEffect$2$$anonfun$apply$3.apply(Command.scala:61)
>> at sbt.Command$.process(Command.scala:93)
>> at sbt.MainLoop$$anonfun$1$$anonfun$apply$1.apply(MainLoop.scala:96)
>> at sbt.MainLoop$$anonfun$1$$anonfun$apply$1.apply(MainLoop.scala:96)
>> at sbt.State$$anon$1.process(State.scala:184)
>> at sbt.MainLoop$$anonfun$1.apply(MainLoop.scala:96)
>> at sbt.MainLoop$$anonfun$1.apply(MainLoop.scala:96)
>> at sbt.ErrorHandling$.wideConvert(ErrorHandling.scala:17)
>> at sbt.MainLoop$.next(MainLoop.scala:96)

import data from s3 which region supports AWS4-HMAC-SHA256 only

2016-04-12 Thread QiuxuanZhu

Dear all,

Is it possible for now that spark could import data from s3 which
supports AWS4-HMAC-SHA256 only? such as frankfurt or beijing?

I check the issues in github. It seems that upgrade jets3t failed in UT for
now.
https://github.com/apache/spark/pull/9306

So, any help?

Thanks.

-- 
跑不完马拉松的摄影师不是好背包客。
下个目标，该是6K的峰了吧？恩。

Re: build/sbt gen-idea error

2016-04-12 Thread Ted Yu

gen-idea doesn't seem to be a valid command:

[warn] Ignoring load failure: no project loaded.
[error] Not a valid command: gen-idea
[error] gen-idea

On Tue, Apr 12, 2016 at 8:28 AM, ImMr.K <875061...@qq.com> wrote:

> Hi,
> I have cloned spark and ,
> cd spark
> build/sbt gen-idea
>
> got the following output:
>
>
> Using /usr/java/jre1.7.0_09 as default JAVA_HOME.
> Note, this will be overridden by -java-home if it is set.
> [info] Loading project definition from
> /home/king/github/spark/project/project
> [info] Loading project definition from
> /home/king/.sbt/0.13/staging/ad8e8574a5bcb2d22d23/sbt-pom-reader/project
> [warn] Multiple resolvers having different access mechanism configured
> with same name 'sbt-plugin-releases'. To avoid conflict, Remove duplicate
> project resolvers (`resolvers`) or rename publishing resolver (`publishTo`).
> [info] Loading project definition from /home/king/github/spark/project
> org.apache.maven.model.building.ModelBuildingException: 1 problem was
> encountered while building the effective model for
> org.apache.spark:spark-parent_2.11:2.0.0-SNAPSHOT
> [FATAL] Non-resolvable parent POM: Could not transfer artifact
> org.apache:apache:pom:14 from/to central (
> http://repo.maven.apache.org/maven2): Error transferring file: Connection
> timed out from
> http://repo.maven.apache.org/maven2/org/apache/apache/14/apache-14.pom
> and 'parent.relativePath' points at wrong local POM @ line 22, column 11
>
> at
> org.apache.maven.model.building.DefaultModelProblemCollector.newModelBuildingException(DefaultModelProblemCollector.java:195)
> at
> org.apache.maven.model.building.DefaultModelBuilder.readParentExternally(DefaultModelBuilder.java:841)
> at
> org.apache.maven.model.building.DefaultModelBuilder.readParent(DefaultModelBuilder.java:664)
> at
> org.apache.maven.model.building.DefaultModelBuilder.build(DefaultModelBuilder.java:310)
> at
> org.apache.maven.model.building.DefaultModelBuilder.build(DefaultModelBuilder.java:232)
> at
> com.typesafe.sbt.pom.MvnPomResolver.loadEffectivePom(MavenPomResolver.scala:61)
> at com.typesafe.sbt.pom.package$.loadEffectivePom(package.scala:41)
> at
> com.typesafe.sbt.pom.MavenProjectHelper$.makeProjectTree(MavenProjectHelper.scala:128)
> at
> com.typesafe.sbt.pom.MavenProjectHelper$.makeReactorProject(MavenProjectHelper.scala:49)
> at
> com.typesafe.sbt.pom.PomBuild$class.projectDefinitions(PomBuild.scala:28)
> at SparkBuild$.projectDefinitions(SparkBuild.scala:347)
> at sbt.Load$.sbt$Load$$projectsFromBuild(Load.scala:506)
> at sbt.Load$$anonfun$27.apply(Load.scala:446)
> at sbt.Load$$anonfun$27.apply(Load.scala:446)
> at scala.collection.immutable.Stream.flatMap(Stream.scala:442)
> at sbt.Load$.loadUnit(Load.scala:446)
> at sbt.Load$$anonfun$18$$anonfun$apply$11.apply(Load.scala:291)
> at sbt.Load$$anonfun$18$$anonfun$apply$11.apply(Load.scala:291)
> at
> sbt.BuildLoader$$anonfun$componentLoader$1$$anonfun$apply$4$$anonfun$apply$5$$anonfun$apply$6.apply(BuildLoader.scala:91)
> at
> sbt.BuildLoader$$anonfun$componentLoader$1$$anonfun$apply$4$$anonfun$apply$5$$anonfun$apply$6.apply(BuildLoader.scala:90)
> at sbt.BuildLoader.apply(BuildLoader.scala:140)
> at sbt.Load$.loadAll(Load.scala:344)
> at sbt.Load$.loadURI(Load.scala:299)
> at sbt.Load$.load(Load.scala:295)
> at sbt.Load$.load(Load.scala:286)
> at sbt.Load$.apply(Load.scala:140)
> at sbt.Load$.defaultLoad(Load.scala:36)
> at sbt.BuiltinCommands$.liftedTree1$1(Main.scala:492)
> at sbt.BuiltinCommands$.doLoadProject(Main.scala:492)
> at sbt.BuiltinCommands$$anonfun$loadProjectImpl$2.apply(Main.scala:484)
> at sbt.BuiltinCommands$$anonfun$loadProjectImpl$2.apply(Main.scala:484)
> at
> sbt.Command$$anonfun$applyEffect$1$$anonfun$apply$2.apply(Command.scala:59)
> at
> sbt.Command$$anonfun$applyEffect$1$$anonfun$apply$2.apply(Command.scala:59)
> at
> sbt.Command$$anonfun$applyEffect$2$$anonfun$apply$3.apply(Command.scala:61)
> at
> sbt.Command$$anonfun$applyEffect$2$$anonfun$apply$3.apply(Command.scala:61)
> at sbt.Command$.process(Command.scala:93)
> at sbt.MainLoop$$anonfun$1$$anonfun$apply$1.apply(MainLoop.scala:96)
> at sbt.MainLoop$$anonfun$1$$anonfun$apply$1.apply(MainLoop.scala:96)
> at sbt.State$$anon$1.process(State.scala:184)
> at sbt.MainLoop$$anonfun$1.apply(MainLoop.scala:96)
> at sbt.MainLoop$$anonfun$1.apply(MainLoop.scala:96)
> at sbt.ErrorHandling$.wideConvert(ErrorHandling.scala:17)
> at sbt.MainLoop$.next(MainLoop.scala:96)
> at sbt.MainLoop$.run(MainLoop.scala:89)
> at sbt.MainLoop$$anonfun$runWithNewLog$1.apply(MainLoop.scala:68)
> at sbt.MainLoop$$anonfun$runWithNewLog$1.apply(MainLoop.scala:63)
> at sbt.Using.apply(Using.scala:24)
> at sbt.MainLoop$.runWithNewLog(MainLoop.scala:63)
> at sbt.MainLoop$.runAndClearLast(MainLoop.scala:46)
> at sbt.MainLoop$.runLoggedLoop(MainLoop.scala:30)
> at sbt.MainLoop$.runLogged(MainLoop.scala:22)
> at sbt.StandardMain$.runManaged(Main.scala:54)
> at sbt.xMain.run(Main.scala:29)
> at

build/sbt gen-idea error

2016-04-12 Thread ImMr.K

Hi,
I have cloned spark and ,
cd spark
build/sbt gen-idea


got the following output:




Using /usr/java/jre1.7.0_09 as default JAVA_HOME.
Note, this will be overridden by -java-home if it is set.
[info] Loading project definition from /home/king/github/spark/project/project
[info] Loading project definition from 
/home/king/.sbt/0.13/staging/ad8e8574a5bcb2d22d23/sbt-pom-reader/project
[warn] Multiple resolvers having different access mechanism configured with 
same name 'sbt-plugin-releases'. To avoid conflict, Remove duplicate project 
resolvers (`resolvers`) or rename publishing resolver (`publishTo`).
[info] Loading project definition from /home/king/github/spark/project
org.apache.maven.model.building.ModelBuildingException: 1 problem was 
encountered while building the effective model for 
org.apache.spark:spark-parent_2.11:2.0.0-SNAPSHOT
[FATAL] Non-resolvable parent POM: Could not transfer artifact 
org.apache:apache:pom:14 from/to central ( 
http://repo.maven.apache.org/maven2): Error transferring file: Connection timed 
out from  
http://repo.maven.apache.org/maven2/org/apache/apache/14/apache-14.pom and 
'parent.relativePath' points at wrong local POM @ line 22, column 11


at 
org.apache.maven.model.building.DefaultModelProblemCollector.newModelBuildingException(DefaultModelProblemCollector.java:195)
at 
org.apache.maven.model.building.DefaultModelBuilder.readParentExternally(DefaultModelBuilder.java:841)
at 
org.apache.maven.model.building.DefaultModelBuilder.readParent(DefaultModelBuilder.java:664)
at 
org.apache.maven.model.building.DefaultModelBuilder.build(DefaultModelBuilder.java:310)
at 
org.apache.maven.model.building.DefaultModelBuilder.build(DefaultModelBuilder.java:232)
at 
com.typesafe.sbt.pom.MvnPomResolver.loadEffectivePom(MavenPomResolver.scala:61)
at com.typesafe.sbt.pom.package$.loadEffectivePom(package.scala:41)
at 
com.typesafe.sbt.pom.MavenProjectHelper$.makeProjectTree(MavenProjectHelper.scala:128)
at 
com.typesafe.sbt.pom.MavenProjectHelper$.makeReactorProject(MavenProjectHelper.scala:49)
at 
com.typesafe.sbt.pom.PomBuild$class.projectDefinitions(PomBuild.scala:28)
at SparkBuild$.projectDefinitions(SparkBuild.scala:347)
at sbt.Load$.sbt$Load$$projectsFromBuild(Load.scala:506)
at sbt.Load$$anonfun$27.apply(Load.scala:446)
at sbt.Load$$anonfun$27.apply(Load.scala:446)
at scala.collection.immutable.Stream.flatMap(Stream.scala:442)
at sbt.Load$.loadUnit(Load.scala:446)
at sbt.Load$$anonfun$18$$anonfun$apply$11.apply(Load.scala:291)
at sbt.Load$$anonfun$18$$anonfun$apply$11.apply(Load.scala:291)
at 
sbt.BuildLoader$$anonfun$componentLoader$1$$anonfun$apply$4$$anonfun$apply$5$$anonfun$apply$6.apply(BuildLoader.scala:91)
at 
sbt.BuildLoader$$anonfun$componentLoader$1$$anonfun$apply$4$$anonfun$apply$5$$anonfun$apply$6.apply(BuildLoader.scala:90)
at sbt.BuildLoader.apply(BuildLoader.scala:140)
at sbt.Load$.loadAll(Load.scala:344)
at sbt.Load$.loadURI(Load.scala:299)
at sbt.Load$.load(Load.scala:295)
at sbt.Load$.load(Load.scala:286)
at sbt.Load$.apply(Load.scala:140)
at sbt.Load$.defaultLoad(Load.scala:36)
at sbt.BuiltinCommands$.liftedTree1$1(Main.scala:492)
at sbt.BuiltinCommands$.doLoadProject(Main.scala:492)
at sbt.BuiltinCommands$$anonfun$loadProjectImpl$2.apply(Main.scala:484)
at sbt.BuiltinCommands$$anonfun$loadProjectImpl$2.apply(Main.scala:484)
at 
sbt.Command$$anonfun$applyEffect$1$$anonfun$apply$2.apply(Command.scala:59)
at 
sbt.Command$$anonfun$applyEffect$1$$anonfun$apply$2.apply(Command.scala:59)
at 
sbt.Command$$anonfun$applyEffect$2$$anonfun$apply$3.apply(Command.scala:61)
at 
sbt.Command$$anonfun$applyEffect$2$$anonfun$apply$3.apply(Command.scala:61)
at sbt.Command$.process(Command.scala:93)
at sbt.MainLoop$$anonfun$1$$anonfun$apply$1.apply(MainLoop.scala:96)
at sbt.MainLoop$$anonfun$1$$anonfun$apply$1.apply(MainLoop.scala:96)
at sbt.State$$anon$1.process(State.scala:184)
at sbt.MainLoop$$anonfun$1.apply(MainLoop.scala:96)
at sbt.MainLoop$$anonfun$1.apply(MainLoop.scala:96)
at sbt.ErrorHandling$.wideConvert(ErrorHandling.scala:17)
at sbt.MainLoop$.next(MainLoop.scala:96)
at sbt.MainLoop$.run(MainLoop.scala:89)
at sbt.MainLoop$$anonfun$runWithNewLog$1.apply(MainLoop.scala:68)
at sbt.MainLoop$$anonfun$runWithNewLog$1.apply(MainLoop.scala:63)
at sbt.Using.apply(Using.scala:24)
at sbt.MainLoop$.runWithNewLog(MainLoop.scala:63)
at sbt.MainLoop$.runAndClearLast(MainLoop.scala:46)
at sbt.MainLoop$.runLoggedLoop(MainLoop.scala:30)
at sbt.MainLoop$.runLogged(MainLoop.scala:22)
at sbt.StandardMain$.runManaged(Main.scala:54)
at sbt.xMain.run(Main.scala:29)

HiveContext in spark

2016-04-12 Thread Selvam Raman

I Could not able to use Insert , update and delete command in HiveContext.

i am using spark 1.6.1 version and hive 1.1.0

Please find the error below.



scala> hc.sql("delete from  trans_detail where counter=1");
16/04/12 14:58:45 INFO ParseDriver: Parsing command: delete from
 trans_detail where counter=1
16/04/12 14:58:45 INFO ParseDriver: Parse Completed
16/04/12 14:58:45 INFO ParseDriver: Parsing command: delete from
 trans_detail where counter=1
16/04/12 14:58:45 INFO ParseDriver: Parse Completed
16/04/12 14:58:45 INFO BlockManagerInfo: Removed broadcast_2_piece0 on
localhost:60409 in memory (size: 46.9 KB, free: 536.7 MB)
16/04/12 14:58:46 INFO ContextCleaner: Cleaned accumulator 3
16/04/12 14:58:46 INFO BlockManagerInfo: Removed broadcast_4_piece0 on
localhost:60409 in memory (size: 3.6 KB, free: 536.7 MB)
org.apache.spark.sql.AnalysisException:
Unsupported language features in query: delete from  trans_detail where
counter=1
TOK_DELETE_FROM 1, 0,11, 13
  TOK_TABNAME 1, 5,5, 13
trans_detail 1, 5,5, 13
  TOK_WHERE 1, 7,11, 39
= 1, 9,11, 39
  TOK_TABLE_OR_COL 1, 9,9, 32
counter 1, 9,9, 32
  1 1, 11,11, 40

scala.NotImplementedError: No parse rules for TOK_DELETE_FROM:
 TOK_DELETE_FROM 1, 0,11, 13
  TOK_TABNAME 1, 5,5, 13
trans_detail 1, 5,5, 13
  TOK_WHERE 1, 7,11, 39
= 1, 9,11, 39
  TOK_TABLE_OR_COL 1, 9,9, 32
counter 1, 9,9, 32
  1 1, 11,11, 40

org.apache.spark.sql.hive.HiveQl$.nodeToPlan(HiveQl.scala:1217)




-- 
Selvam Raman
"லஞ்சம் தவிர்த்து நெஞ்சம் நிமிர்த்து"

Starting Spark Job Remotely with Function Call

2016-04-12 Thread Prateek .

Hi,

To start a spark job, remotely we are using:


1)  spark-submit

2)  Spark Job Server


As for quick testing of my application in IDE, I am giving a call to function 
that makes Spark Context, set's all spark configuration and master, and 
executes the flow and I can see the output in my driver.
I was wondering  what if we don't use spark submit/spark job server and give a 
call to the function that executes the job.

Will it create any implications in production environment, am I missing some 
important points?


Thank You,
Prateek

"DISCLAIMER: This message is proprietary to Aricent and is intended solely for 
the use of the individual to whom it is addressed. It may contain privileged or 
confidential information and should not be circulated or used for any purpose 
other than for what it is intended. If you have received this message in error, 
please notify the originator immediately. If you are not the intended 
recipient, you are notified that you are strictly prohibited from using, 
copying, altering, or disclosing the contents of this message. Aricent accepts 
no responsibility for loss or damage arising from the use of the information 
transmitted by this email including damage from virus."

Importing hive thrift server

2016-04-12 Thread ram kumar

Hi,

In spark-shell, we start hive thrift server by importing,
import org.apache.spark.sql.hive.thriftserver._

Is there a package for importing it from pyspark

Thanks

Performance of algorithm on top of Pregel non linearly depends on the number of iterations

2016-04-12 Thread lamerman

I try to run a greedy graph partitioning algorithm on top of Pregel. 

My algorithm is pretty easy

val InitialMsg = 
val NoGroup = -1

def mergeMsg(msg1: Int, msg2: Int): Int = {
msg1 min msg2
}

def vprog(vertexId: VertexId, value: Long, message: Int): Long = {
if (message == InitialMsg) {
value
} else {
message
}
}

def sendMsg(triplet: EdgeTriplet[Long, Boolean]): Iterator[(VertexId, Int)]
= {
if (triplet.srcAttr == NoGroup) {
Iterator.empty
} else if (triplet.dstAttr != NoGroup) {
Iterator.empty
} else { 
Iterator((triplet.dstId, triplet.srcAttr.toInt))
}
}

Before running it I just assign groups (1,2,3...n) to random vertices of a
big graph. The rest remains with no group (the value is -1).

It makes the actual partitioning but the problem is that every next job is
slower than previous. In the beginning I understand why it may get slower.
The number of active vertices grows until some point. But after it the
runtime of every job should descend as the number of the active vertices
decreases until it reaches zero.

But in reality in only grows. I can make 80 iterations on the same graph for
1 minute, when 150 iterations require 17 minutes and I expect that 300 will
require hours. As I understand the algorithm, the complexity of each next
job should not increase, but from the numbers it does.

The Spark version is 1.6.0 and 1.6.1 for this behaviour.

What is wrong?
If you need more info let me know.

Thank you



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Performance-of-algorithm-on-top-of-Pregel-non-linearly-depends-on-the-number-of-iterations-tp26760.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Choosing an Algorithm in Spark MLib

2016-04-12 Thread Joe San

up vote
down votefavorite


I'm working on a problem where in I have some data sets about some power
generating units. Each of these units have been activated to run in the
past and while activation, some units went into some issues. I now have all
these data and I would like to come up with some sort of Ranking for these
generating units. The criteria for ranking would be pretty simple to start
with. They are:

   1. Maximum number of times a particular generating unit was activated
   2. How many times did the generating unit ran into problems during
   activation

Later on I would expand on this ranking algorithm by adding more criteria.
I will be using Apache Spark MLIB library and I can already see that there
are quite a few algorithms already in place.

http://spark.apache.org/docs/latest/mllib-guide.html

I'm just not sure which algorithm would fit my purpose. Any suggestions?

Re: how to deploy new code with checkpointing

2016-04-12 Thread Cody Koeninger

- Checkpointing alone isn't enough to get exactly-once semantics.
Events will be replayed in case of failure.  You must have idempotent
output operations.

- Another way to handle upgrades is to just start a second app with
the new code, then stop the old one once everything's caught up.

On Tue, Apr 12, 2016 at 1:15 AM, Soumitra Siddharth Johri
 wrote:
> I think before doing a code update you would like to gracefully shutdown
> your streaming job and checkpoint the processed offsets ( and any state that
> you maintain ) in database or Hdfs.
> When you start the job up it should read this checkpoint file , build the
> necessary state and begin processing from the last offset processed.
>
> Another approach would be to checkpoint the processed offsets in the
> streaming job whenever you read from Kafka . Then before reading the next
> batch of offsets instead of relying on spark checkpoint for offsets, read
> from the last processed offset that you saved.
>
> Regards
> Soumitra
>
> On Apr 11, 2016, at 8:31 PM, Siva Gudavalli  wrote:
>
> Okie. That makes sense.
>
> Any recommendations on how to manage changes to my spark streaming app and
> achieving fault tolerance at the same time
>
> On Mon, Apr 11, 2016 at 8:16 PM, Shixiong(Ryan) Zhu
>  wrote:
>>
>> You cannot. Streaming doesn't support it because code changes will break
>> Java serialization.
>>
>> On Mon, Apr 11, 2016 at 4:30 PM, Siva Gudavalli 
>> wrote:
>>>
>>> hello,
>>>
>>> i am writing a spark streaming application to read data from kafka. I am
>>> using no receiver approach and enabled checkpointing to make sure I am not
>>> reading messages again in case of failure. (exactly once semantics)
>>>
>>> i have a quick question how checkpointing needs to be configured to
>>> handle code changes in my spark streaming app.
>>>
>>> can you please suggest. hope the question makes sense.
>>>
>>> thank you
>>>
>>> regards
>>> shiv
>>
>>
>

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: History Server Refresh?

2016-04-12 Thread Steve Loughran


On 12 Apr 2016, at 00:21, Miles Crawford 
> wrote:

Hey there. I have my spark applications set up to write their event logs into 
S3 - this is super useful for ephemeral clusters, I can have persistent history 
even though my hosts go away.

A history server is set up to view this s3 location, and that works fine too - 
at least on startup.

The problem is that the history server doesn't seem to notice new logs arriving 
into the S3 bucket.  Any idea how I can get it to scan the folder for new files?

Thanks,
-miles

s3 isn't a real filesystem, and apps writing to it don't have any data written 
until one of
 -the output stream is close()'d. This happens at the end of the app
 -the file is set up to be partitioned and a partition size is crossed

Until either of those conditions are met, the history server isn't going to see 
anything.

If you are going to use s3 as the dest, and you want to see incomplete apps, 
then you'll need to configure the spark job to have smaller partition size (64? 
128? MB).

If it's completed apps that aren't being seen by the HS, then that's a bug, 
though if its against s3 only, likely to be something related to directory 
listings

Re: [ML] Training with bias

2016-04-12 Thread Nick Pentreath

Are you referring to fitting the intercept term? You can use
lr.setFitIntercept (though it is true by default):

scala> lr.explainParam(lr.fitIntercept)
res27: String = fitIntercept: whether to fit an intercept term (default:
true)

On Mon, 11 Apr 2016 at 21:59 Daniel Siegmann 
wrote:

> I'm trying to understand how I can add a bias when training in Spark. I
> have only a vague familiarity with this subject, so I hope this question
> will be clear enough.
>
> Using liblinear a bias can be set - if it's >= 0, there will be an
> additional weight appended in the model, and predicting with that model
> will automatically append a feature for the bias.
>
> Is there anything similar in Spark, such as for logistic regression? The
> closest thing I can find is MLUtils.appendBias, but this seems to require
> manual work on both the training and scoring side. I was hoping for
> something that would just be part of the model.
>
>
> ~Daniel Siegmann
>

Re: HashingTF "compatibility" across Python, Scala?

2016-04-12 Thread Nick Pentreath

I should point out that actually the "ml" version of HashingTF does call
into Java so that will be consistent across Python and Java.

It's the "mllib" version in PySpark that implements its own version using
Pythons "hash" function (while Java uses Object.hashCode).

On Thu, 7 Apr 2016 at 18:19 Nick Pentreath  wrote:

> You're right Sean, the implementation depends on hash code currently so
> may differ. I opened a JIRA (which duplicated this one -
> https://issues.apache.org/jira/plugins/servlet/mobile#issue/SPARK-10574
> which is the active JIRA), for using murmurhash3 which should then be
> consistent across platforms & langs (as well as more performant).
>
> It's also odd (legacy I think) that the Python version has its own
> implementation rather than calling into Java. That should also be changed
> probably.
> On Thu, 7 Apr 2016 at 17:59, Sean Owen  wrote:
>
>> Let's say I use HashingTF in my Pipeline to hash a string feature.
>> This is available in Python and Scala, but they hash strings to
>> different values since both use their respective runtime's native hash
>> implementation. This means that I create different feature vectors for
>> the same input. While I can load/store something like a
>> NaiveBayesModel across the two languages successfully, it seems like
>> the hashing part doesn't translate.
>>
>> Is that accurate, or, have I completely missed a way to get the same
>> hashing for the same input across languages?
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>

Re: Run a self-contained Spark app on a Spark standalone cluster

2016-04-12 Thread Kevin Eid

Hi,

Thanks for your emails, I tried running your command but it returned: "No
such file or directory".
So I definitely need to move my local .py files to the cluster, I tried
login but (before sshing) but could not find the master:
./spark-ec2 -k key -i key.pem  login weather-cluster
- then sshing, the copy-dir is located in the spark-ec2 but to, replicate
my files across all nodes I need to get them into the root folder in the
spark EC2 cluster:
./spark-ec2/copy-dir /root/spark/myfiles

I used that: http://spark.apache.org/docs/latest/ec2-scripts.html.

Do you have any suggestions about how to move those files from local to the
cluster?
Thanks in advance,
Kevin

On 12 April 2016 at 12:19, Sun, Rui  wrote:

> Which py file is your main file (primary py file)? Zip the other two py
> files. Leave the main py file alone. Don't copy them to S3 because it seems
> that only local primary and additional py files are supported.
>
> ./bin/spark-submit --master spark://... --py-files   file>
>
> -Original Message-
> From: kevllino [mailto:kevin.e...@mail.dcu.ie]
> Sent: Tuesday, April 12, 2016 5:07 PM
> To: user@spark.apache.org
> Subject: Run a self-contained Spark app on a Spark standalone cluster
>
> Hi,
>
> I need to know how to run a self-contained Spark app  (3 python files) in
> a Spark standalone cluster. Can I move the .py files to the cluster, or
> should I store them locally, on HDFS or S3? I tried the following locally
> and on S3 with a zip of my .py files as suggested  here <
> http://spark.apache.org/docs/latest/submitting-applications.html>  :
>
> ./bin/spark-submit --master
> spark://ec2-54-51-23-172.eu-west-1.compute.amazonaws.com:5080
> --py-files
> s3n://AWS_ACCESS_KEY_ID:AWS_SECRET_ACCESS_KEY@mubucket
> //weather_predict.zip
>
> But get: “Error: Must specify a primary resource (JAR or Python file)”
>
> Best,
> Kevin
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Run-a-self-contained-Spark-app-on-a-Spark-standalone-cluster-tp26753.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


-- 
Kevin EID
M.Sc. in Computing, Data Analytics

DStream how many RDD's are created by batch

2016-04-12 Thread Natu Lauchande

Hi,

What's the criteria for the number of RDD's created for each micro bath
iteration  ?

Thanks,
Natu

RE: Can i have a hive context and sql context in the same app ?

2016-04-12 Thread Sun, Rui

  val ALLOW_MULTIPLE_CONTEXTS = booleanConf("spark.sql.allowMultipleContexts",
defaultValue = Some(true),
doc = "When set to true, creating multiple SQLContexts/HiveContexts is 
allowed." +
  "When set to false, only one SQLContext/HiveContext is allowed to be 
created " +
  "through the constructor (new SQLContexts/HiveContexts created through 
newSession " +
  "method is allowed). Please note that this conf needs to be set in Spark 
Conf. Once" +
  "a SQLContext/HiveContext has been created, changing the value of this 
conf will not" +
  "have effect.",
isPublic = true)

I don’t think there is any performance pernalties of doing so.
From: Natu Lauchande [mailto:nlaucha...@gmail.com]
Sent: Tuesday, April 12, 2016 4:49 PM
To: user@spark.apache.org
Subject: Can i have a hive context and sql context in the same app ?

Hi,
Is it possible to have both a sqlContext and a hiveContext in the same 
application ?
If yes would there be any performance pernalties of doing so.

Regards,
Natu

Graphical representation of Spark Decision Tree . How to do it ?

2016-04-12 Thread Eli Super

Hi Spark Users,

I need your help.

I've some output after running DecisionTree :



I work with Jupyter notebook and python 2.7

How I can create a  graphical representation of the Decision Tree model ?

In sklearn I can use tree.export_graphviz , in R I can see the Decision
Tree output as well .

How to make tree representation of spark in Jupyter notebook ?

My code below :

model_1 = DecisionTree.trainClassifier(joined_rdd_1, numClasses=2,
categoricalFeaturesInfo={0:2, 1:16,..},impurity='gini', maxDepth=3,
maxBins=32)

model_1.toDebugString

model_1.toDebugString()

Thanks !

RE: Run a self-contained Spark app on a Spark standalone cluster

2016-04-12 Thread Sun, Rui

Which py file is your main file (primary py file)? Zip the other two py files. 
Leave the main py file alone. Don't copy them to S3 because it seems that only 
local primary and additional py files are supported.

./bin/spark-submit --master spark://... --py-files  

-Original Message-
From: kevllino [mailto:kevin.e...@mail.dcu.ie] 
Sent: Tuesday, April 12, 2016 5:07 PM
To: user@spark.apache.org
Subject: Run a self-contained Spark app on a Spark standalone cluster

Hi, 

I need to know how to run a self-contained Spark app  (3 python files) in a 
Spark standalone cluster. Can I move the .py files to the cluster, or should I 
store them locally, on HDFS or S3? I tried the following locally and on S3 with 
a zip of my .py files as suggested  here 
  : 

./bin/spark-submit --master
spark://ec2-54-51-23-172.eu-west-1.compute.amazonaws.com:5080--py-files
s3n://AWS_ACCESS_KEY_ID:AWS_SECRET_ACCESS_KEY@mubucket//weather_predict.zip

But get: “Error: Must specify a primary resource (JAR or Python file)”

Best,
Kevin 

--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Run-a-self-contained-Spark-app-on-a-Spark-standalone-cluster-tp26753.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

RE: How does spark-submit handle Python scripts (and how to repeat it)?

2016-04-12 Thread Sun, Rui

There is much deployment preparation work handling different deployment modes 
for pyspark and SparkR in SparkSubmit. It is difficult to summarize it briefly, 
you had better refer to the source code.

Supporting running Julia scripts in SparkSubmit is more than implementing a 
‘JuliaRunner’. One part is passing the command line options, like “--master”, 
from the JVM launched by spark-submit to the JVM where SparkContext resides, in 
the case that the two JVMs are not the same. For pySpark & SparkR, when running 
scripts in client deployment modes (standalone client and yarn client), the JVM 
is the same (py4j/RBackend running as a thread in the JVM launched by 
spark-submit) , so no need to pass the command line options around. However, in 
your case, Julia interpreter launches an in-process JVM for SparkContext, which 
is a separate JVM from the one launched by spark-submit. So you need a way, 
typically an environment environment variable, like “SPARKR_SUBMIT_ARGS” for 
SparkR or “PYSPARK_SUBMIT_ARGS” for pyspark, to pass command line args to the 
in-process JVM in the Julia interpreter so that SparkConf can pick the options.

From: Andrei [mailto:faithlessfri...@gmail.com]
Sent: Tuesday, April 12, 2016 3:48 AM
To: user 
Subject: How does spark-submit handle Python scripts (and how to repeat it)?

I'm working on a wrapper [1] around Spark for the Julia programming language 
[2] similar to PySpark. I've got it working with Spark Standalone server by 
creating local JVM and setting master programmatically. However, this approach 
doesn't work with YARN (and probably Mesos), which require running via 
`spark-submit`.

In `SparkSubmit` class I see that for Python a special class `PythonRunner` is 
launched, so I tried to do similar `JuliaRunner`, which essentially does the 
following:

val pb = new ProcessBuilder(Seq("julia", juliaScript))
val process = pb.start()
process.waitFor()

where `juliaScript` itself creates new JVM and `SparkContext` inside it WITHOUT 
setting master URL. I then tried to launch this class using

spark-submit --master yarn \
  --class o.a.s.a.j.JuliaRunner \
  project.jar my_script.jl

I expected that `spark-submit` would set environment variables or something 
that SparkContext would then read and connect to appropriate master. This 
didn't happen, however, and process failed while trying to instantiate 
`SparkContext`, saying that master is not specified.

So what am I missing? How can use `spark-submit` to run driver in a non-JVM 
language?


[1]: https://github.com/dfdx/Sparta.jl
[2]: http://julialang.org/

Re: Need Streaming output to single HDFS File

2016-04-12 Thread Sachin Aggarwal

hey u can use repartition and set it to 1
as in this example

unionDStream.foreachRDD((rdd, time) => {
  val count = rdd.count()
  println("count" + count)
  if (count > 0) {
print("rdd partition=" + rdd.partitions.length)
val outputRDD = rdd.repartition(numFilesPerPartition)
outputRDD.saveAsTextFile(
  outputDirectory + "/" + rdd.id)
}
  }
})

On Tue, Apr 12, 2016 at 3:59 PM, Priya Ch 
wrote:

> Hi All,
>
>   I am working with Kafka, Spark Streaming and I want to write the
> streaming output to a single file. dstream.saveAsTexFiles() is creating
> files in different folders. Is there a way to write to a single folder ? or
> else if written to different folders, how do I merge them ?
> Thanks,
> Padma Ch
>



-- 

Thanks & Regards

Sachin Aggarwal
7760502772

Need Streaming output to single HDFS File

2016-04-12 Thread Priya Ch

Hi All,

  I am working with Kafka, Spark Streaming and I want to write the
streaming output to a single file. dstream.saveAsTexFiles() is creating
files in different folders. Is there a way to write to a single folder ? or
else if written to different folders, how do I merge them ?
Thanks,
Padma Ch

Re: Running Spark on Yarn-Client/Cluster mode

2016-04-12 Thread Jon Kjær Amundsen

Hi Ashesh

You might be experiencing problems with the virtual memory allocation.
Try grepping the yarn-hadoop-nodemanager-*.log (found in
$HADOOP_INSTALL/logs) for 'virtual memory limits'
If you se a message like
-
WARN 
org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl:
Container [pid=5692,containerID=container_1460451108851_0001_02_01]
is running beyond virtual memory limits. Current usage: 248.7 MB of 1
GB physical memory used; 2.1 GB of 2.1 GB virtual memory used. Killing
container.
-
thats you problem.

You can solve it setting the yarn.nodemanager.vmem-pmem-ratio to a
higher ratio in yarn-site.xml like this:
--
 
   yarn.nodemanager.vmem-pmem-ratio
5
Ratio between virtual memory to physical memory when
setting memory limits for containers
  
--

You can also totally disable the vmem-check.

See https://issues.apache.org/jira/browse/YARN-4714 for further info

/ Jon
/ jokja
Venlig hilsen/Best regards

Jon Kjær Amundsen
Information Architect & Product Owner

Phone: +45 7023 9080
Direct: +45 8882 1331
E-mail: j...@udbudsvagten.dk
Web: www.udbudsvagten.dk
Nitivej 10 | DK - 2000 Frederiksberg


Intelligent Offentlig Samhandel
Før, under og efter udbud

Følg UdbudsVagten og markedet her Linkedin


2016-04-12 8:06 GMT+02:00 ashesh_28 :
> I have updated all my nodes in the Cluster to have 4GB RAM memory , but still
> face the same error when trying to launch Spark-Shell  in yarn-client mode
>
> Any suggestion ?
>
>
>
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/Running-Spark-on-Yarn-Client-Cluster-mode-tp26691p26752.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Run a self-contained Spark app on a Spark standalone cluster

2016-04-12 Thread Alonso Isidoro Roman

I don't know how to do it with python, but scala has a plugin named
sbt-pack that creates an auto contained unix command with your code, no
need to use spark-submit. It should be out there something similar to this
tool.



Alonso Isidoro Roman.

Mis citas preferidas (de hoy) :
"Si depurar es el proceso de quitar los errores de software, entonces
programar debe ser el proceso de introducirlos..."
 -  Edsger Dijkstra

My favorite quotes (today):
"If debugging is the process of removing software bugs, then programming
must be the process of putting ..."
  - Edsger Dijkstra

"If you pay peanuts you get monkeys"


2016-04-12 11:06 GMT+02:00 kevllino :

> Hi,
>
> I need to know how to run a self-contained Spark app  (3 python files) in a
> Spark standalone cluster. Can I move the .py files to the cluster, or
> should
> I store them locally, on HDFS or S3? I tried the following locally and on
> S3
> with a zip of my .py files as suggested  here
>   :
>
> ./bin/spark-submit --master
> spark://ec2-54-51-23-172.eu-west-1.compute.amazonaws.com:5080
> --py-files
> s3n://AWS_ACCESS_KEY_ID:AWS_SECRET_ACCESS_KEY@mubucket
> //weather_predict.zip
>
> But get: “Error: Must specify a primary resource (JAR or Python file)”
>
> Best,
> Kevin
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Run-a-self-contained-Spark-app-on-a-Spark-standalone-cluster-tp26753.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

Installation Issues - Spark 1.6.0 With Hadoop 2.6 - Pre Built On Windows 7

2016-04-12 Thread My List

Dear Experts,

Need help to get this resolved -

What am I doing wrong? Any Help greatly appreciated.

Env -
Windows 7 - 64 bit OS
Spark 1.6.0 With Hadoop 2.6 - Pre Built setup
JAVA_HOME - point to 1.7
SCALA_HOME - 2.11

I have Admin User and Standard User on Windows.

All the setups and running of spark is done using the Standard User Acc.

Have Spark setup on D
drive.- D:\Home\Prod_Inst\BigData\Spark\VER_1_6_0_W_H_2_6
Have set Hadoop_Home to point to winutils.exe (64 bit version) on D drive
- D:\Home\Prod_Inst\BigData\Spark\MySparkSetup\winutils

Standard User Acc -  w7-PC\Shaffu_Knowledge
Using Standard User account - created mkdir D:\tmp\hive
Using Standard User account - winutils.exe chmod -R 777 D:\tmp
Using Standard User account - winutils.exe ls D:\tmp and D:\tmp\hive

*drwxrwxrwx 1 w7-PC\Shaffu_Knowledge w7-PC\None 0 Apr 12 2016 \tmp*
*drwxrwxrwx 1 w7-PC\Shaffu_Knowledge w7-PC\None 0 Apr 12 2016 \tmp\hive*

Running spark-shell results in the following exception

D:\Home\Prod_Inst\BigData\Spark\VER_1_6_0_W_H_2_6>./bin/spark-shell
'.' is not recognized as an internal or external command,
operable program or batch file.

D:\Home\Prod_Inst\BigData\Spark\VER_1_6_0_W_H_2_6>.\bin\spark-shell
16/04/12 14:40:16 WARN NativeCodeLoader: Unable to load native-hadoop
library for your platform... using builtin-java classes where applicable
Welcome to
    __
 / __/__  ___ _/ /__
_\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 1.6.0
  /_/

Using Scala version 2.10.5 (Java HotSpot(TM) 64-Bit Server VM, Java
1.7.0_25)
Type in expressions to have them evaluated.
Type :help for more information.
Spark context available as sc.
16/04/12 14:40:22 WARN General: Plugin (Bundle)
"org.datanucleus.store.rdbms" is already registered. Ensure you dont have
multiple JAR versions of the same plugin in the classpath. The URL
"file:/D:/Home/Prod_Inst/BigData/Spark/VER_1_6_0_W_H_2_6/lib/
16/04/12 14:40:22 WARN General: Plugin (Bundle) "org.datanucleus" is
already registered. Ensure you dont have multiple JAR versions of the same
plugin in the classpath. The URL
"file:/D:/Home/Prod_Inst/BigData/Spark/VER_1_6_0_W_H_2_6/bin/../lib/datan
16/04/12 14:40:22 WARN General: Plugin (Bundle) "org.datanucleus.api.jdo"
is already registered. Ensure you dont have multiple JAR versions of the
same plugin in the classpath. The URL
"file:/D:/Home/Prod_Inst/BigData/Spark/VER_1_6_0_W_H_2_6/bin/../l
16/04/12 14:40:22 WARN Connection: BoneCP specified but not present in
CLASSPATH (or one of dependencies)
16/04/12 14:40:23 WARN Connection: BoneCP specified but not present in
CLASSPATH (or one of dependencies)
16/04/12 14:40:38 WARN ObjectStore: Version information not found in
metastore. hive.metastore.schema.verification is not enabled so recording
the schema version 1.2.0
16/04/12 14:40:38 WARN ObjectStore: Failed to get database default,
returning NoSuchObjectException
16/04/12 14:40:39 WARN : Your hostname, w7-PC resolves to a
loopback/non-reachable address: fe80:0:0:0:8d4f:1fa9:cf7d:23d0%17, but we
couldn't find any external IP address!
j*ava.lang.RuntimeException: java.lang.RuntimeException: The root scratch
dir: /tmp/hive on HDFS should be writable. Current permissions are:
rw-rw-rw-*
at
org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:522)
at
org.apache.spark.sql.hive.client.ClientWrapper.(ClientWrapper.scala:194)
at
org.apache.spark.sql.hive.client.IsolatedClientLoader.createClient(IsolatedClientLoader.scala:238)
at
org.apache.spark.sql.hive.HiveContext.executionHive$lzycompute(HiveContext.scala:218)
at
org.apache.spark.sql.hive.HiveContext.executionHive(HiveContext.scala:208)
at
org.apache.spark.sql.hive.HiveContext.functionRegistry$lzycompute(HiveContext.scala:462)
at
org.apache.spark.sql.hive.HiveContext.functionRegistry(HiveContext.scala:461)
at
org.apache.spark.sql.UDFRegistration.(UDFRegistration.scala:40)
at org.apache.spark.sql.SQLContext.(SQLContext.scala:330)
at
org.apache.spark.sql.hive.HiveContext.(HiveContext.scala:90)
at
org.apache.spark.sql.hive.HiveContext.(HiveContext.scala:101)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
Method)
at
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
at
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
at
org.apache.spark.repl.SparkILoop.createSQLContext(SparkILoop.scala:1028)
at $iwC$$iwC.(:15)
at $iwC.(:24)
at (:26)
at .(:30)
at .()
at .(:7)
at .()
at $print()
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at

Run a self-contained Spark app on a Spark standalone cluster

2016-04-12 Thread kevllino

Hi, 

I need to know how to run a self-contained Spark app  (3 python files) in a
Spark standalone cluster. Can I move the .py files to the cluster, or should
I store them locally, on HDFS or S3? I tried the following locally and on S3
with a zip of my .py files as suggested  here
  : 

./bin/spark-submit --master
spark://ec2-54-51-23-172.eu-west-1.compute.amazonaws.com:5080--py-files
s3n://AWS_ACCESS_KEY_ID:AWS_SECRET_ACCESS_KEY@mubucket//weather_predict.zip

But get: “Error: Must specify a primary resource (JAR or Python file)”

Best, 
Kevin 



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Run-a-self-contained-Spark-app-on-a-Spark-standalone-cluster-tp26753.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Can i have a hive context and sql context in the same app ?

2016-04-12 Thread Natu Lauchande

Hi,

Is it possible to have both a sqlContext and a hiveContext in the same
application ?

If yes would there be any performance pernalties of doing so.

Regards,
Natu

Init/Setup worker thread

2016-04-12 Thread Perrin, Lionel

Hello,

I'm looking for a solution to use jruby on top of spark. It looks that the work 
required is quite small. The only tricky point is that I need to make sure that 
every worker thread has a ruby interpreter initialized. Basically, I need to 
register a function to be called when each worker thread is created : a thread 
local variable must be set for the ruby interpreter so that ruby object can be 
deserialized.

Is there any solution to setup the worker threads before any spark call is made 
using this thread ?

I've found a workaround using a java object that wrap any ruby function like: 
rdd.map( JavaFuncWrapper.new(ruby_function) ). In the java JavaFuncWrapper, the 
readObject method used to serialize has been overridden to perform the 
initialization. I must say that I'm not very confident with this workaround. I 
would prefer a more straightforward solution.

Thanks for your help,

Regards,

Lionel

-

Moody's monitors email communications through its networks for regulatory 
compliance purposes and to protect its clients, employees and business and 
where allowed to do so by applicable law. Parties communicating with Moody's 
consent to such monitoring by their use of the email communication. The 
information contained in this e-mail message, and any attachment thereto, is 
confidential and may not be disclosed without our express permission. If you 
are not the intended recipient or an employee or agent responsible for 
delivering this message to the intended recipient, you are hereby notified that 
you have received this message in error and that any review, dissemination, 
distribution or copying of this message, or any attachment thereto, in whole or 
in part, is strictly prohibited. If you have received this message in error, 
please immediately notify us by telephone, fax or e-mail and delete the message 
and all of its attachments. Thank you. Every effort is made to keep our network 
free from viruses. You should, however, review this e-mail message, as well as 
any attachment thereto, for viruses. We take no responsibility and have no 
liability for any computer virus which may be transferred via this e-mail 
message. 

-

Check if spark master/history server is running via Java

2016-04-12 Thread Mihir Monani

Hi,

How to check if spark master /history server is running on node? is there
any command for it?

I would like to accomplish it with java if possible.

Thanks,
Mihir Monani

Re: how to deploy new code with checkpointing

2016-04-12 Thread Soumitra Siddharth Johri

I think before doing a code update you would like to gracefully shutdown your 
streaming job and checkpoint the processed offsets ( and any state that you 
maintain ) in database or Hdfs.
When you start the job up it should read this checkpoint file , build the 
necessary state and begin processing from the last offset processed.

Another approach would be to checkpoint the processed offsets in the streaming 
job whenever you read from Kafka . Then before reading the next batch of 
offsets instead of relying on spark checkpoint for offsets, read from the last 
processed offset that you saved.

Regards
Soumitra

> On Apr 11, 2016, at 8:31 PM, Siva Gudavalli  wrote:
> 
> Okie. That makes sense. 
> 
> Any recommendations on how to manage changes to my spark streaming app and 
> achieving fault tolerance at the same time
> 
>> On Mon, Apr 11, 2016 at 8:16 PM, Shixiong(Ryan) Zhu 
>>  wrote:
>> You cannot. Streaming doesn't support it because code changes will break 
>> Java serialization.
>> 
>>> On Mon, Apr 11, 2016 at 4:30 PM, Siva Gudavalli  wrote:
>>> hello,
>>> 
>>> i am writing a spark streaming application to read data from kafka. I am 
>>> using no receiver approach and enabled checkpointing to make sure I am not 
>>> reading messages again in case of failure. (exactly once semantics) 
>>> 
>>> i have a quick question how checkpointing needs to be configured to handle 
>>> code changes in my spark streaming app. 
>>> 
>>> can you please suggest. hope the question makes sense.
>>> 
>>> thank you 
>>> 
>>> regards
>>> shiv
>

Re: Running Spark on Yarn-Client/Cluster mode

2016-04-12 Thread ashesh_28

I have updated all my nodes in the Cluster to have 4GB RAM memory , but still
face the same error when trying to launch Spark-Shell  in yarn-client mode

Any suggestion ?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Running-Spark-on-Yarn-Client-Cluster-mode-tp26691p26752.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

76 matches

Mail list logo