Re: Concatenate a string to a Column of type string in DataFrame

2015-12-13 Thread Yanbo Liang
Sorry, it was added since 1.5.0.

2015-12-13 2:07 GMT+08:00 Satish :

> Hi,
> Will the below mentioned snippet work for Spark 1.4.0
>
> Thanks for your inputs
>
> Regards,
> Satish
> --
> From: Yanbo Liang 
> Sent: ‎12-‎12-‎2015 20:54
> To: satish chandra j 
> Cc: user 
> Subject: Re: Concatenate a string to a Column of type string in DataFrame
>
> Hi Satish,
>
> You can refer the following code snippet:
> df.select(concat(col("String_Column"), lit("00:00:000")))
>
> Yanbo
>
> 2015-12-12 16:01 GMT+08:00 satish chandra j :
>
>> HI,
>> I am trying to update a column value in DataFrame, incrementing a column
>> of integer data type than the below code works
>>
>> val new_df=old_df.select(df("Int_Column")+10)
>>
>> If I implement the similar approach for appending a string to a column of
>> string datatype  as below than it does not error out but returns only
>> "null" values
>>
>> val new_df=old_df.select(df("String_Column")+"00:00:000")
>>  OR
>> val dt ="00:00:000"
>> val new_df=old_df.select(df("String_Column")+toString(dt))
>>
>> Please suggest if any approach to update a column value of datatype String
>> Ex: Column value consist '20-10-2015' post updating it should have
>> '20-10-201500:00:000'
>>
>> Note: Transformation such that new DataFrame has to becreated from old
>> DataFrame
>>
>> Regards,
>> Satish Chandra
>>
>>
>>
>>
>
>


Re: Questions on Kerberos usage with YARN and JDBC

2015-12-13 Thread Mike Wright
Kerberos seems to be working otherwise ... for example, we're using it
successfully to control access to HDFS and it's linked to AD ... we're
using Ranger if that helps. I'm not a systems admin guy so this is really
not my area of expertise.


___

*Mike Wright*
Principal Architect, Software Engineering
S&P Capital IQ and SNL

434-951-7816 *p*
434-244-4466 *f*
540-470-0119 *m*

mwri...@snl.com



On Fri, Dec 11, 2015 at 4:36 PM, Todd Simmer  wrote:

> hey Mike,
>
> Are these part of an Active Directory Domain? If so are they pointed at
> the AD domain controllers that hosts the Kerberos server? Windows AD create
> SRV records in DNS to help windows clients find the Kerberos server for
> their domain. If you look you can see if you have a kdc record in Windows
> DNS and what it's pointing at. Can you do a
>
> kinit *username *
>
> on that host? It should tell you if it can find the KDC.
>
> Let me know if that's helpful at all.
>
> Todd
>
> On Fri, Dec 11, 2015 at 1:50 PM, Mike Wright  wrote:
>
>> As part of our implementation, we are utilizing a full "Kerberized"
>> cluster built on the Hortonworks suite. We're using Job Server as the front
>> end to initiate short-run jobs directly from our client-facing product
>> suite.
>>
>> 1) We believe we have configured the job server to start with the
>> appropriate credentials, specifying a principal and keytab. We switch to
>> YARN-CLIENT mode and can see Job Server attempt to connect to the resource
>> manager, and the result is that whatever the principal name is, it "cannot
>> impersonate root."  We have been unable to solve this.
>>
>> 2) We are primarily a Windows shop, hence our cluelessness here. That
>> said, we're using the JDBC driver version 4.2 and want to use JavaKerberos
>> authentication to connect to SQL Server. The queries performed by the job
>> are done in the driver, and hence would be running on the Job Server, which
>> we confirmed is running as the principal we have designated. However, when
>> attempting to connect with this option enabled I receive a "Unable to
>> obtain Principal Name for authentication" exception.
>>
>> Reading this:
>>
>> https://msdn.microsoft.com/en-us/library/ms378428.aspx
>>
>> We have Kerberos working on the machine and thus have krb5.conf setup
>> correctly. However the section, "
>> ​​
>> Enabling the Domain Configuration File and the Login Module Configuration
>> File" seems to indicate we've missed a step somewhere.
>>
>> Forgive my ignorance here ... I've been on Windows for 20 years and this
>> is all new to.
>>
>> Thanks for any guidance you can provide.
>>
>
>


How to unpack the values of an item in a RDD so I can create a RDD with multiple items?

2015-12-13 Thread Abhishek Shivkumar
Hi,

I have a RDD of many items.

Each item has a key and its value is a list of elements.

I want to unpack the elements of the item so that I can create a new RDD
with each of its item being the original key and one single element.

I tried doing RDD.flatmap(lambda line: [ (line[0], v) for v in line[1]])

but it throws an error saying "AttributeError: 'PipelinedRDD' object has no
attribute 'flatmap"

Can someone tell me the right way to unpack the values to different items
in the new RDD?

Thank you!

With Regards,
Abhishek S


Re: How to unpack the values of an item in a RDD so I can create a RDD with multiple items?

2015-12-13 Thread Nick Pentreath
The function name is flatMap - with capital M to match the Scala API.









—
Sent from Mailbox

On Sun, Dec 13, 2015 at 7:40 PM, Abhishek Shivkumar
 wrote:

> Hi,
> I have a RDD of many items.
> Each item has a key and its value is a list of elements.
> I want to unpack the elements of the item so that I can create a new RDD
> with each of its item being the original key and one single element.
> I tried doing RDD.flatmap(lambda line: [ (line[0], v) for v in line[1]])
> but it throws an error saying "AttributeError: 'PipelinedRDD' object has no
> attribute 'flatmap"
> Can someone tell me the right way to unpack the values to different items
> in the new RDD?
> Thank you!
> With Regards,
> Abhishek S

Use of rdd.zipWithUniqueId() in DStream

2015-12-13 Thread Sourav Mazumder
Hi All,

I'm trying to use zipWithUniqieId() function of RDD using transform
function of dStream. It does generate unique id always starting from 0 and
in sequence.

However, not sure whether this is a reliable behavior which is always
guaranteed to generate sequence number starting form 0.

Can anyone confirm ?

Regards,
Sourav


How to save Multilayer Perceptron Classifier model.

2015-12-13 Thread Vadim Gribanov
Hey everyone! I’m new with spark and scala. I looked at examples in user guide 
and didn’t find how to save Multilayer Perceptron Classifier model to HDFS.  

Trivial:

model.save(sc, “NNModel”)

Didn’t work for me.

Help me please. 
-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Make Spark Streaming DFrame as SQL table

2015-12-13 Thread Karthikeyan Muthukumar
Hi,

The aim here is as follows:
- read data from Socket using Spark Streaming every N seconds
- register received data as SQL table
- there will be more data read from HDFS etc as reference data, they will
also be registered as SQL tables
- the idea is to perform arbitrary SQL queries on the combined streaming &
reference data

Please see below code snippet. I see Data is written out to disk from
"inside" the forEachRDD loop but the same registered SQL table's data is
empty when written "outside" of forEachRDD loop.

Please give your opinion/suggestions to fix this. Also any other mechanism
to achieve the above stated "aim" is welcome.

case class Record(id:Int, status:String, source:String)

object SqlApp2 {
  def main(args: Array[String]) {
val sparkConf = new
SparkConf().setAppName("SqlApp2").setMaster("local[2]")
val sc = new SparkContext(sparkConf)
val sqlContext = new SQLContext(sc)
// Create the streaming context with a 10 second batch size
val ssc = new StreamingContext(sc, Seconds(10))

val lines = ssc.socketTextStream("localhost", ,
StorageLevel.MEMORY_AND_DISK_SER)

var alldata:DataFrame=sqlContext.emptyDataFrame
alldata.registerTempTable("alldata")

val data = lines.foreachRDD((rdd: RDD[String], time: Time) => {
  import sqlContext.implicits._

  // Convert RDD[String] to DataFrame
  val data = rdd.map(w => {
val words = w.split(" ")
Record(words(0).toInt, words(1), words(2))}).toDF()

  // Register as table
  data.registerTempTable("alldata")
  data.save("inside/file"+System.currentTimeMillis(), "json",
SaveMode.ErrorIfExists)  // this data is written properly
})

val dataOutside = sqlContext.sql("select * from alldata")
dataOutside.save("outside/file"+System.currentTimeMillis(), "json",
SaveMode.ErrorIfExists) // this data is empty, how to make the SQL table
registered inside the forEachRDD loop visible for rest of application

ssc.start()
ssc.awaitTermination()
  }

Thanks & Regards
MK


Re: Multiple drivers, same worker

2015-12-13 Thread Ted Yu
Just got back from my trip - I couldn't access gmail from laptop.

I took a look at the stack trace. I saw a few jetty threads getting blocked
but don't have much clue yet.

Will look at the stack some more.

On Wed, Dec 9, 2015 at 1:21 PM, andresb...@gmail.com 
wrote:

> Ok, attached you can see the jstack
>
> 2015-12-09 14:22 GMT-06:00 andresb...@gmail.com :
>
>> Sadly, no.
>>
>> The only evidence I have is the master's log which shows that the Driver
>> was requested:
>>
>> 15/12/09 18:25:06 INFO Master: Driver submitted
>> org.apache.spark.deploy.worker.DriverWrapper
>> 15/12/09 18:25:06 INFO Master: Launching driver
>> driver-20151209182506-0164 on worker
>> worker-20151209181534-172.31.31.159-7077
>>
>>
>>
>> 2015-12-09 14:19 GMT-06:00 Ted Yu :
>>
>>> When this happened, did you have a chance to take jstack of the stuck
>>> driver process ?
>>>
>>> Thanks
>>>
>>> On Wed, Dec 9, 2015 at 11:38 AM, andresb...@gmail.com <
>>> andresb...@gmail.com> wrote:
>>>
 Forgot to mention that it doesn't happen every time, it's pretty random
 so far. We've have complete days when it behaves just fine and others when
 it gets crazy. We're using spark 1.5.2

 2015-12-09 13:33 GMT-06:00 andresb...@gmail.com :

> Hi everyone,
>
> We've been getting an issue with spark lately where multiple drivers
> are assigned to a same worker but resources are never assigned to them and
> get "stuck" forever.
>
> If I login in the worker machine I see that the driver processes
> aren't really running and the worker's log don't show any error or 
> anything
> related to the driver. The master UI does show the drivers as submitted 
> and
> in RUNNING state.
>
>
> Not sure where else to look for clues, any ideas?
>
> --
> Andrés Blanco Morales
>



 --
 Andrés Blanco Morales

>>>
>>>
>>
>>
>> --
>> Andrés Blanco Morales
>>
>
>
>
> --
> Andrés Blanco Morales
>


Re: Inconsistent data in Cassandra

2015-12-13 Thread Gerard Maas
Hi Padma,

Have you considered reducing the dataset before writing it to Cassandra? Looks 
like this consistency problem could be avoided by cleaning the dataset of 
unnecessary records before persisting it:

val onlyMax = rddByPrimaryKey.reduceByKey{case (x,y) => Max(x,y)} // your max 
function here will need to pick the right max value from the records attached 
to the same primary key

-kr, Gerard.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

why "cache table a as select * from b" will do shuffle,and create 2 stages.

2015-12-13 Thread ant2nebula
why "cache table a as select * from b" will do shuffle,and create 2 stages.

 

example:

table "ods_pay_consume" is from "KafkaUtils.createDirectStream"

 hiveContext.sql("cache table dwd_pay_consume as select * from
ods_pay_consume")

this code will make 2 statges of DAG

 

 hiveContext.sql("cache table dw_game_server_recharge as select *
from dwd_pay_consume")

this code also will make 2 stages of DAG,and it is similar caculate from the
beginning for ther DAG Visualization tool,"cache table dwd_pay_consume" is
not effect.

 



comment on table

2015-12-13 Thread Jung
Hi,
My question is how to leave a comment on the tables.
Sometimes, the users including me create a lot of temporary and managed tables 
and want to leave a short comment to know what this table means without 
checking records.
Is there a way to do this? or suggesting an alternative will be very welcome.

Re: How to save Multilayer Perceptron Classifier model.

2015-12-13 Thread Yanbo Liang
Hi Vadim,

It does not support save/load for Multilayer Perceptron Model currently,
you can track the issues at SPARK-11871
.

Yanbo

2015-12-14 2:31 GMT+08:00 Vadim Gribanov :

> Hey everyone! I’m new with spark and scala. I looked at examples in user
> guide and didn’t find how to save Multilayer Perceptron Classifier model to
> HDFS.
>
> Trivial:
>
> model.save(sc, “NNModel”)
>
> Didn’t work for me.
>
> Help me please.
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Re: comment on table

2015-12-13 Thread Ted Yu
Please take a look at SPARK-5196

Cheers

On Sun, Dec 13, 2015 at 8:18 PM, Jung  wrote:

> Hi,
> My question is how to leave a comment on the tables.
> Sometimes, the users including me create a lot of temporary and managed
> tables and want to leave a short comment to know what this table means
> without checking records.
> Is there a way to do this? or suggesting an alternative will be very
> welcome.


Graphx Spark Accumulator

2015-12-13 Thread prasad223
Hi All,

I'm new to Spark and Scala , I'm unable to create a Array of Integer
Accumulators in Spark.

val diameterAccumulator =
sparkContext.accumulator(Array.fill(maxDegree)(0))(Array(maxDegree)[AccumulatorParam[Int]])

Can anyone give me a simple example of how to create an array of Int
accumulators ??

Thanks,
Prasad



-
Thanks,
Prasad
--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Graphx-Spark-Accumulator-tp25695.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



RE: Re: Spark assembly in Maven repo?

2015-12-13 Thread Xiaoyong Zhu
Thanks! do you mean something here (for example for 1.5.1 using scala 2.10)?
https://repository.apache.org/content/repositories/releases/org/apache/spark/spark-core_2.10/1.5.1/

Xiaoyong

From: Sean Owen [mailto:so...@cloudera.com]
Sent: Saturday, December 12, 2015 12:45 AM
To: Xiaoyong Zhu 
Cc: user 
Subject: Re: Re: Spark assembly in Maven repo?

That's exactly what the various artifacts in the Maven repo are for. The API 
classes for core are in the core artifact and so on. You don't need an assembly.

On Sat, Dec 12, 2015 at 12:32 AM, Xiaoyong Zhu 
mailto:xiaoy...@microsoft.com>> wrote:
Yes, so our scenario is to treat the spark assembly as an “SDK” so users can 
develop Spark applications easily without downloading them. In this case which 
way do you guys think might be good?

Xiaoyong

From: fightf...@163.com 
[mailto:fightf...@163.com]
Sent: Friday, December 11, 2015 12:08 AM
To: Mark Hamstra mailto:m...@clearstorydata.com>>
Cc: Xiaoyong Zhu mailto:xiaoy...@microsoft.com>>; Jeff 
Zhang mailto:zjf...@gmail.com>>; user 
mailto:user@spark.apache.org>>; Zhaomin Xu 
mailto:z...@microsoft.com>>; Joe Zhang (SDE) 
mailto:gui...@microsoft.com>>
Subject: Re: Re: Spark assembly in Maven repo?

Agree with you that assembly jar is not good to publish. However, what he 
really need is to fetch
an updatable maven jar file.


fightf...@163.com

From: Mark Hamstra
Date: 2015-12-11 15:34
To: fightf...@163.com
CC: Xiaoyong Zhu; Jeff 
Zhang; user; Zhaomin 
Xu; Joe Zhang (SDE)
Subject: Re: RE: Spark assembly in Maven repo?
No, publishing a spark assembly jar is not fine.  See the doc attached to 
https://issues.apache.org/jira/browse/SPARK-11157
 and be aware that a likely goal of Spark 2.0 will be the elimination of 
assemblies.

On Thu, Dec 10, 2015 at 11:19 PM, fightf...@163.com 
mailto:fightf...@163.com>> wrote:
Using maven to download the assembly jar is fine. I would recommend to deploy 
this
assembly jar to your local maven repo, i.e. nexus repo, Or more likey a 
snapshot repository


fightf...@163.com

From: Xiaoyong Zhu
Date: 2015-12-11 15:10
To: Jeff Zhang
CC: user@spark.apache.org; Zhaomin 
Xu; Joe Zhang (SDE)
Subject: RE: Spark assembly in Maven repo?
Sorry – I didn’t make it clear. It’s actually not a “dependency” – it’s 
actually that we are building a certain plugin for IntelliJ where we want to 
distribute this jar. But since the jar is updated frequently we don't want to 
distribute it together with our plugin but we would like to download it via 
Maven.

In this case what’s the recommended way?

Xiaoyong

From: Jeff Zhang [mailto:zjf...@gmail.com]
Sent: Thursday, December 10, 2015 11:03 PM
To: Xiaoyong Zhu mailto:xiaoy...@microsoft.com>>
Cc: user@spark.apache.org
Subject: Re: Spark assembly in Maven repo?

I don't think make the assembly jar as dependency a good practice. You may meet 
jar hell issue in that case.

On Fri, Dec 11, 2015 at 2:46 PM, Xiaoyong Zhu 
mailto:xiaoy...@microsoft.com>> wrote:
Hi Experts,

We have a project which has a dependency for the following jar

spark-assembly--hadoop.jar
for example:
spark-assembly-1.4.1.2.3.3.0-2983-hadoop2.7.1.2.3.3.0-2983.jar

since this assembly might be updated in the future, I am not sure if there is a 
Maven repo that has the above spark assembly jar? Or should we create & upload 
it to Maven central?

Thanks!

Xiaoyong




--
Best Regards

Jeff Zhang




Re: IP error on starting spark-shell on windows 7

2015-12-13 Thread Akhil Das
Its a warning, not an error. What happens when you don't specify
SPARK_LOCAL_IP at all? If it is able to bring up the spark shell, then
try *netstat
-np* and see on which address the driver is binding to.

Thanks
Best Regards

On Thu, Dec 10, 2015 at 9:49 AM, Stefan Karos  wrote:

> On starting spark-shell I see this just before the scala prompt:
>
> WARN : Your hostname, BloomBear-SSD resolves to a loopback/non-reachable
> address: fe80:0:0:0:0:5efe:c0a8:317%net10, but we couldn't find any
> external IP address!
>
> I get this error even when firewall is disabled.
> I also tried setting the environment variable SPARK_IP_LOCAL to various
> choices listed below:
>
> SPARK_LOCAL_IP=localhost
> SPARK_LOCAL_IP=127.0.0.1
> SPARK_LOCAL_IP=192.168.1.88   (my local machine's IPv4 address)
> SPARK_LOCAL_IP=fe80::eda5:a1a7:be1e:13cb%14  (my local machine's IPv6
> address)
>
> I still get this annoying error! How can I resolve this?
> See below for my environment
>
> Environment
> windows 7 64 bit
> Spark 1.5.2
> Scala 2.10.6
> Python 2.7.10 (from Anaconda)
>
> PATH includes:
> C:\Users\Stefan\spark-1.5.2-bin-hadoop2.6\bin
> C:\ProgramData\Oracle\Java\javapath
> C:\Users\Stefan\scala\bin
> C:\Users\Stefan\hadoop-2.6.0\bin
> (where the bin\winutils resides)
> C:\ProgramData\Oracle\Java\javapath
>
> SYSTEM variables set are:
> SPARK_HOME=C:\Users\Stefan\spark-1.5.2-bin-hadoop2.6
> JAVA_HOME=C:\Program Files\Java\jre1.8.0_65
> HADOOP_HOME=C:\Users\Stefan\hadoop-2.6.0
>
> \tmp\hive directory at root on C; drive with full permissions,
> e.g.
> >winutils ls \tmp\hive
> drwxrwxrwx 1 BloomBear-SSD\Stefan BloomBear-SSD\None 0 Dec  8 2015
> \tmp\hive
>
>


[SparkR] Any reason why saveDF's mode is append by default ?

2015-12-13 Thread Jeff Zhang
It is inconsistent with scala api which is error by default. Any reason for
that ? Thanks



-- 
Best Regards

Jeff Zhang