Dataset - withColumn and withColumnRenamed that accept Column type

2018-07-13 Thread Nirav Patel
Is there a version of withColumn or withColumnRenamed that accept Column
instead of String? That way I can specify FQN in case when there is
duplicate column names.

I can Drop column based on Column type argument then why can't I rename
them based on same type argument.

Use case is, I have Dataframe with duplicate columns at end of the join.
Most of the time I drop duplicate but now I need to rename one of those
column. I can not do it because there is no API that . I can rename it
before the join but that is not preferred.


def
withColumn(colName: String, col: Column): DataFrame
Returns a new Dataset by adding a column or replacing the existing column
that has the same name.

def
withColumnRenamed(existingName: String, newName: String): DataFrame
Returns a new Dataset with a column renamed.



I think there should also be this one:

def
withColumnRenamed(existingName: *Column*, newName: *Column*): DataFrame
Returns a new Dataset with a column renamed.

-- 


 

 
   
   
      



Spark on Mesos: Spark issuing hundreds of SUBSCRIBE requests / second and crashing Mesos

2018-07-13 Thread Nimi W
I've come across an issue with Mesos 1.4.1 and Spark 2.2.1. We launch Spark
tasks using the MesosClusterDispatcher in cluster mode. On a couple of
occasions, we have noticed that when the Spark Driver crashes (to various
causes - human error, network error), sometimes, when the Driver is
restarted, it issues a hundreds of SUBSCRIBE requests to mesos / per second
up until the Mesos Master node gets overwhelmed and crashes. It does this
again to the next master node, over and over until it takes down all the
master nodes. Usually the only thing that will fix is manually stopping the
driver and restarting.

Here is a snippet of the log of the mesos master, which just logs the
repeated SUBSCRIBE command:
https://gist.github.com/nemosupremo/28ef4acfd7ec5bdcccee9789c021a97f

Here is the output of the spark framework:
https://gist.github.com/nemosupremo/d098ef4def28ebf96c14d8f87aecd133 which
also just repeats 'Transport endpoint is not connected' over and over.

Thanks for any insights


Re: Live Streamed Code Review today at 11am Pacific

2018-07-13 Thread Holden Karau
This afternoon @ 3pm pacific I'll be looking at review tooling for Spark &
Beam https://www.youtube.com/watch?v=ff8_jbzC8JI.

Next week's regular Friday code (this time July 20th @ 9:30am pacific)
review will once again probably have more of an ML focus for folks
interested in watching Spark ML PRs be reviewed -
 https://www.youtube.com/watch?v=aG5h99yb6XE


Next week I'll have a live coding session with more of a Beam focus if you
want to see something a bit different (but still related since Beam runs on
Spark) with a focus on Python dependency management (which is a thing we
are also exploring in Spark at the same time) -
https://www.youtube.com/watch?v=Sv0XhS2pYqA on July 19th at 2pm pacific.

P.S.

You can follow more generally me holdenkarau on YouTube

and holdenkarau on Twitch  to be
notified even when I forget to send out the emails (which is pretty often).

This morning I did another live review session I forgot to ping to the list
about (
https://www.youtube.com/watch?v=M_lRFptcGTI=PLRLebp9QyZtYF46jlSnIu2x1NDBkKa2uw=31
)
and yesterday I did some live coding using PySpark and working on Sparkling
ML -
https://www.youtube.com/watch?v=kCnBDpNce9A=PLRLebp9QyZtYF46jlSnIu2x1NDBkKa2uw=32

On Wed, Jun 27, 2018 at 10:44 AM, Holden Karau  wrote:

> Today @ 1:30pm pacific I'll be looking at the current Spark 2.1.3 RC and
> see how we validate Spark releases - https://www.twitch.tv/events/
> VAg-5PKURQeH15UAawhBtw / https://www.youtube.com/watch?v=1_XLrlKS26o .
> Tomorrow @ 12:30 live PR reviews & Monday live coding -
> https://youtube.com/user/holdenkarau & https://www.twitch.tv/
> holdenkarau/events . Hopefully this can encourage more folks to help with
> RC validation & PR reviews :)
>
> On Thu, Jun 14, 2018 at 6:07 AM, Holden Karau 
> wrote:
>
>> Next week is pride in San Francisco but I'm still going to do two quick
>> session. One will be live coding with Apache Spark to collect ASF diversity
>> information ( https://www.youtube.com/watch?v=OirnFnsU37A /
>> https://www.twitch.tv/events/O1edDMkTRBGy0I0RCK-Afg ) on Monday at 9am
>> pacific and the other will be the regular Friday code review (
>> https://www.youtube.com/watch?v=IAWm4OLRoyY / https://www.tw
>> itch.tv/events/v0qzXxnNQ_K7a8JYFsIiKQ ) also at 9am.
>>
>> On Thu, Jun 7, 2018 at 9:10 PM, Holden Karau 
>> wrote:
>>
>>> I'll be doing another one tomorrow morning at 9am pacific focused on
>>> Python + K8s support & improved JSON support -
>>> https://www.youtube.com/watch?v=Z7ZEkvNwneU &
>>> https://www.twitch.tv/events/xU90q9RGRGSOgp2LoNsf6A :)
>>>
>>> On Fri, Mar 9, 2018 at 3:54 PM, Holden Karau 
>>> wrote:
>>>
 If anyone wants to watch the recording: https://www.youtube
 .com/watch?v=lugG_2QU6YU

 I'll do one next week as well - March 16th @ 11am -
 https://www.youtube.com/watch?v=pXzVtEUjrLc

 On Fri, Mar 9, 2018 at 9:28 AM, Holden Karau 
 wrote:

> Hi folks,
>
> If your curious in learning more about how Spark is developed, I’m
> going to expirement doing a live code review where folks can watch and see
> how that part of our process works. I have two volunteers already for
> having their PRs looked at live, and if you have a Spark PR your working 
> on
> you’d like me to livestream a review of please ping me.
>
> The livestream will be at https://www.youtube.com/watch?v=lugG_2QU6YU.
>
> Cheers,
>
> Holden :)
> --
> Twitter: https://twitter.com/holdenkarau
>



 --
 Twitter: https://twitter.com/holdenkarau

>>>
>>>
>>>
>>> --
>>> Twitter: https://twitter.com/holdenkarau
>>>
>>
>>
>>
>> --
>> Twitter: https://twitter.com/holdenkarau
>>
>
>
>
> --
> Twitter: https://twitter.com/holdenkarau
>



-- 
Twitter: https://twitter.com/holdenkarau


[ML] Linear regression with SGD

2018-07-13 Thread sandy
Hi,

I would like to compare different implementations of linear regression (and
possibly generalised linear regression) in Spark. I was wondering why the
functions for linear regression (and GLM) with stochastic gradient descent
have been deprecated? 

I have found some old posts of people having problems with
LinearRegressionSGD and saying that it it slower than L-BFGS but I am not
sure what they mean. Shouldn’t SGD be better? Is there any plan to make
those functions available again in the new DataFrame-based API? 

Thank you,
Sandy



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



spark rename or access columns which has special chars " ?:

2018-07-13 Thread Great Info
I have a columns like below

 root
 |-- metadata: struct (nullable = true)
 ||-- "drop":{"dropPath":"
https://dstpath.media27.ec2.st-av.net/drop?source_id: string (nullable =
true)
 ||-- "selection":{"AlllURL":"
https://dstpath.media27.ec2.st-av.net/image?source_id: string (nullable =
true)
 ||-- "dstpath":"
https://dstpath.media28.ec2.st-av.net/image?source_id: string (nullable =
true)


now there is a problem in select any column, since all the column have
special chars


*
"drop":{"dropPath":"https://dstpath.media27.ec2.st-av.net/drop?source_id
: string (nullable =
true)*

this column has special chars " : { and . .

how to select this column or rename in spark ?

 *
df.select('`metada."drop":{"dropPath":"https://dstpath.media27.ec2.st-av.net/drop?source_id`
')*
gives error as  error: unclosed character literal



Regards
Indra


Re: spark sql data skew

2018-07-13 Thread Jean Georges Perrin
Just thinking out loud… repartition by key? create a composite key based on 
company and userid? 

How big is your dataset?

> On Jul 13, 2018, at 06:20, 崔苗  wrote:
> 
> Hi,
> when I want to count(distinct userId) by company,I met the data skew and the 
> task takes too long time,how to count distinct by keys on skew data in spark 
> sql ?
> 
> thanks for any reply
> 



spark sql data skew

2018-07-13 Thread 崔苗
Hi,
when I want to count(distinct userId) by company,I met the data skew and the 
task takes too long time,how to count distinct by keys on skew data in spark 
sql ?


thanks for any reply