Re: createDataFrame allows column names as second param in Python not in Scala

2015-05-03 Thread Reynold Xin
We can't drop the existing createDataFrame one, since it breaks API
compatibility, and the existing one also automatically infers the column
name for case classes (in that case users most likely won't be declaring
names directly). If this is really a problem, we should just create a new
function (maybe more than one, since you could argue the one for Seq should
also have that ...).



On Sun, May 3, 2015 at 2:13 AM, Olivier Girardot 
o.girar...@lateral-thoughts.com wrote:

 I have the perfect counter example where some of the data scientists
 prototype in Python and the production materials is done in Scala.
 But I get your point, as a matter of fact I realised the toDF method took
 parameters a little while after posting this.
 However the toDF still needs you to go from a List to an RDD, or create a
 useless Dataframe and call toDF on it re-creating a complete data
 structure. I just feel that the createDataFrame(_: Seq) is not really
 useful as it is, because I think there are practically no circumstances
 where you'd want to create a DataFrame without column names.

 I'm not implying a n-th overloaded method should be created, rather than
 change the signature of the existing method with an optional Seq of column
 names.

 Regards,

 Olivier.

 Le dim. 3 mai 2015 à 07:44, Reynold Xin r...@databricks.com a écrit :

 Part of the reason is that it is really easy to just call toDF on Scala,
 and we already have a lot of createDataFrame functions.

 (You might find some of the cross-language differences confusing, but I'd
 argue most real users just stick to one language, and developers or
 trainers are the only ones that need to constantly switch between
 languages).

 On Sat, May 2, 2015 at 11:05 AM, Olivier Girardot 
 o.girar...@lateral-thoughts.com wrote:

 Hi everyone,
 SQLContext.createDataFrame has different behaviour in Scala or Python :

  l = [('Alice', 1)]
  sqlContext.createDataFrame(l).collect()
 [Row(_1=u'Alice', _2=1)]
  sqlContext.createDataFrame(l, ['name', 'age']).collect()
 [Row(name=u'Alice', age=1)]

 and in Scala :

 scala val data = List((Alice, 1), (Wonderland, 0))
 scala sqlContext.createDataFrame(data, List(name, score))
 console:28: error: overloaded method value createDataFrame with
 alternatives: ... cannot be applied to ...

 What do you think about allowing in Scala too to have a Seq of column
 names
 for the sake of consistency ?

 Regards,

 Olivier.





Re: createDataFrame allows column names as second param in Python not in Scala

2015-05-03 Thread Olivier Girardot
I have the perfect counter example where some of the data scientists
prototype in Python and the production materials is done in Scala.
But I get your point, as a matter of fact I realised the toDF method took
parameters a little while after posting this.
However the toDF still needs you to go from a List to an RDD, or create a
useless Dataframe and call toDF on it re-creating a complete data
structure. I just feel that the createDataFrame(_: Seq) is not really
useful as it is, because I think there are practically no circumstances
where you'd want to create a DataFrame without column names.

I'm not implying a n-th overloaded method should be created, rather than
change the signature of the existing method with an optional Seq of column
names.

Regards,

Olivier.

Le dim. 3 mai 2015 à 07:44, Reynold Xin r...@databricks.com a écrit :

 Part of the reason is that it is really easy to just call toDF on Scala,
 and we already have a lot of createDataFrame functions.

 (You might find some of the cross-language differences confusing, but I'd
 argue most real users just stick to one language, and developers or
 trainers are the only ones that need to constantly switch between
 languages).

 On Sat, May 2, 2015 at 11:05 AM, Olivier Girardot 
 o.girar...@lateral-thoughts.com wrote:

 Hi everyone,
 SQLContext.createDataFrame has different behaviour in Scala or Python :

  l = [('Alice', 1)]
  sqlContext.createDataFrame(l).collect()
 [Row(_1=u'Alice', _2=1)]
  sqlContext.createDataFrame(l, ['name', 'age']).collect()
 [Row(name=u'Alice', age=1)]

 and in Scala :

 scala val data = List((Alice, 1), (Wonderland, 0))
 scala sqlContext.createDataFrame(data, List(name, score))
 console:28: error: overloaded method value createDataFrame with
 alternatives: ... cannot be applied to ...

 What do you think about allowing in Scala too to have a Seq of column
 names
 for the sake of consistency ?

 Regards,

 Olivier.





createDataFrame allows column names as second param in Python not in Scala

2015-05-02 Thread Olivier Girardot
Hi everyone,
SQLContext.createDataFrame has different behaviour in Scala or Python :

 l = [('Alice', 1)]
 sqlContext.createDataFrame(l).collect()
[Row(_1=u'Alice', _2=1)]
 sqlContext.createDataFrame(l, ['name', 'age']).collect()
[Row(name=u'Alice', age=1)]

and in Scala :

scala val data = List((Alice, 1), (Wonderland, 0))
scala sqlContext.createDataFrame(data, List(name, score))
console:28: error: overloaded method value createDataFrame with
alternatives: ... cannot be applied to ...

What do you think about allowing in Scala too to have a Seq of column names
for the sake of consistency ?

Regards,

Olivier.


Re: createDataFrame allows column names as second param in Python not in Scala

2015-05-02 Thread Reynold Xin
Part of the reason is that it is really easy to just call toDF on Scala,
and we already have a lot of createDataFrame functions.

(You might find some of the cross-language differences confusing, but I'd
argue most real users just stick to one language, and developers or
trainers are the only ones that need to constantly switch between
languages).

On Sat, May 2, 2015 at 11:05 AM, Olivier Girardot 
o.girar...@lateral-thoughts.com wrote:

 Hi everyone,
 SQLContext.createDataFrame has different behaviour in Scala or Python :

  l = [('Alice', 1)]
  sqlContext.createDataFrame(l).collect()
 [Row(_1=u'Alice', _2=1)]
  sqlContext.createDataFrame(l, ['name', 'age']).collect()
 [Row(name=u'Alice', age=1)]

 and in Scala :

 scala val data = List((Alice, 1), (Wonderland, 0))
 scala sqlContext.createDataFrame(data, List(name, score))
 console:28: error: overloaded method value createDataFrame with
 alternatives: ... cannot be applied to ...

 What do you think about allowing in Scala too to have a Seq of column names
 for the sake of consistency ?

 Regards,

 Olivier.