DataFrame API and Ordering

2016-02-16 Thread Maciej Szymkiewicz
I am not sure if I've missed something obvious but as far as I can tell
DataFrame API doesn't provide a clearly defined ordering rules excluding
NaN handling. Methods like DataFrame.sort or sql.functions like min /
max provide only general description. Discrepancy between functions.max
(min) and GroupedData.max where the latter one supports only numeric
makes current situation even more confusing. With growing number of
orderable types I believe that documentation should clearly define
ordering rules including:

- NULL behavior
- collation
- behavior on complex types (structs, arrays)

While this information can extracted from the source it is not easily
accessible and without explicit specification it is not clear if current
behavior is contractual. It can be also confusing if user expects an
order depending on a current locale (R).

Best,
Maciej



signature.asc
Description: OpenPGP digital signature


Re: Welcoming two new committers

2016-02-16 Thread Raffael Bottoli Schemmer
Congratulations Herman and Wenchen,

2016-02-16 20:45 GMT-02:00 Igor Costa :

> Congratulations Herman and Wenchen.
>
> On Tue, Feb 9, 2016 at 10:58 AM, Joseph Bradley 
> wrote:
>
>> Congrats & welcome!
>>
>> On Mon, Feb 8, 2016 at 12:19 PM, Ram Sriharsha 
>> wrote:
>>
>>> great job guys! congrats and welcome!
>>>
>>> On Mon, Feb 8, 2016 at 12:05 PM, Amit Chavan  wrote:
>>>
 Welcome.

 On Mon, Feb 8, 2016 at 2:50 PM, Suresh Thalamati <
 suresh.thalam...@gmail.com> wrote:

> Congratulations Herman and Wenchen!
>
> On Mon, Feb 8, 2016 at 10:59 AM, Andrew Or 
> wrote:
>
>> Welcome!
>>
>> 2016-02-08 10:55 GMT-08:00 Bhupendra Mishra <
>> bhupendra.mis...@gmail.com>:
>>
>>> Congratulations to both. and welcome to group.
>>>
>>> On Mon, Feb 8, 2016 at 10:45 PM, Matei Zaharia <
>>> matei.zaha...@gmail.com> wrote:
>>>
 Hi all,

 The PMC has recently added two new Spark committers -- Herman van
 Hovell and Wenchen Fan. Both have been heavily involved in Spark SQL 
 and
 Tungsten, adding new features, optimizations and APIs. Please join me 
 in
 welcoming Herman and Wenchen.

 Matei

 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org


>>>
>>
>

>>>
>>>
>>> --
>>> Ram Sriharsha
>>> Architect, Spark and Data Science
>>> Hortonworks, 2550 Great America Way, 2nd Floor
>>> Santa Clara, CA 95054
>>> Ph: 408-510-8635
>>> email: har...@apache.org
>>>
>>> [image: https://www.linkedin.com/in/harsha340]
>>>  
>>> 
>>>
>>>
>>
>


Re: Welcoming two new committers

2016-02-16 Thread Igor Costa
Congratulations Herman and Wenchen.

On Tue, Feb 9, 2016 at 10:58 AM, Joseph Bradley 
wrote:

> Congrats & welcome!
>
> On Mon, Feb 8, 2016 at 12:19 PM, Ram Sriharsha 
> wrote:
>
>> great job guys! congrats and welcome!
>>
>> On Mon, Feb 8, 2016 at 12:05 PM, Amit Chavan  wrote:
>>
>>> Welcome.
>>>
>>> On Mon, Feb 8, 2016 at 2:50 PM, Suresh Thalamati <
>>> suresh.thalam...@gmail.com> wrote:
>>>
 Congratulations Herman and Wenchen!

 On Mon, Feb 8, 2016 at 10:59 AM, Andrew Or 
 wrote:

> Welcome!
>
> 2016-02-08 10:55 GMT-08:00 Bhupendra Mishra <
> bhupendra.mis...@gmail.com>:
>
>> Congratulations to both. and welcome to group.
>>
>> On Mon, Feb 8, 2016 at 10:45 PM, Matei Zaharia <
>> matei.zaha...@gmail.com> wrote:
>>
>>> Hi all,
>>>
>>> The PMC has recently added two new Spark committers -- Herman van
>>> Hovell and Wenchen Fan. Both have been heavily involved in Spark SQL and
>>> Tungsten, adding new features, optimizations and APIs. Please join me in
>>> welcoming Herman and Wenchen.
>>>
>>> Matei
>>> -
>>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: dev-h...@spark.apache.org
>>>
>>>
>>
>

>>>
>>
>>
>> --
>> Ram Sriharsha
>> Architect, Spark and Data Science
>> Hortonworks, 2550 Great America Way, 2nd Floor
>> Santa Clara, CA 95054
>> Ph: 408-510-8635
>> email: har...@apache.org
>>
>> [image: https://www.linkedin.com/in/harsha340]
>>  
>> 
>>
>>
>


Re: SPARK_WORKER_MEMORY in Spark Standalone - conf.getenv vs System.getenv?

2016-02-16 Thread Igor Costa
Actually answering the first question:

Is there a reason to use conf to read SPARK_WORKER_MEMORY not
System.getenv as for the other env vars?

You can use the properties file to change the amount, System.getenv would
be bad when you have for example other things running on the JVM which will
cause conflict on some parts.
 Defined usage in properties files is more convenience for custom UI to be
made available.

On Sat, Feb 13, 2016 at 8:38 PM, Sean Owen  wrote:

> Yes you said it is only set in a props file, but why do you say that?
> because the resolution of your first question is that this is not
> differently handled.
>
> On Fri, Feb 12, 2016 at 11:11 PM, Jacek Laskowski  wrote:
> > On Fri, Feb 12, 2016 at 11:08 PM, Sean Owen  wrote:
> >> I think that difference in the code is just an oversight. They
> >> actually do the same thing.
> >
> > Correct. Just meant to know the reason if there was any.
> >
> >> Why do you say this property can only be set in a file?
> >
> > I said that conf/spark-defaults.conf can *not* be used to set
> > spark.worker.ui.port property and wondered why is so? It'd be nice to
> > have it for settings (not use workarounds like
> > SPARK_WORKER_OPTS=-Dspark.worker.ui.port=21212). Just spot it and
> > thought I'd ask if it needs to be cleaned up or improved.
> >
> > Jacek
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>


Call wholeTextFiles to read gzip files

2016-02-16 Thread Deepak Gopalakrishnan
Hello,

I'm reading S3 files using wholeTextFiles() . My files are gzip format but
the names of the files does not end with a ".gz". I cannot force the names
of these files to end with a ".gz" . Is there a way to specify the
InputFormat as Gzip when using wholeTextFiles()
?

-- 
Regards,
*Deepak Gopalakrishnan*
*Mobile*:+918891509774
*Skype* : deepakgk87
http://myexps.blogspot.com