[jira] [Assigned] (SPARK-8211) math function: radians

2015-06-09 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8211?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin reassigned SPARK-8211:
--

Assignee: Reynold Xin

> math function: radians
> --
>
> Key: SPARK-8211
> URL: https://issues.apache.org/jira/browse/SPARK-8211
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>
> Alias toRadians -> radians in FunctionRegistry.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-8219) math function: negative

2015-06-09 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8219?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin reassigned SPARK-8219:
--

Assignee: Reynold Xin

> math function: negative
> ---
>
> Key: SPARK-8219
> URL: https://issues.apache.org/jira/browse/SPARK-8219
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>
> This is just an alias for UnaryMinus. Only add it to FunctionRegistry, and 
> not DataFrame.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-8210) math function: degrees

2015-06-09 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8210?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin reassigned SPARK-8210:
--

Assignee: Reynold Xin

> math function: degrees
> --
>
> Key: SPARK-8210
> URL: https://issues.apache.org/jira/browse/SPARK-8210
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>
> Alias todegrees -> degrees.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-8219) math function: negative

2015-06-09 Thread Reynold Xin (JIRA)
Reynold Xin created SPARK-8219:
--

 Summary: math function: negative
 Key: SPARK-8219
 URL: https://issues.apache.org/jira/browse/SPARK-8219
 Project: Spark
  Issue Type: Sub-task
Reporter: Reynold Xin


This is just an alias for UnaryMinus. Only add it to FunctionRegistry, and not 
DataFrame.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-8216) math function: rename log -> ln

2015-06-09 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8216?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin reassigned SPARK-8216:
--

Assignee: Reynold Xin

> math function: rename log -> ln
> ---
>
> Key: SPARK-8216
> URL: https://issues.apache.org/jira/browse/SPARK-8216
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>
> Rename expression Log -> Ln.
> Also create aliased data frame functions, and update FunctionRegistry.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-8205) conditional function: nvl

2015-06-09 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8205?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin reassigned SPARK-8205:
--

Assignee: Reynold Xin

> conditional function: nvl
> -
>
> Key: SPARK-8205
> URL: https://issues.apache.org/jira/browse/SPARK-8205
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>
> nvl(T value, T default_value): T
> Returns default value if value is null else returns value (as of HIve 0.11).
> We already have this (called Coalesce). Just need to register an alias for it 
> in FunctionRegistry.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-8201) conditional function: if

2015-06-09 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin reassigned SPARK-8201:
--

Assignee: Reynold Xin

> conditional function: if
> 
>
> Key: SPARK-8201
> URL: https://issues.apache.org/jira/browse/SPARK-8201
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>
> We already have an If expression. Just need to register it in 
> FunctionRegistry.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-8101) Upgrade netty to avoid memory leak accord to netty #3837 issues

2015-06-09 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8101?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reassigned SPARK-8101:


Assignee: Sean Owen

> Upgrade netty to avoid memory leak accord to netty #3837 issues
> ---
>
> Key: SPARK-8101
> URL: https://issues.apache.org/jira/browse/SPARK-8101
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.4.0
>Reporter: SuYan
>Assignee: Sean Owen
>Priority: Minor
> Fix For: 1.5.0
>
>
> There is a direct buffer leak in netty, due to netty 4.0.23-Final not release 
> threadlocal after netty already send message success.
> Please Ref: https://github.com/netty/netty/issues/3837



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8126) Use temp directory under build dir for unit tests

2015-06-09 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8126?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-8126:
-
Fix Version/s: 1.3.2
   Labels:   (was: backport-needed)

> Use temp directory under build dir for unit tests
> -
>
> Key: SPARK-8126
> URL: https://issues.apache.org/jira/browse/SPARK-8126
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Reporter: Marcelo Vanzin
>Assignee: Marcelo Vanzin
>Priority: Minor
> Fix For: 1.3.2, 1.4.1, 1.5.0
>
>
> Spark's unit tests leave a lot of garbage in /tmp after a run, making it hard 
> to clean things up. Let's place those files under the build dir so that 
> "mvn|sbt|git clean" can do their job.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-8101) Upgrade netty to avoid memory leak accord to netty #3837 issues

2015-06-09 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8101?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-8101.
--
   Resolution: Fixed
Fix Version/s: 1.5.0

Issue resolved by pull request 6701
[https://github.com/apache/spark/pull/6701]

> Upgrade netty to avoid memory leak accord to netty #3837 issues
> ---
>
> Key: SPARK-8101
> URL: https://issues.apache.org/jira/browse/SPARK-8101
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.4.0
>Reporter: SuYan
>Priority: Minor
> Fix For: 1.5.0
>
>
> There is a direct buffer leak in netty, due to netty 4.0.23-Final not release 
> threadlocal after netty already send message success.
> Please Ref: https://github.com/netty/netty/issues/3837



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8120) Typos in warning message in sql/types.py

2015-06-09 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14578447#comment-14578447
 ] 

Sean Owen commented on SPARK-8120:
--

I'll probably learn something here, but does that not work? I tried a similar 
example locally and interpolation worked in this situation. Yeah there's a 
missing space here though.

> Typos in warning message in sql/types.py
> 
>
> Key: SPARK-8120
> URL: https://issues.apache.org/jira/browse/SPARK-8120
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 1.4.0
>Reporter: Joseph K. Bradley
>Priority: Trivial
>  Labels: starter
>
> See 
> [https://github.com/apache/spark/blob/3ba6fc515d6ea45c281bb81f648a38523be06383/python/pyspark/sql/types.py#L1093]
> Need to fix string concat + use of %



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-8220) math function: positive

2015-06-09 Thread Reynold Xin (JIRA)
Reynold Xin created SPARK-8220:
--

 Summary: math function: positive
 Key: SPARK-8220
 URL: https://issues.apache.org/jira/browse/SPARK-8220
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin


positive(INT a): INT
positive(DOUBLE a): DOUBLE

This is really just an identify function. We should create an Identity 
expression, and then in the optimizer just removes the Identity functions.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-8221) math function: pmod

2015-06-09 Thread Reynold Xin (JIRA)
Reynold Xin created SPARK-8221:
--

 Summary: math function: pmod
 Key: SPARK-8221
 URL: https://issues.apache.org/jira/browse/SPARK-8221
 Project: Spark
  Issue Type: Sub-task
Reporter: Reynold Xin


pmod(INT a, INT b): INT
pmod(DOUBLE a, DOUBLE b): DOUBLE


Returns the positive value of a mod b.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-8222) math function: alias power / pow

2015-06-09 Thread Reynold Xin (JIRA)
Reynold Xin created SPARK-8222:
--

 Summary: math function: alias power / pow
 Key: SPARK-8222
 URL: https://issues.apache.org/jira/browse/SPARK-8222
 Project: Spark
  Issue Type: Sub-task
Reporter: Reynold Xin
Assignee: Reynold Xin


Add to FunctionRegistry power.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-8224) math function: shiftright

2015-06-09 Thread Reynold Xin (JIRA)
Reynold Xin created SPARK-8224:
--

 Summary: math function: shiftright
 Key: SPARK-8224
 URL: https://issues.apache.org/jira/browse/SPARK-8224
 Project: Spark
  Issue Type: Sub-task
Reporter: Reynold Xin


shiftrightunsigned(INT a), shiftrightunsigned(BIGINT a) 

Bitwise unsigned right shift (as of Hive 1.2.0). Returns int for tinyint, 
smallint and int a. Returns bigint for bigint a.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-8223) math function: shiftleft

2015-06-09 Thread Reynold Xin (JIRA)
Reynold Xin created SPARK-8223:
--

 Summary: math function: shiftleft
 Key: SPARK-8223
 URL: https://issues.apache.org/jira/browse/SPARK-8223
 Project: Spark
  Issue Type: Sub-task
Reporter: Reynold Xin


shiftleft(INT a)
shiftleft(BIGINT a)

Bitwise left shift (as of Hive 1.2.0). Returns int for tinyint, smallint and 
int a. Returns bigint for bigint a.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-8225) math function: alias sign / signum

2015-06-09 Thread Reynold Xin (JIRA)
Reynold Xin created SPARK-8225:
--

 Summary: math function: alias sign / signum
 Key: SPARK-8225
 URL: https://issues.apache.org/jira/browse/SPARK-8225
 Project: Spark
  Issue Type: Sub-task
Reporter: Reynold Xin
Assignee: Reynold Xin


Alias them in FunctionRegistry.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-8227) math function: unhex

2015-06-09 Thread Reynold Xin (JIRA)
Reynold Xin created SPARK-8227:
--

 Summary: math function: unhex
 Key: SPARK-8227
 URL: https://issues.apache.org/jira/browse/SPARK-8227
 Project: Spark
  Issue Type: Sub-task
Reporter: Reynold Xin


unhex(STRING a): BINARY

Inverse of hex. Interprets each pair of characters as a hexadecimal number and 
converts to the byte representation of the number.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-8226) math function: shiftrightunsigned

2015-06-09 Thread Reynold Xin (JIRA)
Reynold Xin created SPARK-8226:
--

 Summary: math function: shiftrightunsigned
 Key: SPARK-8226
 URL: https://issues.apache.org/jira/browse/SPARK-8226
 Project: Spark
  Issue Type: Sub-task
Reporter: Reynold Xin


shiftrightunsigned(INT a), shiftrightunsigned(BIGINT a) 

Bitwise unsigned right shift (as of Hive 1.2.0). Returns int for tinyint, 
smallint and int a. Returns bigint for bigint a.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-8228) conditional function: isnull

2015-06-09 Thread Reynold Xin (JIRA)
Reynold Xin created SPARK-8228:
--

 Summary: conditional function: isnull
 Key: SPARK-8228
 URL: https://issues.apache.org/jira/browse/SPARK-8228
 Project: Spark
  Issue Type: Sub-task
Reporter: Reynold Xin
Assignee: Reynold Xin


Just need to register it in FunctionRegistry.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-8229) conditional function: isnotnull

2015-06-09 Thread Reynold Xin (JIRA)
Reynold Xin created SPARK-8229:
--

 Summary: conditional function: isnotnull
 Key: SPARK-8229
 URL: https://issues.apache.org/jira/browse/SPARK-8229
 Project: Spark
  Issue Type: Sub-task
Reporter: Reynold Xin
Assignee: Reynold Xin


Just need to register it in the FunctionRegistry.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8173) A class which holds all the constants should be present

2015-06-09 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8173?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14578454#comment-14578454
 ] 

Sean Owen commented on SPARK-8173:
--

While I think you might be able to suggest some targeted cases in which some 
constants should be refactored into one place, I don't think it's true that 
most are worth moving, and not to one monolithic class. I'm going to close this 
unless you have more targeted suggestions.

> A class which holds all the constants should be present
> ---
>
> Key: SPARK-8173
> URL: https://issues.apache.org/jira/browse/SPARK-8173
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
> Environment: software
>Reporter: sahitya pavurala
>Priority: Minor
>
> A class which holds all the constants should be present, instead of 
> hardcoding every where(Similar to MRConstants.java in MapReduce)
> All the parameter names when used should be referenced from that class



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-8231) complex function: array_contains

2015-06-09 Thread Reynold Xin (JIRA)
Reynold Xin created SPARK-8231:
--

 Summary: complex function: array_contains
 Key: SPARK-8231
 URL: https://issues.apache.org/jira/browse/SPARK-8231
 Project: Spark
  Issue Type: Sub-task
Reporter: Reynold Xin


array_contains(Array, value)

Returns TRUE if the array contains value.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-8230) complex function: size

2015-06-09 Thread Reynold Xin (JIRA)
Reynold Xin created SPARK-8230:
--

 Summary: complex function: size
 Key: SPARK-8230
 URL: https://issues.apache.org/jira/browse/SPARK-8230
 Project: Spark
  Issue Type: Sub-task
Reporter: Reynold Xin


size(Map): int

size(Array): int

return the number of elements in the map or array.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4809) Improve Guava shading in Spark

2015-06-09 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14578457#comment-14578457
 ] 

Sean Owen commented on SPARK-4809:
--

[~pyrolistical] what do you mean specifically? are you aware of the reasons 
some Guava classes can't be shaded? without that, yeah you wouldn't understand 
this.

> Improve Guava shading in Spark
> --
>
> Key: SPARK-4809
> URL: https://issues.apache.org/jira/browse/SPARK-4809
> Project: Spark
>  Issue Type: Improvement
>Affects Versions: 1.2.0
>Reporter: Marcelo Vanzin
>Assignee: Marcelo Vanzin
> Fix For: 1.3.0
>
>
> As part of SPARK-2848, we started shading Guava to help with projects that 
> want to use Spark but use an incompatible version of Guava.
> The approach used there is a little sub-optimal, though. It makes it tricky, 
> especially, to run unit tests in your project when those need to use 
> spark-core APIs.
> We should make the shading more transparent so that it's easier to use 
> spark-core, with or without an explicit Guava dependency.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-8232) complex function: sort_array

2015-06-09 Thread Reynold Xin (JIRA)
Reynold Xin created SPARK-8232:
--

 Summary: complex function: sort_array
 Key: SPARK-8232
 URL: https://issues.apache.org/jira/browse/SPARK-8232
 Project: Spark
  Issue Type: Sub-task
Reporter: Reynold Xin


sort_array(Array)

Sorts the input array in ascending order according to the natural ordering of 
the array elements and returns it




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-8233) misc function: hash

2015-06-09 Thread Reynold Xin (JIRA)
Reynold Xin created SPARK-8233:
--

 Summary: misc function: hash
 Key: SPARK-8233
 URL: https://issues.apache.org/jira/browse/SPARK-8233
 Project: Spark
  Issue Type: Sub-task
Reporter: Reynold Xin


hash(a1[, a2...]): int

Returns a hash value of the arguments. See Hive's implementation.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-8235) misc function: sha1 / sha

2015-06-09 Thread Reynold Xin (JIRA)
Reynold Xin created SPARK-8235:
--

 Summary: misc function: sha1 / sha
 Key: SPARK-8235
 URL: https://issues.apache.org/jira/browse/SPARK-8235
 Project: Spark
  Issue Type: Sub-task
Reporter: Reynold Xin


sha1(string/binary): string
sha(string/binary): string


Calculates the SHA-1 digest for string or binary and returns the value as a hex 
string (as of Hive 1.3.0). Example: sha1('ABC') = 
'3c01bdbb26f358bab27f267924aa2c9a03fcfdb8'.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-8234) misc function: md5

2015-06-09 Thread Reynold Xin (JIRA)
Reynold Xin created SPARK-8234:
--

 Summary: misc function: md5
 Key: SPARK-8234
 URL: https://issues.apache.org/jira/browse/SPARK-8234
 Project: Spark
  Issue Type: Sub-task
Reporter: Reynold Xin


md5(string/binary): string

Calculates an MD5 128-bit checksum for the string or binary (as of Hive 1.3.0). 
The value is returned as a string of 32 hex digits, or NULL if the argument was 
NULL. Example: md5('ABC') = '902fbdd2b1df0c4f70b4a5d23525e932'.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-8236) misc function: crc32

2015-06-09 Thread Reynold Xin (JIRA)
Reynold Xin created SPARK-8236:
--

 Summary: misc function: crc32
 Key: SPARK-8236
 URL: https://issues.apache.org/jira/browse/SPARK-8236
 Project: Spark
  Issue Type: Sub-task
Reporter: Reynold Xin


crc32(string/binary): bigint

Computes a cyclic redundancy check value for string or binary argument and 
returns bigint value (as of Hive 1.3.0). Example: crc32('ABC') = 2743272264.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-8237) misc function: sha2

2015-06-09 Thread Reynold Xin (JIRA)
Reynold Xin created SPARK-8237:
--

 Summary: misc function: sha2
 Key: SPARK-8237
 URL: https://issues.apache.org/jira/browse/SPARK-8237
 Project: Spark
  Issue Type: Sub-task
Reporter: Reynold Xin


sha2(string/binary, int): string

Calculates the SHA-2 family of hash functions (SHA-224, SHA-256, SHA-384, and 
SHA-512) (as of Hive 1.3.0). The first argument is the string or binary to be 
hashed. The second argument indicates the desired bit length of the result, 
which must have a value of 224, 256, 384, 512, or 0 (which is equivalent to 
256). SHA-224 is supported starting from Java 8. If either argument is NULL or 
the hash length is not one of the permitted values, the return value is NULL. 
Example: sha2('ABC', 256) = 
'b5d4045c3f466fa91fe2cc6abe79232a1a57cdf104f7a26e716e0a1e2789df78'.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-8238) string function: ascii

2015-06-09 Thread Reynold Xin (JIRA)
Reynold Xin created SPARK-8238:
--

 Summary: string function: ascii
 Key: SPARK-8238
 URL: https://issues.apache.org/jira/browse/SPARK-8238
 Project: Spark
  Issue Type: Sub-task
Reporter: Reynold Xin


ascii(string str): int

Returns the numeric value of the first character of str.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-8239) string function: base64

2015-06-09 Thread Reynold Xin (JIRA)
Reynold Xin created SPARK-8239:
--

 Summary: string function: base64
 Key: SPARK-8239
 URL: https://issues.apache.org/jira/browse/SPARK-8239
 Project: Spark
  Issue Type: Sub-task
Reporter: Reynold Xin


base64(binary bin): string

Converts the argument from binary to a base 64 string



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-8240) string function: concat

2015-06-09 Thread Reynold Xin (JIRA)
Reynold Xin created SPARK-8240:
--

 Summary: string function: concat
 Key: SPARK-8240
 URL: https://issues.apache.org/jira/browse/SPARK-8240
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin


concat(string|binary A, string|binary B...): string / binary

Returns the string or bytes resulting from concatenating the strings or bytes 
passed in as parameters in order. For example, concat('foo', 'bar') results in 
'foobar'. Note that this function can take any number of input strings.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-8241) string function: concat_ws

2015-06-09 Thread Reynold Xin (JIRA)
Reynold Xin created SPARK-8241:
--

 Summary: string function: concat_ws
 Key: SPARK-8241
 URL: https://issues.apache.org/jira/browse/SPARK-8241
 Project: Spark
  Issue Type: Sub-task
Reporter: Reynold Xin


concat_ws(string SEP, string A, string B...): string

concat_ws(string SEP, array): string




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-8242) string function: decode

2015-06-09 Thread Reynold Xin (JIRA)
Reynold Xin created SPARK-8242:
--

 Summary: string function: decode
 Key: SPARK-8242
 URL: https://issues.apache.org/jira/browse/SPARK-8242
 Project: Spark
  Issue Type: Sub-task
Reporter: Reynold Xin


decode(binary bin, string charset): string

Decodes the first argument into a String using the provided character set (one 
of 'US-ASCII', 'ISO-8859-1', 'UTF-8', 'UTF-16BE', 'UTF-16LE', 'UTF-16'). If 
either argument is null, the result will also be null. (As of Hive 0.12.0.)




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-8243) string function: encode

2015-06-09 Thread Reynold Xin (JIRA)
Reynold Xin created SPARK-8243:
--

 Summary: string function: encode
 Key: SPARK-8243
 URL: https://issues.apache.org/jira/browse/SPARK-8243
 Project: Spark
  Issue Type: Sub-task
Reporter: Reynold Xin


encode(string src, string charset): binary

Encodes the first argument into a BINARY using the provided character set (one 
of 'US-ASCII', 'ISO-8859-1', 'UTF-8', 'UTF-16BE', 'UTF-16LE', 'UTF-16'). If 
either argument is null, the result will also be null. (As of Hive 0.12.0.)





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-8245) string function: format_number

2015-06-09 Thread Reynold Xin (JIRA)
Reynold Xin created SPARK-8245:
--

 Summary: string function: format_number
 Key: SPARK-8245
 URL: https://issues.apache.org/jira/browse/SPARK-8245
 Project: Spark
  Issue Type: Sub-task
Reporter: Reynold Xin


format_number(number x, int d): string

Formats the number X to a format like '#,###,###.##', rounded to D decimal 
places, and returns the result as a string. If D is 0, the result has no 
decimal point or fractional part. (As of Hive 0.10.0; bug with float types 
fixed in Hive 0.14.0, decimal type support added in Hive 0.14.0)




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-8244) string function: find_in_set

2015-06-09 Thread Reynold Xin (JIRA)
Reynold Xin created SPARK-8244:
--

 Summary: string function: find_in_set
 Key: SPARK-8244
 URL: https://issues.apache.org/jira/browse/SPARK-8244
 Project: Spark
  Issue Type: Sub-task
Reporter: Reynold Xin
Priority: Minor


find_in_set(string str, string strList): int

Returns the first occurance of str in strList where strList is a 
comma-delimited string. Returns null if either argument is null. Returns 0 if 
the first argument contains any commas. For example, find_in_set('ab', 
'abc,b,ab,c,def') returns 3.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-8246) string function: get_json_object

2015-06-09 Thread Reynold Xin (JIRA)
Reynold Xin created SPARK-8246:
--

 Summary: string function: get_json_object
 Key: SPARK-8246
 URL: https://issues.apache.org/jira/browse/SPARK-8246
 Project: Spark
  Issue Type: Sub-task
Reporter: Reynold Xin


get_json_object(string json_string, string path): string

This is actually fairly complicated. Take a look at 
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8246) string function: get_json_object

2015-06-09 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8246?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-8246:
---
Description: 
get_json_object(string json_string, string path): string

This is actually fairly complicated. Take a look at 
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF


Only add this to SQL, not DataFrame.

  was:
get_json_object(string json_string, string path): string

This is actually fairly complicated. Take a look at 
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF


> string function: get_json_object
> 
>
> Key: SPARK-8246
> URL: https://issues.apache.org/jira/browse/SPARK-8246
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>
> get_json_object(string json_string, string path): string
> This is actually fairly complicated. Take a look at 
> https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF
> Only add this to SQL, not DataFrame.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8244) string function: find_in_set

2015-06-09 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8244?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-8244:
---
Description: 
find_in_set(string str, string strList): int

Returns the first occurance of str in strList where strList is a 
comma-delimited string. Returns null if either argument is null. Returns 0 if 
the first argument contains any commas. For example, find_in_set('ab', 
'abc,b,ab,c,def') returns 3.

Only add this to SQL, not DataFrame.

  was:
find_in_set(string str, string strList): int

Returns the first occurance of str in strList where strList is a 
comma-delimited string. Returns null if either argument is null. Returns 0 if 
the first argument contains any commas. For example, find_in_set('ab', 
'abc,b,ab,c,def') returns 3.



> string function: find_in_set
> 
>
> Key: SPARK-8244
> URL: https://issues.apache.org/jira/browse/SPARK-8244
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Priority: Minor
>
> find_in_set(string str, string strList): int
> Returns the first occurance of str in strList where strList is a 
> comma-delimited string. Returns null if either argument is null. Returns 0 if 
> the first argument contains any commas. For example, find_in_set('ab', 
> 'abc,b,ab,c,def') returns 3.
> Only add this to SQL, not DataFrame.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-8247) string function: instr

2015-06-09 Thread Reynold Xin (JIRA)
Reynold Xin created SPARK-8247:
--

 Summary: string function: instr
 Key: SPARK-8247
 URL: https://issues.apache.org/jira/browse/SPARK-8247
 Project: Spark
  Issue Type: Sub-task
Reporter: Reynold Xin


instr(string str, string substr): int 

Returns the position of the first occurrence of substr in str. Returns null if 
either of the arguments are null and returns 0 if substr could not be found in 
str. Be aware that this is not zero based. The first character in str has index 
1.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-8248) string function: length

2015-06-09 Thread Reynold Xin (JIRA)
Reynold Xin created SPARK-8248:
--

 Summary: string function: length
 Key: SPARK-8248
 URL: https://issues.apache.org/jira/browse/SPARK-8248
 Project: Spark
  Issue Type: Sub-task
Reporter: Reynold Xin


length(string A): int

Returns the length of the string.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-8250) string function: alias lower/lcase

2015-06-09 Thread Reynold Xin (JIRA)
Reynold Xin created SPARK-8250:
--

 Summary: string function: alias lower/lcase
 Key: SPARK-8250
 URL: https://issues.apache.org/jira/browse/SPARK-8250
 Project: Spark
  Issue Type: Sub-task
Reporter: Reynold Xin
Assignee: Reynold Xin


Alias lower/lcase in FunctionRegistry.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-8249) string function: locate

2015-06-09 Thread Reynold Xin (JIRA)
Reynold Xin created SPARK-8249:
--

 Summary: string function: locate
 Key: SPARK-8249
 URL: https://issues.apache.org/jira/browse/SPARK-8249
 Project: Spark
  Issue Type: Sub-task
Reporter: Reynold Xin


locate(string substr, string str[, int pos]): int

Returns the position of the first occurrence of substr in str after position 
pos.





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-8252) string function: lpad

2015-06-09 Thread Reynold Xin (JIRA)
Reynold Xin created SPARK-8252:
--

 Summary: string function: lpad
 Key: SPARK-8252
 URL: https://issues.apache.org/jira/browse/SPARK-8252
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin


lpad(string str, int len, string pad): string

Returns str, left-padded with pad to a length of len.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-8253) string function: ltrim

2015-06-09 Thread Reynold Xin (JIRA)
Reynold Xin created SPARK-8253:
--

 Summary: string function: ltrim
 Key: SPARK-8253
 URL: https://issues.apache.org/jira/browse/SPARK-8253
 Project: Spark
  Issue Type: Sub-task
Reporter: Reynold Xin


ltrim(string A): string

Returns the string resulting from trimming spaces from the beginning(left hand 
side) of A. For example, ltrim(' foobar ') results in 'foobar '.





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-8251) string function: alias upper / ucase

2015-06-09 Thread Reynold Xin (JIRA)
Reynold Xin created SPARK-8251:
--

 Summary: string function: alias upper / ucase
 Key: SPARK-8251
 URL: https://issues.apache.org/jira/browse/SPARK-8251
 Project: Spark
  Issue Type: Sub-task
Reporter: Reynold Xin
Assignee: Reynold Xin


Alias upper / ucase in FunctionRegistry.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-8254) string function: printf

2015-06-09 Thread Reynold Xin (JIRA)
Reynold Xin created SPARK-8254:
--

 Summary: string function: printf
 Key: SPARK-8254
 URL: https://issues.apache.org/jira/browse/SPARK-8254
 Project: Spark
  Issue Type: Sub-task
Reporter: Reynold Xin


printf(String format, Obj... args): string

Returns the input formatted according do printf-style format strings.


We need to come up with a name for this in DataFrame -- maybe formatString.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-8255) string function: regexp_extract

2015-06-09 Thread Reynold Xin (JIRA)
Reynold Xin created SPARK-8255:
--

 Summary: string function: regexp_extract
 Key: SPARK-8255
 URL: https://issues.apache.org/jira/browse/SPARK-8255
 Project: Spark
  Issue Type: Sub-task
Reporter: Reynold Xin


regexp_extract(string subject, string pattern, int index): string

Returns the string extracted using the pattern. For example, 
regexp_extract('foothebar', 'foo(.*?)(bar)', 2) returns 'bar.' Note that some 
care is necessary in using predefined character classes: using '\s' as the 
second argument will match the letter s; '\\s' is necessary to match 
whitespace, etc. The 'index' parameter is the Java regex Matcher group() method 
index. See docs/api/java/util/regex/Matcher.html for more information on the 
'index' or Java regex group() method.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-8257) string function: repeat

2015-06-09 Thread Reynold Xin (JIRA)
Reynold Xin created SPARK-8257:
--

 Summary: string function: repeat
 Key: SPARK-8257
 URL: https://issues.apache.org/jira/browse/SPARK-8257
 Project: Spark
  Issue Type: Sub-task
Reporter: Reynold Xin


repeat(string str, int n): string

Repeats str n times.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-8256) string function: regexp_replace

2015-06-09 Thread Reynold Xin (JIRA)
Reynold Xin created SPARK-8256:
--

 Summary: string function: regexp_replace
 Key: SPARK-8256
 URL: https://issues.apache.org/jira/browse/SPARK-8256
 Project: Spark
  Issue Type: Sub-task
Reporter: Reynold Xin


regexp_replace(string INITIAL_STRING, string PATTERN, string REPLACEMENT): 
string

Returns the string resulting from replacing all substrings in INITIAL_STRING 
that match the java regular expression syntax defined in PATTERN with instances 
of REPLACEMENT. For example, regexp_replace("foobar", "oo|ar", "") returns 
'fb.' Note that some care is necessary in using predefined character classes: 
using '\s' as the second argument will match the letter s; '\\s' is necessary 
to match whitespace, etc.






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-8259) string function: rpad

2015-06-09 Thread Reynold Xin (JIRA)
Reynold Xin created SPARK-8259:
--

 Summary: string function: rpad
 Key: SPARK-8259
 URL: https://issues.apache.org/jira/browse/SPARK-8259
 Project: Spark
  Issue Type: Sub-task
Reporter: Reynold Xin


rpad(string str, int len, string pad): string

Returns str, right-padded with pad to a length of len.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-8258) string function: reverse

2015-06-09 Thread Reynold Xin (JIRA)
Reynold Xin created SPARK-8258:
--

 Summary: string function: reverse
 Key: SPARK-8258
 URL: https://issues.apache.org/jira/browse/SPARK-8258
 Project: Spark
  Issue Type: Sub-task
Reporter: Reynold Xin


reverse(string A): string

Returns the reversed string.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-8260) string function: rtrim

2015-06-09 Thread Reynold Xin (JIRA)
Reynold Xin created SPARK-8260:
--

 Summary: string function: rtrim
 Key: SPARK-8260
 URL: https://issues.apache.org/jira/browse/SPARK-8260
 Project: Spark
  Issue Type: Sub-task
Reporter: Reynold Xin


rtrim(string A): string

Returns the string resulting from trimming spaces from the end(right hand side) 
of A. For example, rtrim(' foobar ') results in ' foobar'.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-8262) string function: split

2015-06-09 Thread Reynold Xin (JIRA)
Reynold Xin created SPARK-8262:
--

 Summary: string function: split
 Key: SPARK-8262
 URL: https://issues.apache.org/jira/browse/SPARK-8262
 Project: Spark
  Issue Type: Sub-task
Reporter: Reynold Xin


split(string str, string pat): array[string]

Splits str around pat (pat is a regular expression).






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-8261) string function: space

2015-06-09 Thread Reynold Xin (JIRA)
Reynold Xin created SPARK-8261:
--

 Summary: string function: space
 Key: SPARK-8261
 URL: https://issues.apache.org/jira/browse/SPARK-8261
 Project: Spark
  Issue Type: Sub-task
Reporter: Reynold Xin


space(int n): string

Returns a string of n spaces.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-8263) string function: substr/substring should also support binary type

2015-06-09 Thread Reynold Xin (JIRA)
Reynold Xin created SPARK-8263:
--

 Summary: string function: substr/substring should also support 
binary type
 Key: SPARK-8263
 URL: https://issues.apache.org/jira/browse/SPARK-8263
 Project: Spark
  Issue Type: Sub-task
Reporter: Reynold Xin
Priority: Minor


See Hive's: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-8265) Add LinearDataGenerator to pyspark.mllib.utils

2015-06-09 Thread Manoj Kumar (JIRA)
Manoj Kumar created SPARK-8265:
--

 Summary: Add LinearDataGenerator to pyspark.mllib.utils
 Key: SPARK-8265
 URL: https://issues.apache.org/jira/browse/SPARK-8265
 Project: Spark
  Issue Type: Improvement
  Components: MLlib, PySpark
Reporter: Manoj Kumar
Priority: Minor


This is useful in testing various linear models in pyspark



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-8264) string function: substring_index

2015-06-09 Thread Reynold Xin (JIRA)
Reynold Xin created SPARK-8264:
--

 Summary: string function: substring_index
 Key: SPARK-8264
 URL: https://issues.apache.org/jira/browse/SPARK-8264
 Project: Spark
  Issue Type: Sub-task
Reporter: Reynold Xin


substring_index(string A, string delim, int count): string



Returns the substring from string A before count occurrences of the delimiter 
delim (as of Hive 1.3.0). If count is positive, everything to the left of the 
final delimiter (counting from the left) is returned. If count is negative, 
everything to the right of the final delimiter (counting from the right) is 
returned. Substring_index performs a case-sensitive match when searching for 
delim. Example: substring_index('www.apache.org', '.', 2) = 'www.apache'.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-8266) string function: translate

2015-06-09 Thread Reynold Xin (JIRA)
Reynold Xin created SPARK-8266:
--

 Summary: string function: translate
 Key: SPARK-8266
 URL: https://issues.apache.org/jira/browse/SPARK-8266
 Project: Spark
  Issue Type: Sub-task
Reporter: Reynold Xin
Priority: Minor


translate(string|char|varchar input, string|char|varchar from, 
string|char|varchar to): string

Translates the input string by replacing the characters present in the from 
string with the corresponding characters in the to string. This is similar to 
the translate function in PostgreSQL. If any of the parameters to this UDF are 
NULL, the result is NULL as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-8267) string function: trim

2015-06-09 Thread Reynold Xin (JIRA)
Reynold Xin created SPARK-8267:
--

 Summary: string function: trim
 Key: SPARK-8267
 URL: https://issues.apache.org/jira/browse/SPARK-8267
 Project: Spark
  Issue Type: Sub-task
Reporter: Reynold Xin


trim(string A): string

Returns the string resulting from trimming spaces from both ends of A. For 
example, trim(' foobar ') results in 'foobar'




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-8269) string function: initcap

2015-06-09 Thread Reynold Xin (JIRA)
Reynold Xin created SPARK-8269:
--

 Summary: string function: initcap
 Key: SPARK-8269
 URL: https://issues.apache.org/jira/browse/SPARK-8269
 Project: Spark
  Issue Type: Sub-task
Reporter: Reynold Xin


initcap(string A): string

Returns string, with the first letter of each word in uppercase, all other 
letters in lowercase. Words are delimited by whitespace.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-8268) string function: unbase64

2015-06-09 Thread Reynold Xin (JIRA)
Reynold Xin created SPARK-8268:
--

 Summary: string function: unbase64
 Key: SPARK-8268
 URL: https://issues.apache.org/jira/browse/SPARK-8268
 Project: Spark
  Issue Type: Sub-task
Reporter: Reynold Xin


unbase64(string str): binary

Converts the argument from a base 64 string to BINARY.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-8270) string function: levenshtein

2015-06-09 Thread Reynold Xin (JIRA)
Reynold Xin created SPARK-8270:
--

 Summary: string function: levenshtein
 Key: SPARK-8270
 URL: https://issues.apache.org/jira/browse/SPARK-8270
 Project: Spark
  Issue Type: Sub-task
Reporter: Reynold Xin


levenshtein(string A, string B): int

Returns the Levenshtein distance between two strings (as of Hive 1.2.0). For 
example, levenshtein('kitten', 'sitting') results in 3.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-8271) string function: soundex

2015-06-09 Thread Reynold Xin (JIRA)
Reynold Xin created SPARK-8271:
--

 Summary: string function: soundex
 Key: SPARK-8271
 URL: https://issues.apache.org/jira/browse/SPARK-8271
 Project: Spark
  Issue Type: Sub-task
Reporter: Reynold Xin


soundex(string A): string

Returns soundex code of the string. For example, soundex('Miller') results in 
M460.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8159) Improve expression coverage

2015-06-09 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8159?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14578488#comment-14578488
 ] 

Sean Owen commented on SPARK-8159:
--

I suppose it's not a big deal, but do there really need to be _hundreds_ of 
JIRAs to track each function? is there no logical groupings of these that form 
tasks? 

> Improve expression coverage
> ---
>
> Key: SPARK-8159
> URL: https://issues.apache.org/jira/browse/SPARK-8159
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Reynold Xin
>
> This is an umbrella ticket to track new expressions we are adding to 
> SQL/DataFrame.
> For each new expression, we should:
> 1. Add a new Expression implementation in 
> org.apache.spark.sql.catalyst.expressions
> 2. If applicable, implement the code generated version (by implementing 
> genCode).
> 3. Add comprehensive unit tests (for all the data types the expressions 
> support).
> 4. If applicable, add a new function for DataFrame in 
> org.apache.spark.sql.functions, and python/pyspark/sql/functions.py for 
> Python.
> For date/time functions, put them in expressions/datetime.scala, and create a 
> DateTimeFunctionSuite.scala for testing.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8159) Improve expression coverage

2015-06-09 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8159?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14578490#comment-14578490
 ] 

Reynold Xin commented on SPARK-8159:


It's easier to parallelize the work this way. Also it's unclear which 
expressions are small enough to be grouped together.

> Improve expression coverage
> ---
>
> Key: SPARK-8159
> URL: https://issues.apache.org/jira/browse/SPARK-8159
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Reynold Xin
>
> This is an umbrella ticket to track new expressions we are adding to 
> SQL/DataFrame.
> For each new expression, we should:
> 1. Add a new Expression implementation in 
> org.apache.spark.sql.catalyst.expressions
> 2. If applicable, implement the code generated version (by implementing 
> genCode).
> 3. Add comprehensive unit tests (for all the data types the expressions 
> support).
> 4. If applicable, add a new function for DataFrame in 
> org.apache.spark.sql.functions, and python/pyspark/sql/functions.py for 
> Python.
> For date/time functions, put them in expressions/datetime.scala, and create a 
> DateTimeFunctionSuite.scala for testing.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-8272) BigDecimal in parquet not working

2015-06-09 Thread Bipin Roshan Nag (JIRA)
Bipin Roshan Nag created SPARK-8272:
---

 Summary: BigDecimal in parquet not working
 Key: SPARK-8272
 URL: https://issues.apache.org/jira/browse/SPARK-8272
 Project: Spark
  Issue Type: Bug
  Components: Spark Core, SQL
Affects Versions: 1.3.1
 Environment: Ubuntu 14.0 LTS
Reporter: Bipin Roshan Nag


When trying to save a DDF to parquet file I get the following errror:

org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
stage 2.0 failed 1 times, most recent failure: Lost task 0.0 in stage 2.0 (TID 
311, localhost): java.lang.ClassCastException: scala.runtime.BoxedUnit cannot 
be cast to org.apache.spark.sql.types.Decimal
at 
org.apache.spark.sql.parquet.RowWriteSupport.writePrimitive(ParquetTableSupport.scala:220)
at 
org.apache.spark.sql.parquet.RowWriteSupport.writeValue(ParquetTableSupport.scala:192)
at 
org.apache.spark.sql.parquet.RowWriteSupport.write(ParquetTableSupport.scala:171)
at 
org.apache.spark.sql.parquet.RowWriteSupport.write(ParquetTableSupport.scala:134)
at 
parquet.hadoop.InternalParquetRecordWriter.write(InternalParquetRecordWriter.java:120)
at parquet.hadoop.ParquetRecordWriter.write(ParquetRecordWriter.java:81)
at parquet.hadoop.ParquetRecordWriter.write(ParquetRecordWriter.java:37)
at 
org.apache.spark.sql.parquet.ParquetRelation2.org$apache$spark$sql$parquet$ParquetRelation2$$writeShard$1(newParquet.scala:671)
at 
org.apache.spark.sql.parquet.ParquetRelation2$$anonfun$insert$2.apply(newParquet.scala:689)
at 
org.apache.spark.sql.parquet.ParquetRelation2$$anonfun$insert$2.apply(newParquet.scala:689)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
at org.apache.spark.scheduler.Task.run(Task.scala:64)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)

Driver stacktrace:
at 
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1204)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1193)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1192)
at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at 
org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1192)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693)
at scala.Option.foreach(Option.scala:236)
at 
org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:693)
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1393)
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1354)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)

I cannot save the dataframe. Please help.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8159) Improve expression coverage

2015-06-09 Thread Cheng Hao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8159?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14578502#comment-14578502
 ] 

Cheng Hao commented on SPARK-8159:
--

Agree, it would be easier to track the progress.

> Improve expression coverage
> ---
>
> Key: SPARK-8159
> URL: https://issues.apache.org/jira/browse/SPARK-8159
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Reynold Xin
>
> This is an umbrella ticket to track new expressions we are adding to 
> SQL/DataFrame.
> For each new expression, we should:
> 1. Add a new Expression implementation in 
> org.apache.spark.sql.catalyst.expressions
> 2. If applicable, implement the code generated version (by implementing 
> genCode).
> 3. Add comprehensive unit tests (for all the data types the expressions 
> support).
> 4. If applicable, add a new function for DataFrame in 
> org.apache.spark.sql.functions, and python/pyspark/sql/functions.py for 
> Python.
> For date/time functions, put them in expressions/datetime.scala, and create a 
> DateTimeFunctionSuite.scala for testing.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8159) Improve expression coverage

2015-06-09 Thread Cheng Hao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8159?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14578503#comment-14578503
 ] 

Cheng Hao commented on SPARK-8159:
--

One more question, is that possible to assign the task to `meself` when start 
the work? So we can avoid the overlapped job.

> Improve expression coverage
> ---
>
> Key: SPARK-8159
> URL: https://issues.apache.org/jira/browse/SPARK-8159
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Reynold Xin
>
> This is an umbrella ticket to track new expressions we are adding to 
> SQL/DataFrame.
> For each new expression, we should:
> 1. Add a new Expression implementation in 
> org.apache.spark.sql.catalyst.expressions
> 2. If applicable, implement the code generated version (by implementing 
> genCode).
> 3. Add comprehensive unit tests (for all the data types the expressions 
> support).
> 4. If applicable, add a new function for DataFrame in 
> org.apache.spark.sql.functions, and python/pyspark/sql/functions.py for 
> Python.
> For date/time functions, put them in expressions/datetime.scala, and create a 
> DateTimeFunctionSuite.scala for testing.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8102) Big performance difference when joining 3 tables in different order

2015-06-09 Thread Hao Ren (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14578504#comment-14578504
 ] 

Hao Ren commented on SPARK-8102:


For the first query, {{CartesianProduct}} is necessary. 
But for the third query, even without {{CartesianProduct}}, it is much slower 
and has much more {{shuffle write}} than the second one.
Check out the attached images for more details.

> Big performance difference when joining 3 tables in different order
> ---
>
> Key: SPARK-8102
> URL: https://issues.apache.org/jira/browse/SPARK-8102
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.3.1
> Environment: spark in local mode
>Reporter: Hao Ren
> Attachments: query2job.png, query3job.png
>
>
> Given 3 tables loaded from CSV files: 
> ( tables name => size)
> *click_meter_site_grouped* =>10 687 455 bytes
> *t_zipcode* => 2 738 954 bytes
> *t_category* => 2 182 bytes
> When joining the 3 tables, I notice a large performance difference if they 
> are joined in different order.
> Here are the SQL queries to compare:
> {code}
> -- snippet 1
> SELECT g.period, c.categoryName, z.regionName, action, list_id, cnt
> FROM t_category c, t_zipcode z, click_meter_site_grouped g
> WHERE c.refCategoryID = g.category AND z.regionCode = g.region
> {code}
> {code}
> -- snippet 2
> SELECT g.period, c.categoryName, z.regionName, action, list_id, cnt
> FROM t_category c, click_meter_site_grouped g, t_zipcode z
> WHERE c.refCategoryID = g.category AND z.regionCode = g.region
> {code}
> As you see, the largest table *click_meter_site_grouped* is the last table in 
> FROM clause in the first snippet,  and it is in the middle of table list in 
> second one.
> Snippet 2 runs three times faster than Snippet 1.
> (8 seconds VS 24 seconds)
> As the data is just sampled from a large data set, if we test it on the 
> original data set, it will normally result in a performance issue.
> After checking the log, we found something strange In snippet 1's log:
> 15/06/04 15:32:03 INFO HadoopRDD: Input split: 
> file:/home/invkrh/workspace/java/data_spark_etl/data-sample/bconf/zipcodes_4.txt:0+2738954
> 15/06/04 15:32:04 INFO HadoopRDD: Input split: 
> file:/home/invkrh/workspace/java/data_spark_etl/data-sample/bconf/zipcodes_4.txt:0+2738954
> 15/06/04 15:32:04 INFO HadoopRDD: Input split: 
> file:/home/invkrh/workspace/java/data_spark_etl/data-sample/bconf/zipcodes_4.txt:0+2738954
> 15/06/04 15:32:05 INFO HadoopRDD: Input split: 
> file:/home/invkrh/workspace/java/data_spark_etl/data-sample/bconf/zipcodes_4.txt:0+2738954
> 15/06/04 15:32:05 INFO HadoopRDD: Input split: 
> file:/home/invkrh/workspace/java/data_spark_etl/data-sample/bconf/zipcodes_4.txt:0+2738954
> 15/06/04 15:32:05 INFO HadoopRDD: Input split: 
> file:/home/invkrh/workspace/java/data_spark_etl/data-sample/bconf/zipcodes_4.txt:0+2738954
> 15/06/04 15:32:05 INFO HadoopRDD: Input split: 
> file:/home/invkrh/workspace/java/data_spark_etl/data-sample/bconf/zipcodes_4.txt:0+2738954
> 15/06/04 15:32:06 INFO HadoopRDD: Input split: 
> file:/home/invkrh/workspace/java/data_spark_etl/data-sample/bconf/zipcodes_4.txt:0+2738954
> 15/06/04 15:32:06 INFO HadoopRDD: Input split: 
> file:/home/invkrh/workspace/java/data_spark_etl/data-sample/bconf/zipcodes_4.txt:0+2738954
> 15/06/04 15:32:06 INFO HadoopRDD: Input split: 
> file:/home/invkrh/workspace/java/data_spark_etl/data-sample/bconf/zipcodes_4.txt:0+2738954
> 15/06/04 15:32:07 INFO HadoopRDD: Input split: 
> file:/home/invkrh/workspace/java/data_spark_etl/data-sample/bconf/zipcodes_4.txt:0+2738954
> 15/06/04 15:32:07 INFO HadoopRDD: Input split: 
> file:/home/invkrh/workspace/java/data_spark_etl/data-sample/bconf/zipcodes_4.txt:0+2738954
> 15/06/04 15:32:07 INFO HadoopRDD: Input split: 
> file:/home/invkrh/workspace/java/data_spark_etl/data-sample/bconf/zipcodes_4.txt:0+2738954
> 15/06/04 15:32:07 INFO HadoopRDD: Input split: 
> file:/home/invkrh/workspace/java/data_spark_etl/data-sample/bconf/zipcodes_4.txt:0+2738954
> 15/06/04 15:32:08 INFO HadoopRDD: Input split: 
> file:/home/invkrh/workspace/java/data_spark_etl/data-sample/bconf/zipcodes_4.txt:0+2738954
> 15/06/04 15:32:08 INFO HadoopRDD: Input split: 
> file:/home/invkrh/workspace/java/data_spark_etl/data-sample/bconf/zipcodes_4.txt:0+2738954
> 15/06/04 15:32:08 INFO HadoopRDD: Input split: 
> file:/home/invkrh/workspace/java/data_spark_etl/data-sample/bconf/zipcodes_4.txt:0+2738954
> 15/06/04 15:32:09 INFO HadoopRDD: Input split: 
> file:/home/invkrh/workspace/java/data_spark_etl/data-sample/bconf/zipcodes_4.txt:0+2738954
> 15/06/04 15:32:09 INFO HadoopRDD: Input split: 
> file:/home/invkrh/workspace/java/data_spark_etl/data-sample/bconf/zipcodes_4.txt:0+2738954
> 15/06/04 15:32:09 INFO HadoopRDD: Input sp

[jira] [Commented] (SPARK-8159) Improve expression coverage

2015-06-09 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8159?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14578507#comment-14578507
 ] 

Reynold Xin commented on SPARK-8159:


The protocol we have been using is to leave a comment on the JIRA ticket.


> Improve expression coverage
> ---
>
> Key: SPARK-8159
> URL: https://issues.apache.org/jira/browse/SPARK-8159
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Reynold Xin
>
> This is an umbrella ticket to track new expressions we are adding to 
> SQL/DataFrame.
> For each new expression, we should:
> 1. Add a new Expression implementation in 
> org.apache.spark.sql.catalyst.expressions
> 2. If applicable, implement the code generated version (by implementing 
> genCode).
> 3. Add comprehensive unit tests (for all the data types the expressions 
> support).
> 4. If applicable, add a new function for DataFrame in 
> org.apache.spark.sql.functions, and python/pyspark/sql/functions.py for 
> Python.
> For date/time functions, put them in expressions/datetime.scala, and create a 
> DateTimeFunctionSuite.scala for testing.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8159) Improve SQL/DataFrame expression coverage

2015-06-09 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8159?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-8159:
---
Summary: Improve SQL/DataFrame expression coverage  (was: Improve 
expression coverage)

> Improve SQL/DataFrame expression coverage
> -
>
> Key: SPARK-8159
> URL: https://issues.apache.org/jira/browse/SPARK-8159
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Reynold Xin
>
> This is an umbrella ticket to track new expressions we are adding to 
> SQL/DataFrame.
> For each new expression, we should:
> 1. Add a new Expression implementation in 
> org.apache.spark.sql.catalyst.expressions
> 2. If applicable, implement the code generated version (by implementing 
> genCode).
> 3. Add comprehensive unit tests (for all the data types the expressions 
> support).
> 4. If applicable, add a new function for DataFrame in 
> org.apache.spark.sql.functions, and python/pyspark/sql/functions.py for 
> Python.
> For date/time functions, put them in expressions/datetime.scala, and create a 
> DateTimeFunctionSuite.scala for testing.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8159) Improve SQL/DataFrame expression coverage

2015-06-09 Thread Adrian Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8159?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14578508#comment-14578508
 ] 

Adrian Wang commented on SPARK-8159:


Are we missing xpath functions?

> Improve SQL/DataFrame expression coverage
> -
>
> Key: SPARK-8159
> URL: https://issues.apache.org/jira/browse/SPARK-8159
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Reynold Xin
>
> This is an umbrella ticket to track new expressions we are adding to 
> SQL/DataFrame.
> For each new expression, we should:
> 1. Add a new Expression implementation in 
> org.apache.spark.sql.catalyst.expressions
> 2. If applicable, implement the code generated version (by implementing 
> genCode).
> 3. Add comprehensive unit tests (for all the data types the expressions 
> support).
> 4. If applicable, add a new function for DataFrame in 
> org.apache.spark.sql.functions, and python/pyspark/sql/functions.py for 
> Python.
> For date/time functions, put them in expressions/datetime.scala, and create a 
> DateTimeFunctionSuite.scala for testing.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8159) Improve SQL/DataFrame expression coverage

2015-06-09 Thread Adrian Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8159?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14578509#comment-14578509
 ] 

Adrian Wang commented on SPARK-8159:


Are we missing xpath functions?

> Improve SQL/DataFrame expression coverage
> -
>
> Key: SPARK-8159
> URL: https://issues.apache.org/jira/browse/SPARK-8159
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Reynold Xin
>
> This is an umbrella ticket to track new expressions we are adding to 
> SQL/DataFrame.
> For each new expression, we should:
> 1. Add a new Expression implementation in 
> org.apache.spark.sql.catalyst.expressions
> 2. If applicable, implement the code generated version (by implementing 
> genCode).
> 3. Add comprehensive unit tests (for all the data types the expressions 
> support).
> 4. If applicable, add a new function for DataFrame in 
> org.apache.spark.sql.functions, and python/pyspark/sql/functions.py for 
> Python.
> For date/time functions, put them in expressions/datetime.scala, and create a 
> DateTimeFunctionSuite.scala for testing.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-8159) Improve SQL/DataFrame expression coverage

2015-06-09 Thread Adrian Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8159?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adrian Wang updated SPARK-8159:
---
Comment: was deleted

(was: Are we missing xpath functions?)

> Improve SQL/DataFrame expression coverage
> -
>
> Key: SPARK-8159
> URL: https://issues.apache.org/jira/browse/SPARK-8159
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Reynold Xin
>
> This is an umbrella ticket to track new expressions we are adding to 
> SQL/DataFrame.
> For each new expression, we should:
> 1. Add a new Expression implementation in 
> org.apache.spark.sql.catalyst.expressions
> 2. If applicable, implement the code generated version (by implementing 
> genCode).
> 3. Add comprehensive unit tests (for all the data types the expressions 
> support).
> 4. If applicable, add a new function for DataFrame in 
> org.apache.spark.sql.functions, and python/pyspark/sql/functions.py for 
> Python.
> For date/time functions, put them in expressions/datetime.scala, and create a 
> DateTimeFunctionSuite.scala for testing.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8174) date/time function: unix_timestamp

2015-06-09 Thread Adrian Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8174?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14578513#comment-14578513
 ] 

Adrian Wang commented on SPARK-8174:


I'll deal with this.

> date/time function: unix_timestamp
> --
>
> Key: SPARK-8174
> URL: https://issues.apache.org/jira/browse/SPARK-8174
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>
> 3 variants:
> {code}
> unix_timestamp(): long
> Gets current Unix timestamp in seconds.
> unix_timestamp(string|date): long
> Converts time string in format -MM-dd HH:mm:ss to Unix timestamp (in 
> seconds), using the default timezone and the default locale, return 0 if 
> fail: unix_timestamp('2009-03-20 11:30:01') = 1237573801
> unix_timestamp(string date, string pattern): long
> Convert time string with given pattern (see 
> [http://docs.oracle.com/javase/tutorial/i18n/format/simpleDateFormat.html]) 
> to Unix time stamp (in seconds), return 0 if fail: 
> unix_timestamp('2009-03-20', '-MM-dd') = 1237532400.
> {code}
> See: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8181) date/time function: hour

2015-06-09 Thread Adrian Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14578522#comment-14578522
 ] 

Adrian Wang commented on SPARK-8181:


I'll deal with this.

> date/time function: hour
> 
>
> Key: SPARK-8181
> URL: https://issues.apache.org/jira/browse/SPARK-8181
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>
> hour(string|date|timestamp): int
> Returns the hour of the timestamp: hour('2009-07-30 12:58:59') = 12, 
> hour('12:58:59') = 12.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8178) date/time function: quarter

2015-06-09 Thread Adrian Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14578519#comment-14578519
 ] 

Adrian Wang commented on SPARK-8178:


I'll deal with this.

> date/time function: quarter
> ---
>
> Key: SPARK-8178
> URL: https://issues.apache.org/jira/browse/SPARK-8178
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>
> quarter(string|date|timestamp): int
> Returns the quarter of the year for a date, timestamp, or string in the range 
> 1 to 4. Example: quarter('2015-04-08') = 2.
> See https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8176) date/time function: to_date

2015-06-09 Thread Adrian Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8176?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14578516#comment-14578516
 ] 

Adrian Wang commented on SPARK-8176:


I'll deal with this.

> date/time function: to_date
> ---
>
> Key: SPARK-8176
> URL: https://issues.apache.org/jira/browse/SPARK-8176
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>
> to_date(date|timestamp): date
> to_date(string): string
> Returns the date part of a timestamp string: to_date("1970-01-01 00:00:00") = 
> "1970-01-01".
> See https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8180) date/time function: day / dayofmonth

2015-06-09 Thread Adrian Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14578521#comment-14578521
 ] 

Adrian Wang commented on SPARK-8180:


I'll deal with this.

> date/time function: day / dayofmonth 
> -
>
> Key: SPARK-8180
> URL: https://issues.apache.org/jira/browse/SPARK-8180
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>
> day(string|date|timestamp): int
> dayofmonth(string|date|timestamp): int
> Returns the day part of a date or a timestamp string: day("1970-11-01 
> 00:00:00") = 1, day("1970-11-01") = 1.
> See https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8179) date/time function: month

2015-06-09 Thread Adrian Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8179?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14578520#comment-14578520
 ] 

Adrian Wang commented on SPARK-8179:


I'll deal with this.

> date/time function: month
> -
>
> Key: SPARK-8179
> URL: https://issues.apache.org/jira/browse/SPARK-8179
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>
> month(string|date|timestamp): int
> Returns the month part of a date or a timestamp string: month("1970-11-01 
> 00:00:00") = 11, month("1970-11-01") = 11.
> See https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8177) date/time function: year

2015-06-09 Thread Adrian Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14578518#comment-14578518
 ] 

Adrian Wang commented on SPARK-8177:


I'll deal with this.

> date/time function: year
> 
>
> Key: SPARK-8177
> URL: https://issues.apache.org/jira/browse/SPARK-8177
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>
> year(string|date|timestamp): int
> Returns the year part of a date or a timestamp string: year("1970-01-01 
> 00:00:00") = 1970, year("1970-01-01") = 1970.
> See https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8175) date/time function: from_unixtime

2015-06-09 Thread Adrian Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14578515#comment-14578515
 ] 

Adrian Wang commented on SPARK-8175:


I'll deal with this.

> date/time function: from_unixtime
> -
>
> Key: SPARK-8175
> URL: https://issues.apache.org/jira/browse/SPARK-8175
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>
> from_unixtime(bigint unixtime[, string format]): string
> Converts the number of seconds from unix epoch (1970-01-01 00:00:00 UTC) to a 
> string representing the timestamp of that moment in the current system time 
> zone in the format of "1970-01-01 00:00:00".
> See https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8183) date/time function: second

2015-06-09 Thread Adrian Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14578524#comment-14578524
 ] 

Adrian Wang commented on SPARK-8183:


I'll deal with this.

> date/time function: second
> --
>
> Key: SPARK-8183
> URL: https://issues.apache.org/jira/browse/SPARK-8183
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>
> second(string|date|timestamp): int
> Returns the second of the timestamp.
> See https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8184) date/time function: weekofyear

2015-06-09 Thread Adrian Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14578525#comment-14578525
 ] 

Adrian Wang commented on SPARK-8184:


I'll deal with this.

> date/time function: weekofyear
> --
>
> Key: SPARK-8184
> URL: https://issues.apache.org/jira/browse/SPARK-8184
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>
> weekofyear(string|date|timestamp): int
> Returns the week number of a timestamp string: weekofyear("1970-11-01 
> 00:00:00") = 44, weekofyear("1970-11-01") = 44.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8182) date/time function: minute

2015-06-09 Thread Adrian Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8182?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14578523#comment-14578523
 ] 

Adrian Wang commented on SPARK-8182:


I'll deal with this.

> date/time function: minute
> --
>
> Key: SPARK-8182
> URL: https://issues.apache.org/jira/browse/SPARK-8182
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>
> minute(string|date|timestamp): int
> Returns the minute of the timestamp.
> See https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8265) Add LinearDataGenerator to pyspark.mllib.utils

2015-06-09 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14578529#comment-14578529
 ] 

Apache Spark commented on SPARK-8265:
-

User 'MechCoder' has created a pull request for this issue:
https://github.com/apache/spark/pull/6715

> Add LinearDataGenerator to pyspark.mllib.utils
> --
>
> Key: SPARK-8265
> URL: https://issues.apache.org/jira/browse/SPARK-8265
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib, PySpark
>Reporter: Manoj Kumar
>Priority: Minor
>
> This is useful in testing various linear models in pyspark



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-8265) Add LinearDataGenerator to pyspark.mllib.utils

2015-06-09 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8265?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8265:
---

Assignee: (was: Apache Spark)

> Add LinearDataGenerator to pyspark.mllib.utils
> --
>
> Key: SPARK-8265
> URL: https://issues.apache.org/jira/browse/SPARK-8265
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib, PySpark
>Reporter: Manoj Kumar
>Priority: Minor
>
> This is useful in testing various linear models in pyspark



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8191) date/time function: to_utc_timestamp

2015-06-09 Thread Adrian Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14578534#comment-14578534
 ] 

Adrian Wang commented on SPARK-8191:


I'll deal with this.

> date/time function: to_utc_timestamp
> 
>
> Key: SPARK-8191
> URL: https://issues.apache.org/jira/browse/SPARK-8191
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>
> to_utc_timestamp(timestamp, string timezone): timestamp
> Assumes given timestamp is in given timezone and converts to UTC (as of Hive 
> 0.8.0). For example, to_utc_timestamp('1970-01-01 00:00:00','PST') returns 
> 1970-01-01 08:00:00.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8187) date/time function: date_sub

2015-06-09 Thread Adrian Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8187?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14578532#comment-14578532
 ] 

Adrian Wang commented on SPARK-8187:


I'll deal with this.

> date/time function: date_sub
> 
>
> Key: SPARK-8187
> URL: https://issues.apache.org/jira/browse/SPARK-8187
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>
> date_sub(string startdate, int days): string
> date_sub(date startdate, int days): date
> Subtracts a number of days to startdate: date_sub('2008-12-31', 1) = 
> '2008-12-30'.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-8102) Big performance difference when joining 3 tables in different order

2015-06-09 Thread Hao Ren (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14578504#comment-14578504
 ] 

Hao Ren edited comment on SPARK-8102 at 6/9/15 8:06 AM:


For the first query, {{CartesianProduct}} is necessary.
What I am trying to understand is that, for the third query, even without 
{{CartesianProduct}}, it is much slower and has much more {{shuffle write}} 
than the second one.
Please check out the attached images for more details.


was (Author: invkrh):
For the first query, {{CartesianProduct}} is necessary. 
But for the third query, even without {{CartesianProduct}}, it is much slower 
and has much more {{shuffle write}} than the second one.
Check out the attached images for more details.

> Big performance difference when joining 3 tables in different order
> ---
>
> Key: SPARK-8102
> URL: https://issues.apache.org/jira/browse/SPARK-8102
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.3.1
> Environment: spark in local mode
>Reporter: Hao Ren
> Attachments: query2job.png, query3job.png
>
>
> Given 3 tables loaded from CSV files: 
> ( tables name => size)
> *click_meter_site_grouped* =>10 687 455 bytes
> *t_zipcode* => 2 738 954 bytes
> *t_category* => 2 182 bytes
> When joining the 3 tables, I notice a large performance difference if they 
> are joined in different order.
> Here are the SQL queries to compare:
> {code}
> -- snippet 1
> SELECT g.period, c.categoryName, z.regionName, action, list_id, cnt
> FROM t_category c, t_zipcode z, click_meter_site_grouped g
> WHERE c.refCategoryID = g.category AND z.regionCode = g.region
> {code}
> {code}
> -- snippet 2
> SELECT g.period, c.categoryName, z.regionName, action, list_id, cnt
> FROM t_category c, click_meter_site_grouped g, t_zipcode z
> WHERE c.refCategoryID = g.category AND z.regionCode = g.region
> {code}
> As you see, the largest table *click_meter_site_grouped* is the last table in 
> FROM clause in the first snippet,  and it is in the middle of table list in 
> second one.
> Snippet 2 runs three times faster than Snippet 1.
> (8 seconds VS 24 seconds)
> As the data is just sampled from a large data set, if we test it on the 
> original data set, it will normally result in a performance issue.
> After checking the log, we found something strange In snippet 1's log:
> 15/06/04 15:32:03 INFO HadoopRDD: Input split: 
> file:/home/invkrh/workspace/java/data_spark_etl/data-sample/bconf/zipcodes_4.txt:0+2738954
> 15/06/04 15:32:04 INFO HadoopRDD: Input split: 
> file:/home/invkrh/workspace/java/data_spark_etl/data-sample/bconf/zipcodes_4.txt:0+2738954
> 15/06/04 15:32:04 INFO HadoopRDD: Input split: 
> file:/home/invkrh/workspace/java/data_spark_etl/data-sample/bconf/zipcodes_4.txt:0+2738954
> 15/06/04 15:32:05 INFO HadoopRDD: Input split: 
> file:/home/invkrh/workspace/java/data_spark_etl/data-sample/bconf/zipcodes_4.txt:0+2738954
> 15/06/04 15:32:05 INFO HadoopRDD: Input split: 
> file:/home/invkrh/workspace/java/data_spark_etl/data-sample/bconf/zipcodes_4.txt:0+2738954
> 15/06/04 15:32:05 INFO HadoopRDD: Input split: 
> file:/home/invkrh/workspace/java/data_spark_etl/data-sample/bconf/zipcodes_4.txt:0+2738954
> 15/06/04 15:32:05 INFO HadoopRDD: Input split: 
> file:/home/invkrh/workspace/java/data_spark_etl/data-sample/bconf/zipcodes_4.txt:0+2738954
> 15/06/04 15:32:06 INFO HadoopRDD: Input split: 
> file:/home/invkrh/workspace/java/data_spark_etl/data-sample/bconf/zipcodes_4.txt:0+2738954
> 15/06/04 15:32:06 INFO HadoopRDD: Input split: 
> file:/home/invkrh/workspace/java/data_spark_etl/data-sample/bconf/zipcodes_4.txt:0+2738954
> 15/06/04 15:32:06 INFO HadoopRDD: Input split: 
> file:/home/invkrh/workspace/java/data_spark_etl/data-sample/bconf/zipcodes_4.txt:0+2738954
> 15/06/04 15:32:07 INFO HadoopRDD: Input split: 
> file:/home/invkrh/workspace/java/data_spark_etl/data-sample/bconf/zipcodes_4.txt:0+2738954
> 15/06/04 15:32:07 INFO HadoopRDD: Input split: 
> file:/home/invkrh/workspace/java/data_spark_etl/data-sample/bconf/zipcodes_4.txt:0+2738954
> 15/06/04 15:32:07 INFO HadoopRDD: Input split: 
> file:/home/invkrh/workspace/java/data_spark_etl/data-sample/bconf/zipcodes_4.txt:0+2738954
> 15/06/04 15:32:07 INFO HadoopRDD: Input split: 
> file:/home/invkrh/workspace/java/data_spark_etl/data-sample/bconf/zipcodes_4.txt:0+2738954
> 15/06/04 15:32:08 INFO HadoopRDD: Input split: 
> file:/home/invkrh/workspace/java/data_spark_etl/data-sample/bconf/zipcodes_4.txt:0+2738954
> 15/06/04 15:32:08 INFO HadoopRDD: Input split: 
> file:/home/invkrh/workspace/java/data_spark_etl/data-sample/bconf/zipcodes_4.txt:0+2738954
> 15/06/04 15:32:08 INFO HadoopRDD: Input split: 
> file:/home/invkrh/workspace/java/data_spark_etl/data-sample/bconf/

[jira] [Commented] (SPARK-8192) date/time function: current_date

2015-06-09 Thread Adrian Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8192?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14578535#comment-14578535
 ] 

Adrian Wang commented on SPARK-8192:


I'll deal with this.

> date/time function: current_date
> 
>
> Key: SPARK-8192
> URL: https://issues.apache.org/jira/browse/SPARK-8192
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>
> current_date(): date
> Returns the current date at the start of query evaluation (as of Hive 1.2.0). 
> All calls of current_date within the same query return the same value.
> We should just replace this with a date literal in the optimizer.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8186) date/time function: date_add

2015-06-09 Thread Adrian Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14578531#comment-14578531
 ] 

Adrian Wang commented on SPARK-8186:


I'll deal with this.

> date/time function: date_add
> 
>
> Key: SPARK-8186
> URL: https://issues.apache.org/jira/browse/SPARK-8186
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>
> date_add(string startdate, int days): string
> date_add(date startdate, int days): date
> Adds a number of days to startdate: date_add('2008-12-31', 1) = '2009-01-01'.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8185) date/time function: datediff

2015-06-09 Thread Adrian Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8185?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14578530#comment-14578530
 ] 

Adrian Wang commented on SPARK-8185:


I'll deal with this.

> date/time function: datediff
> 
>
> Key: SPARK-8185
> URL: https://issues.apache.org/jira/browse/SPARK-8185
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>
> datediff(date enddate, date startdate): int
> Returns the number of days from startdate to enddate: datediff('2009-03-01', 
> '2009-02-27') = 2.
> See https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-8265) Add LinearDataGenerator to pyspark.mllib.utils

2015-06-09 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8265?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8265:
---

Assignee: Apache Spark

> Add LinearDataGenerator to pyspark.mllib.utils
> --
>
> Key: SPARK-8265
> URL: https://issues.apache.org/jira/browse/SPARK-8265
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib, PySpark
>Reporter: Manoj Kumar
>Assignee: Apache Spark
>Priority: Minor
>
> This is useful in testing various linear models in pyspark



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8188) date/time function: from_utc_timestamp

2015-06-09 Thread Adrian Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14578533#comment-14578533
 ] 

Adrian Wang commented on SPARK-8188:


I'll deal with this.

> date/time function: from_utc_timestamp
> --
>
> Key: SPARK-8188
> URL: https://issues.apache.org/jira/browse/SPARK-8188
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>
> from_utc_timestamp(timestamp, string timezone): timestamp
> Assumes given timestamp is UTC and converts to given timezone (as of Hive 
> 0.8.0). For example, from_utc_timestamp('1970-01-01 08:00:00','PST') returns 
> 1970-01-01 00:00:00.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8195) date/time function: last_day

2015-06-09 Thread Adrian Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8195?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14578537#comment-14578537
 ] 

Adrian Wang commented on SPARK-8195:


I'll deal with this.

> date/time function: last_day
> 
>
> Key: SPARK-8195
> URL: https://issues.apache.org/jira/browse/SPARK-8195
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>
> last_day(string date): string
> last_day(date date): date
> Returns the last day of the month which the date belongs to (as of Hive 
> 1.1.0). date is a string in the format '-MM-dd HH:mm:ss' or '-MM-dd'. 
> The time part of date is ignored.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   3   4   >