[jira] [Resolved] (SPARK-7909) spark-ec2 and associated tools not py3 ready

2015-07-09 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7909?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-7909.
---
Resolution: Fixed

 spark-ec2 and associated tools not py3 ready
 

 Key: SPARK-7909
 URL: https://issues.apache.org/jira/browse/SPARK-7909
 Project: Spark
  Issue Type: Improvement
  Components: EC2
 Environment: ec2 python3
Reporter: Matthew Goodman
Priority: Blocker

 At present there is not a possible permutation of tools that supports Python3 
 on both the launching computer and running cluster.  There are a couple 
 problems involved:
  - There is no prebuilt spark binary with python3 support.
  - spark-ec2/spark/init.sh contains inline py3 unfriendly print statements
  - Config files for cluster processes don't seem to make it to all nodes in a 
 working format.
 I have fixes for some of this, but the config and running context debugging 
 remains elusive to me.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-6289) PySpark doesn't maintain SQL date Types

2015-07-09 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6289?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-6289.
---
   Resolution: Fixed
Fix Version/s: 1.5.0

Issue resolved by pull request 7301
[https://github.com/apache/spark/pull/7301]

 PySpark doesn't maintain SQL date Types
 ---

 Key: SPARK-6289
 URL: https://issues.apache.org/jira/browse/SPARK-6289
 Project: Spark
  Issue Type: Bug
  Components: PySpark, SQL
Affects Versions: 1.2.1
Reporter: Michael Nazario
Assignee: Davies Liu
 Fix For: 1.5.0


 For the DateType, Spark SQL requires a datetime.date in Python. However, if 
 you collect a row based on that type, you'll end up with a returned value 
 which is type datetime.datetime.
 I have tried to reproduce this using the pyspark shell, but have been unable 
 to. This is definitely a problem coming from pyrolite though:
 https://github.com/irmen/Pyrolite/
 Pyrolite is being used for datetime and date serialization, but appears to 
 not map to date objects, but maps to datetime objects.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-7902) SQL UDF doesn't support UDT in PySpark

2015-07-09 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7902?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-7902.
---
   Resolution: Fixed
Fix Version/s: 1.5.0

Issue resolved by pull request 7301
[https://github.com/apache/spark/pull/7301]

 SQL UDF doesn't support UDT in PySpark
 --

 Key: SPARK-7902
 URL: https://issues.apache.org/jira/browse/SPARK-7902
 Project: Spark
  Issue Type: Bug
  Components: PySpark, SQL
Affects Versions: 1.4.0
Reporter: Xiangrui Meng
Assignee: Davies Liu
Priority: Critical
 Fix For: 1.5.0


 We don't convert Python SQL internal types to Python types in SQL UDF 
 execution. This causes problems if the input arguments contain UDTs or the 
 return type is a UDT. Right now, the raw SQL types are passed into the Python 
 UDF and the return value is not converted to Python SQL types.
 This is the code (from [~rams]) to produce this bug. (Actually, it triggers 
 another bug first right now.)
 {code}
 from pyspark.mllib.linalg import SparseVector
 from pyspark.sql.functions import udf
 from pyspark.sql.types import IntegerType
 df = sqlContext.createDataFrame([(SparseVector(2, {0: 0.0}),)], [features])
 sz = udf(lambda s: s.size, IntegerType())
 df.select(sz(df.features).alias(sz)).collect()
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7909) spark-ec2 and associated tools not py3 ready

2015-07-08 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7909?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu updated SPARK-7909:
--
Target Version/s: 1.5.0
Priority: Blocker  (was: Major)

 spark-ec2 and associated tools not py3 ready
 

 Key: SPARK-7909
 URL: https://issues.apache.org/jira/browse/SPARK-7909
 Project: Spark
  Issue Type: Improvement
  Components: EC2
 Environment: ec2 python3
Reporter: Matthew Goodman
Priority: Blocker

 At present there is not a possible permutation of tools that supports Python3 
 on both the launching computer and running cluster.  There are a couple 
 problems involved:
  - There is no prebuilt spark binary with python3 support.
  - spark-ec2/spark/init.sh contains inline py3 unfriendly print statements
  - Config files for cluster processes don't seem to make it to all nodes in a 
 working format.
 I have fixes for some of this, but the config and running context debugging 
 remains elusive to me.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4315) PySpark pickling of pyspark.sql.Row objects is extremely inefficient

2015-07-08 Thread Davies Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4315?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14619611#comment-14619611
 ] 

Davies Liu commented on SPARK-4315:
---

This is fixed by https://github.com/apache/spark/pull/5445

 PySpark pickling of pyspark.sql.Row objects is extremely inefficient
 

 Key: SPARK-4315
 URL: https://issues.apache.org/jira/browse/SPARK-4315
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.1.0
 Environment: Ubuntu, Python 2.7, Spark 1.1.0
Reporter: Adam Davison

 Working with an RDD of pyspark.sql.Row objects, created by reading a file 
 with SQLContext in a local PySpark context.
 Operations on the RDD, such as: data.groupBy(lambda x: x.field_name) are 
 extremely slow (more than 10x slower than an equivalent Scala/Spark 
 implementation). Obviously I expected it to be somewhat slower, but I did a 
 bit of digging given the difference was so huge.
 Luckily it's fairly easy to add profiling to the Python workers. I see that 
 the vast majority of time is spent in:
 spark-1.1.0-bin-cdh4/python/pyspark/sql.py:757(_restore_object)
 It seems that this line attempts to accelerate pickling of Rows with the use 
 of a cache. Some debugging reveals that this cache becomes quite big (100s of 
 entries). Disabling the cache by adding:
 return _create_cls(dataType)(obj)
 as the first line of _restore_object made my query run 5x faster. Implying 
 that the caching is not providing the desired acceleration...



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-4315) PySpark pickling of pyspark.sql.Row objects is extremely inefficient

2015-07-08 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4315?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu reassigned SPARK-4315:
-

Assignee: Davies Liu

 PySpark pickling of pyspark.sql.Row objects is extremely inefficient
 

 Key: SPARK-4315
 URL: https://issues.apache.org/jira/browse/SPARK-4315
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.1.0
 Environment: Ubuntu, Python 2.7, Spark 1.1.0
Reporter: Adam Davison
Assignee: Davies Liu

 Working with an RDD of pyspark.sql.Row objects, created by reading a file 
 with SQLContext in a local PySpark context.
 Operations on the RDD, such as: data.groupBy(lambda x: x.field_name) are 
 extremely slow (more than 10x slower than an equivalent Scala/Spark 
 implementation). Obviously I expected it to be somewhat slower, but I did a 
 bit of digging given the difference was so huge.
 Luckily it's fairly easy to add profiling to the Python workers. I see that 
 the vast majority of time is spent in:
 spark-1.1.0-bin-cdh4/python/pyspark/sql.py:757(_restore_object)
 It seems that this line attempts to accelerate pickling of Rows with the use 
 of a cache. Some debugging reveals that this cache becomes quite big (100s of 
 entries). Disabling the cache by adding:
 return _create_cls(dataType)(obj)
 as the first line of _restore_object made my query run 5x faster. Implying 
 that the caching is not providing the desired acceleration...



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-4315) PySpark pickling of pyspark.sql.Row objects is extremely inefficient

2015-07-08 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4315?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-4315.
---
   Resolution: Fixed
Fix Version/s: 1.4.0
   1.3.2

 PySpark pickling of pyspark.sql.Row objects is extremely inefficient
 

 Key: SPARK-4315
 URL: https://issues.apache.org/jira/browse/SPARK-4315
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.1.0
 Environment: Ubuntu, Python 2.7, Spark 1.1.0
Reporter: Adam Davison
Assignee: Davies Liu
 Fix For: 1.3.2, 1.4.0


 Working with an RDD of pyspark.sql.Row objects, created by reading a file 
 with SQLContext in a local PySpark context.
 Operations on the RDD, such as: data.groupBy(lambda x: x.field_name) are 
 extremely slow (more than 10x slower than an equivalent Scala/Spark 
 implementation). Obviously I expected it to be somewhat slower, but I did a 
 bit of digging given the difference was so huge.
 Luckily it's fairly easy to add profiling to the Python workers. I see that 
 the vast majority of time is spent in:
 spark-1.1.0-bin-cdh4/python/pyspark/sql.py:757(_restore_object)
 It seems that this line attempts to accelerate pickling of Rows with the use 
 of a cache. Some debugging reveals that this cache becomes quite big (100s of 
 entries). Disabling the cache by adding:
 return _create_cls(dataType)(obj)
 as the first line of _restore_object made my query run 5x faster. Implying 
 that the caching is not providing the desired acceleration...



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5092) Selecting from a nested structure with SparkSQL should return a nested structure

2015-07-08 Thread Davies Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5092?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14619632#comment-14619632
 ] 

Davies Liu commented on SPARK-5092:
---

cc [~marmbrus]

 Selecting from a nested structure with SparkSQL should return a nested 
 structure
 

 Key: SPARK-5092
 URL: https://issues.apache.org/jira/browse/SPARK-5092
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.2.0
Reporter: Brad Willard
Priority: Minor
  Labels: pyspark, spark, sql

 When running a sparksql query like this (at least on a json dataset)
 select
rid,
meta_data.name
 from
a_table
 The rows returned lose the nested structure. I receive a row like
 Row(rid='123', name='delete')
 instead of
 Row(rid='123', meta_data=Row(name='data'))
 I personally think this is confusing especially when programmatically 
 building and executing queries and then parsing it to find your data in a new 
 structure. I could understand how that's less desirable in some situations, 
 but you could get around it by supporting 'as'. If you wanted to skip the 
 nested structure simply write.
 select
rid,
meta_data.name as name
 from
a_table



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8931) Fallback to interpret mode if failed to compile in codegen

2015-07-08 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8931?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu updated SPARK-8931:
--
Description: And we should not fallback during testing.

 Fallback to interpret mode if failed to compile in codegen
 --

 Key: SPARK-8931
 URL: https://issues.apache.org/jira/browse/SPARK-8931
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Davies Liu
Assignee: Davies Liu
Priority: Critical

 And we should not fallback during testing.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-7507) pyspark.sql.types.StructType and Row should implement __iter__()

2015-07-08 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7507?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu closed SPARK-7507.
-
Resolution: Won't Fix

 pyspark.sql.types.StructType and Row should implement __iter__()
 

 Key: SPARK-7507
 URL: https://issues.apache.org/jira/browse/SPARK-7507
 Project: Spark
  Issue Type: Sub-task
  Components: PySpark, SQL
Reporter: Nicholas Chammas
Priority: Minor

 {{StructType}} looks an awful lot like a Python dictionary.
 However, it doesn't implement {{\_\_iter\_\_()}}, so doing a quick conversion 
 like this doesn't work:
 {code}
  df = sqlContext.jsonRDD(sc.parallelize(['{name: El Magnifico}']))
  df.schema
 StructType(List(StructField(name,StringType,true)))
  dict(df.schema)
 Traceback (most recent call last):
   File stdin, line 1, in module
 TypeError: 'StructType' object is not iterable
 {code}
 This would be super helpful for doing any custom schema manipulations without 
 having to go through the whole {{.json() - json.loads() - manipulate() - 
 json.dumps() - .fromJson()}} charade.
 Same goes for {{Row}}, which offers an 
 [{{asDict()}}|https://spark.apache.org/docs/1.3.1/api/python/pyspark.sql.html#pyspark.sql.Row.asDict]
  method but doesn't support the more Pythonic {{dict(Row)}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7507) pyspark.sql.types.StructType and Row should implement __iter__()

2015-07-08 Thread Davies Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7507?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14619609#comment-14619609
 ] 

Davies Liu commented on SPARK-7507:
---

For `Row`, it's similar to namedtuple, you can iterate on it, get each column 
of it, but dict() require a key-value pair.

I'd like to close this as `Wo't fix`.

 pyspark.sql.types.StructType and Row should implement __iter__()
 

 Key: SPARK-7507
 URL: https://issues.apache.org/jira/browse/SPARK-7507
 Project: Spark
  Issue Type: Sub-task
  Components: PySpark, SQL
Reporter: Nicholas Chammas
Priority: Minor

 {{StructType}} looks an awful lot like a Python dictionary.
 However, it doesn't implement {{\_\_iter\_\_()}}, so doing a quick conversion 
 like this doesn't work:
 {code}
  df = sqlContext.jsonRDD(sc.parallelize(['{name: El Magnifico}']))
  df.schema
 StructType(List(StructField(name,StringType,true)))
  dict(df.schema)
 Traceback (most recent call last):
   File stdin, line 1, in module
 TypeError: 'StructType' object is not iterable
 {code}
 This would be super helpful for doing any custom schema manipulations without 
 having to go through the whole {{.json() - json.loads() - manipulate() - 
 json.dumps() - .fromJson()}} charade.
 Same goes for {{Row}}, which offers an 
 [{{asDict()}}|https://spark.apache.org/docs/1.3.1/api/python/pyspark.sql.html#pyspark.sql.Row.asDict]
  method but doesn't support the more Pythonic {{dict(Row)}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-8450) PySpark write.parquet raises Unsupported datatype DecimalType()

2015-07-08 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8450?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-8450.
---
   Resolution: Fixed
Fix Version/s: 1.5.0

Issue resolved by pull request 7131
[https://github.com/apache/spark/pull/7131]

 PySpark write.parquet raises Unsupported datatype DecimalType()
 ---

 Key: SPARK-8450
 URL: https://issues.apache.org/jira/browse/SPARK-8450
 Project: Spark
  Issue Type: Bug
  Components: PySpark, SQL
 Environment: Spark 1.4.0 on Debian
Reporter: Peter Hoffmann
 Fix For: 1.5.0


 I'm getting an Exception when I try to save a DataFrame with a DeciamlType as 
 an parquet file
 Minimal Example:
 {code}
 from decimal import Decimal
 from pyspark.sql import SQLContext
 from pyspark.sql.types import *
 sqlContext = SQLContext(sc)
 schema = StructType([
 StructField('id', LongType()),
 StructField('value', DecimalType())])
 rdd = sc.parallelize([[1, Decimal(0.5)],[2, Decimal(2.9)]])
 df = sqlContext.createDataFrame(rdd, schema)
 df.write.parquet(hdfs://srv:9000/user/ph/decimal.parquet, 'overwrite')
 {code}
 Stack Trace
 {code}
 ---
 Py4JJavaError Traceback (most recent call last)
 ipython-input-19-a77dac8de5f3 in module()
  1 sr.write.parquet(hdfs://srv:9000/user/ph/decimal.parquet, 
 'overwrite')
 /home/spark/spark-1.4.0-bin-hadoop2.6/python/pyspark/sql/readwriter.pyc in 
 parquet(self, path, mode)
 367 :param mode: one of `append`, `overwrite`, `error`, `ignore` 
 (default: error)
 368 
 -- 369 return self._jwrite.mode(mode).parquet(path)
 370 
 371 @since(1.4)
 /home/spark/spark-1.4.0-bin-hadoop2.6/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py
  in __call__(self, *args)
 536 answer = self.gateway_client.send_command(command)
 537 return_value = get_return_value(answer, self.gateway_client,
 -- 538 self.target_id, self.name)
 539 
 540 for temp_arg in temp_args:
 /home/spark/spark-1.4.0-bin-hadoop2.6/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py
  in get_return_value(answer, gateway_client, target_id, name)
 298 raise Py4JJavaError(
 299 'An error occurred while calling {0}{1}{2}.\n'.
 -- 300 format(target_id, '.', name), value)
 301 else:
 302 raise Py4JError(
 Py4JJavaError: An error occurred while calling o361.parquet.
 : org.apache.spark.SparkException: Job aborted.
   at 
 org.apache.spark.sql.sources.InsertIntoHadoopFsRelation.insert(commands.scala:138)
   at 
 org.apache.spark.sql.sources.InsertIntoHadoopFsRelation.run(commands.scala:114)
   at 
 org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult$lzycompute(commands.scala:57)
   at 
 org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult(commands.scala:57)
   at 
 org.apache.spark.sql.execution.ExecutedCommand.doExecute(commands.scala:68)
   at 
 org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:88)
   at 
 org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:88)
   at 
 org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:148)
   at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:87)
   at 
 org.apache.spark.sql.SQLContext$QueryExecution.toRdd$lzycompute(SQLContext.scala:939)
   at 
 org.apache.spark.sql.SQLContext$QueryExecution.toRdd(SQLContext.scala:939)
   at org.apache.spark.sql.sources.ResolvedDataSource$.apply(ddl.scala:332)
   at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:144)
   at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:135)
   at 
 org.apache.spark.sql.DataFrameWriter.parquet(DataFrameWriter.scala:281)
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
   at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   at java.lang.reflect.Method.invoke(Method.java:606)
   at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
   at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
   at py4j.Gateway.invoke(Gateway.java:259)
   at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
   at py4j.commands.CallCommand.execute(CallCommand.java:79)
   at py4j.GatewayConnection.run(GatewayConnection.java:207)
   at java.lang.Thread.run(Thread.java:745)
 Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: 

[jira] [Commented] (SPARK-8408) Python OR operator is not considered while creating a column of boolean type

2015-07-08 Thread Davies Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8408?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14619622#comment-14619622
 ] 

Davies Liu commented on SPARK-8408:
---

In Python, We cannot override `or` `and` `not`, so we should use `|` `` `~` 
for them. We will throw an exception if you have to use `and` with columns. see 
https://github.com/apache/spark/pull/6961

 Python OR operator is not considered while creating a column of boolean type
 

 Key: SPARK-8408
 URL: https://issues.apache.org/jira/browse/SPARK-8408
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.4.0
 Environment: OSX Apache Spark 1.4.0
Reporter: Felix Maximilian Möller
Priority: Minor
 Fix For: 1.4.1

 Attachments: bug_report.ipynb.json


 h3. Given
 {code}
 d = [{'name': 'Alice', 'age': 1},{'name': 'Bob', 'age': 2}]
 person_df = sqlContext.createDataFrame(d)
 {code}
 h3. When
 {code}
 person_df.filter(person_df.age==1 or person_df.age==2).collect()
 {code}
 h3. Expected
 [Row(age=1, name=u'Alice'), Row(age=2, name=u'Bob')]
 h3. Actual
 [Row(age=1, name=u'Alice')]
 h3. While
 {code}
 person_df.filter(age = 1 or age = 2).collect()
 {code}
 yields the correct result:
 [Row(age=1, name=u'Alice'), Row(age=2, name=u'Bob')]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-8408) Python OR operator is not considered while creating a column of boolean type

2015-07-08 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8408?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-8408.
---
   Resolution: Fixed
 Assignee: Davies Liu
Fix Version/s: 1.4.1

 Python OR operator is not considered while creating a column of boolean type
 

 Key: SPARK-8408
 URL: https://issues.apache.org/jira/browse/SPARK-8408
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.4.0
 Environment: OSX Apache Spark 1.4.0
Reporter: Felix Maximilian Möller
Assignee: Davies Liu
Priority: Minor
 Fix For: 1.4.1

 Attachments: bug_report.ipynb.json


 h3. Given
 {code}
 d = [{'name': 'Alice', 'age': 1},{'name': 'Bob', 'age': 2}]
 person_df = sqlContext.createDataFrame(d)
 {code}
 h3. When
 {code}
 person_df.filter(person_df.age==1 or person_df.age==2).collect()
 {code}
 h3. Expected
 [Row(age=1, name=u'Alice'), Row(age=2, name=u'Bob')]
 h3. Actual
 [Row(age=1, name=u'Alice')]
 h3. While
 {code}
 person_df.filter(age = 1 or age = 2).collect()
 {code}
 yields the correct result:
 [Row(age=1, name=u'Alice'), Row(age=2, name=u'Bob')]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-8931) Fallback to interpret mode if failed to compile in codegen

2015-07-08 Thread Davies Liu (JIRA)
Davies Liu created SPARK-8931:
-

 Summary: Fallback to interpret mode if failed to compile in codegen
 Key: SPARK-8931
 URL: https://issues.apache.org/jira/browse/SPARK-8931
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Davies Liu
Assignee: Davies Liu
Priority: Critical






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-7190) UTF8String backed by binary data

2015-07-08 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7190?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-7190.
---
Resolution: Fixed

 UTF8String backed by binary data
 

 Key: SPARK-7190
 URL: https://issues.apache.org/jira/browse/SPARK-7190
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core
Reporter: Reynold Xin
Assignee: Davies Liu

 Just a pointer to some memory address, so we don't need to copy the data into 
 a byte array.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-7815) Enable UTF8String to work against memory address directly

2015-07-08 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7815?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-7815.
---
Resolution: Fixed

 Enable UTF8String to work against memory address directly
 -

 Key: SPARK-7815
 URL: https://issues.apache.org/jira/browse/SPARK-7815
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Reynold Xin
Assignee: Davies Liu

 So we can avoid an extra copy of data into byte array.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-6573) Convert inbound NaN values as null

2015-07-08 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6573?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu reassigned SPARK-6573:
-

Assignee: Davies Liu

 Convert inbound NaN values as null
 --

 Key: SPARK-6573
 URL: https://issues.apache.org/jira/browse/SPARK-6573
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 1.3.0
Reporter: Fabian Boehnlein
Assignee: Davies Liu

 In pandas it is common to use numpy.nan as the null value, for missing data 
 or whatever.
 http://pandas.pydata.org/pandas-docs/dev/gotchas.html#nan-integer-na-values-and-na-type-promotions
 http://stackoverflow.com/questions/17534106/what-is-the-difference-between-nan-and-none
 http://pandas.pydata.org/pandas-docs/dev/missing_data.html#filling-missing-values-fillna
 createDataFrame however only works with None as null values, parsing them as 
 None in the RDD.
 I suggest to add support for np.nan values in pandas DataFrames.
 current stracktrace when calling a DataFrame with object type columns with 
 np.nan values (which are floats)
 {code}
 TypeError Traceback (most recent call last)
 ipython-input-38-34f0263f0bf4 in module()
  1 sqldf = sqlCtx.createDataFrame(df_, schema=schema)
 /opt/spark/spark-1.3.0-bin-hadoop2.4/python/pyspark/sql/context.py in 
 createDataFrame(self, data, schema, samplingRatio)
 339 schema = self._inferSchema(data.map(lambda r: 
 row_cls(*r)), samplingRatio)
 340 
 -- 341 return self.applySchema(data, schema)
 342 
 343 def registerDataFrameAsTable(self, rdd, tableName):
 /opt/spark/spark-1.3.0-bin-hadoop2.4/python/pyspark/sql/context.py in 
 applySchema(self, rdd, schema)
 246 
 247 for row in rows:
 -- 248 _verify_type(row, schema)
 249 
 250 # convert python objects to sql data
 /opt/spark/spark-1.3.0-bin-hadoop2.4/python/pyspark/sql/types.py in 
 _verify_type(obj, dataType)
1064  length of fields (%d) % (len(obj), 
 len(dataType.fields)))
1065 for v, f in zip(obj, dataType.fields):
 - 1066 _verify_type(v, f.dataType)
1067 
1068 _cached_cls = weakref.WeakValueDictionary()
 /opt/spark/spark-1.3.0-bin-hadoop2.4/python/pyspark/sql/types.py in 
 _verify_type(obj, dataType)
1048 if type(obj) not in _acceptable_types[_type]:
1049 raise TypeError(%s can not accept object in type %s
 - 1050 % (dataType, type(obj)))
1051 
1052 if isinstance(dataType, ArrayType):
 TypeError: StringType can not accept object in type type 'float'{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-8804) order of UTF8String is wrong if there is any non-ascii character in it

2015-07-07 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8804?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-8804.
---
   Resolution: Fixed
Fix Version/s: 1.5.0

  order of UTF8String is wrong if there is any non-ascii character in it
 ---

 Key: SPARK-8804
 URL: https://issues.apache.org/jira/browse/SPARK-8804
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.4.0
Reporter: Davies Liu
Assignee: Davies Liu
Priority: Blocker
 Fix For: 1.4.1, 1.5.0


 We compare the UTF8String byte by byte, but byte in JVM is signed, it should 
 be compared as unsigned.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-8844) head/collect is broken in SparkR

2015-07-06 Thread Davies Liu (JIRA)
Davies Liu created SPARK-8844:
-

 Summary: head/collect is broken in SparkR 
 Key: SPARK-8844
 URL: https://issues.apache.org/jira/browse/SPARK-8844
 Project: Spark
  Issue Type: Bug
  Components: SparkR
Affects Versions: 1.5.0
Reporter: Davies Liu
Priority: Blocker


{code}
 t = tables(sqlContext)
 showDF(T)
Error in (function (classes, fdef, mtable)  :
  unable to find an inherited method for function ‘showDF’ for signature 
‘logical’
 showDF(t)
+-+---+
|tableName|isTemporary|
+-+---+
+-+---+
 15/07/06 09:59:10 WARN Executor: Told to re-register on heartbeat



 head(t)
Error in readTypedObject(con, type) :
  Unsupported type for deserialization

 collect(t)
Error in readTypedObject(con, type) :
  Unsupported type for deserialization
{code}




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8646) PySpark does not run on YARN

2015-07-06 Thread Davies Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14615293#comment-14615293
 ] 

Davies Liu commented on SPARK-8646:
---

To be clear, PySpark does NOT depends on pandas. In dataframe.py, it works with 
pandas dataframe only when you have it.

[~juliet] example/pi.py should run fine in YARN (it does not need panda at 
all). Is it possible that `outofstock/data_transform.py` depends on 
`pandas.algos` (pandas.algos is used by a closure from driver), and you upload 
the wrong log file?


 PySpark does not run on YARN
 

 Key: SPARK-8646
 URL: https://issues.apache.org/jira/browse/SPARK-8646
 Project: Spark
  Issue Type: Bug
  Components: PySpark, YARN
Affects Versions: 1.4.0
 Environment: SPARK_HOME=local/path/to/spark1.4install/dir
 also with
 SPARK_HOME=local/path/to/spark1.4install/dir
 PYTHONPATH=$SPARK_HOME/python/lib
 Spark apps are submitted with the command:
 $SPARK_HOME/bin/spark-submit outofstock/data_transform.py 
 hdfs://foe-dev/DEMO_DATA/FACT_POS hdfs:/user/juliet/ex/ yarn-client
 data_transform contains a main method, and the rest of the args are parsed in 
 my own code.
Reporter: Juliet Hougland
 Attachments: pi-test.log, spark1.4-SPARK_HOME-set-PYTHONPATH-set.log, 
 spark1.4-SPARK_HOME-set-inline-HADOOP_CONF_DIR.log, 
 spark1.4-SPARK_HOME-set.log


 Running pyspark jobs result in a no module named pyspark when run in 
 yarn-client mode in spark 1.4.
 [I believe this JIRA represents the change that introduced this error.| 
 https://issues.apache.org/jira/browse/SPARK-6869 ]
 This does not represent a binary compatible change to spark. Scripts that 
 worked on previous spark versions (ie comands the use spark-submit) should 
 continue to work without modification between minor versions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8745) Remove GenerateProjection

2015-07-06 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8745?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu updated SPARK-8745:
--
Summary: Remove GenerateProjection  (was: Remove GenerateMutableProjection)

 Remove GenerateProjection
 -

 Key: SPARK-8745
 URL: https://issues.apache.org/jira/browse/SPARK-8745
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Reynold Xin
Assignee: Davies Liu

 Based on discussion offline with [~marmbrus], we should remove 
 GenerateMutableProjection.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8745) Remove GenerateProjection

2015-07-06 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8745?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu updated SPARK-8745:
--
Description: Based on discussion offline with [~marmbrus], we should remove 
GenerateProjection.  (was: Based on discussion offline with [~marmbrus], we 
should remove GenerateMutableProjection.)

 Remove GenerateProjection
 -

 Key: SPARK-8745
 URL: https://issues.apache.org/jira/browse/SPARK-8745
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Reynold Xin
Assignee: Davies Liu

 Based on discussion offline with [~marmbrus], we should remove 
 GenerateProjection.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8636) CaseKeyWhen has incorrect NULL handling

2015-07-04 Thread Davies Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8636?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14614080#comment-14614080
 ] 

Davies Liu commented on SPARK-8636:
---

[~smolav] I'm just curious that how can we sort of group by a row with NULL in 
it, If we can not compare NULL with NULL?

 CaseKeyWhen has incorrect NULL handling
 ---

 Key: SPARK-8636
 URL: https://issues.apache.org/jira/browse/SPARK-8636
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.4.0
Reporter: Santiago M. Mola
  Labels: starter

 CaseKeyWhen implementation in Spark uses the following equals implementation:
 {code}
   private def equalNullSafe(l: Any, r: Any) = {
 if (l == null  r == null) {
   true
 } else if (l == null || r == null) {
   false
 } else {
   l == r
 }
   }
 {code}
 Which is not correct, since in SQL, NULL is never equal to NULL (actually, it 
 is not unequal either). In this case, a NULL value in a CASE WHEN expression 
 should never match.
 For example, you can execute this in MySQL:
 {code}
 SELECT CASE NULL WHEN NULL THEN NULL MATCHES ELSE NULL DOES NOT MATCH END 
 FROM DUAL;
 {code}
 And the result will be NULL DOES NOT MATCH.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-7401) Dot product and squared_distances should be vectorized in Vectors

2015-07-03 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7401?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-7401.
---
   Resolution: Fixed
Fix Version/s: 1.5.0

Issue resolved by pull request 5946
[https://github.com/apache/spark/pull/5946]

 Dot product and squared_distances should be vectorized in Vectors
 -

 Key: SPARK-7401
 URL: https://issues.apache.org/jira/browse/SPARK-7401
 Project: Spark
  Issue Type: Sub-task
  Components: MLlib, PySpark
Reporter: Manoj Kumar
 Fix For: 1.5.0






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-8226) math function: shiftrightunsigned

2015-07-03 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8226?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-8226.
---
   Resolution: Fixed
Fix Version/s: 1.5.0

Issue resolved by pull request 7035
[https://github.com/apache/spark/pull/7035]

 math function: shiftrightunsigned
 -

 Key: SPARK-8226
 URL: https://issues.apache.org/jira/browse/SPARK-8226
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
Assignee: zhichao-li
 Fix For: 1.5.0


 shiftrightunsigned(INT a), shiftrightunsigned(BIGINT a)   
 Bitwise unsigned right shift (as of Hive 1.2.0). Returns int for tinyint, 
 smallint and int a. Returns bigint for bigint a.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-8784) Add python API for hex/unhex

2015-07-02 Thread Davies Liu (JIRA)
Davies Liu created SPARK-8784:
-

 Summary: Add python API for hex/unhex
 Key: SPARK-8784
 URL: https://issues.apache.org/jira/browse/SPARK-8784
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Davies Liu
Assignee: Davies Liu






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8632) Poor Python UDF performance because of RDD caching

2015-07-02 Thread Davies Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14611602#comment-14611602
 ] 

Davies Liu commented on SPARK-8632:
---

[~justin.uang] Sounds interesting, could you sending out the PR?

 Poor Python UDF performance because of RDD caching
 --

 Key: SPARK-8632
 URL: https://issues.apache.org/jira/browse/SPARK-8632
 Project: Spark
  Issue Type: Bug
  Components: PySpark, SQL
Affects Versions: 1.4.0
Reporter: Justin Uang

 {quote}
 We have been running into performance problems using Python UDFs with 
 DataFrames at large scale.
 From the implementation of BatchPythonEvaluation, it looks like the goal was 
 to reuse the PythonRDD code. It caches the entire child RDD so that it can do 
 two passes over the data. One to give to the PythonRDD, then one to join the 
 python lambda results with the original row (which may have java objects that 
 should be passed through).
 In addition, it caches all the columns, even the ones that don't need to be 
 processed by the Python UDF. In the cases I was working with, I had a 500 
 column table, and i wanted to use a python UDF for one column, and it ended 
 up caching all 500 columns. 
 {quote}
 http://apache-spark-developers-list.1001551.n3.nabble.com/Python-UDF-performance-at-large-scale-td12843.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-8786) Create a wrapper for BinaryType

2015-07-02 Thread Davies Liu (JIRA)
Davies Liu created SPARK-8786:
-

 Summary: Create a wrapper for BinaryType
 Key: SPARK-8786
 URL: https://issues.apache.org/jira/browse/SPARK-8786
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Davies Liu


The hashCode and equals() of Array[Byte] does check the bytes, we should create 
a wrapper to do that.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8786) Create a wrapper for BinaryType

2015-07-02 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8786?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu updated SPARK-8786:
--
Description: The hashCode and equals() of Array[Byte] does check the bytes, 
we should create a wrapper (internally) to do that.  (was: The hashCode and 
equals() of Array[Byte] does check the bytes, we should create a wrapper to do 
that.)

 Create a wrapper for BinaryType
 ---

 Key: SPARK-8786
 URL: https://issues.apache.org/jira/browse/SPARK-8786
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Davies Liu

 The hashCode and equals() of Array[Byte] does check the bytes, we should 
 create a wrapper (internally) to do that.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-8223) math function: shiftleft

2015-07-02 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8223?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-8223.
---
   Resolution: Fixed
Fix Version/s: 1.5.0

Issue resolved by pull request 7178
[https://github.com/apache/spark/pull/7178]

 math function: shiftleft
 

 Key: SPARK-8223
 URL: https://issues.apache.org/jira/browse/SPARK-8223
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
Assignee: zhichao-li
 Fix For: 1.5.0


 shiftleft(INT a)
 shiftleft(BIGINT a)
 Bitwise left shift (as of Hive 1.2.0). Returns int for tinyint, smallint and 
 int a. Returns bigint for bigint a.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-8224) math function: shiftright

2015-07-02 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8224?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-8224.
---
   Resolution: Fixed
Fix Version/s: 1.5.0

Issue resolved by pull request 7178
[https://github.com/apache/spark/pull/7178]

 math function: shiftright
 -

 Key: SPARK-8224
 URL: https://issues.apache.org/jira/browse/SPARK-8224
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
Assignee: zhichao-li
 Fix For: 1.5.0


 shiftrightunsigned(INT a), shiftrightunsigned(BIGINT a)   
 Bitwise unsigned right shift (as of Hive 1.2.0). Returns int for tinyint, 
 smallint and int a. Returns bigint for bigint a.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-8747) fix EqualNullSafe for binary type

2015-07-02 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8747?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-8747.
---
   Resolution: Fixed
Fix Version/s: 1.5.0

Issue resolved by pull request 7143
[https://github.com/apache/spark/pull/7143]

 fix EqualNullSafe for binary type
 -

 Key: SPARK-8747
 URL: https://issues.apache.org/jira/browse/SPARK-8747
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Wenchen Fan
Priority: Minor
 Fix For: 1.5.0






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-7190) UTF8String backed by binary data

2015-07-02 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7190?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu reassigned SPARK-7190:
-

Assignee: Davies Liu

 UTF8String backed by binary data
 

 Key: SPARK-7190
 URL: https://issues.apache.org/jira/browse/SPARK-7190
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core
Reporter: Reynold Xin
Assignee: Davies Liu

 Just a pointer to some memory address, so we don't need to copy the data into 
 a byte array.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8745) Remove GenerateMutableProjection

2015-07-02 Thread Davies Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14612471#comment-14612471
 ] 

Davies Liu commented on SPARK-8745:
---

I can take this one, if you have not started.

 Remove GenerateMutableProjection
 

 Key: SPARK-8745
 URL: https://issues.apache.org/jira/browse/SPARK-8745
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Reynold Xin
Assignee: Davies Liu

 Based on discussion offline with [~marmbrus], we should remove 
 GenerateMutableProjection.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-8745) Remove GenerateMutableProjection

2015-07-02 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8745?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu reassigned SPARK-8745:
-

Assignee: Davies Liu

 Remove GenerateMutableProjection
 

 Key: SPARK-8745
 URL: https://issues.apache.org/jira/browse/SPARK-8745
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Reynold Xin
Assignee: Davies Liu

 Based on discussion offline with [~marmbrus], we should remove 
 GenerateMutableProjection.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-8804) order of UTF8String is wrong if there is any non-ascii character in it

2015-07-02 Thread Davies Liu (JIRA)
Davies Liu created SPARK-8804:
-

 Summary:  order of UTF8String is wrong if there is any non-ascii 
character in it
 Key: SPARK-8804
 URL: https://issues.apache.org/jira/browse/SPARK-8804
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.4.0
Reporter: Davies Liu
Assignee: Davies Liu
Priority: Blocker


We compare the UTF8String byte by byte, but byte in JVM is signed, it should be 
compared as unsigned.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-8766) DataFrame Python API should work with column which has non-ascii character in it

2015-07-01 Thread Davies Liu (JIRA)
Davies Liu created SPARK-8766:
-

 Summary: DataFrame Python API should work with column which has 
non-ascii character in it
 Key: SPARK-8766
 URL: https://issues.apache.org/jira/browse/SPARK-8766
 Project: Spark
  Issue Type: Bug
Affects Versions: 1.4.0, 1.3.1
Reporter: Davies Liu
Assignee: Davies Liu






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-8763) executing run-tests.py with Python 2.6 fails with absence of subprocess.check_output function

2015-07-01 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8763?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-8763.
---
   Resolution: Fixed
Fix Version/s: 1.5.0

Issue resolved by pull request 7161
[https://github.com/apache/spark/pull/7161]

 executing run-tests.py with Python 2.6 fails with absence of 
 subprocess.check_output function
 -

 Key: SPARK-8763
 URL: https://issues.apache.org/jira/browse/SPARK-8763
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.5.0
 Environment: Mac OS X 10.10.3 Python 2.6.9 Java 1.8.0 
Reporter: Tomohiko K.
  Labels: pyspark, testing
 Fix For: 1.5.0


 Running run-tests.py with Python 2.6 cause following error:
 {noformat}
 Running PySpark tests. Output is in 
 python//Users/tomohiko/.jenkins/jobs/pyspark_test/workspace/python/unit-tests.log
 Will test against the following Python executables: ['python2.6', 
 'python3.4', 'pypy']
 Will test the following Python modules: ['pyspark-core', 'pyspark-ml', 
 'pyspark-mllib', 'pyspark-sql', 'pyspark-streaming']
 Traceback (most recent call last):
   File ./python/run-tests.py, line 196, in module
 main()
   File ./python/run-tests.py, line 159, in main
 python_implementation = subprocess.check_output(
 AttributeError: 'module' object has no attribute 'check_output'
 ...
 {noformat}
 The cause of this error is using subprocess.check_output function, which 
 exists since Python 2.7.
 (ref. 
 https://docs.python.org/2.7/library/subprocess.html#subprocess.check_output)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-8227) math function: unhex

2015-07-01 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8227?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-8227.
---
   Resolution: Fixed
Fix Version/s: 1.5.0

Issue resolved by pull request 7113
[https://github.com/apache/spark/pull/7113]

 math function: unhex
 

 Key: SPARK-8227
 URL: https://issues.apache.org/jira/browse/SPARK-8227
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
Assignee: zhichao-li
 Fix For: 1.5.0


 unhex(STRING a): BINARY
 Inverse of hex. Interprets each pair of characters as a hexadecimal number 
 and converts to the byte representation of the number.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-8766) DataFrame Python API should work with column which has non-ascii character in it

2015-07-01 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8766?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-8766.
---
   Resolution: Fixed
Fix Version/s: 1.5.0

Issue resolved by pull request 7165
[https://github.com/apache/spark/pull/7165]

 DataFrame Python API should work with column which has non-ascii character in 
 it
 

 Key: SPARK-8766
 URL: https://issues.apache.org/jira/browse/SPARK-8766
 Project: Spark
  Issue Type: Sub-task
  Components: PySpark
Affects Versions: 1.3.1, 1.4.0
Reporter: Davies Liu
Assignee: Davies Liu
 Fix For: 1.5.0






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-8727) Add missing python api

2015-06-30 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8727?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-8727.
---
   Resolution: Fixed
Fix Version/s: 1.5.0

Issue resolved by pull request 7114
[https://github.com/apache/spark/pull/7114]

 Add missing python api
 --

 Key: SPARK-8727
 URL: https://issues.apache.org/jira/browse/SPARK-8727
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Tarek Auel
 Fix For: 1.5.0


 Add the python api that is missing for
 https://issues.apache.org/jira/browse/SPARK-8248
 https://issues.apache.org/jira/browse/SPARK-8234
 https://issues.apache.org/jira/browse/SPARK-8217
 https://issues.apache.org/jira/browse/SPARK-8215
 https://issues.apache.org/jira/browse/SPARK-8212



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-8535) PySpark : Can't create DataFrame from Pandas dataframe with no explicit column name

2015-06-30 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8535?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-8535.
---
   Resolution: Fixed
Fix Version/s: 1.5.0

Issue resolved by pull request 7124
[https://github.com/apache/spark/pull/7124]

 PySpark : Can't create DataFrame from Pandas dataframe with no explicit 
 column name
 ---

 Key: SPARK-8535
 URL: https://issues.apache.org/jira/browse/SPARK-8535
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.4.0
Reporter: Christophe Bourguignat
 Fix For: 1.5.0


 Trying to create a Spark DataFrame from a pandas dataframe with no explicit 
 column name : 
 pandasDF = pd.DataFrame([[1, 2], [5, 6]])
 sparkDF = sqlContext.createDataFrame(pandasDF)
 ***
  1 sparkDF = sqlContext.createDataFrame(pandasDF)
 /usr/local/Cellar/apache-spark/1.4.0/libexec/python/pyspark/sql/context.pyc 
 in createDataFrame(self, data, schema, samplingRatio)
 344 
 345 jrdd = 
 self._jvm.SerDeUtil.toJavaArray(rdd._to_java_object_rdd())
 -- 346 df = self._ssql_ctx.applySchemaToPythonRDD(jrdd.rdd(), 
 schema.json())
 347 return DataFrame(df, self)
 348 
 /usr/local/Cellar/apache-spark/1.4.0/libexec/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py
  in __call__(self, *args)
 536 answer = self.gateway_client.send_command(command)
 537 return_value = get_return_value(answer, self.gateway_client,
 -- 538 self.target_id, self.name)
 539 
 540 for temp_arg in temp_args:
 /usr/local/Cellar/apache-spark/1.4.0/libexec/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py
  in get_return_value(answer, gateway_client, target_id, name)
 298 raise Py4JJavaError(
 299 'An error occurred while calling {0}{1}{2}.\n'.
 -- 300 format(target_id, '.', name), value)
 301 else:
 302 raise Py4JError(
 Py4JJavaError: An error occurred while calling o87.applySchemaToPythonRDD.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8653) Add constraint for Children expression for data type

2015-06-30 Thread Davies Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14609579#comment-14609579
 ] 

Davies Liu commented on SPARK-8653:
---

[~rxin] With the new `ExpectsInputTypes`, we still need a way to tell how to do 
the conversion, it's ugly to do the type switch in eval() or codegen().

Maybe we could improve `AutoCastInputType` to have a method `acceptedTypes`, 
which returns a list of list of data types, specify those types could be casted 
into expected types. Be default, it will accept all type types which could be 
casted to expected types. 

 Add constraint for Children expression for data type
 

 Key: SPARK-8653
 URL: https://issues.apache.org/jira/browse/SPARK-8653
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Cheng Hao

 Currently, we have trait in Expression like `ExpectsInputTypes` and also the 
 `checkInputDataTypes`, but can not convert the children expressions 
 automatically, except we write the new rules in the `HiveTypeCoercion`.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-8723) improve code gen for divide and remainder

2015-06-30 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8723?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-8723.
---
   Resolution: Fixed
Fix Version/s: 1.5.0

Issue resolved by pull request 7111
[https://github.com/apache/spark/pull/7111]

 improve code gen for divide and remainder
 -

 Key: SPARK-8723
 URL: https://issues.apache.org/jira/browse/SPARK-8723
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Wenchen Fan
Priority: Minor
 Fix For: 1.5.0






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-8680) PropagateTypes is very slow when there are lots of columns

2015-06-30 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8680?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-8680.
---
   Resolution: Fixed
Fix Version/s: 1.5.0

Issue resolved by pull request 7087
[https://github.com/apache/spark/pull/7087]

 PropagateTypes is very slow when there are lots of columns
 --

 Key: SPARK-8680
 URL: https://issues.apache.org/jira/browse/SPARK-8680
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.3.1, 1.4.0
Reporter: Davies Liu
 Fix For: 1.5.0


 The time for PropagateTypes is O(N*N), N is the number of columns, which is 
 very slow if there many columns (1000)
 There easiest optimization could be put `q.inputSet` outside of  
 transformExpressions which could have about 4 times improvement for N=3000



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8680) PropagateTypes is very slow when there are lots of columns

2015-06-30 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8680?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu updated SPARK-8680:
--
Assignee: Liang-Chi Hsieh

 PropagateTypes is very slow when there are lots of columns
 --

 Key: SPARK-8680
 URL: https://issues.apache.org/jira/browse/SPARK-8680
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.3.1, 1.4.0
Reporter: Davies Liu
Assignee: Liang-Chi Hsieh
 Fix For: 1.5.0


 The time for PropagateTypes is O(N*N), N is the number of columns, which is 
 very slow if there many columns (1000)
 There easiest optimization could be put `q.inputSet` outside of  
 transformExpressions which could have about 4 times improvement for N=3000



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-8590) add code gen for ExtractValue

2015-06-30 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8590?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-8590.
---
   Resolution: Fixed
Fix Version/s: 1.5.0

Issue resolved by pull request 6982
[https://github.com/apache/spark/pull/6982]

 add code gen for ExtractValue
 -

 Key: SPARK-8590
 URL: https://issues.apache.org/jira/browse/SPARK-8590
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Wenchen Fan
 Fix For: 1.5.0






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-8236) misc function: crc32

2015-06-30 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8236?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-8236.
---
   Resolution: Fixed
Fix Version/s: 1.5.0

Issue resolved by pull request 7108
[https://github.com/apache/spark/pull/7108]

 misc function: crc32
 

 Key: SPARK-8236
 URL: https://issues.apache.org/jira/browse/SPARK-8236
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
 Fix For: 1.5.0


 crc32(string/binary): bigint
 Computes a cyclic redundancy check value for string or binary argument and 
 returns bigint value (as of Hive 1.3.0). Example: crc32('ABC') = 2743272264.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8450) PySpark write.parquet raises Unsupported datatype DecimalType()

2015-06-30 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8450?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu updated SPARK-8450:
--
Description: 
I'm getting an Exception when I try to save a DataFrame with a DeciamlType as 
an parquet file

Minimal Example:
{code}
from decimal import Decimal
from pyspark.sql import SQLContext
from pyspark.sql.types import *

sqlContext = SQLContext(sc)
schema = StructType([
StructField('id', LongType()),
StructField('value', DecimalType())])
rdd = sc.parallelize([[1, Decimal(0.5)],[2, Decimal(2.9)]])
df = sqlContext.createDataFrame(rdd, schema)
df.write.parquet(hdfs://srv:9000/user/ph/decimal.parquet, 'overwrite')

{code}

Stack Trace
{code}
---
Py4JJavaError Traceback (most recent call last)
ipython-input-19-a77dac8de5f3 in module()
 1 sr.write.parquet(hdfs://srv:9000/user/ph/decimal.parquet, 'overwrite')

/home/spark/spark-1.4.0-bin-hadoop2.6/python/pyspark/sql/readwriter.pyc in 
parquet(self, path, mode)
367 :param mode: one of `append`, `overwrite`, `error`, `ignore` 
(default: error)
368 
-- 369 return self._jwrite.mode(mode).parquet(path)
370 
371 @since(1.4)

/home/spark/spark-1.4.0-bin-hadoop2.6/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py
 in __call__(self, *args)
536 answer = self.gateway_client.send_command(command)
537 return_value = get_return_value(answer, self.gateway_client,
-- 538 self.target_id, self.name)
539 
540 for temp_arg in temp_args:

/home/spark/spark-1.4.0-bin-hadoop2.6/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py
 in get_return_value(answer, gateway_client, target_id, name)
298 raise Py4JJavaError(
299 'An error occurred while calling {0}{1}{2}.\n'.
-- 300 format(target_id, '.', name), value)
301 else:
302 raise Py4JError(

Py4JJavaError: An error occurred while calling o361.parquet.
: org.apache.spark.SparkException: Job aborted.
at 
org.apache.spark.sql.sources.InsertIntoHadoopFsRelation.insert(commands.scala:138)
at 
org.apache.spark.sql.sources.InsertIntoHadoopFsRelation.run(commands.scala:114)
at 
org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult$lzycompute(commands.scala:57)
at 
org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult(commands.scala:57)
at 
org.apache.spark.sql.execution.ExecutedCommand.doExecute(commands.scala:68)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:88)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:88)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:148)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:87)
at 
org.apache.spark.sql.SQLContext$QueryExecution.toRdd$lzycompute(SQLContext.scala:939)
at 
org.apache.spark.sql.SQLContext$QueryExecution.toRdd(SQLContext.scala:939)
at org.apache.spark.sql.sources.ResolvedDataSource$.apply(ddl.scala:332)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:144)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:135)
at 
org.apache.spark.sql.DataFrameWriter.parquet(DataFrameWriter.scala:281)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
at py4j.Gateway.invoke(Gateway.java:259)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:207)
at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: 
Task 158 in stage 35.0 failed 4 times, most recent failure: Lost task 158.3 in 
stage 35.0 (TID 2736, 10.2.160.14): java.lang.RuntimeException: Unsupported 
datatype DecimalType()
at scala.sys.package$.error(package.scala:27)
at 
org.apache.spark.sql.parquet.ParquetTypesConverter$$anonfun$fromDataType$2.apply(ParquetTypes.scala:374)
at 
org.apache.spark.sql.parquet.ParquetTypesConverter$$anonfun$fromDataType$2.apply(ParquetTypes.scala:318)
at scala.Option.getOrElse(Option.scala:120)
at 

[jira] [Resolved] (SPARK-8713) Support codegen for not thread-safe expressions

2015-06-30 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8713?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-8713.
---
   Resolution: Fixed
Fix Version/s: 1.5.0

Issue resolved by pull request 7101
[https://github.com/apache/spark/pull/7101]

 Support codegen for not thread-safe expressions
 ---

 Key: SPARK-8713
 URL: https://issues.apache.org/jira/browse/SPARK-8713
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Davies Liu
Assignee: Davies Liu
 Fix For: 1.5.0


 Currently, we disable codegen if any expression is not thread safe. We should 
 support that, but disable caching the compiled expresssions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-8738) Generate better error message in Python for AnalysisException

2015-06-30 Thread Davies Liu (JIRA)
Davies Liu created SPARK-8738:
-

 Summary: Generate better error message in Python for 
AnalysisException 
 Key: SPARK-8738
 URL: https://issues.apache.org/jira/browse/SPARK-8738
 Project: Spark
  Issue Type: Bug
Reporter: Davies Liu
Assignee: Davies Liu


The long Java stack trace is hard to read.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-8738) Generate better error message in Python for AnalysisException

2015-06-30 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8738?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-8738.
---
   Resolution: Fixed
Fix Version/s: 1.5.0

Issue resolved by pull request 7135
[https://github.com/apache/spark/pull/7135]

 Generate better error message in Python for AnalysisException 
 --

 Key: SPARK-8738
 URL: https://issues.apache.org/jira/browse/SPARK-8738
 Project: Spark
  Issue Type: Bug
Reporter: Davies Liu
Assignee: Davies Liu
 Fix For: 1.5.0


 The long Java stack trace is hard to read.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6360) For Spark 1.1 and 1.2, after any RDD transformations, calling saveAsParquetFile over a SchemaRDD with decimal or UDT column throws

2015-06-30 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6360?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu updated SPARK-6360:
--
Description: 
Spark shell session for reproduction (use {{:paste}}):
{noformat}
import org.apache.spark.sql.SQLContext
import org.apache.spark.sql.catalyst.types.decimal._
import org.apache.spark.sql.catalyst.types._
import org.apache.hadoop.fs._

val sqlContext = new SQLContext(sc)
val fs = FileSystem.get(sc.hadoopConfiguration)

fs.delete(new Path(a.parquet))
fs.delete(new Path(b.parquet))

import sc._
import sqlContext._

val r1 = parallelize(1 to 10).map(i = Tuple1(Decimal(i, 10, 0))).select('_1 
cast DecimalType(10, 0))

// OK
r1.saveAsParquetFile(a.parquet)

val r2 = parallelize(1 to 10).map(i = Tuple1(Decimal(i, 10, 0))).select('_1 
cast DecimalType(10, 0))

val r3 = r2.coalesce(1)

// Error
r3.saveAsParquetFile(b.parquet)
{noformat}
Exception thrown:
{noformat}
java.lang.ClassCastException: scala.math.BigDecimal cannot be cast to 
org.apache.spark.sql.catalyst.types.decimal.Decimal
at 
org.apache.spark.sql.parquet.MutableRowWriteSupport.consumeType(ParquetTableSupport.scala:359)
at 
org.apache.spark.sql.parquet.MutableRowWriteSupport.write(ParquetTableSupport.scala:328)
at 
org.apache.spark.sql.parquet.MutableRowWriteSupport.write(ParquetTableSupport.scala:314)
at 
parquet.hadoop.InternalParquetRecordWriter.write(InternalParquetRecordWriter.java:120)
at parquet.hadoop.ParquetRecordWriter.write(ParquetRecordWriter.java:81)
at parquet.hadoop.ParquetRecordWriter.write(ParquetRecordWriter.java:37)
at 
org.apache.spark.sql.parquet.InsertIntoParquetTable.org$apache$spark$sql$parquet$InsertIntoParquetTable$$writeShard$1(ParquetTableOperations.scala:308)
at 
org.apache.spark.sql.parquet.InsertIntoParquetTable$$anonfun$saveAsHadoopFile$1.apply(ParquetTableOperations.scala:325)
at 
org.apache.spark.sql.parquet.InsertIntoParquetTable$$anonfun$saveAsHadoopFile$1.apply(ParquetTableOperations.scala:325)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
at org.apache.spark.scheduler.Task.run(Task.scala:56)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
15/03/17 00:04:13 WARN TaskSetManager: Lost task 0.0 in stage 2.0 (TID 2, 
localhost): java.lang.ClassCastException: scala.math.BigDecimal cannot be cast 
to org.apache.spark.sql.catalyst.types.decimal.Decimal
at 
org.apache.spark.sql.parquet.MutableRowWriteSupport.consumeType(ParquetTableSupport.scala:359)
at 
org.apache.spark.sql.parquet.MutableRowWriteSupport.write(ParquetTableSupport.scala:328)
at 
org.apache.spark.sql.parquet.MutableRowWriteSupport.write(ParquetTableSupport.scala:314)
at 
parquet.hadoop.InternalParquetRecordWriter.write(InternalParquetRecordWriter.java:120)
at parquet.hadoop.ParquetRecordWriter.write(ParquetRecordWriter.java:81)
at parquet.hadoop.ParquetRecordWriter.write(ParquetRecordWriter.java:37)
at 
org.apache.spark.sql.parquet.InsertIntoParquetTable.org$apache$spark$sql$parquet$InsertIntoParquetTable$$writeShard$1(ParquetTableOperations.scala:308)
at 
org.apache.spark.sql.parquet.InsertIntoParquetTable$$anonfun$saveAsHadoopFile$1.apply(ParquetTableOperations.scala:325)
at 
org.apache.spark.sql.parquet.InsertIntoParquetTable$$anonfun$saveAsHadoopFile$1.apply(ParquetTableOperations.scala:325)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
at org.apache.spark.scheduler.Task.run(Task.scala:56)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
{noformat}
The query plan of {{r1}} is:
{noformat}
== Parsed Logical Plan ==
'Project [CAST('_1, DecimalType(10,0)) AS c0#60]
 LogicalRDD [_1#59], MapPartitionsRDD[71] at mapPartitions at 
ExistingRDD.scala:36

== Analyzed Logical Plan ==
Project [CAST(_1#59, DecimalType(10,0)) AS c0#60]
 LogicalRDD [_1#59], MapPartitionsRDD[71] at mapPartitions at 
ExistingRDD.scala:36

== Optimized Logical Plan ==
Project [CAST(_1#59, DecimalType(10,0)) AS c0#60]
 LogicalRDD [_1#59], MapPartitionsRDD[71] at mapPartitions at 
ExistingRDD.scala:36

== Physical Plan ==
Project [CAST(_1#59, DecimalType(10,0)) AS c0#60]
 PhysicalRDD [_1#59], MapPartitionsRDD[71] at mapPartitions at 
ExistingRDD.scala:36

Code Generation: false
== RDD ==
{noformat}
while {{r3}}'s query plan is:
{noformat}
== 

[jira] [Resolved] (SPARK-8741) Remove e and pi from DataFrame functions

2015-06-30 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8741?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-8741.
---
   Resolution: Fixed
Fix Version/s: 1.5.0

Issue resolved by pull request 7137
[https://github.com/apache/spark/pull/7137]

 Remove e and pi from DataFrame functions
 

 Key: SPARK-8741
 URL: https://issues.apache.org/jira/browse/SPARK-8741
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
Assignee: Reynold Xin
 Fix For: 1.5.0


 It is not really useful to have dataframe functions that return numeric 
 constants available already in all programming languages. We should keep the 
 expression for SQL, but nothing else.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-7902) SQL UDF doesn't support UDT in PySpark

2015-06-30 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7902?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu reassigned SPARK-7902:
-

Assignee: Davies Liu

 SQL UDF doesn't support UDT in PySpark
 --

 Key: SPARK-7902
 URL: https://issues.apache.org/jira/browse/SPARK-7902
 Project: Spark
  Issue Type: Bug
  Components: PySpark, SQL
Affects Versions: 1.4.0
Reporter: Xiangrui Meng
Assignee: Davies Liu
Priority: Critical

 We don't convert Python SQL internal types to Python types in SQL UDF 
 execution. This causes problems if the input arguments contain UDTs or the 
 return type is a UDT. Right now, the raw SQL types are passed into the Python 
 UDF and the return value is not converted to Python SQL types.
 This is the code (from [~rams]) to produce this bug. (Actually, it triggers 
 another bug first right now.)
 {code}
 from pyspark.mllib.linalg import SparseVector
 from pyspark.sql.functions import udf
 from pyspark.sql.types import IntegerType
 df = sqlContext.createDataFrame([(SparseVector(2, {0: 0.0}),)], [features])
 sz = udf(lambda s: s.size, IntegerType())
 df.select(sz(df.features).alias(sz)).collect()
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-8235) misc function: sha1 / sha

2015-06-29 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8235?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-8235.
---
   Resolution: Fixed
Fix Version/s: 1.5.0

Issue resolved by pull request 6963
[https://github.com/apache/spark/pull/6963]

 misc function: sha1 / sha
 -

 Key: SPARK-8235
 URL: https://issues.apache.org/jira/browse/SPARK-8235
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
 Fix For: 1.5.0


 sha1(string/binary): string
 sha(string/binary): string
 Calculates the SHA-1 digest for string or binary and returns the value as a 
 hex string (as of Hive 1.3.0). Example: sha1('ABC') = 
 '3c01bdbb26f358bab27f267924aa2c9a03fcfdb8'.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-8713) Support codegen for not thread-safe expressions

2015-06-29 Thread Davies Liu (JIRA)
Davies Liu created SPARK-8713:
-

 Summary: Support codegen for not thread-safe expressions
 Key: SPARK-8713
 URL: https://issues.apache.org/jira/browse/SPARK-8713
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Davies Liu
Assignee: Davies Liu


Currently, we disable codegen if any expression is not thread safe. We should 
support that, but disable caching the compiled expresssions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-7810) rdd.py _load_from_socket cannot load data from jvm socket if ipv6 is used

2015-06-29 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7810?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-7810.
---
   Resolution: Fixed
Fix Version/s: 1.6.0
   1.3.2
   1.4.1

Issue resolved by pull request 6338
[https://github.com/apache/spark/pull/6338]

 rdd.py _load_from_socket cannot load data from jvm socket if ipv6 is used
 ---

 Key: SPARK-7810
 URL: https://issues.apache.org/jira/browse/SPARK-7810
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Affects Versions: 1.3.1
Reporter: Ai He
 Fix For: 1.4.1, 1.3.2, 1.6.0


 Method _load_from_socket in rdd.py cannot load data from jvm socket if ipv6 
 is used. The current method only works well with ipv4. New modification 
 should work around both two protocols.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-8579) Support arbitrary object in UnsafeRow

2015-06-29 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8579?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-8579.
---
   Resolution: Fixed
Fix Version/s: 1.5.0

Issue resolved by pull request 6959
[https://github.com/apache/spark/pull/6959]

 Support arbitrary object in UnsafeRow
 -

 Key: SPARK-8579
 URL: https://issues.apache.org/jira/browse/SPARK-8579
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Reporter: Davies Liu
Assignee: Davies Liu
 Fix For: 1.5.0


 It's common to run count(distinct xxx) in SQL, the data type will be UDT of 
 OpenHashSet, it's good that we could use UnsafeRow to reducing the memory 
 usage during aggregation.
 Also for DecimalType, which could be used inside the grouping key.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7810) rdd.py _load_from_socket cannot load data from jvm socket if ipv6 is used

2015-06-29 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7810?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu updated SPARK-7810:
--
Fix Version/s: (was: 1.4.1)
   (was: 1.6.0)
   1.4.2
   1.5.0

 rdd.py _load_from_socket cannot load data from jvm socket if ipv6 is used
 ---

 Key: SPARK-7810
 URL: https://issues.apache.org/jira/browse/SPARK-7810
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Affects Versions: 1.3.1
Reporter: Ai He
 Fix For: 1.3.2, 1.5.0, 1.4.2


 Method _load_from_socket in rdd.py cannot load data from jvm socket if ipv6 
 is used. The current method only works well with ipv4. New modification 
 should work around both two protocols.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-5161) Parallelize Python test execution

2015-06-29 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5161?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-5161.
---
   Resolution: Fixed
Fix Version/s: 1.5.0

Issue resolved by pull request 7031
[https://github.com/apache/spark/pull/7031]

 Parallelize Python test execution
 -

 Key: SPARK-5161
 URL: https://issues.apache.org/jira/browse/SPARK-5161
 Project: Spark
  Issue Type: Improvement
  Components: Project Infra
Affects Versions: 1.2.0
Reporter: Nicholas Chammas
Assignee: Josh Rosen
 Fix For: 1.5.0


 [Original discussion 
 here.|https://github.com/apache/spark/pull/3564#issuecomment-67785676]
 As of 1.2.0, Python tests take around 10-12 minutes to run. Once [SPARK-3431] 
 is complete, this will become a significant fraction of the total test time.
 There are 2 separate approaches to explore for parallelizing the execution of 
 Python unit tests:
 * Use GNU parallel to run each Python test file in parallel.
 * Use 
 [{{nose}}|http://nose.readthedocs.org/en/latest/doc_tests/test_multiprocess/multiprocess.html]
  to parallelize all Python tests in a more extensible/configurable way.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-8214) math function: hex

2015-06-29 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8214?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-8214.
---
   Resolution: Fixed
Fix Version/s: 1.5.0

Issue resolved by pull request 6976
[https://github.com/apache/spark/pull/6976]

 math function: hex
 --

 Key: SPARK-8214
 URL: https://issues.apache.org/jira/browse/SPARK-8214
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
Assignee: zhichao-li
 Fix For: 1.5.0


 hex(BIGINT a): string
 hex(STRING a): string
 hex(BINARY a): string
 If the argument is an INT or binary, hex returns the number as a STRING in 
 hexadecimal format. Otherwise if the number is a STRING, it converts each 
 character into its hexadecimal representation and returns the resulting 
 STRING. (See 
 http://dev.mysql.com/doc/refman/5.0/en/string-functions.html#function_hex, 
 BINARY version as of Hive 0.12.0.)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-8610) Separate Row and InternalRow (part 2)

2015-06-28 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8610?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-8610.
---
   Resolution: Fixed
Fix Version/s: 1.5.0

Issue resolved by pull request 7003
[https://github.com/apache/spark/pull/7003]

 Separate Row and InternalRow (part 2)
 -

 Key: SPARK-8610
 URL: https://issues.apache.org/jira/browse/SPARK-8610
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Davies Liu
Assignee: Davies Liu
 Fix For: 1.5.0


 Currently, we use GenericRow both for Row and InternalRow, which is confusing 
 because it could contain Scala type also Catalyst types.
 We should have different implementation for them, to avoid some potential 
 bugs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8636) CaseKeyWhen has incorrect NULL handling

2015-06-28 Thread Davies Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8636?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14604724#comment-14604724
 ] 

Davies Liu commented on SPARK-8636:
---

[~animeshbaranawal] What happen if there is null in the grouping key? Does a 
row with null equal to another row with null?

 CaseKeyWhen has incorrect NULL handling
 ---

 Key: SPARK-8636
 URL: https://issues.apache.org/jira/browse/SPARK-8636
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.4.0
Reporter: Santiago M. Mola
  Labels: starter

 CaseKeyWhen implementation in Spark uses the following equals implementation:
 {code}
   private def equalNullSafe(l: Any, r: Any) = {
 if (l == null  r == null) {
   true
 } else if (l == null || r == null) {
   false
 } else {
   l == r
 }
   }
 {code}
 Which is not correct, since in SQL, NULL is never equal to NULL (actually, it 
 is not unequal either). In this case, a NULL value in a CASE WHEN expression 
 should never match.
 For example, you can execute this in MySQL:
 {code}
 SELECT CASE NULL WHEN NULL THEN NULL MATCHES ELSE NULL DOES NOT MATCH END 
 FROM DUAL;
 {code}
 And the result will be NULL DOES NOT MATCH.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-8686) DataFrame should support `where` with expression represented by String

2015-06-28 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8686?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-8686.
---
   Resolution: Fixed
Fix Version/s: 1.5.0

Issue resolved by pull request 7063
[https://github.com/apache/spark/pull/7063]

 DataFrame should support `where` with expression represented by String
 --

 Key: SPARK-8686
 URL: https://issues.apache.org/jira/browse/SPARK-8686
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.5.0
Reporter: Kousuke Saruta
Priority: Minor
 Fix For: 1.5.0


 DataFrame supports `filter` function with two types of argument, `Column` and 
 `String`. But `where` doesn't.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8677) Decimal divide operation throws ArithmeticException

2015-06-28 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8677?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu updated SPARK-8677:
--
Assignee: Liang-Chi Hsieh

 Decimal divide operation throws ArithmeticException
 ---

 Key: SPARK-8677
 URL: https://issues.apache.org/jira/browse/SPARK-8677
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Liang-Chi Hsieh
Assignee: Liang-Chi Hsieh
 Fix For: 1.5.0


 Please refer to [BigDecimal 
 doc|http://docs.oracle.com/javase/1.5.0/docs/api/java/math/BigDecimal.html]:
 {quote}
 ... the rounding mode setting of a MathContext object with a precision 
 setting of 0 is not used and thus irrelevant. In the case of divide, the 
 exact quotient could have an infinitely long decimal expansion; for example, 
 1 divided by 3.
 {quote}
 Because we provide a MathContext.UNLIMITED in toBigDecimal, Decimal divide 
 operation will throw the following exception:
 {code}
 val decimal = Decimal(1.0, 10, 3) / Decimal(3.0, 10, 3)
 [info]   java.lang.ArithmeticException: Non-terminating decimal expansion; no 
 exact representable decimal result.
 [info]   at java.math.BigDecimal.divide(BigDecimal.java:1690)
 [info]   at java.math.BigDecimal.divide(BigDecimal.java:1723)
 [info]   at scala.math.BigDecimal.$div(BigDecimal.scala:256)
 [info]   at org.apache.spark.sql.types.Decimal.$div(Decimal.scala:272)
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-8680) PropagateTypes is very slow when there are lots of columns

2015-06-27 Thread Davies Liu (JIRA)
Davies Liu created SPARK-8680:
-

 Summary: PropagateTypes is very slow when there are lots of columns
 Key: SPARK-8680
 URL: https://issues.apache.org/jira/browse/SPARK-8680
 Project: Spark
  Issue Type: Bug
Affects Versions: 1.4.0, 1.3.1
Reporter: Davies Liu


The time for PropagateTypes is O(N*N), N is the number of columns, which is 
very slow if there many columns (1000)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8680) PropagateTypes is very slow when there are lots of columns

2015-06-27 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8680?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu updated SPARK-8680:
--
Description: 
The time for PropagateTypes is O(N*N), N is the number of columns, which is 
very slow if there many columns (1000)

There easiest optimization could be put `q.inputSet` outside of  
transformExpressions which could have about 4 times improvement for N=3000

  was:The time for PropagateTypes is O(N*N), N is the number of columns, which 
is very slow if there many columns (1000)


 PropagateTypes is very slow when there are lots of columns
 --

 Key: SPARK-8680
 URL: https://issues.apache.org/jira/browse/SPARK-8680
 Project: Spark
  Issue Type: Bug
Affects Versions: 1.3.1, 1.4.0
Reporter: Davies Liu

 The time for PropagateTypes is O(N*N), N is the number of columns, which is 
 very slow if there many columns (1000)
 There easiest optimization could be put `q.inputSet` outside of  
 transformExpressions which could have about 4 times improvement for N=3000



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-8583) Refactor python/run-tests to integrate with dev/run-test's module system

2015-06-27 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8583?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-8583.
---
   Resolution: Fixed
Fix Version/s: 1.5.0

Issue resolved by pull request 6967
[https://github.com/apache/spark/pull/6967]

 Refactor python/run-tests to integrate with dev/run-test's module system
 

 Key: SPARK-8583
 URL: https://issues.apache.org/jira/browse/SPARK-8583
 Project: Spark
  Issue Type: Sub-task
  Components: Build, Project Infra, PySpark
Reporter: Josh Rosen
Assignee: Josh Rosen
 Fix For: 1.5.0


 We should refactor the {{python/run-tests}} script to be written in Python 
 and integrate with the recent {{dev/run-tests}} module system so that we can 
 more granularly skip Python tests in the pull request builder.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-5482) Allow individual test suites in python/run-tests

2015-06-27 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5482?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-5482.
---
   Resolution: Fixed
Fix Version/s: 1.5.0

Issue resolved by pull request 6967
[https://github.com/apache/spark/pull/6967]

 Allow individual test suites in python/run-tests
 

 Key: SPARK-5482
 URL: https://issues.apache.org/jira/browse/SPARK-5482
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Reporter: Katsunori Kanda
Priority: Minor
 Fix For: 1.5.0


 Add options to run individual test suites in python/run-tests. The usage is 
 as follow.
 ./python/run-tests \[core|sql|mllib|ml|streaming\]
 When you select none, all test suites are run for backward compatibility.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-8620) cleanup CodeGenContext

2015-06-26 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8620?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-8620.
---
   Resolution: Fixed
Fix Version/s: 1.5.0

 cleanup CodeGenContext
 --

 Key: SPARK-8620
 URL: https://issues.apache.org/jira/browse/SPARK-8620
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Wenchen Fan
Assignee: Wenchen Fan
 Fix For: 1.5.0






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-8652) PySpark tests sometimes forget to check return status of doctest.testmod(), masking failing tests

2015-06-26 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8652?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-8652.
---
   Resolution: Fixed
Fix Version/s: 1.5.0

Issue resolved by pull request 7032
[https://github.com/apache/spark/pull/7032]

 PySpark tests sometimes forget to check return status of doctest.testmod(), 
 masking failing tests
 -

 Key: SPARK-8652
 URL: https://issues.apache.org/jira/browse/SPARK-8652
 Project: Spark
  Issue Type: Bug
  Components: PySpark, Tests
Reporter: Josh Rosen
Assignee: Josh Rosen
Priority: Blocker
 Fix For: 1.5.0


 Several PySpark files call {{doctest.testmod()}} in order to run doctests, 
 but forget to check its return status. As a result, failures will not be 
 automatically detected by our test runner script, creating the potential for 
 bugs to slip through.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8670) Nested columns can't be referenced (but they can be selected)

2015-06-26 Thread Davies Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8670?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14603605#comment-14603605
 ] 

Davies Liu commented on SPARK-8670:
---

I think you should use `df.stats.age`  or df.selectExpr(stats.age)

 Nested columns can't be referenced (but they can be selected)
 -

 Key: SPARK-8670
 URL: https://issues.apache.org/jira/browse/SPARK-8670
 Project: Spark
  Issue Type: Bug
  Components: PySpark, SQL
Affects Versions: 1.4.0
Reporter: Nicholas Chammas

 This is strange and looks like a regression from 1.3.
 {code}
 import json
 daterz = [
   {
 'name': 'Nick',
 'stats': {
   'age': 28
 }
   },
   {
 'name': 'George',
 'stats': {
   'age': 31
 }
   }
 ]
 df = sqlContext.jsonRDD(sc.parallelize(daterz).map(lambda x: json.dumps(x)))
 df.select('stats.age').show()
 df['stats.age']  # 1.4 fails on this line
 {code}
 On 1.3 this works and yields:
 {code}
 age
 28 
 31 
 Out[1]: Columnstats.age AS age#2958L
 {code}
 On 1.4, however, this gives an error on the last line:
 {code}
 +---+
 |age|
 +---+
 | 28|
 | 31|
 +---+
 ---
 IndexErrorTraceback (most recent call last)
 ipython-input-1-04bd990e94c6 in module()
  19 
  20 df.select('stats.age').show()
 --- 21 df['stats.age']
 /path/to/spark/python/pyspark/sql/dataframe.pyc in __getitem__(self, item)
 678 if isinstance(item, basestring):
 679 if item not in self.columns:
 -- 680 raise IndexError(no such column: %s % item)
 681 jc = self._jdf.apply(item)
 682 return Column(jc)
 IndexError: no such column: stats.age
 {code}
 This means, among other things, that you can't join DataFrames on nested 
 columns.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8620) cleanup CodeGenContext

2015-06-25 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8620?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu updated SPARK-8620:
--
Assignee: Wenchen Fan

 cleanup CodeGenContext
 --

 Key: SPARK-8620
 URL: https://issues.apache.org/jira/browse/SPARK-8620
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Wenchen Fan
Assignee: Wenchen Fan





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-8635) improve performance of CatalystTypeConverters

2015-06-25 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-8635.
---
   Resolution: Fixed
Fix Version/s: 1.5.0

Issue resolved by pull request 7018
[https://github.com/apache/spark/pull/7018]

 improve performance of CatalystTypeConverters
 -

 Key: SPARK-8635
 URL: https://issues.apache.org/jira/browse/SPARK-8635
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Wenchen Fan
 Fix For: 1.5.0






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-8237) misc function: sha2

2015-06-25 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8237?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-8237.
---
   Resolution: Fixed
Fix Version/s: 1.5.0

Issue resolved by pull request 6934
[https://github.com/apache/spark/pull/6934]

 misc function: sha2
 ---

 Key: SPARK-8237
 URL: https://issues.apache.org/jira/browse/SPARK-8237
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
 Fix For: 1.5.0


 sha2(string/binary, int): string
 Calculates the SHA-2 family of hash functions (SHA-224, SHA-256, SHA-384, and 
 SHA-512) (as of Hive 1.3.0). The first argument is the string or binary to be 
 hashed. The second argument indicates the desired bit length of the result, 
 which must have a value of 224, 256, 384, 512, or 0 (which is equivalent to 
 256). SHA-224 is supported starting from Java 8. If either argument is NULL 
 or the hash length is not one of the permitted values, the return value is 
 NULL. Example: sha2('ABC', 256) = 
 'b5d4045c3f466fa91fe2cc6abe79232a1a57cdf104f7a26e716e0a1e2789df78'.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-8371) improve unit test for MaxOf and MinOf

2015-06-24 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8371?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-8371.
---
   Resolution: Fixed
Fix Version/s: 1.5.0

Issue resolved by pull request 6825
[https://github.com/apache/spark/pull/6825]

 improve unit test for MaxOf and MinOf
 -

 Key: SPARK-8371
 URL: https://issues.apache.org/jira/browse/SPARK-8371
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Wenchen Fan
Assignee: Wenchen Fan
 Fix For: 1.5.0






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-8610) Separate Row and InternalRow (part 2)

2015-06-24 Thread Davies Liu (JIRA)
Davies Liu created SPARK-8610:
-

 Summary: Separate Row and InternalRow (part 2)
 Key: SPARK-8610
 URL: https://issues.apache.org/jira/browse/SPARK-8610
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Davies Liu
Assignee: Davies Liu


Currently, we use GenericRow both for Row and InternalRow, which is confusing 
because it could contain Scala type also Catalyst types.

We should have different implementation for them, to avoid some potential bugs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-8431) Add in operator to DataFrame Column in SparkR

2015-06-23 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8431?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-8431.
---
   Resolution: Fixed
Fix Version/s: 1.5.0

Issue resolved by pull request 6941
[https://github.com/apache/spark/pull/6941]

 Add in operator to DataFrame Column in SparkR
 -

 Key: SPARK-8431
 URL: https://issues.apache.org/jira/browse/SPARK-8431
 Project: Spark
  Issue Type: New Feature
  Components: SparkR, SQL
Reporter: Yu Ishikawa
 Fix For: 1.5.0


 To filter values in a set, we should add {{%in%}} operation into SparkR.
 {noformat}
 df$a %in% c(1, 2, 3)
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-8359) Spark SQL Decimal type precision loss on multiplication

2015-06-23 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8359?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-8359.
---
   Resolution: Fixed
Fix Version/s: 1.5.0

Issue resolved by pull request 6814
[https://github.com/apache/spark/pull/6814]

 Spark SQL Decimal type precision loss on multiplication
 ---

 Key: SPARK-8359
 URL: https://issues.apache.org/jira/browse/SPARK-8359
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.5.0
Reporter: Rene Treffer
 Fix For: 1.5.0


 It looks like the precision of decimal can not be raised beyond ~2^112 
 without causing full value truncation.
 The following code computes the power of two up to a specific point
 {code}
 import org.apache.spark.sql.types.Decimal
 val one = Decimal(1)
 val two = Decimal(2)
 def pow(n : Int) :  Decimal = if (n = 0) { one } else { 
   val a = pow(n - 1)
   a.changePrecision(n,0)
   two.changePrecision(n,0)
   a * two
 }
 (109 to 120).foreach(n = 
 println(pow(n).toJavaBigDecimal.unscaledValue.toString))
 649037107316853453566312041152512
 1298074214633706907132624082305024
 2596148429267413814265248164610048
 5192296858534827628530496329220096
 1038459371706965525706099265844019
 2076918743413931051412198531688038
 4153837486827862102824397063376076
 8307674973655724205648794126752152
 1661534994731144841129758825350430
 3323069989462289682259517650700860
 6646139978924579364519035301401720
 1329227995784915872903807060280344
 {code}
 Beyond ~2^112 the precision is truncated even if the precision was set to n 
 and should thus handle 10^n without problems..



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-8190) ExpressionEvalHelper.checkEvaluation should also run the optimizer version

2015-06-23 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8190?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-8190.
---
Resolution: Fixed

 ExpressionEvalHelper.checkEvaluation should also run the optimizer version
 --

 Key: SPARK-8190
 URL: https://issues.apache.org/jira/browse/SPARK-8190
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Reynold Xin
Assignee: Davies Liu

 We should remove the existing ExpressionOptimizationSuite, and update 
 checkEvaluation to also run the optimizer version.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-8579) Support arbitrary object in UnsafeRow

2015-06-23 Thread Davies Liu (JIRA)
Davies Liu created SPARK-8579:
-

 Summary: Support arbitrary object in UnsafeRow
 Key: SPARK-8579
 URL: https://issues.apache.org/jira/browse/SPARK-8579
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Reporter: Davies Liu
Assignee: Davies Liu


It's common to run count(distinct xxx) in SQL, the data type will be UDT of 
OpenHashSet, it's good that we could use UnsafeRow to reducing the memory usage 
during aggregation.

Also for DecimalType, which could be used inside the grouping key.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8187) date/time function: date_sub

2015-06-23 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8187?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu updated SPARK-8187:
--
Shepherd: Davies Liu

 date/time function: date_sub
 

 Key: SPARK-8187
 URL: https://issues.apache.org/jira/browse/SPARK-8187
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
Assignee: Adrian Wang

 date_sub(string startdate, int days): string
 date_sub(date startdate, int days): date
 Subtracts a number of days to startdate: date_sub('2008-12-31', 1) = 
 '2008-12-30'.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8186) date/time function: date_add

2015-06-23 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8186?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu updated SPARK-8186:
--
Shepherd: Davies Liu

 date/time function: date_add
 

 Key: SPARK-8186
 URL: https://issues.apache.org/jira/browse/SPARK-8186
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
Assignee: Adrian Wang

 date_add(string startdate, int days): string
 date_add(date startdate, int days): date
 Adds a number of days to startdate: date_add('2008-12-31', 1) = '2009-01-01'.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7810) rdd.py _load_from_socket cannot load data from jvm socket if ipv6 is used

2015-06-23 Thread Davies Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14598610#comment-14598610
 ] 

Davies Liu commented on SPARK-7810:
---

What's the stack trace look like? Does the host only have IPv6?

There are multiple place which donot consider IPv6 in mind, you can grep 
`127.0.0.1` or `localhost` in the tree, could you also fix them together?

 rdd.py _load_from_socket cannot load data from jvm socket if ipv6 is used
 ---

 Key: SPARK-7810
 URL: https://issues.apache.org/jira/browse/SPARK-7810
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Affects Versions: 1.3.1
Reporter: Ai He

 Method _load_from_socket in rdd.py cannot load data from jvm socket if ipv6 
 is used. The current method only works well with ipv4. New modification 
 should work around both two protocols.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-8492) Support BinaryType in UnsafeRow

2015-06-22 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8492?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-8492.
---
   Resolution: Fixed
Fix Version/s: 1.5.0

Issue resolved by pull request 6911
[https://github.com/apache/spark/pull/6911]

 Support BinaryType in UnsafeRow
 ---

 Key: SPARK-8492
 URL: https://issues.apache.org/jira/browse/SPARK-8492
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Davies Liu
Assignee: Davies Liu
 Fix For: 1.5.0






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-8307) Improve timestamp from parquet

2015-06-22 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8307?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-8307.
---
   Resolution: Fixed
Fix Version/s: 1.5.0

Issue resolved by pull request 6759
[https://github.com/apache/spark/pull/6759]

 Improve timestamp from parquet
 --

 Key: SPARK-8307
 URL: https://issues.apache.org/jira/browse/SPARK-8307
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Davies Liu
Assignee: Davies Liu
 Fix For: 1.5.0


 Currently, it's complicated to convert a timestamp from Parquet or Hive, 
 really slow.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8301) Improve UTF8String substring/startsWith/endsWith/contains performance

2015-06-20 Thread Davies Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14594926#comment-14594926
 ] 

Davies Liu commented on SPARK-8301:
---

[~rxin] Why I can't assign this JIRA to  [~TarekAuel]?

 Improve UTF8String substring/startsWith/endsWith/contains performance
 -

 Key: SPARK-8301
 URL: https://issues.apache.org/jira/browse/SPARK-8301
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Reynold Xin
Priority: Critical

 Many functions in UTF8String are unnecessarily expensive.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-8301) Improve UTF8String substring/startsWith/endsWith/contains performance

2015-06-20 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8301?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-8301.
---
   Resolution: Fixed
Fix Version/s: 1.5.0

 Improve UTF8String substring/startsWith/endsWith/contains performance
 -

 Key: SPARK-8301
 URL: https://issues.apache.org/jira/browse/SPARK-8301
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Reynold Xin
Priority: Critical
 Fix For: 1.5.0


 Many functions in UTF8String are unnecessarily expensive.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-8422) Introduce a module abstraction inside of dev/run-tests

2015-06-20 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8422?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-8422.
---
   Resolution: Fixed
Fix Version/s: 1.5.0

Issue resolved by pull request 6866
[https://github.com/apache/spark/pull/6866]

 Introduce a module abstraction inside of dev/run-tests
 --

 Key: SPARK-8422
 URL: https://issues.apache.org/jira/browse/SPARK-8422
 Project: Spark
  Issue Type: Sub-task
  Components: Build, Project Infra
Reporter: Josh Rosen
Assignee: Josh Rosen
 Fix For: 1.5.0


 At a high level, we have Spark modules / components which
 1. are affected / impacted by file changes (e.g. a module is associated with 
 a set of source files, so changes to those files change the module),
 2. contain a set of tests to run, which are triggered via Maven, SBT, or via 
 Python / R scripts.
 3. depend on other modules and have dependent modules: if we change core, 
 then every downstream test should be run.  If we change only MLlib, then we 
 can skip the SQL tests but should probably run the Python MLlib tests, etc.
 Right now, the per-module logic is spread across a few different places 
 inside of the {{dev/run-tests}} script: we have one function that describes 
 how to detect changes for all modules, another function that (implicitly) 
 deals with module dependencies, etc.
 Instead, I propose that we introduce a class for describing a module, use 
 instances of this class to build up a dependency graph, then phrase the find 
 which tests to run operations in terms of that graph.  I think that this 
 will be easier to understand / maintain.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8477) Add in operator to DataFrame Column in Python

2015-06-19 Thread Davies Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14593928#comment-14593928
 ] 

Davies Liu commented on SPARK-8477:
---

[~rxin] [~yuu.ishik...@gmail.com] We already have `inSet` to match the Scala 
API `in`, we cloud close this one.

 Add in operator to DataFrame Column in Python
 -

 Key: SPARK-8477
 URL: https://issues.apache.org/jira/browse/SPARK-8477
 Project: Spark
  Issue Type: New Feature
  Components: PySpark, SQL
Reporter: Yu Ishikawa





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-8492) Support BinaryType in UnsafeRow

2015-06-19 Thread Davies Liu (JIRA)
Davies Liu created SPARK-8492:
-

 Summary: Support BinaryType in UnsafeRow
 Key: SPARK-8492
 URL: https://issues.apache.org/jira/browse/SPARK-8492
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Davies Liu
Assignee: Davies Liu






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8477) Add in operator to DataFrame Column in Python

2015-06-19 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8477?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu updated SPARK-8477:
--
Fix Version/s: 1.3.0

 Add in operator to DataFrame Column in Python
 -

 Key: SPARK-8477
 URL: https://issues.apache.org/jira/browse/SPARK-8477
 Project: Spark
  Issue Type: New Feature
  Components: PySpark, SQL
Reporter: Yu Ishikawa
 Fix For: 1.3.0






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-8477) Add in operator to DataFrame Column in Python

2015-06-19 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8477?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-8477.
---
  Resolution: Implemented
Target Version/s: 1.3.0  (was: 1.5.0)

 Add in operator to DataFrame Column in Python
 -

 Key: SPARK-8477
 URL: https://issues.apache.org/jira/browse/SPARK-8477
 Project: Spark
  Issue Type: New Feature
  Components: PySpark, SQL
Reporter: Yu Ishikawa





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-8339) Itertools islice requires an integer for the stop argument.

2015-06-19 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8339?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-8339.
---
   Resolution: Fixed
Fix Version/s: 1.4.1
   1.5.0

Issue resolved by pull request 6794
[https://github.com/apache/spark/pull/6794]

 Itertools islice requires an integer for the stop argument.
 ---

 Key: SPARK-8339
 URL: https://issues.apache.org/jira/browse/SPARK-8339
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.4.0
 Environment: python 3
Reporter: Kevin Conor
Priority: Minor
 Fix For: 1.5.0, 1.4.1

   Original Estimate: 5m
  Remaining Estimate: 5m

 Itertools islice requires an integer for the stop argument.  The bug is in 
 serializers.py and can prevent and rdd from being written to disk.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-8444) Add Python example in streaming for queueStream usage

2015-06-19 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8444?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-8444.
---
   Resolution: Fixed
Fix Version/s: 1.5.0

Issue resolved by pull request 6884
[https://github.com/apache/spark/pull/6884]

 Add Python example in streaming for queueStream usage
 -

 Key: SPARK-8444
 URL: https://issues.apache.org/jira/browse/SPARK-8444
 Project: Spark
  Issue Type: Sub-task
  Components: Streaming
Affects Versions: 1.4.0
Reporter: Bryan Cutler
Priority: Minor
 Fix For: 1.5.0


 I noticed there was no Python equivalent for Scala queueStream example.  This 
 will have to be slightly different because changes in the Queue after the 
 stream is created are not recognized.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-8461) ClassNotFoundException when code generation is enabled

2015-06-19 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8461?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu reassigned SPARK-8461:
-

Assignee: Davies Liu

 ClassNotFoundException when code generation is enabled
 --

 Key: SPARK-8461
 URL: https://issues.apache.org/jira/browse/SPARK-8461
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.5.0
Reporter: Cheng Lian
Assignee: Davies Liu
Priority: Blocker

 Build Spark without {{-Phive}} to make sure the isolated classloader for Hive 
 support is irrelevant, then run the following Spark shell snippet:
 {code}
 sqlContext.range(0, 2).select(lit(a) as 
 'a).coalesce(1).write.mode(overwrite).json(file:///tmp/foo)
 {code}
 Exception thrown:
 {noformat}
 15/06/18 15:36:30 ERROR codegen.GenerateMutableProjection: failed to compile:
   import org.apache.spark.sql.catalyst.InternalRow;
   public SpecificProjection 
 generate(org.apache.spark.sql.catalyst.expressions.Expression[] expr) {
 return new SpecificProjection(expr);
   }
   class SpecificProjection extends 
 org.apache.spark.sql.catalyst.expressions.codegen.BaseMutableProjection {
 private org.apache.spark.sql.catalyst.expressions.Expression[] 
 expressions = null;
 private org.apache.spark.sql.catalyst.expressions.MutableRow 
 mutableRow = null;
 public 
 SpecificProjection(org.apache.spark.sql.catalyst.expressions.Expression[] 
 expr) {
   expressions = expr;
   mutableRow = new 
 org.apache.spark.sql.catalyst.expressions.GenericMutableRow(1);
 }
 public 
 org.apache.spark.sql.catalyst.expressions.codegen.BaseMutableProjection 
 target(org.apache.spark.sql.catalyst.expressions.MutableRow row) {
   mutableRow = row;
   return this;
 }
 /* Provide immutable access to the last projected row. */
 public InternalRow currentValue() {
   return (InternalRow) mutableRow;
 }
 public Object apply(Object _i) {
   InternalRow i = (InternalRow) _i;
   /* expression: a */
   Object obj2 = expressions[0].eval(i);
   boolean isNull0 = obj2 == null;
   org.apache.spark.unsafe.types.UTF8String primitive1 = null;
   if (!isNull0) {
 primitive1 = (org.apache.spark.unsafe.types.UTF8String) obj2;
   }
   if(isNull0)
 mutableRow.setNullAt(0);
   else
 mutableRow.update(0, primitive1);
   return mutableRow;
 }
   }
 org.codehaus.commons.compiler.CompileException: Line 28, Column 35: Object
 at 
 org.codehaus.janino.UnitCompiler.findTypeByName(UnitCompiler.java:6897)
 at 
 org.codehaus.janino.UnitCompiler.getReferenceType(UnitCompiler.java:5331)
 at 
 org.codehaus.janino.UnitCompiler.getReferenceType(UnitCompiler.java:5207)
 at org.codehaus.janino.UnitCompiler.getType2(UnitCompiler.java:5188)
 at 
 org.codehaus.janino.UnitCompiler.access$12600(UnitCompiler.java:185)
 at 
 org.codehaus.janino.UnitCompiler$16.visitReferenceType(UnitCompiler.java:5119)
 at org.codehaus.janino.Java$ReferenceType.accept(Java.java:2880)
 at org.codehaus.janino.UnitCompiler.getType(UnitCompiler.java:5159)
 at 
 org.codehaus.janino.UnitCompiler.access$16700(UnitCompiler.java:185)
 at 
 org.codehaus.janino.UnitCompiler$31.getParameterTypes2(UnitCompiler.java:8533)
 at 
 org.codehaus.janino.IClass$IInvocable.getParameterTypes(IClass.java:835)
 at org.codehaus.janino.IClass$IMethod.getDescriptor2(IClass.java:1063)
 at 
 org.codehaus.janino.IClass$IInvocable.getDescriptor(IClass.java:849)
 at org.codehaus.janino.IClass.getIMethods(IClass.java:211)
 at org.codehaus.janino.IClass.getIMethods(IClass.java:199)
 at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:409)
 at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:658)
 at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:662)
 at org.codehaus.janino.UnitCompiler.access$600(UnitCompiler.java:185)
 at 
 org.codehaus.janino.UnitCompiler$2.visitMemberClassDeclaration(UnitCompiler.java:350)
 at 
 org.codehaus.janino.Java$MemberClassDeclaration.accept(Java.java:1035)
 at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:354)
 at 
 org.codehaus.janino.UnitCompiler.compileDeclaredMemberTypes(UnitCompiler.java:769)
 at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:532)
 at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:393)
 at org.codehaus.janino.UnitCompiler.access$400(UnitCompiler.java:185)
 at 
 

[jira] [Resolved] (SPARK-8207) math function: bin

2015-06-19 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8207?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-8207.
---
Resolution: Fixed

 math function: bin
 --

 Key: SPARK-8207
 URL: https://issues.apache.org/jira/browse/SPARK-8207
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
Assignee: Liang-Chi Hsieh

 bin(long a): string
 Returns the number in binary format (see 
 http://dev.mysql.com/doc/refman/5.0/en/string-functions.html#function_bin).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8477) Add in operator to DataFrame Column in Python

2015-06-19 Thread Davies Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14593850#comment-14593850
 ] 

Davies Liu commented on SPARK-8477:
---

I think we can use the upper case `In`, or another word(such as `within`) 

 Add in operator to DataFrame Column in Python
 -

 Key: SPARK-8477
 URL: https://issues.apache.org/jira/browse/SPARK-8477
 Project: Spark
  Issue Type: New Feature
  Components: PySpark, SQL
Reporter: Yu Ishikawa





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



<    12   13   14   15   16   17   18   19   20   21   >