[jira] [Resolved] (SPARK-7909) spark-ec2 and associated tools not py3 ready
[ https://issues.apache.org/jira/browse/SPARK-7909?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu resolved SPARK-7909. --- Resolution: Fixed spark-ec2 and associated tools not py3 ready Key: SPARK-7909 URL: https://issues.apache.org/jira/browse/SPARK-7909 Project: Spark Issue Type: Improvement Components: EC2 Environment: ec2 python3 Reporter: Matthew Goodman Priority: Blocker At present there is not a possible permutation of tools that supports Python3 on both the launching computer and running cluster. There are a couple problems involved: - There is no prebuilt spark binary with python3 support. - spark-ec2/spark/init.sh contains inline py3 unfriendly print statements - Config files for cluster processes don't seem to make it to all nodes in a working format. I have fixes for some of this, but the config and running context debugging remains elusive to me. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-6289) PySpark doesn't maintain SQL date Types
[ https://issues.apache.org/jira/browse/SPARK-6289?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu resolved SPARK-6289. --- Resolution: Fixed Fix Version/s: 1.5.0 Issue resolved by pull request 7301 [https://github.com/apache/spark/pull/7301] PySpark doesn't maintain SQL date Types --- Key: SPARK-6289 URL: https://issues.apache.org/jira/browse/SPARK-6289 Project: Spark Issue Type: Bug Components: PySpark, SQL Affects Versions: 1.2.1 Reporter: Michael Nazario Assignee: Davies Liu Fix For: 1.5.0 For the DateType, Spark SQL requires a datetime.date in Python. However, if you collect a row based on that type, you'll end up with a returned value which is type datetime.datetime. I have tried to reproduce this using the pyspark shell, but have been unable to. This is definitely a problem coming from pyrolite though: https://github.com/irmen/Pyrolite/ Pyrolite is being used for datetime and date serialization, but appears to not map to date objects, but maps to datetime objects. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-7902) SQL UDF doesn't support UDT in PySpark
[ https://issues.apache.org/jira/browse/SPARK-7902?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu resolved SPARK-7902. --- Resolution: Fixed Fix Version/s: 1.5.0 Issue resolved by pull request 7301 [https://github.com/apache/spark/pull/7301] SQL UDF doesn't support UDT in PySpark -- Key: SPARK-7902 URL: https://issues.apache.org/jira/browse/SPARK-7902 Project: Spark Issue Type: Bug Components: PySpark, SQL Affects Versions: 1.4.0 Reporter: Xiangrui Meng Assignee: Davies Liu Priority: Critical Fix For: 1.5.0 We don't convert Python SQL internal types to Python types in SQL UDF execution. This causes problems if the input arguments contain UDTs or the return type is a UDT. Right now, the raw SQL types are passed into the Python UDF and the return value is not converted to Python SQL types. This is the code (from [~rams]) to produce this bug. (Actually, it triggers another bug first right now.) {code} from pyspark.mllib.linalg import SparseVector from pyspark.sql.functions import udf from pyspark.sql.types import IntegerType df = sqlContext.createDataFrame([(SparseVector(2, {0: 0.0}),)], [features]) sz = udf(lambda s: s.size, IntegerType()) df.select(sz(df.features).alias(sz)).collect() {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7909) spark-ec2 and associated tools not py3 ready
[ https://issues.apache.org/jira/browse/SPARK-7909?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu updated SPARK-7909: -- Target Version/s: 1.5.0 Priority: Blocker (was: Major) spark-ec2 and associated tools not py3 ready Key: SPARK-7909 URL: https://issues.apache.org/jira/browse/SPARK-7909 Project: Spark Issue Type: Improvement Components: EC2 Environment: ec2 python3 Reporter: Matthew Goodman Priority: Blocker At present there is not a possible permutation of tools that supports Python3 on both the launching computer and running cluster. There are a couple problems involved: - There is no prebuilt spark binary with python3 support. - spark-ec2/spark/init.sh contains inline py3 unfriendly print statements - Config files for cluster processes don't seem to make it to all nodes in a working format. I have fixes for some of this, but the config and running context debugging remains elusive to me. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4315) PySpark pickling of pyspark.sql.Row objects is extremely inefficient
[ https://issues.apache.org/jira/browse/SPARK-4315?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14619611#comment-14619611 ] Davies Liu commented on SPARK-4315: --- This is fixed by https://github.com/apache/spark/pull/5445 PySpark pickling of pyspark.sql.Row objects is extremely inefficient Key: SPARK-4315 URL: https://issues.apache.org/jira/browse/SPARK-4315 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 1.1.0 Environment: Ubuntu, Python 2.7, Spark 1.1.0 Reporter: Adam Davison Working with an RDD of pyspark.sql.Row objects, created by reading a file with SQLContext in a local PySpark context. Operations on the RDD, such as: data.groupBy(lambda x: x.field_name) are extremely slow (more than 10x slower than an equivalent Scala/Spark implementation). Obviously I expected it to be somewhat slower, but I did a bit of digging given the difference was so huge. Luckily it's fairly easy to add profiling to the Python workers. I see that the vast majority of time is spent in: spark-1.1.0-bin-cdh4/python/pyspark/sql.py:757(_restore_object) It seems that this line attempts to accelerate pickling of Rows with the use of a cache. Some debugging reveals that this cache becomes quite big (100s of entries). Disabling the cache by adding: return _create_cls(dataType)(obj) as the first line of _restore_object made my query run 5x faster. Implying that the caching is not providing the desired acceleration... -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-4315) PySpark pickling of pyspark.sql.Row objects is extremely inefficient
[ https://issues.apache.org/jira/browse/SPARK-4315?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu reassigned SPARK-4315: - Assignee: Davies Liu PySpark pickling of pyspark.sql.Row objects is extremely inefficient Key: SPARK-4315 URL: https://issues.apache.org/jira/browse/SPARK-4315 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 1.1.0 Environment: Ubuntu, Python 2.7, Spark 1.1.0 Reporter: Adam Davison Assignee: Davies Liu Working with an RDD of pyspark.sql.Row objects, created by reading a file with SQLContext in a local PySpark context. Operations on the RDD, such as: data.groupBy(lambda x: x.field_name) are extremely slow (more than 10x slower than an equivalent Scala/Spark implementation). Obviously I expected it to be somewhat slower, but I did a bit of digging given the difference was so huge. Luckily it's fairly easy to add profiling to the Python workers. I see that the vast majority of time is spent in: spark-1.1.0-bin-cdh4/python/pyspark/sql.py:757(_restore_object) It seems that this line attempts to accelerate pickling of Rows with the use of a cache. Some debugging reveals that this cache becomes quite big (100s of entries). Disabling the cache by adding: return _create_cls(dataType)(obj) as the first line of _restore_object made my query run 5x faster. Implying that the caching is not providing the desired acceleration... -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-4315) PySpark pickling of pyspark.sql.Row objects is extremely inefficient
[ https://issues.apache.org/jira/browse/SPARK-4315?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu resolved SPARK-4315. --- Resolution: Fixed Fix Version/s: 1.4.0 1.3.2 PySpark pickling of pyspark.sql.Row objects is extremely inefficient Key: SPARK-4315 URL: https://issues.apache.org/jira/browse/SPARK-4315 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 1.1.0 Environment: Ubuntu, Python 2.7, Spark 1.1.0 Reporter: Adam Davison Assignee: Davies Liu Fix For: 1.3.2, 1.4.0 Working with an RDD of pyspark.sql.Row objects, created by reading a file with SQLContext in a local PySpark context. Operations on the RDD, such as: data.groupBy(lambda x: x.field_name) are extremely slow (more than 10x slower than an equivalent Scala/Spark implementation). Obviously I expected it to be somewhat slower, but I did a bit of digging given the difference was so huge. Luckily it's fairly easy to add profiling to the Python workers. I see that the vast majority of time is spent in: spark-1.1.0-bin-cdh4/python/pyspark/sql.py:757(_restore_object) It seems that this line attempts to accelerate pickling of Rows with the use of a cache. Some debugging reveals that this cache becomes quite big (100s of entries). Disabling the cache by adding: return _create_cls(dataType)(obj) as the first line of _restore_object made my query run 5x faster. Implying that the caching is not providing the desired acceleration... -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5092) Selecting from a nested structure with SparkSQL should return a nested structure
[ https://issues.apache.org/jira/browse/SPARK-5092?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14619632#comment-14619632 ] Davies Liu commented on SPARK-5092: --- cc [~marmbrus] Selecting from a nested structure with SparkSQL should return a nested structure Key: SPARK-5092 URL: https://issues.apache.org/jira/browse/SPARK-5092 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.2.0 Reporter: Brad Willard Priority: Minor Labels: pyspark, spark, sql When running a sparksql query like this (at least on a json dataset) select rid, meta_data.name from a_table The rows returned lose the nested structure. I receive a row like Row(rid='123', name='delete') instead of Row(rid='123', meta_data=Row(name='data')) I personally think this is confusing especially when programmatically building and executing queries and then parsing it to find your data in a new structure. I could understand how that's less desirable in some situations, but you could get around it by supporting 'as'. If you wanted to skip the nested structure simply write. select rid, meta_data.name as name from a_table -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8931) Fallback to interpret mode if failed to compile in codegen
[ https://issues.apache.org/jira/browse/SPARK-8931?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu updated SPARK-8931: -- Description: And we should not fallback during testing. Fallback to interpret mode if failed to compile in codegen -- Key: SPARK-8931 URL: https://issues.apache.org/jira/browse/SPARK-8931 Project: Spark Issue Type: Improvement Components: SQL Reporter: Davies Liu Assignee: Davies Liu Priority: Critical And we should not fallback during testing. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-7507) pyspark.sql.types.StructType and Row should implement __iter__()
[ https://issues.apache.org/jira/browse/SPARK-7507?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu closed SPARK-7507. - Resolution: Won't Fix pyspark.sql.types.StructType and Row should implement __iter__() Key: SPARK-7507 URL: https://issues.apache.org/jira/browse/SPARK-7507 Project: Spark Issue Type: Sub-task Components: PySpark, SQL Reporter: Nicholas Chammas Priority: Minor {{StructType}} looks an awful lot like a Python dictionary. However, it doesn't implement {{\_\_iter\_\_()}}, so doing a quick conversion like this doesn't work: {code} df = sqlContext.jsonRDD(sc.parallelize(['{name: El Magnifico}'])) df.schema StructType(List(StructField(name,StringType,true))) dict(df.schema) Traceback (most recent call last): File stdin, line 1, in module TypeError: 'StructType' object is not iterable {code} This would be super helpful for doing any custom schema manipulations without having to go through the whole {{.json() - json.loads() - manipulate() - json.dumps() - .fromJson()}} charade. Same goes for {{Row}}, which offers an [{{asDict()}}|https://spark.apache.org/docs/1.3.1/api/python/pyspark.sql.html#pyspark.sql.Row.asDict] method but doesn't support the more Pythonic {{dict(Row)}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7507) pyspark.sql.types.StructType and Row should implement __iter__()
[ https://issues.apache.org/jira/browse/SPARK-7507?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14619609#comment-14619609 ] Davies Liu commented on SPARK-7507: --- For `Row`, it's similar to namedtuple, you can iterate on it, get each column of it, but dict() require a key-value pair. I'd like to close this as `Wo't fix`. pyspark.sql.types.StructType and Row should implement __iter__() Key: SPARK-7507 URL: https://issues.apache.org/jira/browse/SPARK-7507 Project: Spark Issue Type: Sub-task Components: PySpark, SQL Reporter: Nicholas Chammas Priority: Minor {{StructType}} looks an awful lot like a Python dictionary. However, it doesn't implement {{\_\_iter\_\_()}}, so doing a quick conversion like this doesn't work: {code} df = sqlContext.jsonRDD(sc.parallelize(['{name: El Magnifico}'])) df.schema StructType(List(StructField(name,StringType,true))) dict(df.schema) Traceback (most recent call last): File stdin, line 1, in module TypeError: 'StructType' object is not iterable {code} This would be super helpful for doing any custom schema manipulations without having to go through the whole {{.json() - json.loads() - manipulate() - json.dumps() - .fromJson()}} charade. Same goes for {{Row}}, which offers an [{{asDict()}}|https://spark.apache.org/docs/1.3.1/api/python/pyspark.sql.html#pyspark.sql.Row.asDict] method but doesn't support the more Pythonic {{dict(Row)}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-8450) PySpark write.parquet raises Unsupported datatype DecimalType()
[ https://issues.apache.org/jira/browse/SPARK-8450?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu resolved SPARK-8450. --- Resolution: Fixed Fix Version/s: 1.5.0 Issue resolved by pull request 7131 [https://github.com/apache/spark/pull/7131] PySpark write.parquet raises Unsupported datatype DecimalType() --- Key: SPARK-8450 URL: https://issues.apache.org/jira/browse/SPARK-8450 Project: Spark Issue Type: Bug Components: PySpark, SQL Environment: Spark 1.4.0 on Debian Reporter: Peter Hoffmann Fix For: 1.5.0 I'm getting an Exception when I try to save a DataFrame with a DeciamlType as an parquet file Minimal Example: {code} from decimal import Decimal from pyspark.sql import SQLContext from pyspark.sql.types import * sqlContext = SQLContext(sc) schema = StructType([ StructField('id', LongType()), StructField('value', DecimalType())]) rdd = sc.parallelize([[1, Decimal(0.5)],[2, Decimal(2.9)]]) df = sqlContext.createDataFrame(rdd, schema) df.write.parquet(hdfs://srv:9000/user/ph/decimal.parquet, 'overwrite') {code} Stack Trace {code} --- Py4JJavaError Traceback (most recent call last) ipython-input-19-a77dac8de5f3 in module() 1 sr.write.parquet(hdfs://srv:9000/user/ph/decimal.parquet, 'overwrite') /home/spark/spark-1.4.0-bin-hadoop2.6/python/pyspark/sql/readwriter.pyc in parquet(self, path, mode) 367 :param mode: one of `append`, `overwrite`, `error`, `ignore` (default: error) 368 -- 369 return self._jwrite.mode(mode).parquet(path) 370 371 @since(1.4) /home/spark/spark-1.4.0-bin-hadoop2.6/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py in __call__(self, *args) 536 answer = self.gateway_client.send_command(command) 537 return_value = get_return_value(answer, self.gateway_client, -- 538 self.target_id, self.name) 539 540 for temp_arg in temp_args: /home/spark/spark-1.4.0-bin-hadoop2.6/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name) 298 raise Py4JJavaError( 299 'An error occurred while calling {0}{1}{2}.\n'. -- 300 format(target_id, '.', name), value) 301 else: 302 raise Py4JError( Py4JJavaError: An error occurred while calling o361.parquet. : org.apache.spark.SparkException: Job aborted. at org.apache.spark.sql.sources.InsertIntoHadoopFsRelation.insert(commands.scala:138) at org.apache.spark.sql.sources.InsertIntoHadoopFsRelation.run(commands.scala:114) at org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult$lzycompute(commands.scala:57) at org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult(commands.scala:57) at org.apache.spark.sql.execution.ExecutedCommand.doExecute(commands.scala:68) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:88) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:88) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:148) at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:87) at org.apache.spark.sql.SQLContext$QueryExecution.toRdd$lzycompute(SQLContext.scala:939) at org.apache.spark.sql.SQLContext$QueryExecution.toRdd(SQLContext.scala:939) at org.apache.spark.sql.sources.ResolvedDataSource$.apply(ddl.scala:332) at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:144) at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:135) at org.apache.spark.sql.DataFrameWriter.parquet(DataFrameWriter.scala:281) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379) at py4j.Gateway.invoke(Gateway.java:259) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:207) at java.lang.Thread.run(Thread.java:745) Caused by: org.apache.spark.SparkException: Job aborted due to stage failure:
[jira] [Commented] (SPARK-8408) Python OR operator is not considered while creating a column of boolean type
[ https://issues.apache.org/jira/browse/SPARK-8408?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14619622#comment-14619622 ] Davies Liu commented on SPARK-8408: --- In Python, We cannot override `or` `and` `not`, so we should use `|` `` `~` for them. We will throw an exception if you have to use `and` with columns. see https://github.com/apache/spark/pull/6961 Python OR operator is not considered while creating a column of boolean type Key: SPARK-8408 URL: https://issues.apache.org/jira/browse/SPARK-8408 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 1.4.0 Environment: OSX Apache Spark 1.4.0 Reporter: Felix Maximilian Möller Priority: Minor Fix For: 1.4.1 Attachments: bug_report.ipynb.json h3. Given {code} d = [{'name': 'Alice', 'age': 1},{'name': 'Bob', 'age': 2}] person_df = sqlContext.createDataFrame(d) {code} h3. When {code} person_df.filter(person_df.age==1 or person_df.age==2).collect() {code} h3. Expected [Row(age=1, name=u'Alice'), Row(age=2, name=u'Bob')] h3. Actual [Row(age=1, name=u'Alice')] h3. While {code} person_df.filter(age = 1 or age = 2).collect() {code} yields the correct result: [Row(age=1, name=u'Alice'), Row(age=2, name=u'Bob')] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-8408) Python OR operator is not considered while creating a column of boolean type
[ https://issues.apache.org/jira/browse/SPARK-8408?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu resolved SPARK-8408. --- Resolution: Fixed Assignee: Davies Liu Fix Version/s: 1.4.1 Python OR operator is not considered while creating a column of boolean type Key: SPARK-8408 URL: https://issues.apache.org/jira/browse/SPARK-8408 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 1.4.0 Environment: OSX Apache Spark 1.4.0 Reporter: Felix Maximilian Möller Assignee: Davies Liu Priority: Minor Fix For: 1.4.1 Attachments: bug_report.ipynb.json h3. Given {code} d = [{'name': 'Alice', 'age': 1},{'name': 'Bob', 'age': 2}] person_df = sqlContext.createDataFrame(d) {code} h3. When {code} person_df.filter(person_df.age==1 or person_df.age==2).collect() {code} h3. Expected [Row(age=1, name=u'Alice'), Row(age=2, name=u'Bob')] h3. Actual [Row(age=1, name=u'Alice')] h3. While {code} person_df.filter(age = 1 or age = 2).collect() {code} yields the correct result: [Row(age=1, name=u'Alice'), Row(age=2, name=u'Bob')] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-8931) Fallback to interpret mode if failed to compile in codegen
Davies Liu created SPARK-8931: - Summary: Fallback to interpret mode if failed to compile in codegen Key: SPARK-8931 URL: https://issues.apache.org/jira/browse/SPARK-8931 Project: Spark Issue Type: Improvement Components: SQL Reporter: Davies Liu Assignee: Davies Liu Priority: Critical -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-7190) UTF8String backed by binary data
[ https://issues.apache.org/jira/browse/SPARK-7190?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu resolved SPARK-7190. --- Resolution: Fixed UTF8String backed by binary data Key: SPARK-7190 URL: https://issues.apache.org/jira/browse/SPARK-7190 Project: Spark Issue Type: New Feature Components: Spark Core Reporter: Reynold Xin Assignee: Davies Liu Just a pointer to some memory address, so we don't need to copy the data into a byte array. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-7815) Enable UTF8String to work against memory address directly
[ https://issues.apache.org/jira/browse/SPARK-7815?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu resolved SPARK-7815. --- Resolution: Fixed Enable UTF8String to work against memory address directly - Key: SPARK-7815 URL: https://issues.apache.org/jira/browse/SPARK-7815 Project: Spark Issue Type: Improvement Components: SQL Reporter: Reynold Xin Assignee: Davies Liu So we can avoid an extra copy of data into byte array. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-6573) Convert inbound NaN values as null
[ https://issues.apache.org/jira/browse/SPARK-6573?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu reassigned SPARK-6573: - Assignee: Davies Liu Convert inbound NaN values as null -- Key: SPARK-6573 URL: https://issues.apache.org/jira/browse/SPARK-6573 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 1.3.0 Reporter: Fabian Boehnlein Assignee: Davies Liu In pandas it is common to use numpy.nan as the null value, for missing data or whatever. http://pandas.pydata.org/pandas-docs/dev/gotchas.html#nan-integer-na-values-and-na-type-promotions http://stackoverflow.com/questions/17534106/what-is-the-difference-between-nan-and-none http://pandas.pydata.org/pandas-docs/dev/missing_data.html#filling-missing-values-fillna createDataFrame however only works with None as null values, parsing them as None in the RDD. I suggest to add support for np.nan values in pandas DataFrames. current stracktrace when calling a DataFrame with object type columns with np.nan values (which are floats) {code} TypeError Traceback (most recent call last) ipython-input-38-34f0263f0bf4 in module() 1 sqldf = sqlCtx.createDataFrame(df_, schema=schema) /opt/spark/spark-1.3.0-bin-hadoop2.4/python/pyspark/sql/context.py in createDataFrame(self, data, schema, samplingRatio) 339 schema = self._inferSchema(data.map(lambda r: row_cls(*r)), samplingRatio) 340 -- 341 return self.applySchema(data, schema) 342 343 def registerDataFrameAsTable(self, rdd, tableName): /opt/spark/spark-1.3.0-bin-hadoop2.4/python/pyspark/sql/context.py in applySchema(self, rdd, schema) 246 247 for row in rows: -- 248 _verify_type(row, schema) 249 250 # convert python objects to sql data /opt/spark/spark-1.3.0-bin-hadoop2.4/python/pyspark/sql/types.py in _verify_type(obj, dataType) 1064 length of fields (%d) % (len(obj), len(dataType.fields))) 1065 for v, f in zip(obj, dataType.fields): - 1066 _verify_type(v, f.dataType) 1067 1068 _cached_cls = weakref.WeakValueDictionary() /opt/spark/spark-1.3.0-bin-hadoop2.4/python/pyspark/sql/types.py in _verify_type(obj, dataType) 1048 if type(obj) not in _acceptable_types[_type]: 1049 raise TypeError(%s can not accept object in type %s - 1050 % (dataType, type(obj))) 1051 1052 if isinstance(dataType, ArrayType): TypeError: StringType can not accept object in type type 'float'{code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-8804) order of UTF8String is wrong if there is any non-ascii character in it
[ https://issues.apache.org/jira/browse/SPARK-8804?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu resolved SPARK-8804. --- Resolution: Fixed Fix Version/s: 1.5.0 order of UTF8String is wrong if there is any non-ascii character in it --- Key: SPARK-8804 URL: https://issues.apache.org/jira/browse/SPARK-8804 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.4.0 Reporter: Davies Liu Assignee: Davies Liu Priority: Blocker Fix For: 1.4.1, 1.5.0 We compare the UTF8String byte by byte, but byte in JVM is signed, it should be compared as unsigned. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-8844) head/collect is broken in SparkR
Davies Liu created SPARK-8844: - Summary: head/collect is broken in SparkR Key: SPARK-8844 URL: https://issues.apache.org/jira/browse/SPARK-8844 Project: Spark Issue Type: Bug Components: SparkR Affects Versions: 1.5.0 Reporter: Davies Liu Priority: Blocker {code} t = tables(sqlContext) showDF(T) Error in (function (classes, fdef, mtable) : unable to find an inherited method for function ‘showDF’ for signature ‘logical’ showDF(t) +-+---+ |tableName|isTemporary| +-+---+ +-+---+ 15/07/06 09:59:10 WARN Executor: Told to re-register on heartbeat head(t) Error in readTypedObject(con, type) : Unsupported type for deserialization collect(t) Error in readTypedObject(con, type) : Unsupported type for deserialization {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8646) PySpark does not run on YARN
[ https://issues.apache.org/jira/browse/SPARK-8646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14615293#comment-14615293 ] Davies Liu commented on SPARK-8646: --- To be clear, PySpark does NOT depends on pandas. In dataframe.py, it works with pandas dataframe only when you have it. [~juliet] example/pi.py should run fine in YARN (it does not need panda at all). Is it possible that `outofstock/data_transform.py` depends on `pandas.algos` (pandas.algos is used by a closure from driver), and you upload the wrong log file? PySpark does not run on YARN Key: SPARK-8646 URL: https://issues.apache.org/jira/browse/SPARK-8646 Project: Spark Issue Type: Bug Components: PySpark, YARN Affects Versions: 1.4.0 Environment: SPARK_HOME=local/path/to/spark1.4install/dir also with SPARK_HOME=local/path/to/spark1.4install/dir PYTHONPATH=$SPARK_HOME/python/lib Spark apps are submitted with the command: $SPARK_HOME/bin/spark-submit outofstock/data_transform.py hdfs://foe-dev/DEMO_DATA/FACT_POS hdfs:/user/juliet/ex/ yarn-client data_transform contains a main method, and the rest of the args are parsed in my own code. Reporter: Juliet Hougland Attachments: pi-test.log, spark1.4-SPARK_HOME-set-PYTHONPATH-set.log, spark1.4-SPARK_HOME-set-inline-HADOOP_CONF_DIR.log, spark1.4-SPARK_HOME-set.log Running pyspark jobs result in a no module named pyspark when run in yarn-client mode in spark 1.4. [I believe this JIRA represents the change that introduced this error.| https://issues.apache.org/jira/browse/SPARK-6869 ] This does not represent a binary compatible change to spark. Scripts that worked on previous spark versions (ie comands the use spark-submit) should continue to work without modification between minor versions. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8745) Remove GenerateProjection
[ https://issues.apache.org/jira/browse/SPARK-8745?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu updated SPARK-8745: -- Summary: Remove GenerateProjection (was: Remove GenerateMutableProjection) Remove GenerateProjection - Key: SPARK-8745 URL: https://issues.apache.org/jira/browse/SPARK-8745 Project: Spark Issue Type: Improvement Components: SQL Reporter: Reynold Xin Assignee: Davies Liu Based on discussion offline with [~marmbrus], we should remove GenerateMutableProjection. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8745) Remove GenerateProjection
[ https://issues.apache.org/jira/browse/SPARK-8745?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu updated SPARK-8745: -- Description: Based on discussion offline with [~marmbrus], we should remove GenerateProjection. (was: Based on discussion offline with [~marmbrus], we should remove GenerateMutableProjection.) Remove GenerateProjection - Key: SPARK-8745 URL: https://issues.apache.org/jira/browse/SPARK-8745 Project: Spark Issue Type: Improvement Components: SQL Reporter: Reynold Xin Assignee: Davies Liu Based on discussion offline with [~marmbrus], we should remove GenerateProjection. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8636) CaseKeyWhen has incorrect NULL handling
[ https://issues.apache.org/jira/browse/SPARK-8636?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14614080#comment-14614080 ] Davies Liu commented on SPARK-8636: --- [~smolav] I'm just curious that how can we sort of group by a row with NULL in it, If we can not compare NULL with NULL? CaseKeyWhen has incorrect NULL handling --- Key: SPARK-8636 URL: https://issues.apache.org/jira/browse/SPARK-8636 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.4.0 Reporter: Santiago M. Mola Labels: starter CaseKeyWhen implementation in Spark uses the following equals implementation: {code} private def equalNullSafe(l: Any, r: Any) = { if (l == null r == null) { true } else if (l == null || r == null) { false } else { l == r } } {code} Which is not correct, since in SQL, NULL is never equal to NULL (actually, it is not unequal either). In this case, a NULL value in a CASE WHEN expression should never match. For example, you can execute this in MySQL: {code} SELECT CASE NULL WHEN NULL THEN NULL MATCHES ELSE NULL DOES NOT MATCH END FROM DUAL; {code} And the result will be NULL DOES NOT MATCH. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-7401) Dot product and squared_distances should be vectorized in Vectors
[ https://issues.apache.org/jira/browse/SPARK-7401?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu resolved SPARK-7401. --- Resolution: Fixed Fix Version/s: 1.5.0 Issue resolved by pull request 5946 [https://github.com/apache/spark/pull/5946] Dot product and squared_distances should be vectorized in Vectors - Key: SPARK-7401 URL: https://issues.apache.org/jira/browse/SPARK-7401 Project: Spark Issue Type: Sub-task Components: MLlib, PySpark Reporter: Manoj Kumar Fix For: 1.5.0 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-8226) math function: shiftrightunsigned
[ https://issues.apache.org/jira/browse/SPARK-8226?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu resolved SPARK-8226. --- Resolution: Fixed Fix Version/s: 1.5.0 Issue resolved by pull request 7035 [https://github.com/apache/spark/pull/7035] math function: shiftrightunsigned - Key: SPARK-8226 URL: https://issues.apache.org/jira/browse/SPARK-8226 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin Assignee: zhichao-li Fix For: 1.5.0 shiftrightunsigned(INT a), shiftrightunsigned(BIGINT a) Bitwise unsigned right shift (as of Hive 1.2.0). Returns int for tinyint, smallint and int a. Returns bigint for bigint a. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-8784) Add python API for hex/unhex
Davies Liu created SPARK-8784: - Summary: Add python API for hex/unhex Key: SPARK-8784 URL: https://issues.apache.org/jira/browse/SPARK-8784 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Davies Liu Assignee: Davies Liu -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8632) Poor Python UDF performance because of RDD caching
[ https://issues.apache.org/jira/browse/SPARK-8632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14611602#comment-14611602 ] Davies Liu commented on SPARK-8632: --- [~justin.uang] Sounds interesting, could you sending out the PR? Poor Python UDF performance because of RDD caching -- Key: SPARK-8632 URL: https://issues.apache.org/jira/browse/SPARK-8632 Project: Spark Issue Type: Bug Components: PySpark, SQL Affects Versions: 1.4.0 Reporter: Justin Uang {quote} We have been running into performance problems using Python UDFs with DataFrames at large scale. From the implementation of BatchPythonEvaluation, it looks like the goal was to reuse the PythonRDD code. It caches the entire child RDD so that it can do two passes over the data. One to give to the PythonRDD, then one to join the python lambda results with the original row (which may have java objects that should be passed through). In addition, it caches all the columns, even the ones that don't need to be processed by the Python UDF. In the cases I was working with, I had a 500 column table, and i wanted to use a python UDF for one column, and it ended up caching all 500 columns. {quote} http://apache-spark-developers-list.1001551.n3.nabble.com/Python-UDF-performance-at-large-scale-td12843.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-8786) Create a wrapper for BinaryType
Davies Liu created SPARK-8786: - Summary: Create a wrapper for BinaryType Key: SPARK-8786 URL: https://issues.apache.org/jira/browse/SPARK-8786 Project: Spark Issue Type: Bug Components: SQL Reporter: Davies Liu The hashCode and equals() of Array[Byte] does check the bytes, we should create a wrapper to do that. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8786) Create a wrapper for BinaryType
[ https://issues.apache.org/jira/browse/SPARK-8786?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu updated SPARK-8786: -- Description: The hashCode and equals() of Array[Byte] does check the bytes, we should create a wrapper (internally) to do that. (was: The hashCode and equals() of Array[Byte] does check the bytes, we should create a wrapper to do that.) Create a wrapper for BinaryType --- Key: SPARK-8786 URL: https://issues.apache.org/jira/browse/SPARK-8786 Project: Spark Issue Type: Bug Components: SQL Reporter: Davies Liu The hashCode and equals() of Array[Byte] does check the bytes, we should create a wrapper (internally) to do that. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-8223) math function: shiftleft
[ https://issues.apache.org/jira/browse/SPARK-8223?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu resolved SPARK-8223. --- Resolution: Fixed Fix Version/s: 1.5.0 Issue resolved by pull request 7178 [https://github.com/apache/spark/pull/7178] math function: shiftleft Key: SPARK-8223 URL: https://issues.apache.org/jira/browse/SPARK-8223 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin Assignee: zhichao-li Fix For: 1.5.0 shiftleft(INT a) shiftleft(BIGINT a) Bitwise left shift (as of Hive 1.2.0). Returns int for tinyint, smallint and int a. Returns bigint for bigint a. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-8224) math function: shiftright
[ https://issues.apache.org/jira/browse/SPARK-8224?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu resolved SPARK-8224. --- Resolution: Fixed Fix Version/s: 1.5.0 Issue resolved by pull request 7178 [https://github.com/apache/spark/pull/7178] math function: shiftright - Key: SPARK-8224 URL: https://issues.apache.org/jira/browse/SPARK-8224 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin Assignee: zhichao-li Fix For: 1.5.0 shiftrightunsigned(INT a), shiftrightunsigned(BIGINT a) Bitwise unsigned right shift (as of Hive 1.2.0). Returns int for tinyint, smallint and int a. Returns bigint for bigint a. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-8747) fix EqualNullSafe for binary type
[ https://issues.apache.org/jira/browse/SPARK-8747?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu resolved SPARK-8747. --- Resolution: Fixed Fix Version/s: 1.5.0 Issue resolved by pull request 7143 [https://github.com/apache/spark/pull/7143] fix EqualNullSafe for binary type - Key: SPARK-8747 URL: https://issues.apache.org/jira/browse/SPARK-8747 Project: Spark Issue Type: Bug Components: SQL Reporter: Wenchen Fan Priority: Minor Fix For: 1.5.0 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-7190) UTF8String backed by binary data
[ https://issues.apache.org/jira/browse/SPARK-7190?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu reassigned SPARK-7190: - Assignee: Davies Liu UTF8String backed by binary data Key: SPARK-7190 URL: https://issues.apache.org/jira/browse/SPARK-7190 Project: Spark Issue Type: New Feature Components: Spark Core Reporter: Reynold Xin Assignee: Davies Liu Just a pointer to some memory address, so we don't need to copy the data into a byte array. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8745) Remove GenerateMutableProjection
[ https://issues.apache.org/jira/browse/SPARK-8745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14612471#comment-14612471 ] Davies Liu commented on SPARK-8745: --- I can take this one, if you have not started. Remove GenerateMutableProjection Key: SPARK-8745 URL: https://issues.apache.org/jira/browse/SPARK-8745 Project: Spark Issue Type: Improvement Components: SQL Reporter: Reynold Xin Assignee: Davies Liu Based on discussion offline with [~marmbrus], we should remove GenerateMutableProjection. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8745) Remove GenerateMutableProjection
[ https://issues.apache.org/jira/browse/SPARK-8745?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu reassigned SPARK-8745: - Assignee: Davies Liu Remove GenerateMutableProjection Key: SPARK-8745 URL: https://issues.apache.org/jira/browse/SPARK-8745 Project: Spark Issue Type: Improvement Components: SQL Reporter: Reynold Xin Assignee: Davies Liu Based on discussion offline with [~marmbrus], we should remove GenerateMutableProjection. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-8804) order of UTF8String is wrong if there is any non-ascii character in it
Davies Liu created SPARK-8804: - Summary: order of UTF8String is wrong if there is any non-ascii character in it Key: SPARK-8804 URL: https://issues.apache.org/jira/browse/SPARK-8804 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.4.0 Reporter: Davies Liu Assignee: Davies Liu Priority: Blocker We compare the UTF8String byte by byte, but byte in JVM is signed, it should be compared as unsigned. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-8766) DataFrame Python API should work with column which has non-ascii character in it
Davies Liu created SPARK-8766: - Summary: DataFrame Python API should work with column which has non-ascii character in it Key: SPARK-8766 URL: https://issues.apache.org/jira/browse/SPARK-8766 Project: Spark Issue Type: Bug Affects Versions: 1.4.0, 1.3.1 Reporter: Davies Liu Assignee: Davies Liu -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-8763) executing run-tests.py with Python 2.6 fails with absence of subprocess.check_output function
[ https://issues.apache.org/jira/browse/SPARK-8763?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu resolved SPARK-8763. --- Resolution: Fixed Fix Version/s: 1.5.0 Issue resolved by pull request 7161 [https://github.com/apache/spark/pull/7161] executing run-tests.py with Python 2.6 fails with absence of subprocess.check_output function - Key: SPARK-8763 URL: https://issues.apache.org/jira/browse/SPARK-8763 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 1.5.0 Environment: Mac OS X 10.10.3 Python 2.6.9 Java 1.8.0 Reporter: Tomohiko K. Labels: pyspark, testing Fix For: 1.5.0 Running run-tests.py with Python 2.6 cause following error: {noformat} Running PySpark tests. Output is in python//Users/tomohiko/.jenkins/jobs/pyspark_test/workspace/python/unit-tests.log Will test against the following Python executables: ['python2.6', 'python3.4', 'pypy'] Will test the following Python modules: ['pyspark-core', 'pyspark-ml', 'pyspark-mllib', 'pyspark-sql', 'pyspark-streaming'] Traceback (most recent call last): File ./python/run-tests.py, line 196, in module main() File ./python/run-tests.py, line 159, in main python_implementation = subprocess.check_output( AttributeError: 'module' object has no attribute 'check_output' ... {noformat} The cause of this error is using subprocess.check_output function, which exists since Python 2.7. (ref. https://docs.python.org/2.7/library/subprocess.html#subprocess.check_output) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-8227) math function: unhex
[ https://issues.apache.org/jira/browse/SPARK-8227?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu resolved SPARK-8227. --- Resolution: Fixed Fix Version/s: 1.5.0 Issue resolved by pull request 7113 [https://github.com/apache/spark/pull/7113] math function: unhex Key: SPARK-8227 URL: https://issues.apache.org/jira/browse/SPARK-8227 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin Assignee: zhichao-li Fix For: 1.5.0 unhex(STRING a): BINARY Inverse of hex. Interprets each pair of characters as a hexadecimal number and converts to the byte representation of the number. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-8766) DataFrame Python API should work with column which has non-ascii character in it
[ https://issues.apache.org/jira/browse/SPARK-8766?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu resolved SPARK-8766. --- Resolution: Fixed Fix Version/s: 1.5.0 Issue resolved by pull request 7165 [https://github.com/apache/spark/pull/7165] DataFrame Python API should work with column which has non-ascii character in it Key: SPARK-8766 URL: https://issues.apache.org/jira/browse/SPARK-8766 Project: Spark Issue Type: Sub-task Components: PySpark Affects Versions: 1.3.1, 1.4.0 Reporter: Davies Liu Assignee: Davies Liu Fix For: 1.5.0 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-8727) Add missing python api
[ https://issues.apache.org/jira/browse/SPARK-8727?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu resolved SPARK-8727. --- Resolution: Fixed Fix Version/s: 1.5.0 Issue resolved by pull request 7114 [https://github.com/apache/spark/pull/7114] Add missing python api -- Key: SPARK-8727 URL: https://issues.apache.org/jira/browse/SPARK-8727 Project: Spark Issue Type: Improvement Components: SQL Reporter: Tarek Auel Fix For: 1.5.0 Add the python api that is missing for https://issues.apache.org/jira/browse/SPARK-8248 https://issues.apache.org/jira/browse/SPARK-8234 https://issues.apache.org/jira/browse/SPARK-8217 https://issues.apache.org/jira/browse/SPARK-8215 https://issues.apache.org/jira/browse/SPARK-8212 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-8535) PySpark : Can't create DataFrame from Pandas dataframe with no explicit column name
[ https://issues.apache.org/jira/browse/SPARK-8535?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu resolved SPARK-8535. --- Resolution: Fixed Fix Version/s: 1.5.0 Issue resolved by pull request 7124 [https://github.com/apache/spark/pull/7124] PySpark : Can't create DataFrame from Pandas dataframe with no explicit column name --- Key: SPARK-8535 URL: https://issues.apache.org/jira/browse/SPARK-8535 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 1.4.0 Reporter: Christophe Bourguignat Fix For: 1.5.0 Trying to create a Spark DataFrame from a pandas dataframe with no explicit column name : pandasDF = pd.DataFrame([[1, 2], [5, 6]]) sparkDF = sqlContext.createDataFrame(pandasDF) *** 1 sparkDF = sqlContext.createDataFrame(pandasDF) /usr/local/Cellar/apache-spark/1.4.0/libexec/python/pyspark/sql/context.pyc in createDataFrame(self, data, schema, samplingRatio) 344 345 jrdd = self._jvm.SerDeUtil.toJavaArray(rdd._to_java_object_rdd()) -- 346 df = self._ssql_ctx.applySchemaToPythonRDD(jrdd.rdd(), schema.json()) 347 return DataFrame(df, self) 348 /usr/local/Cellar/apache-spark/1.4.0/libexec/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py in __call__(self, *args) 536 answer = self.gateway_client.send_command(command) 537 return_value = get_return_value(answer, self.gateway_client, -- 538 self.target_id, self.name) 539 540 for temp_arg in temp_args: /usr/local/Cellar/apache-spark/1.4.0/libexec/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name) 298 raise Py4JJavaError( 299 'An error occurred while calling {0}{1}{2}.\n'. -- 300 format(target_id, '.', name), value) 301 else: 302 raise Py4JError( Py4JJavaError: An error occurred while calling o87.applySchemaToPythonRDD. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8653) Add constraint for Children expression for data type
[ https://issues.apache.org/jira/browse/SPARK-8653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14609579#comment-14609579 ] Davies Liu commented on SPARK-8653: --- [~rxin] With the new `ExpectsInputTypes`, we still need a way to tell how to do the conversion, it's ugly to do the type switch in eval() or codegen(). Maybe we could improve `AutoCastInputType` to have a method `acceptedTypes`, which returns a list of list of data types, specify those types could be casted into expected types. Be default, it will accept all type types which could be casted to expected types. Add constraint for Children expression for data type Key: SPARK-8653 URL: https://issues.apache.org/jira/browse/SPARK-8653 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Cheng Hao Currently, we have trait in Expression like `ExpectsInputTypes` and also the `checkInputDataTypes`, but can not convert the children expressions automatically, except we write the new rules in the `HiveTypeCoercion`. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-8723) improve code gen for divide and remainder
[ https://issues.apache.org/jira/browse/SPARK-8723?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu resolved SPARK-8723. --- Resolution: Fixed Fix Version/s: 1.5.0 Issue resolved by pull request 7111 [https://github.com/apache/spark/pull/7111] improve code gen for divide and remainder - Key: SPARK-8723 URL: https://issues.apache.org/jira/browse/SPARK-8723 Project: Spark Issue Type: Improvement Components: SQL Reporter: Wenchen Fan Priority: Minor Fix For: 1.5.0 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-8680) PropagateTypes is very slow when there are lots of columns
[ https://issues.apache.org/jira/browse/SPARK-8680?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu resolved SPARK-8680. --- Resolution: Fixed Fix Version/s: 1.5.0 Issue resolved by pull request 7087 [https://github.com/apache/spark/pull/7087] PropagateTypes is very slow when there are lots of columns -- Key: SPARK-8680 URL: https://issues.apache.org/jira/browse/SPARK-8680 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.3.1, 1.4.0 Reporter: Davies Liu Fix For: 1.5.0 The time for PropagateTypes is O(N*N), N is the number of columns, which is very slow if there many columns (1000) There easiest optimization could be put `q.inputSet` outside of transformExpressions which could have about 4 times improvement for N=3000 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8680) PropagateTypes is very slow when there are lots of columns
[ https://issues.apache.org/jira/browse/SPARK-8680?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu updated SPARK-8680: -- Assignee: Liang-Chi Hsieh PropagateTypes is very slow when there are lots of columns -- Key: SPARK-8680 URL: https://issues.apache.org/jira/browse/SPARK-8680 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.3.1, 1.4.0 Reporter: Davies Liu Assignee: Liang-Chi Hsieh Fix For: 1.5.0 The time for PropagateTypes is O(N*N), N is the number of columns, which is very slow if there many columns (1000) There easiest optimization could be put `q.inputSet` outside of transformExpressions which could have about 4 times improvement for N=3000 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-8590) add code gen for ExtractValue
[ https://issues.apache.org/jira/browse/SPARK-8590?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu resolved SPARK-8590. --- Resolution: Fixed Fix Version/s: 1.5.0 Issue resolved by pull request 6982 [https://github.com/apache/spark/pull/6982] add code gen for ExtractValue - Key: SPARK-8590 URL: https://issues.apache.org/jira/browse/SPARK-8590 Project: Spark Issue Type: Improvement Components: SQL Reporter: Wenchen Fan Fix For: 1.5.0 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-8236) misc function: crc32
[ https://issues.apache.org/jira/browse/SPARK-8236?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu resolved SPARK-8236. --- Resolution: Fixed Fix Version/s: 1.5.0 Issue resolved by pull request 7108 [https://github.com/apache/spark/pull/7108] misc function: crc32 Key: SPARK-8236 URL: https://issues.apache.org/jira/browse/SPARK-8236 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin Fix For: 1.5.0 crc32(string/binary): bigint Computes a cyclic redundancy check value for string or binary argument and returns bigint value (as of Hive 1.3.0). Example: crc32('ABC') = 2743272264. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8450) PySpark write.parquet raises Unsupported datatype DecimalType()
[ https://issues.apache.org/jira/browse/SPARK-8450?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu updated SPARK-8450: -- Description: I'm getting an Exception when I try to save a DataFrame with a DeciamlType as an parquet file Minimal Example: {code} from decimal import Decimal from pyspark.sql import SQLContext from pyspark.sql.types import * sqlContext = SQLContext(sc) schema = StructType([ StructField('id', LongType()), StructField('value', DecimalType())]) rdd = sc.parallelize([[1, Decimal(0.5)],[2, Decimal(2.9)]]) df = sqlContext.createDataFrame(rdd, schema) df.write.parquet(hdfs://srv:9000/user/ph/decimal.parquet, 'overwrite') {code} Stack Trace {code} --- Py4JJavaError Traceback (most recent call last) ipython-input-19-a77dac8de5f3 in module() 1 sr.write.parquet(hdfs://srv:9000/user/ph/decimal.parquet, 'overwrite') /home/spark/spark-1.4.0-bin-hadoop2.6/python/pyspark/sql/readwriter.pyc in parquet(self, path, mode) 367 :param mode: one of `append`, `overwrite`, `error`, `ignore` (default: error) 368 -- 369 return self._jwrite.mode(mode).parquet(path) 370 371 @since(1.4) /home/spark/spark-1.4.0-bin-hadoop2.6/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py in __call__(self, *args) 536 answer = self.gateway_client.send_command(command) 537 return_value = get_return_value(answer, self.gateway_client, -- 538 self.target_id, self.name) 539 540 for temp_arg in temp_args: /home/spark/spark-1.4.0-bin-hadoop2.6/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name) 298 raise Py4JJavaError( 299 'An error occurred while calling {0}{1}{2}.\n'. -- 300 format(target_id, '.', name), value) 301 else: 302 raise Py4JError( Py4JJavaError: An error occurred while calling o361.parquet. : org.apache.spark.SparkException: Job aborted. at org.apache.spark.sql.sources.InsertIntoHadoopFsRelation.insert(commands.scala:138) at org.apache.spark.sql.sources.InsertIntoHadoopFsRelation.run(commands.scala:114) at org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult$lzycompute(commands.scala:57) at org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult(commands.scala:57) at org.apache.spark.sql.execution.ExecutedCommand.doExecute(commands.scala:68) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:88) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:88) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:148) at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:87) at org.apache.spark.sql.SQLContext$QueryExecution.toRdd$lzycompute(SQLContext.scala:939) at org.apache.spark.sql.SQLContext$QueryExecution.toRdd(SQLContext.scala:939) at org.apache.spark.sql.sources.ResolvedDataSource$.apply(ddl.scala:332) at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:144) at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:135) at org.apache.spark.sql.DataFrameWriter.parquet(DataFrameWriter.scala:281) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379) at py4j.Gateway.invoke(Gateway.java:259) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:207) at java.lang.Thread.run(Thread.java:745) Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 158 in stage 35.0 failed 4 times, most recent failure: Lost task 158.3 in stage 35.0 (TID 2736, 10.2.160.14): java.lang.RuntimeException: Unsupported datatype DecimalType() at scala.sys.package$.error(package.scala:27) at org.apache.spark.sql.parquet.ParquetTypesConverter$$anonfun$fromDataType$2.apply(ParquetTypes.scala:374) at org.apache.spark.sql.parquet.ParquetTypesConverter$$anonfun$fromDataType$2.apply(ParquetTypes.scala:318) at scala.Option.getOrElse(Option.scala:120) at
[jira] [Resolved] (SPARK-8713) Support codegen for not thread-safe expressions
[ https://issues.apache.org/jira/browse/SPARK-8713?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu resolved SPARK-8713. --- Resolution: Fixed Fix Version/s: 1.5.0 Issue resolved by pull request 7101 [https://github.com/apache/spark/pull/7101] Support codegen for not thread-safe expressions --- Key: SPARK-8713 URL: https://issues.apache.org/jira/browse/SPARK-8713 Project: Spark Issue Type: Improvement Components: SQL Reporter: Davies Liu Assignee: Davies Liu Fix For: 1.5.0 Currently, we disable codegen if any expression is not thread safe. We should support that, but disable caching the compiled expresssions. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-8738) Generate better error message in Python for AnalysisException
Davies Liu created SPARK-8738: - Summary: Generate better error message in Python for AnalysisException Key: SPARK-8738 URL: https://issues.apache.org/jira/browse/SPARK-8738 Project: Spark Issue Type: Bug Reporter: Davies Liu Assignee: Davies Liu The long Java stack trace is hard to read. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-8738) Generate better error message in Python for AnalysisException
[ https://issues.apache.org/jira/browse/SPARK-8738?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu resolved SPARK-8738. --- Resolution: Fixed Fix Version/s: 1.5.0 Issue resolved by pull request 7135 [https://github.com/apache/spark/pull/7135] Generate better error message in Python for AnalysisException -- Key: SPARK-8738 URL: https://issues.apache.org/jira/browse/SPARK-8738 Project: Spark Issue Type: Bug Reporter: Davies Liu Assignee: Davies Liu Fix For: 1.5.0 The long Java stack trace is hard to read. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6360) For Spark 1.1 and 1.2, after any RDD transformations, calling saveAsParquetFile over a SchemaRDD with decimal or UDT column throws
[ https://issues.apache.org/jira/browse/SPARK-6360?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu updated SPARK-6360: -- Description: Spark shell session for reproduction (use {{:paste}}): {noformat} import org.apache.spark.sql.SQLContext import org.apache.spark.sql.catalyst.types.decimal._ import org.apache.spark.sql.catalyst.types._ import org.apache.hadoop.fs._ val sqlContext = new SQLContext(sc) val fs = FileSystem.get(sc.hadoopConfiguration) fs.delete(new Path(a.parquet)) fs.delete(new Path(b.parquet)) import sc._ import sqlContext._ val r1 = parallelize(1 to 10).map(i = Tuple1(Decimal(i, 10, 0))).select('_1 cast DecimalType(10, 0)) // OK r1.saveAsParquetFile(a.parquet) val r2 = parallelize(1 to 10).map(i = Tuple1(Decimal(i, 10, 0))).select('_1 cast DecimalType(10, 0)) val r3 = r2.coalesce(1) // Error r3.saveAsParquetFile(b.parquet) {noformat} Exception thrown: {noformat} java.lang.ClassCastException: scala.math.BigDecimal cannot be cast to org.apache.spark.sql.catalyst.types.decimal.Decimal at org.apache.spark.sql.parquet.MutableRowWriteSupport.consumeType(ParquetTableSupport.scala:359) at org.apache.spark.sql.parquet.MutableRowWriteSupport.write(ParquetTableSupport.scala:328) at org.apache.spark.sql.parquet.MutableRowWriteSupport.write(ParquetTableSupport.scala:314) at parquet.hadoop.InternalParquetRecordWriter.write(InternalParquetRecordWriter.java:120) at parquet.hadoop.ParquetRecordWriter.write(ParquetRecordWriter.java:81) at parquet.hadoop.ParquetRecordWriter.write(ParquetRecordWriter.java:37) at org.apache.spark.sql.parquet.InsertIntoParquetTable.org$apache$spark$sql$parquet$InsertIntoParquetTable$$writeShard$1(ParquetTableOperations.scala:308) at org.apache.spark.sql.parquet.InsertIntoParquetTable$$anonfun$saveAsHadoopFile$1.apply(ParquetTableOperations.scala:325) at org.apache.spark.sql.parquet.InsertIntoParquetTable$$anonfun$saveAsHadoopFile$1.apply(ParquetTableOperations.scala:325) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61) at org.apache.spark.scheduler.Task.run(Task.scala:56) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) 15/03/17 00:04:13 WARN TaskSetManager: Lost task 0.0 in stage 2.0 (TID 2, localhost): java.lang.ClassCastException: scala.math.BigDecimal cannot be cast to org.apache.spark.sql.catalyst.types.decimal.Decimal at org.apache.spark.sql.parquet.MutableRowWriteSupport.consumeType(ParquetTableSupport.scala:359) at org.apache.spark.sql.parquet.MutableRowWriteSupport.write(ParquetTableSupport.scala:328) at org.apache.spark.sql.parquet.MutableRowWriteSupport.write(ParquetTableSupport.scala:314) at parquet.hadoop.InternalParquetRecordWriter.write(InternalParquetRecordWriter.java:120) at parquet.hadoop.ParquetRecordWriter.write(ParquetRecordWriter.java:81) at parquet.hadoop.ParquetRecordWriter.write(ParquetRecordWriter.java:37) at org.apache.spark.sql.parquet.InsertIntoParquetTable.org$apache$spark$sql$parquet$InsertIntoParquetTable$$writeShard$1(ParquetTableOperations.scala:308) at org.apache.spark.sql.parquet.InsertIntoParquetTable$$anonfun$saveAsHadoopFile$1.apply(ParquetTableOperations.scala:325) at org.apache.spark.sql.parquet.InsertIntoParquetTable$$anonfun$saveAsHadoopFile$1.apply(ParquetTableOperations.scala:325) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61) at org.apache.spark.scheduler.Task.run(Task.scala:56) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) {noformat} The query plan of {{r1}} is: {noformat} == Parsed Logical Plan == 'Project [CAST('_1, DecimalType(10,0)) AS c0#60] LogicalRDD [_1#59], MapPartitionsRDD[71] at mapPartitions at ExistingRDD.scala:36 == Analyzed Logical Plan == Project [CAST(_1#59, DecimalType(10,0)) AS c0#60] LogicalRDD [_1#59], MapPartitionsRDD[71] at mapPartitions at ExistingRDD.scala:36 == Optimized Logical Plan == Project [CAST(_1#59, DecimalType(10,0)) AS c0#60] LogicalRDD [_1#59], MapPartitionsRDD[71] at mapPartitions at ExistingRDD.scala:36 == Physical Plan == Project [CAST(_1#59, DecimalType(10,0)) AS c0#60] PhysicalRDD [_1#59], MapPartitionsRDD[71] at mapPartitions at ExistingRDD.scala:36 Code Generation: false == RDD == {noformat} while {{r3}}'s query plan is: {noformat} ==
[jira] [Resolved] (SPARK-8741) Remove e and pi from DataFrame functions
[ https://issues.apache.org/jira/browse/SPARK-8741?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu resolved SPARK-8741. --- Resolution: Fixed Fix Version/s: 1.5.0 Issue resolved by pull request 7137 [https://github.com/apache/spark/pull/7137] Remove e and pi from DataFrame functions Key: SPARK-8741 URL: https://issues.apache.org/jira/browse/SPARK-8741 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin Assignee: Reynold Xin Fix For: 1.5.0 It is not really useful to have dataframe functions that return numeric constants available already in all programming languages. We should keep the expression for SQL, but nothing else. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-7902) SQL UDF doesn't support UDT in PySpark
[ https://issues.apache.org/jira/browse/SPARK-7902?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu reassigned SPARK-7902: - Assignee: Davies Liu SQL UDF doesn't support UDT in PySpark -- Key: SPARK-7902 URL: https://issues.apache.org/jira/browse/SPARK-7902 Project: Spark Issue Type: Bug Components: PySpark, SQL Affects Versions: 1.4.0 Reporter: Xiangrui Meng Assignee: Davies Liu Priority: Critical We don't convert Python SQL internal types to Python types in SQL UDF execution. This causes problems if the input arguments contain UDTs or the return type is a UDT. Right now, the raw SQL types are passed into the Python UDF and the return value is not converted to Python SQL types. This is the code (from [~rams]) to produce this bug. (Actually, it triggers another bug first right now.) {code} from pyspark.mllib.linalg import SparseVector from pyspark.sql.functions import udf from pyspark.sql.types import IntegerType df = sqlContext.createDataFrame([(SparseVector(2, {0: 0.0}),)], [features]) sz = udf(lambda s: s.size, IntegerType()) df.select(sz(df.features).alias(sz)).collect() {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-8235) misc function: sha1 / sha
[ https://issues.apache.org/jira/browse/SPARK-8235?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu resolved SPARK-8235. --- Resolution: Fixed Fix Version/s: 1.5.0 Issue resolved by pull request 6963 [https://github.com/apache/spark/pull/6963] misc function: sha1 / sha - Key: SPARK-8235 URL: https://issues.apache.org/jira/browse/SPARK-8235 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin Fix For: 1.5.0 sha1(string/binary): string sha(string/binary): string Calculates the SHA-1 digest for string or binary and returns the value as a hex string (as of Hive 1.3.0). Example: sha1('ABC') = '3c01bdbb26f358bab27f267924aa2c9a03fcfdb8'. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-8713) Support codegen for not thread-safe expressions
Davies Liu created SPARK-8713: - Summary: Support codegen for not thread-safe expressions Key: SPARK-8713 URL: https://issues.apache.org/jira/browse/SPARK-8713 Project: Spark Issue Type: Improvement Components: SQL Reporter: Davies Liu Assignee: Davies Liu Currently, we disable codegen if any expression is not thread safe. We should support that, but disable caching the compiled expresssions. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-7810) rdd.py _load_from_socket cannot load data from jvm socket if ipv6 is used
[ https://issues.apache.org/jira/browse/SPARK-7810?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu resolved SPARK-7810. --- Resolution: Fixed Fix Version/s: 1.6.0 1.3.2 1.4.1 Issue resolved by pull request 6338 [https://github.com/apache/spark/pull/6338] rdd.py _load_from_socket cannot load data from jvm socket if ipv6 is used --- Key: SPARK-7810 URL: https://issues.apache.org/jira/browse/SPARK-7810 Project: Spark Issue Type: Improvement Components: PySpark Affects Versions: 1.3.1 Reporter: Ai He Fix For: 1.4.1, 1.3.2, 1.6.0 Method _load_from_socket in rdd.py cannot load data from jvm socket if ipv6 is used. The current method only works well with ipv4. New modification should work around both two protocols. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-8579) Support arbitrary object in UnsafeRow
[ https://issues.apache.org/jira/browse/SPARK-8579?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu resolved SPARK-8579. --- Resolution: Fixed Fix Version/s: 1.5.0 Issue resolved by pull request 6959 [https://github.com/apache/spark/pull/6959] Support arbitrary object in UnsafeRow - Key: SPARK-8579 URL: https://issues.apache.org/jira/browse/SPARK-8579 Project: Spark Issue Type: New Feature Components: SQL Reporter: Davies Liu Assignee: Davies Liu Fix For: 1.5.0 It's common to run count(distinct xxx) in SQL, the data type will be UDT of OpenHashSet, it's good that we could use UnsafeRow to reducing the memory usage during aggregation. Also for DecimalType, which could be used inside the grouping key. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7810) rdd.py _load_from_socket cannot load data from jvm socket if ipv6 is used
[ https://issues.apache.org/jira/browse/SPARK-7810?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu updated SPARK-7810: -- Fix Version/s: (was: 1.4.1) (was: 1.6.0) 1.4.2 1.5.0 rdd.py _load_from_socket cannot load data from jvm socket if ipv6 is used --- Key: SPARK-7810 URL: https://issues.apache.org/jira/browse/SPARK-7810 Project: Spark Issue Type: Improvement Components: PySpark Affects Versions: 1.3.1 Reporter: Ai He Fix For: 1.3.2, 1.5.0, 1.4.2 Method _load_from_socket in rdd.py cannot load data from jvm socket if ipv6 is used. The current method only works well with ipv4. New modification should work around both two protocols. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-5161) Parallelize Python test execution
[ https://issues.apache.org/jira/browse/SPARK-5161?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu resolved SPARK-5161. --- Resolution: Fixed Fix Version/s: 1.5.0 Issue resolved by pull request 7031 [https://github.com/apache/spark/pull/7031] Parallelize Python test execution - Key: SPARK-5161 URL: https://issues.apache.org/jira/browse/SPARK-5161 Project: Spark Issue Type: Improvement Components: Project Infra Affects Versions: 1.2.0 Reporter: Nicholas Chammas Assignee: Josh Rosen Fix For: 1.5.0 [Original discussion here.|https://github.com/apache/spark/pull/3564#issuecomment-67785676] As of 1.2.0, Python tests take around 10-12 minutes to run. Once [SPARK-3431] is complete, this will become a significant fraction of the total test time. There are 2 separate approaches to explore for parallelizing the execution of Python unit tests: * Use GNU parallel to run each Python test file in parallel. * Use [{{nose}}|http://nose.readthedocs.org/en/latest/doc_tests/test_multiprocess/multiprocess.html] to parallelize all Python tests in a more extensible/configurable way. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-8214) math function: hex
[ https://issues.apache.org/jira/browse/SPARK-8214?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu resolved SPARK-8214. --- Resolution: Fixed Fix Version/s: 1.5.0 Issue resolved by pull request 6976 [https://github.com/apache/spark/pull/6976] math function: hex -- Key: SPARK-8214 URL: https://issues.apache.org/jira/browse/SPARK-8214 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin Assignee: zhichao-li Fix For: 1.5.0 hex(BIGINT a): string hex(STRING a): string hex(BINARY a): string If the argument is an INT or binary, hex returns the number as a STRING in hexadecimal format. Otherwise if the number is a STRING, it converts each character into its hexadecimal representation and returns the resulting STRING. (See http://dev.mysql.com/doc/refman/5.0/en/string-functions.html#function_hex, BINARY version as of Hive 0.12.0.) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-8610) Separate Row and InternalRow (part 2)
[ https://issues.apache.org/jira/browse/SPARK-8610?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu resolved SPARK-8610. --- Resolution: Fixed Fix Version/s: 1.5.0 Issue resolved by pull request 7003 [https://github.com/apache/spark/pull/7003] Separate Row and InternalRow (part 2) - Key: SPARK-8610 URL: https://issues.apache.org/jira/browse/SPARK-8610 Project: Spark Issue Type: Improvement Components: SQL Reporter: Davies Liu Assignee: Davies Liu Fix For: 1.5.0 Currently, we use GenericRow both for Row and InternalRow, which is confusing because it could contain Scala type also Catalyst types. We should have different implementation for them, to avoid some potential bugs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8636) CaseKeyWhen has incorrect NULL handling
[ https://issues.apache.org/jira/browse/SPARK-8636?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14604724#comment-14604724 ] Davies Liu commented on SPARK-8636: --- [~animeshbaranawal] What happen if there is null in the grouping key? Does a row with null equal to another row with null? CaseKeyWhen has incorrect NULL handling --- Key: SPARK-8636 URL: https://issues.apache.org/jira/browse/SPARK-8636 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.4.0 Reporter: Santiago M. Mola Labels: starter CaseKeyWhen implementation in Spark uses the following equals implementation: {code} private def equalNullSafe(l: Any, r: Any) = { if (l == null r == null) { true } else if (l == null || r == null) { false } else { l == r } } {code} Which is not correct, since in SQL, NULL is never equal to NULL (actually, it is not unequal either). In this case, a NULL value in a CASE WHEN expression should never match. For example, you can execute this in MySQL: {code} SELECT CASE NULL WHEN NULL THEN NULL MATCHES ELSE NULL DOES NOT MATCH END FROM DUAL; {code} And the result will be NULL DOES NOT MATCH. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-8686) DataFrame should support `where` with expression represented by String
[ https://issues.apache.org/jira/browse/SPARK-8686?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu resolved SPARK-8686. --- Resolution: Fixed Fix Version/s: 1.5.0 Issue resolved by pull request 7063 [https://github.com/apache/spark/pull/7063] DataFrame should support `where` with expression represented by String -- Key: SPARK-8686 URL: https://issues.apache.org/jira/browse/SPARK-8686 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.5.0 Reporter: Kousuke Saruta Priority: Minor Fix For: 1.5.0 DataFrame supports `filter` function with two types of argument, `Column` and `String`. But `where` doesn't. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8677) Decimal divide operation throws ArithmeticException
[ https://issues.apache.org/jira/browse/SPARK-8677?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu updated SPARK-8677: -- Assignee: Liang-Chi Hsieh Decimal divide operation throws ArithmeticException --- Key: SPARK-8677 URL: https://issues.apache.org/jira/browse/SPARK-8677 Project: Spark Issue Type: Bug Components: SQL Reporter: Liang-Chi Hsieh Assignee: Liang-Chi Hsieh Fix For: 1.5.0 Please refer to [BigDecimal doc|http://docs.oracle.com/javase/1.5.0/docs/api/java/math/BigDecimal.html]: {quote} ... the rounding mode setting of a MathContext object with a precision setting of 0 is not used and thus irrelevant. In the case of divide, the exact quotient could have an infinitely long decimal expansion; for example, 1 divided by 3. {quote} Because we provide a MathContext.UNLIMITED in toBigDecimal, Decimal divide operation will throw the following exception: {code} val decimal = Decimal(1.0, 10, 3) / Decimal(3.0, 10, 3) [info] java.lang.ArithmeticException: Non-terminating decimal expansion; no exact representable decimal result. [info] at java.math.BigDecimal.divide(BigDecimal.java:1690) [info] at java.math.BigDecimal.divide(BigDecimal.java:1723) [info] at scala.math.BigDecimal.$div(BigDecimal.scala:256) [info] at org.apache.spark.sql.types.Decimal.$div(Decimal.scala:272) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-8680) PropagateTypes is very slow when there are lots of columns
Davies Liu created SPARK-8680: - Summary: PropagateTypes is very slow when there are lots of columns Key: SPARK-8680 URL: https://issues.apache.org/jira/browse/SPARK-8680 Project: Spark Issue Type: Bug Affects Versions: 1.4.0, 1.3.1 Reporter: Davies Liu The time for PropagateTypes is O(N*N), N is the number of columns, which is very slow if there many columns (1000) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8680) PropagateTypes is very slow when there are lots of columns
[ https://issues.apache.org/jira/browse/SPARK-8680?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu updated SPARK-8680: -- Description: The time for PropagateTypes is O(N*N), N is the number of columns, which is very slow if there many columns (1000) There easiest optimization could be put `q.inputSet` outside of transformExpressions which could have about 4 times improvement for N=3000 was:The time for PropagateTypes is O(N*N), N is the number of columns, which is very slow if there many columns (1000) PropagateTypes is very slow when there are lots of columns -- Key: SPARK-8680 URL: https://issues.apache.org/jira/browse/SPARK-8680 Project: Spark Issue Type: Bug Affects Versions: 1.3.1, 1.4.0 Reporter: Davies Liu The time for PropagateTypes is O(N*N), N is the number of columns, which is very slow if there many columns (1000) There easiest optimization could be put `q.inputSet` outside of transformExpressions which could have about 4 times improvement for N=3000 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-8583) Refactor python/run-tests to integrate with dev/run-test's module system
[ https://issues.apache.org/jira/browse/SPARK-8583?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu resolved SPARK-8583. --- Resolution: Fixed Fix Version/s: 1.5.0 Issue resolved by pull request 6967 [https://github.com/apache/spark/pull/6967] Refactor python/run-tests to integrate with dev/run-test's module system Key: SPARK-8583 URL: https://issues.apache.org/jira/browse/SPARK-8583 Project: Spark Issue Type: Sub-task Components: Build, Project Infra, PySpark Reporter: Josh Rosen Assignee: Josh Rosen Fix For: 1.5.0 We should refactor the {{python/run-tests}} script to be written in Python and integrate with the recent {{dev/run-tests}} module system so that we can more granularly skip Python tests in the pull request builder. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-5482) Allow individual test suites in python/run-tests
[ https://issues.apache.org/jira/browse/SPARK-5482?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu resolved SPARK-5482. --- Resolution: Fixed Fix Version/s: 1.5.0 Issue resolved by pull request 6967 [https://github.com/apache/spark/pull/6967] Allow individual test suites in python/run-tests Key: SPARK-5482 URL: https://issues.apache.org/jira/browse/SPARK-5482 Project: Spark Issue Type: Improvement Components: PySpark Reporter: Katsunori Kanda Priority: Minor Fix For: 1.5.0 Add options to run individual test suites in python/run-tests. The usage is as follow. ./python/run-tests \[core|sql|mllib|ml|streaming\] When you select none, all test suites are run for backward compatibility. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-8620) cleanup CodeGenContext
[ https://issues.apache.org/jira/browse/SPARK-8620?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu resolved SPARK-8620. --- Resolution: Fixed Fix Version/s: 1.5.0 cleanup CodeGenContext -- Key: SPARK-8620 URL: https://issues.apache.org/jira/browse/SPARK-8620 Project: Spark Issue Type: Improvement Components: SQL Reporter: Wenchen Fan Assignee: Wenchen Fan Fix For: 1.5.0 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-8652) PySpark tests sometimes forget to check return status of doctest.testmod(), masking failing tests
[ https://issues.apache.org/jira/browse/SPARK-8652?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu resolved SPARK-8652. --- Resolution: Fixed Fix Version/s: 1.5.0 Issue resolved by pull request 7032 [https://github.com/apache/spark/pull/7032] PySpark tests sometimes forget to check return status of doctest.testmod(), masking failing tests - Key: SPARK-8652 URL: https://issues.apache.org/jira/browse/SPARK-8652 Project: Spark Issue Type: Bug Components: PySpark, Tests Reporter: Josh Rosen Assignee: Josh Rosen Priority: Blocker Fix For: 1.5.0 Several PySpark files call {{doctest.testmod()}} in order to run doctests, but forget to check its return status. As a result, failures will not be automatically detected by our test runner script, creating the potential for bugs to slip through. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8670) Nested columns can't be referenced (but they can be selected)
[ https://issues.apache.org/jira/browse/SPARK-8670?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14603605#comment-14603605 ] Davies Liu commented on SPARK-8670: --- I think you should use `df.stats.age` or df.selectExpr(stats.age) Nested columns can't be referenced (but they can be selected) - Key: SPARK-8670 URL: https://issues.apache.org/jira/browse/SPARK-8670 Project: Spark Issue Type: Bug Components: PySpark, SQL Affects Versions: 1.4.0 Reporter: Nicholas Chammas This is strange and looks like a regression from 1.3. {code} import json daterz = [ { 'name': 'Nick', 'stats': { 'age': 28 } }, { 'name': 'George', 'stats': { 'age': 31 } } ] df = sqlContext.jsonRDD(sc.parallelize(daterz).map(lambda x: json.dumps(x))) df.select('stats.age').show() df['stats.age'] # 1.4 fails on this line {code} On 1.3 this works and yields: {code} age 28 31 Out[1]: Columnstats.age AS age#2958L {code} On 1.4, however, this gives an error on the last line: {code} +---+ |age| +---+ | 28| | 31| +---+ --- IndexErrorTraceback (most recent call last) ipython-input-1-04bd990e94c6 in module() 19 20 df.select('stats.age').show() --- 21 df['stats.age'] /path/to/spark/python/pyspark/sql/dataframe.pyc in __getitem__(self, item) 678 if isinstance(item, basestring): 679 if item not in self.columns: -- 680 raise IndexError(no such column: %s % item) 681 jc = self._jdf.apply(item) 682 return Column(jc) IndexError: no such column: stats.age {code} This means, among other things, that you can't join DataFrames on nested columns. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8620) cleanup CodeGenContext
[ https://issues.apache.org/jira/browse/SPARK-8620?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu updated SPARK-8620: -- Assignee: Wenchen Fan cleanup CodeGenContext -- Key: SPARK-8620 URL: https://issues.apache.org/jira/browse/SPARK-8620 Project: Spark Issue Type: Improvement Components: SQL Reporter: Wenchen Fan Assignee: Wenchen Fan -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-8635) improve performance of CatalystTypeConverters
[ https://issues.apache.org/jira/browse/SPARK-8635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu resolved SPARK-8635. --- Resolution: Fixed Fix Version/s: 1.5.0 Issue resolved by pull request 7018 [https://github.com/apache/spark/pull/7018] improve performance of CatalystTypeConverters - Key: SPARK-8635 URL: https://issues.apache.org/jira/browse/SPARK-8635 Project: Spark Issue Type: Improvement Components: SQL Reporter: Wenchen Fan Fix For: 1.5.0 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-8237) misc function: sha2
[ https://issues.apache.org/jira/browse/SPARK-8237?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu resolved SPARK-8237. --- Resolution: Fixed Fix Version/s: 1.5.0 Issue resolved by pull request 6934 [https://github.com/apache/spark/pull/6934] misc function: sha2 --- Key: SPARK-8237 URL: https://issues.apache.org/jira/browse/SPARK-8237 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin Fix For: 1.5.0 sha2(string/binary, int): string Calculates the SHA-2 family of hash functions (SHA-224, SHA-256, SHA-384, and SHA-512) (as of Hive 1.3.0). The first argument is the string or binary to be hashed. The second argument indicates the desired bit length of the result, which must have a value of 224, 256, 384, 512, or 0 (which is equivalent to 256). SHA-224 is supported starting from Java 8. If either argument is NULL or the hash length is not one of the permitted values, the return value is NULL. Example: sha2('ABC', 256) = 'b5d4045c3f466fa91fe2cc6abe79232a1a57cdf104f7a26e716e0a1e2789df78'. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-8371) improve unit test for MaxOf and MinOf
[ https://issues.apache.org/jira/browse/SPARK-8371?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu resolved SPARK-8371. --- Resolution: Fixed Fix Version/s: 1.5.0 Issue resolved by pull request 6825 [https://github.com/apache/spark/pull/6825] improve unit test for MaxOf and MinOf - Key: SPARK-8371 URL: https://issues.apache.org/jira/browse/SPARK-8371 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Wenchen Fan Assignee: Wenchen Fan Fix For: 1.5.0 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-8610) Separate Row and InternalRow (part 2)
Davies Liu created SPARK-8610: - Summary: Separate Row and InternalRow (part 2) Key: SPARK-8610 URL: https://issues.apache.org/jira/browse/SPARK-8610 Project: Spark Issue Type: Improvement Components: SQL Reporter: Davies Liu Assignee: Davies Liu Currently, we use GenericRow both for Row and InternalRow, which is confusing because it could contain Scala type also Catalyst types. We should have different implementation for them, to avoid some potential bugs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-8431) Add in operator to DataFrame Column in SparkR
[ https://issues.apache.org/jira/browse/SPARK-8431?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu resolved SPARK-8431. --- Resolution: Fixed Fix Version/s: 1.5.0 Issue resolved by pull request 6941 [https://github.com/apache/spark/pull/6941] Add in operator to DataFrame Column in SparkR - Key: SPARK-8431 URL: https://issues.apache.org/jira/browse/SPARK-8431 Project: Spark Issue Type: New Feature Components: SparkR, SQL Reporter: Yu Ishikawa Fix For: 1.5.0 To filter values in a set, we should add {{%in%}} operation into SparkR. {noformat} df$a %in% c(1, 2, 3) {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-8359) Spark SQL Decimal type precision loss on multiplication
[ https://issues.apache.org/jira/browse/SPARK-8359?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu resolved SPARK-8359. --- Resolution: Fixed Fix Version/s: 1.5.0 Issue resolved by pull request 6814 [https://github.com/apache/spark/pull/6814] Spark SQL Decimal type precision loss on multiplication --- Key: SPARK-8359 URL: https://issues.apache.org/jira/browse/SPARK-8359 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.5.0 Reporter: Rene Treffer Fix For: 1.5.0 It looks like the precision of decimal can not be raised beyond ~2^112 without causing full value truncation. The following code computes the power of two up to a specific point {code} import org.apache.spark.sql.types.Decimal val one = Decimal(1) val two = Decimal(2) def pow(n : Int) : Decimal = if (n = 0) { one } else { val a = pow(n - 1) a.changePrecision(n,0) two.changePrecision(n,0) a * two } (109 to 120).foreach(n = println(pow(n).toJavaBigDecimal.unscaledValue.toString)) 649037107316853453566312041152512 1298074214633706907132624082305024 2596148429267413814265248164610048 5192296858534827628530496329220096 1038459371706965525706099265844019 2076918743413931051412198531688038 4153837486827862102824397063376076 8307674973655724205648794126752152 1661534994731144841129758825350430 3323069989462289682259517650700860 6646139978924579364519035301401720 1329227995784915872903807060280344 {code} Beyond ~2^112 the precision is truncated even if the precision was set to n and should thus handle 10^n without problems.. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-8190) ExpressionEvalHelper.checkEvaluation should also run the optimizer version
[ https://issues.apache.org/jira/browse/SPARK-8190?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu resolved SPARK-8190. --- Resolution: Fixed ExpressionEvalHelper.checkEvaluation should also run the optimizer version -- Key: SPARK-8190 URL: https://issues.apache.org/jira/browse/SPARK-8190 Project: Spark Issue Type: Improvement Components: SQL Reporter: Reynold Xin Assignee: Davies Liu We should remove the existing ExpressionOptimizationSuite, and update checkEvaluation to also run the optimizer version. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-8579) Support arbitrary object in UnsafeRow
Davies Liu created SPARK-8579: - Summary: Support arbitrary object in UnsafeRow Key: SPARK-8579 URL: https://issues.apache.org/jira/browse/SPARK-8579 Project: Spark Issue Type: New Feature Components: SQL Reporter: Davies Liu Assignee: Davies Liu It's common to run count(distinct xxx) in SQL, the data type will be UDT of OpenHashSet, it's good that we could use UnsafeRow to reducing the memory usage during aggregation. Also for DecimalType, which could be used inside the grouping key. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8187) date/time function: date_sub
[ https://issues.apache.org/jira/browse/SPARK-8187?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu updated SPARK-8187: -- Shepherd: Davies Liu date/time function: date_sub Key: SPARK-8187 URL: https://issues.apache.org/jira/browse/SPARK-8187 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin Assignee: Adrian Wang date_sub(string startdate, int days): string date_sub(date startdate, int days): date Subtracts a number of days to startdate: date_sub('2008-12-31', 1) = '2008-12-30'. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8186) date/time function: date_add
[ https://issues.apache.org/jira/browse/SPARK-8186?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu updated SPARK-8186: -- Shepherd: Davies Liu date/time function: date_add Key: SPARK-8186 URL: https://issues.apache.org/jira/browse/SPARK-8186 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin Assignee: Adrian Wang date_add(string startdate, int days): string date_add(date startdate, int days): date Adds a number of days to startdate: date_add('2008-12-31', 1) = '2009-01-01'. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7810) rdd.py _load_from_socket cannot load data from jvm socket if ipv6 is used
[ https://issues.apache.org/jira/browse/SPARK-7810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14598610#comment-14598610 ] Davies Liu commented on SPARK-7810: --- What's the stack trace look like? Does the host only have IPv6? There are multiple place which donot consider IPv6 in mind, you can grep `127.0.0.1` or `localhost` in the tree, could you also fix them together? rdd.py _load_from_socket cannot load data from jvm socket if ipv6 is used --- Key: SPARK-7810 URL: https://issues.apache.org/jira/browse/SPARK-7810 Project: Spark Issue Type: Improvement Components: PySpark Affects Versions: 1.3.1 Reporter: Ai He Method _load_from_socket in rdd.py cannot load data from jvm socket if ipv6 is used. The current method only works well with ipv4. New modification should work around both two protocols. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-8492) Support BinaryType in UnsafeRow
[ https://issues.apache.org/jira/browse/SPARK-8492?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu resolved SPARK-8492. --- Resolution: Fixed Fix Version/s: 1.5.0 Issue resolved by pull request 6911 [https://github.com/apache/spark/pull/6911] Support BinaryType in UnsafeRow --- Key: SPARK-8492 URL: https://issues.apache.org/jira/browse/SPARK-8492 Project: Spark Issue Type: Bug Components: SQL Reporter: Davies Liu Assignee: Davies Liu Fix For: 1.5.0 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-8307) Improve timestamp from parquet
[ https://issues.apache.org/jira/browse/SPARK-8307?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu resolved SPARK-8307. --- Resolution: Fixed Fix Version/s: 1.5.0 Issue resolved by pull request 6759 [https://github.com/apache/spark/pull/6759] Improve timestamp from parquet -- Key: SPARK-8307 URL: https://issues.apache.org/jira/browse/SPARK-8307 Project: Spark Issue Type: Improvement Components: SQL Reporter: Davies Liu Assignee: Davies Liu Fix For: 1.5.0 Currently, it's complicated to convert a timestamp from Parquet or Hive, really slow. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8301) Improve UTF8String substring/startsWith/endsWith/contains performance
[ https://issues.apache.org/jira/browse/SPARK-8301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14594926#comment-14594926 ] Davies Liu commented on SPARK-8301: --- [~rxin] Why I can't assign this JIRA to [~TarekAuel]? Improve UTF8String substring/startsWith/endsWith/contains performance - Key: SPARK-8301 URL: https://issues.apache.org/jira/browse/SPARK-8301 Project: Spark Issue Type: Improvement Components: SQL Reporter: Reynold Xin Priority: Critical Many functions in UTF8String are unnecessarily expensive. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-8301) Improve UTF8String substring/startsWith/endsWith/contains performance
[ https://issues.apache.org/jira/browse/SPARK-8301?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu resolved SPARK-8301. --- Resolution: Fixed Fix Version/s: 1.5.0 Improve UTF8String substring/startsWith/endsWith/contains performance - Key: SPARK-8301 URL: https://issues.apache.org/jira/browse/SPARK-8301 Project: Spark Issue Type: Improvement Components: SQL Reporter: Reynold Xin Priority: Critical Fix For: 1.5.0 Many functions in UTF8String are unnecessarily expensive. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-8422) Introduce a module abstraction inside of dev/run-tests
[ https://issues.apache.org/jira/browse/SPARK-8422?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu resolved SPARK-8422. --- Resolution: Fixed Fix Version/s: 1.5.0 Issue resolved by pull request 6866 [https://github.com/apache/spark/pull/6866] Introduce a module abstraction inside of dev/run-tests -- Key: SPARK-8422 URL: https://issues.apache.org/jira/browse/SPARK-8422 Project: Spark Issue Type: Sub-task Components: Build, Project Infra Reporter: Josh Rosen Assignee: Josh Rosen Fix For: 1.5.0 At a high level, we have Spark modules / components which 1. are affected / impacted by file changes (e.g. a module is associated with a set of source files, so changes to those files change the module), 2. contain a set of tests to run, which are triggered via Maven, SBT, or via Python / R scripts. 3. depend on other modules and have dependent modules: if we change core, then every downstream test should be run. If we change only MLlib, then we can skip the SQL tests but should probably run the Python MLlib tests, etc. Right now, the per-module logic is spread across a few different places inside of the {{dev/run-tests}} script: we have one function that describes how to detect changes for all modules, another function that (implicitly) deals with module dependencies, etc. Instead, I propose that we introduce a class for describing a module, use instances of this class to build up a dependency graph, then phrase the find which tests to run operations in terms of that graph. I think that this will be easier to understand / maintain. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8477) Add in operator to DataFrame Column in Python
[ https://issues.apache.org/jira/browse/SPARK-8477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14593928#comment-14593928 ] Davies Liu commented on SPARK-8477: --- [~rxin] [~yuu.ishik...@gmail.com] We already have `inSet` to match the Scala API `in`, we cloud close this one. Add in operator to DataFrame Column in Python - Key: SPARK-8477 URL: https://issues.apache.org/jira/browse/SPARK-8477 Project: Spark Issue Type: New Feature Components: PySpark, SQL Reporter: Yu Ishikawa -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-8492) Support BinaryType in UnsafeRow
Davies Liu created SPARK-8492: - Summary: Support BinaryType in UnsafeRow Key: SPARK-8492 URL: https://issues.apache.org/jira/browse/SPARK-8492 Project: Spark Issue Type: Bug Components: SQL Reporter: Davies Liu Assignee: Davies Liu -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8477) Add in operator to DataFrame Column in Python
[ https://issues.apache.org/jira/browse/SPARK-8477?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu updated SPARK-8477: -- Fix Version/s: 1.3.0 Add in operator to DataFrame Column in Python - Key: SPARK-8477 URL: https://issues.apache.org/jira/browse/SPARK-8477 Project: Spark Issue Type: New Feature Components: PySpark, SQL Reporter: Yu Ishikawa Fix For: 1.3.0 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-8477) Add in operator to DataFrame Column in Python
[ https://issues.apache.org/jira/browse/SPARK-8477?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu resolved SPARK-8477. --- Resolution: Implemented Target Version/s: 1.3.0 (was: 1.5.0) Add in operator to DataFrame Column in Python - Key: SPARK-8477 URL: https://issues.apache.org/jira/browse/SPARK-8477 Project: Spark Issue Type: New Feature Components: PySpark, SQL Reporter: Yu Ishikawa -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-8339) Itertools islice requires an integer for the stop argument.
[ https://issues.apache.org/jira/browse/SPARK-8339?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu resolved SPARK-8339. --- Resolution: Fixed Fix Version/s: 1.4.1 1.5.0 Issue resolved by pull request 6794 [https://github.com/apache/spark/pull/6794] Itertools islice requires an integer for the stop argument. --- Key: SPARK-8339 URL: https://issues.apache.org/jira/browse/SPARK-8339 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 1.4.0 Environment: python 3 Reporter: Kevin Conor Priority: Minor Fix For: 1.5.0, 1.4.1 Original Estimate: 5m Remaining Estimate: 5m Itertools islice requires an integer for the stop argument. The bug is in serializers.py and can prevent and rdd from being written to disk. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-8444) Add Python example in streaming for queueStream usage
[ https://issues.apache.org/jira/browse/SPARK-8444?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu resolved SPARK-8444. --- Resolution: Fixed Fix Version/s: 1.5.0 Issue resolved by pull request 6884 [https://github.com/apache/spark/pull/6884] Add Python example in streaming for queueStream usage - Key: SPARK-8444 URL: https://issues.apache.org/jira/browse/SPARK-8444 Project: Spark Issue Type: Sub-task Components: Streaming Affects Versions: 1.4.0 Reporter: Bryan Cutler Priority: Minor Fix For: 1.5.0 I noticed there was no Python equivalent for Scala queueStream example. This will have to be slightly different because changes in the Queue after the stream is created are not recognized. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8461) ClassNotFoundException when code generation is enabled
[ https://issues.apache.org/jira/browse/SPARK-8461?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu reassigned SPARK-8461: - Assignee: Davies Liu ClassNotFoundException when code generation is enabled -- Key: SPARK-8461 URL: https://issues.apache.org/jira/browse/SPARK-8461 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.5.0 Reporter: Cheng Lian Assignee: Davies Liu Priority: Blocker Build Spark without {{-Phive}} to make sure the isolated classloader for Hive support is irrelevant, then run the following Spark shell snippet: {code} sqlContext.range(0, 2).select(lit(a) as 'a).coalesce(1).write.mode(overwrite).json(file:///tmp/foo) {code} Exception thrown: {noformat} 15/06/18 15:36:30 ERROR codegen.GenerateMutableProjection: failed to compile: import org.apache.spark.sql.catalyst.InternalRow; public SpecificProjection generate(org.apache.spark.sql.catalyst.expressions.Expression[] expr) { return new SpecificProjection(expr); } class SpecificProjection extends org.apache.spark.sql.catalyst.expressions.codegen.BaseMutableProjection { private org.apache.spark.sql.catalyst.expressions.Expression[] expressions = null; private org.apache.spark.sql.catalyst.expressions.MutableRow mutableRow = null; public SpecificProjection(org.apache.spark.sql.catalyst.expressions.Expression[] expr) { expressions = expr; mutableRow = new org.apache.spark.sql.catalyst.expressions.GenericMutableRow(1); } public org.apache.spark.sql.catalyst.expressions.codegen.BaseMutableProjection target(org.apache.spark.sql.catalyst.expressions.MutableRow row) { mutableRow = row; return this; } /* Provide immutable access to the last projected row. */ public InternalRow currentValue() { return (InternalRow) mutableRow; } public Object apply(Object _i) { InternalRow i = (InternalRow) _i; /* expression: a */ Object obj2 = expressions[0].eval(i); boolean isNull0 = obj2 == null; org.apache.spark.unsafe.types.UTF8String primitive1 = null; if (!isNull0) { primitive1 = (org.apache.spark.unsafe.types.UTF8String) obj2; } if(isNull0) mutableRow.setNullAt(0); else mutableRow.update(0, primitive1); return mutableRow; } } org.codehaus.commons.compiler.CompileException: Line 28, Column 35: Object at org.codehaus.janino.UnitCompiler.findTypeByName(UnitCompiler.java:6897) at org.codehaus.janino.UnitCompiler.getReferenceType(UnitCompiler.java:5331) at org.codehaus.janino.UnitCompiler.getReferenceType(UnitCompiler.java:5207) at org.codehaus.janino.UnitCompiler.getType2(UnitCompiler.java:5188) at org.codehaus.janino.UnitCompiler.access$12600(UnitCompiler.java:185) at org.codehaus.janino.UnitCompiler$16.visitReferenceType(UnitCompiler.java:5119) at org.codehaus.janino.Java$ReferenceType.accept(Java.java:2880) at org.codehaus.janino.UnitCompiler.getType(UnitCompiler.java:5159) at org.codehaus.janino.UnitCompiler.access$16700(UnitCompiler.java:185) at org.codehaus.janino.UnitCompiler$31.getParameterTypes2(UnitCompiler.java:8533) at org.codehaus.janino.IClass$IInvocable.getParameterTypes(IClass.java:835) at org.codehaus.janino.IClass$IMethod.getDescriptor2(IClass.java:1063) at org.codehaus.janino.IClass$IInvocable.getDescriptor(IClass.java:849) at org.codehaus.janino.IClass.getIMethods(IClass.java:211) at org.codehaus.janino.IClass.getIMethods(IClass.java:199) at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:409) at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:658) at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:662) at org.codehaus.janino.UnitCompiler.access$600(UnitCompiler.java:185) at org.codehaus.janino.UnitCompiler$2.visitMemberClassDeclaration(UnitCompiler.java:350) at org.codehaus.janino.Java$MemberClassDeclaration.accept(Java.java:1035) at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:354) at org.codehaus.janino.UnitCompiler.compileDeclaredMemberTypes(UnitCompiler.java:769) at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:532) at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:393) at org.codehaus.janino.UnitCompiler.access$400(UnitCompiler.java:185) at
[jira] [Resolved] (SPARK-8207) math function: bin
[ https://issues.apache.org/jira/browse/SPARK-8207?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu resolved SPARK-8207. --- Resolution: Fixed math function: bin -- Key: SPARK-8207 URL: https://issues.apache.org/jira/browse/SPARK-8207 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin Assignee: Liang-Chi Hsieh bin(long a): string Returns the number in binary format (see http://dev.mysql.com/doc/refman/5.0/en/string-functions.html#function_bin). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8477) Add in operator to DataFrame Column in Python
[ https://issues.apache.org/jira/browse/SPARK-8477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14593850#comment-14593850 ] Davies Liu commented on SPARK-8477: --- I think we can use the upper case `In`, or another word(such as `within`) Add in operator to DataFrame Column in Python - Key: SPARK-8477 URL: https://issues.apache.org/jira/browse/SPARK-8477 Project: Spark Issue Type: New Feature Components: PySpark, SQL Reporter: Yu Ishikawa -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org