[jira] [Created] (SPARK-4898) Replace cloudpickle with Dill
Josh Rosen created SPARK-4898: - Summary: Replace cloudpickle with Dill Key: SPARK-4898 URL: https://issues.apache.org/jira/browse/SPARK-4898 Project: Spark Issue Type: Bug Components: PySpark Reporter: Josh Rosen We should consider replacing our modified version of {{cloudpickle}} with [Dill|https://github.com/uqfoundation/dill], since it supports both Python 2 and 3 and might do a better job of handling certain corner-cases. I attempted to do this a few months ago but ran into cases where Dill had issues pickling objects defined in doctests, which broke our test suite: https://github.com/uqfoundation/dill/issues/50. This issue may have been resolved now; I haven't checked. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4897) Python 3 support
[ https://issues.apache.org/jira/browse/SPARK-4897?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-4897: -- Description: It would be nice to have Python 3 support in PySpark, provided that we can do it in a way that maintains backwards-compatibility with Python 2.6. I started looking into porting this; my WIP work can be found at https://github.com/JoshRosen/spark/compare/python3 I was able to use the [futurize|http://python-future.org/futurize.html#forwards-conversion-stage1] tool to handle the basic conversion of things like {{print}} statements, etc. and had to manually fix up a few imports for packages that moved / were renamed, but the major blocker that I hit was {{cloudpickle}}: {code} [joshrosen python (python3)]$ PYSPARK_PYTHON=python3 ../bin/pyspark Python 3.4.2 (default, Oct 19 2014, 17:52:17) [GCC 4.2.1 Compatible Apple LLVM 6.0 (clang-600.0.51)] on darwin Type "help", "copyright", "credits" or "license" for more information. Traceback (most recent call last): File "/Users/joshrosen/Documents/Spark/python/pyspark/shell.py", line 28, in import pyspark File "/Users/joshrosen/Documents/spark/python/pyspark/__init__.py", line 41, in from pyspark.context import SparkContext File "/Users/joshrosen/Documents/spark/python/pyspark/context.py", line 26, in from pyspark import accumulators File "/Users/joshrosen/Documents/spark/python/pyspark/accumulators.py", line 97, in from pyspark.cloudpickle import CloudPickler File "/Users/joshrosen/Documents/spark/python/pyspark/cloudpickle.py", line 120, in class CloudPickler(pickle.Pickler): File "/Users/joshrosen/Documents/spark/python/pyspark/cloudpickle.py", line 122, in CloudPickler dispatch = pickle.Pickler.dispatch.copy() AttributeError: type object '_pickle.Pickler' has no attribute 'dispatch' {code} This code looks like it will be hard difficult to port to Python 3, so this might be a good reason to switch to [Dill|https://github.com/uqfoundation/dill] for Python serialization. was: It would be nice to have Python 3 support in PySpark, provided that we can do it in a way that maintains backwards-compatibility with Python 2.6. I started looking into porting this; my WIP work can be found at https://github.com/JoshRosen/spark/compare/python3?expand=1 I was able to use the [futurize|http://python-future.org/futurize.html#forwards-conversion-stage1] tool to handle the basic conversion of things like {{print}} statements, etc. and had to manually fix up a few imports for packages that moved / were renamed, but the major blocker that I hit was {{cloudpickle}}: {code} [joshrosen python (python3)]$ PYSPARK_PYTHON=python3 ../bin/pyspark Python 3.4.2 (default, Oct 19 2014, 17:52:17) [GCC 4.2.1 Compatible Apple LLVM 6.0 (clang-600.0.51)] on darwin Type "help", "copyright", "credits" or "license" for more information. Traceback (most recent call last): File "/Users/joshrosen/Documents/Spark/python/pyspark/shell.py", line 28, in import pyspark File "/Users/joshrosen/Documents/spark/python/pyspark/__init__.py", line 41, in from pyspark.context import SparkContext File "/Users/joshrosen/Documents/spark/python/pyspark/context.py", line 26, in from pyspark import accumulators File "/Users/joshrosen/Documents/spark/python/pyspark/accumulators.py", line 97, in from pyspark.cloudpickle import CloudPickler File "/Users/joshrosen/Documents/spark/python/pyspark/cloudpickle.py", line 120, in class CloudPickler(pickle.Pickler): File "/Users/joshrosen/Documents/spark/python/pyspark/cloudpickle.py", line 122, in CloudPickler dispatch = pickle.Pickler.dispatch.copy() AttributeError: type object '_pickle.Pickler' has no attribute 'dispatch' {code} This code looks like it will be hard difficult to port to Python 3, so this might be a good reason to switch to [Dill|https://github.com/uqfoundation/dill] for Python serialization. > Python 3 support > > > Key: SPARK-4897 > URL: https://issues.apache.org/jira/browse/SPARK-4897 > Project: Spark > Issue Type: Improvement > Components: PySpark >Reporter: Josh Rosen >Priority: Minor > > It would be nice to have Python 3 support in PySpark, provided that we can do > it in a way that maintains backwards-compatibility with Python 2.6. > I started looking into porting this; my WIP work can be found at > https://github.com/JoshRosen/spark/compare/python3 > I was able to use the > [futurize|http://python-future.org/futurize.html#forwards-conversion-stage1] > tool to handle the basic conversion of things like {{print}} statements, etc. > and had to manually fix up a few imports for packages that moved / were > renamed, but the major blocker that I hit was {{cloudpickle}}: > {code} > [joshrosen python (python3)]$ PYSPARK_PYTHON=python3 ..
[jira] [Commented] (SPARK-4886) Support cache control for each partition of a Hive partitioned table
[ https://issues.apache.org/jira/browse/SPARK-4886?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14253138#comment-14253138 ] guowei commented on SPARK-4886: --- use "CACHE TABLE ... AS SELECT..." > Support cache control for each partition of a Hive partitioned table > > > Key: SPARK-4886 > URL: https://issues.apache.org/jira/browse/SPARK-4886 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Xudong Zheng > > SparkSQL currently don't support control cache for each partition of a Hive > partitioned table. If we could add this feature, user could have a better > cache control of a cache table. And in many scenarios, the data is > periodically appended into a table as a new partition, with this feature, > user could easily control a sliding windows of data to be cached in memory. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3619) Upgrade to Mesos 0.21 to work around MESOS-1688
[ https://issues.apache.org/jira/browse/SPARK-3619?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Ash updated SPARK-3619: -- Assignee: Timothy Chen > Upgrade to Mesos 0.21 to work around MESOS-1688 > --- > > Key: SPARK-3619 > URL: https://issues.apache.org/jira/browse/SPARK-3619 > Project: Spark > Issue Type: Improvement > Components: Mesos >Reporter: Matei Zaharia >Assignee: Timothy Chen > > When Mesos 0.21 comes out, it will have a fix for > https://issues.apache.org/jira/browse/MESOS-1688, which affects Spark jobs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3619) Upgrade to Mesos 0.21 to work around MESOS-1688
[ https://issues.apache.org/jira/browse/SPARK-3619?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14253150#comment-14253150 ] Andrew Ash commented on SPARK-3619: --- [~activars] Spark 1.2.0 is being released with a Mesos dependency on 0.18.1 so a fix was not included for the Spark release. [~tnachen] are you still interested in this? I'm assigning the Jira to you > Upgrade to Mesos 0.21 to work around MESOS-1688 > --- > > Key: SPARK-3619 > URL: https://issues.apache.org/jira/browse/SPARK-3619 > Project: Spark > Issue Type: Improvement > Components: Mesos >Reporter: Matei Zaharia > > When Mesos 0.21 comes out, it will have a fix for > https://issues.apache.org/jira/browse/MESOS-1688, which affects Spark jobs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3619) Upgrade to Mesos 0.21 to work around MESOS-1688
[ https://issues.apache.org/jira/browse/SPARK-3619?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Ash updated SPARK-3619: -- Description: The Mesos 0.21 release has a fix for https://issues.apache.org/jira/browse/MESOS-1688, which affects Spark jobs. (was: When Mesos 0.21 comes out, it will have a fix for https://issues.apache.org/jira/browse/MESOS-1688, which affects Spark jobs.) > Upgrade to Mesos 0.21 to work around MESOS-1688 > --- > > Key: SPARK-3619 > URL: https://issues.apache.org/jira/browse/SPARK-3619 > Project: Spark > Issue Type: Improvement > Components: Mesos >Reporter: Matei Zaharia >Assignee: Timothy Chen > > The Mesos 0.21 release has a fix for > https://issues.apache.org/jira/browse/MESOS-1688, which affects Spark jobs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4886) Support cache control for each partition of a Hive partitioned table
[ https://issues.apache.org/jira/browse/SPARK-4886?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14253154#comment-14253154 ] Xudong Zheng commented on SPARK-4886: - Hi Guowei, "CACHE TABLE ... AS SELECT..." will create a new cache table instead of caching the partition of the original table. The query on original table will still go to HDFS. And this is not convenient for appending scenario, because will need to create a new table every time we add a new partition. Actually, that is still table level cache control. > Support cache control for each partition of a Hive partitioned table > > > Key: SPARK-4886 > URL: https://issues.apache.org/jira/browse/SPARK-4886 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Xudong Zheng > > SparkSQL currently don't support control cache for each partition of a Hive > partitioned table. If we could add this feature, user could have a better > cache control of a cache table. And in many scenarios, the data is > periodically appended into a table as a new partition, with this feature, > user could easily control a sliding windows of data to be cached in memory. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4899) Support Mesos features: roles and checkpoints
Andrew Ash created SPARK-4899: - Summary: Support Mesos features: roles and checkpoints Key: SPARK-4899 URL: https://issues.apache.org/jira/browse/SPARK-4899 Project: Spark Issue Type: New Feature Components: Mesos Affects Versions: 1.2.0 Reporter: Andrew Ash Inspired by https://github.com/apache/spark/pull/60 Mesos has two features that would be nice for Spark to take advantage of: 1. Roles -- a way to specify ACLs and priorities for users 2. Checkpoints -- a way to restart a failed Mesos slave without losing all the work that was happening on the box Some of these may require a Mesos upgrade past our current 0.18.1 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4872) Provide sample format of training/test data in MLlib programming guide
[ https://issues.apache.org/jira/browse/SPARK-4872?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14253171#comment-14253171 ] Sean Owen commented on SPARK-4872: -- [~zhjunwei] This is not at all specific to Spark. No, you can certainly use features with 3 values. You should 1-hot encode them though. You will have "Weather-Sunny", "Weather-Cloudy", "Weather-Rainy" binary features instead of one "Weather" feature for example. Although there is some separate support in Spark for this, it's pretty simple to translate this yourself. Is there a remaining action on this issue or did that clarify the usage to you? > Provide sample format of training/test data in MLlib programming guide > -- > > Key: SPARK-4872 > URL: https://issues.apache.org/jira/browse/SPARK-4872 > Project: Spark > Issue Type: Improvement > Components: Documentation >Affects Versions: 1.1.1 >Reporter: zhang jun wei > Labels: documentation > > I suggest: in samples of the online programming guide of MLlib, it's better > to give examples in the real life data, and list the translated data format > for the model to consume. > The problem blocking me is how to translate the real life data into the > format which MLLib can understand correctly. > Here is one sample, I want to use NaiveBayes to train and predict tennis-play > decision, the original data is: > Weather | Temperature | Humidity | Wind => Decision to play tennis > Sunny | Hot | High | No => No > Sunny | Hot | High | Yes=> No > Cloudy| Normal | Normal | No => Yes > Rainy | Cold | Normal | Yes=> No > Now, from my understanding, one potential translation is: > 1) put every feature value word into a line: > Sunny Cloudy Rainy Hot Normal Cold High Normal Yes No > 2) map them to numbers: > 1 2 3 4 5 6 7 8 9 10 > 3) map decision labels to numbers: > 0 - No > 1 - Yes > 4) set the value to 1 if it appears, or 0 if not, for the above example, here > is the data format for MLUtils.loadLibSVMFile to use: > 0 1:1 2:0 3:0 4:1 5:0 6:0 7:1 8:0 9:0 10:1 > 0 1:1 2:0 3:0 4:1 5:0 6:0 7:1 8:0 9:1 10:0 > 1 1:0 2:1 3:0 4:0 5:1 6:0 7:0 8:1 9:0 10:1 > 0 1:0 2:0 3:1 4:0 5:0 6:1 7:0 8:1 9:1 10:0 > ==> Is this a correct understanding? > And another way I can image is: > 1) put every feature name into a line: > Weather Temperature Humidity Wind > 2) map them to numbers: > 1 2 3 4 > 3) map decision labels to numbers: > 0 - No > 1 - Yes > 4) map each value of each feature to a number (e.g. Sunny to 1, Cloudy to 2, > Rainy to 3; Hot to 1, Normal to 2, Cold to 3; High to 1, Normal to 2; Yes to > 1, No to 2) for the above example, here is the data format for > MLUtils.loadLibSVMFile to use: > 0 1:1 2:1 3:1 4:2 > 0 1:1 2:1 3:1 4:1 > 1 1:2 2:2 3:2 4:2 > 0 1:3 2:3 3:2 4:1 > ==> but when I read the source code in NaiveBayes.scala, seems this is not > correct, I am not sure though... > So which data format translation way is correct? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4094) checkpoint should still be available after rdd actions
[ https://issues.apache.org/jira/browse/SPARK-4094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14253191#comment-14253191 ] Sean Owen commented on SPARK-4094: -- [~liyezhang556520] But this is exactly what the doc says is not permitted. By invoking action C, you necessarily execute the job for RDD B, after which time you can't checkpoint it. My question, if you're proposing to loosen the restriction, I wonder what problem there was originally to allowing this, and why the change resolves that? > checkpoint should still be available after rdd actions > -- > > Key: SPARK-4094 > URL: https://issues.apache.org/jira/browse/SPARK-4094 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 1.1.0 >Reporter: Zhang, Liye >Assignee: Zhang, Liye > > rdd.checkpoint() must be called before any actions on this rdd, if there is > any other actions before, checkpoint would never succeed. For the following > code as example: > *rdd = sc.makeRDD(...)* > *rdd.collect()* > *rdd.checkpoint()* > *rdd.count()* > This rdd would never be checkpointed. For algorithms that have many > iterations would have some problem. Such as graph algorithm, there will have > many iterations which will cause the RDD lineage very long. So RDD may need > checkpoint after a certain iteration number. And if there is also any action > within the iteration loop, the checkpoint() operation will never work for the > later iterations after the iteration whichs call the action operation. > But this would not happen for RDD cache. RDD cache would always make > successfully before rdd actions no matter whether there is any actions before > cache(). > So rdd.checkpoint() should also be with the same behavior with rdd.cache(). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2075) Anonymous classes are missing from Spark distribution
[ https://issues.apache.org/jira/browse/SPARK-2075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14253203#comment-14253203 ] Sean Owen commented on SPARK-2075: -- [~sunrui] From digging in to the various reports of this issue, it seemed to me that in each case the Hadoop version did not match. That is, I do not know that it's true that the issue manifests when the Hadoop version matches; that would indeed be strange. I could have missed it; this is a bit hard to follow. But do you see evidence of this? I don't think publishing two versions fixes anything, really. The PR might get at the heart of the difference here and resolve it for real. It doesn't happen if you match binaries, which is good practice anyway. > Anonymous classes are missing from Spark distribution > - > > Key: SPARK-2075 > URL: https://issues.apache.org/jira/browse/SPARK-2075 > Project: Spark > Issue Type: Bug > Components: Build, Spark Core >Affects Versions: 1.0.0 >Reporter: Paul R. Brown >Priority: Critical > > Running a job built against the Maven dep for 1.0.0 and the hadoop1 > distribution produces: > {code} > java.lang.ClassNotFoundException: > org.apache.spark.rdd.RDD$$anonfun$saveAsTextFile$1 > {code} > Here's what's in the Maven dep as of 1.0.0: > {code} > jar tvf > ~/.m2/repository/org/apache/spark/spark-core_2.10/1.0.0/spark-core_2.10-1.0.0.jar > | grep 'rdd/RDD' | grep 'saveAs' > 1519 Mon May 26 13:57:58 PDT 2014 > org/apache/spark/rdd/RDD$anonfun$saveAsTextFile$1.class > 1560 Mon May 26 13:57:58 PDT 2014 > org/apache/spark/rdd/RDD$anonfun$saveAsTextFile$2.class > {code} > And here's what's in the hadoop1 distribution: > {code} > jar tvf spark-assembly-1.0.0-hadoop1.0.4.jar| grep 'rdd/RDD' | grep 'saveAs' > {code} > I.e., it's not there. It is in the hadoop2 distribution: > {code} > jar tvf spark-assembly-1.0.0-hadoop2.2.0.jar| grep 'rdd/RDD' | grep 'saveAs' > 1519 Mon May 26 07:29:54 PDT 2014 > org/apache/spark/rdd/RDD$anonfun$saveAsTextFile$1.class > 1560 Mon May 26 07:29:54 PDT 2014 > org/apache/spark/rdd/RDD$anonfun$saveAsTextFile$2.class > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4900) MLlib SingularValueDecomposition ARPACK IllegalStateException
Mike Beyer created SPARK-4900: - Summary: MLlib SingularValueDecomposition ARPACK IllegalStateException Key: SPARK-4900 URL: https://issues.apache.org/jira/browse/SPARK-4900 Project: Spark Issue Type: Bug Components: MLlib Affects Versions: 1.1.1 Environment: Ubuntu 1410, Java HotSpot(TM) 64-Bit Server VM (build 25.25-b02, mixed mode) Reporter: Mike Beyer Priority: Blocker java.lang.reflect.InvocationTargetException ... Caused by: java.lang.IllegalStateException: ARPACK returns non-zero info = 3 Please refer ARPACK user guide for error message. at org.apache.spark.mllib.linalg.EigenValueDecomposition$.symmetricEigs(EigenValueDecomposition.scala:120) at org.apache.spark.mllib.linalg.distributed.RowMatrix.computeSVD(RowMatrix.scala:235) at org.apache.spark.mllib.linalg.distributed.RowMatrix.computeSVD(RowMatrix.scala:171) ... -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4901) Hot fix for the BytesWritable.copyBytes not exists in Hadoop1
Cheng Hao created SPARK-4901: Summary: Hot fix for the BytesWritable.copyBytes not exists in Hadoop1 Key: SPARK-4901 URL: https://issues.apache.org/jira/browse/SPARK-4901 Project: Spark Issue Type: Bug Components: SQL Reporter: Cheng Hao Priority: Minor HiveInspectors.scala failed in compiling with Hadoop 1, as the BytesWritable.copyBytes is not available in Hadoop 1. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4901) Hot fix for the BytesWritable.copyBytes not exists in Hadoop1
[ https://issues.apache.org/jira/browse/SPARK-4901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14253306#comment-14253306 ] Apache Spark commented on SPARK-4901: - User 'chenghao-intel' has created a pull request for this issue: https://github.com/apache/spark/pull/3742 > Hot fix for the BytesWritable.copyBytes not exists in Hadoop1 > - > > Key: SPARK-4901 > URL: https://issues.apache.org/jira/browse/SPARK-4901 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Cheng Hao >Priority: Minor > > HiveInspectors.scala failed in compiling with Hadoop 1, as the > BytesWritable.copyBytes is not available in Hadoop 1. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4900) MLlib SingularValueDecomposition ARPACK IllegalStateException
[ https://issues.apache.org/jira/browse/SPARK-4900?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mike Beyer updated SPARK-4900: -- Priority: Major (was: Blocker) > MLlib SingularValueDecomposition ARPACK IllegalStateException > -- > > Key: SPARK-4900 > URL: https://issues.apache.org/jira/browse/SPARK-4900 > Project: Spark > Issue Type: Bug > Components: MLlib >Affects Versions: 1.1.1 > Environment: Ubuntu 1410, Java HotSpot(TM) 64-Bit Server VM (build > 25.25-b02, mixed mode) >Reporter: Mike Beyer > > java.lang.reflect.InvocationTargetException > ... > Caused by: java.lang.IllegalStateException: ARPACK returns non-zero info = 3 > Please refer ARPACK user guide for error message. > at > org.apache.spark.mllib.linalg.EigenValueDecomposition$.symmetricEigs(EigenValueDecomposition.scala:120) > at > org.apache.spark.mllib.linalg.distributed.RowMatrix.computeSVD(RowMatrix.scala:235) > at > org.apache.spark.mllib.linalg.distributed.RowMatrix.computeSVD(RowMatrix.scala:171) > ... -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3373) Filtering operations should optionally rebuild routing tables
[ https://issues.apache.org/jira/browse/SPARK-3373?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] uncleGen updated SPARK-3373: Target Version/s: 1.3.0, 1.2.1 (was: 1.1.1, 1.2.0) Affects Version/s: (was: 1.0.2) (was: 1.0.0) 1.1.0 1.1.1 > Filtering operations should optionally rebuild routing tables > - > > Key: SPARK-3373 > URL: https://issues.apache.org/jira/browse/SPARK-3373 > Project: Spark > Issue Type: Improvement > Components: GraphX >Affects Versions: 1.1.0, 1.1.1 >Reporter: uncleGen >Priority: Minor > > Graph operations that filter the edges (subgraph, mask, groupEdges) currently > reuse the existing routing table to avoid the shuffle which would be required > to build a new one. However, this may be inefficient when the filtering is > highly selective. Vertices will be sent to more partitions than necessary, > and the extra routing information may take up excessive space. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3373) Filtering operations should optionally rebuild routing tables
[ https://issues.apache.org/jira/browse/SPARK-3373?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] uncleGen updated SPARK-3373: Priority: Major (was: Minor) > Filtering operations should optionally rebuild routing tables > - > > Key: SPARK-3373 > URL: https://issues.apache.org/jira/browse/SPARK-3373 > Project: Spark > Issue Type: Improvement > Components: GraphX >Affects Versions: 1.1.0, 1.1.1 >Reporter: uncleGen > > Graph operations that filter the edges (subgraph, mask, groupEdges) currently > reuse the existing routing table to avoid the shuffle which would be required > to build a new one. However, this may be inefficient when the filtering is > highly selective. Vertices will be sent to more partitions than necessary, > and the extra routing information may take up excessive space. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4902) gap-sampling performance optimization
Guoqiang Li created SPARK-4902: -- Summary: gap-sampling performance optimization Key: SPARK-4902 URL: https://issues.apache.org/jira/browse/SPARK-4902 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.2.0 Reporter: Guoqiang Li {{CacheManager.getOrCompute}} returns an instance of InterruptibleIterator that contains an array or a iterator(when the memory is not enough). The GapSamplingIterator implementation is as follows {code} private val iterDrop: Int => Unit = { val arrayClass = Array.empty[T].iterator.getClass val arrayBufferClass = ArrayBuffer.empty[T].iterator.getClass data.getClass match { case `arrayClass` => ((n: Int) => { data = data.drop(n) }) case `arrayBufferClass` => ((n: Int) => { data = data.drop(n) }) case _ => ((n: Int) => { var j = 0 while (j < n && data.hasNext) { data.next() j += 1 } }) } } {code} The code does not deal with InterruptibleIterator. This leads to the following code can't use the {{Iterator.drop}} method {code} rdd.cache() data.sample(false,0.1) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-4844) SGD should support custom sampling.
[ https://issues.apache.org/jira/browse/SPARK-4844?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Guoqiang Li resolved SPARK-4844. Resolution: Won't Fix See: SPARK-4902 > SGD should support custom sampling. > --- > > Key: SPARK-4844 > URL: https://issues.apache.org/jira/browse/SPARK-4844 > Project: Spark > Issue Type: Improvement > Components: MLlib >Reporter: Guoqiang Li > Fix For: 1.3.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3619) Upgrade to Mesos 0.21 to work around MESOS-1688
[ https://issues.apache.org/jira/browse/SPARK-3619?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14253394#comment-14253394 ] Jing Dong commented on SPARK-3619: -- Has anyone succeed to run Spark 1.1.1 on Mesos 0.21? What's the known issue running Spark on latest Mesos? > Upgrade to Mesos 0.21 to work around MESOS-1688 > --- > > Key: SPARK-3619 > URL: https://issues.apache.org/jira/browse/SPARK-3619 > Project: Spark > Issue Type: Improvement > Components: Mesos >Reporter: Matei Zaharia >Assignee: Timothy Chen > > The Mesos 0.21 release has a fix for > https://issues.apache.org/jira/browse/MESOS-1688, which affects Spark jobs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4902) gap-sampling performance optimization
[ https://issues.apache.org/jira/browse/SPARK-4902?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Guoqiang Li updated SPARK-4902: --- Description: {{CacheManager.getOrCompute}} returns an instance of InterruptibleIterator that contains an array or a iterator(when the memory is not enough). The GapSamplingIterator implementation is as follows {code} private val iterDrop: Int => Unit = { val arrayClass = Array.empty[T].iterator.getClass val arrayBufferClass = ArrayBuffer.empty[T].iterator.getClass data.getClass match { case `arrayClass` => ((n: Int) => { data = data.drop(n) }) case `arrayBufferClass` => ((n: Int) => { data = data.drop(n) }) case _ => ((n: Int) => { var j = 0 while (j < n && data.hasNext) { data.next() j += 1 } }) } } {code} The code does not deal with InterruptibleIterator. This leads to the following code can't use the {{Iterator.drop}} method {code} rdd.cache() rdd.sample(false,0.1) {code} was: {{CacheManager.getOrCompute}} returns an instance of InterruptibleIterator that contains an array or a iterator(when the memory is not enough). The GapSamplingIterator implementation is as follows {code} private val iterDrop: Int => Unit = { val arrayClass = Array.empty[T].iterator.getClass val arrayBufferClass = ArrayBuffer.empty[T].iterator.getClass data.getClass match { case `arrayClass` => ((n: Int) => { data = data.drop(n) }) case `arrayBufferClass` => ((n: Int) => { data = data.drop(n) }) case _ => ((n: Int) => { var j = 0 while (j < n && data.hasNext) { data.next() j += 1 } }) } } {code} The code does not deal with InterruptibleIterator. This leads to the following code can't use the {{Iterator.drop}} method {code} rdd.cache() data.sample(false,0.1) {code} > gap-sampling performance optimization > - > > Key: SPARK-4902 > URL: https://issues.apache.org/jira/browse/SPARK-4902 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 1.2.0 >Reporter: Guoqiang Li > > {{CacheManager.getOrCompute}} returns an instance of InterruptibleIterator > that contains an array or a iterator(when the memory is not enough). > The GapSamplingIterator implementation is as follows > {code} > private val iterDrop: Int => Unit = { > val arrayClass = Array.empty[T].iterator.getClass > val arrayBufferClass = ArrayBuffer.empty[T].iterator.getClass > data.getClass match { > case `arrayClass` => ((n: Int) => { data = data.drop(n) }) > case `arrayBufferClass` => ((n: Int) => { data = data.drop(n) }) > case _ => ((n: Int) => { > var j = 0 > while (j < n && data.hasNext) { > data.next() > j += 1 > } > }) > } > } > {code} > The code does not deal with InterruptibleIterator. > This leads to the following code can't use the {{Iterator.drop}} method > {code} > rdd.cache() > rdd.sample(false,0.1) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4903) RDD remains cached after "DROP TABLE"
Evert Lammerts created SPARK-4903: - Summary: RDD remains cached after "DROP TABLE" Key: SPARK-4903 URL: https://issues.apache.org/jira/browse/SPARK-4903 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.3.0 Environment: Spark master @ Dec 17 (3cd516191baadf8496ccdae499771020e89acd7e) Reporter: Evert Lammerts Priority: Critical In beeline, when I run: {code:sql} CREATE TABLE test AS select col from table; CACHE TABLE test DROP TABLE test {code} The the table is removed but the RDD is still cached. Running UNCACHE is not possible anymore (table not found from thriftserver). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4903) RDD remains cached after "DROP TABLE"
[ https://issues.apache.org/jira/browse/SPARK-4903?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Evert Lammerts updated SPARK-4903: -- Description: In beeline, when I run: {code:sql} CREATE TABLE test AS select col from table; CACHE TABLE test DROP TABLE test {code} The the table is removed but the RDD is still cached. Running UNCACHE is not possible anymore (table not found from metastore). was: In beeline, when I run: {code:sql} CREATE TABLE test AS select col from table; CACHE TABLE test DROP TABLE test {code} The the table is removed but the RDD is still cached. Running UNCACHE is not possible anymore (table not found from thriftserver). > RDD remains cached after "DROP TABLE" > - > > Key: SPARK-4903 > URL: https://issues.apache.org/jira/browse/SPARK-4903 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.3.0 > Environment: Spark master @ Dec 17 > (3cd516191baadf8496ccdae499771020e89acd7e) >Reporter: Evert Lammerts >Priority: Critical > > In beeline, when I run: > {code:sql} > CREATE TABLE test AS select col from table; > CACHE TABLE test > DROP TABLE test > {code} > The the table is removed but the RDD is still cached. Running UNCACHE is not > possible anymore (table not found from metastore). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4902) gap-sampling performance optimization
[ https://issues.apache.org/jira/browse/SPARK-4902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14253409#comment-14253409 ] Apache Spark commented on SPARK-4902: - User 'witgo' has created a pull request for this issue: https://github.com/apache/spark/pull/3744 > gap-sampling performance optimization > - > > Key: SPARK-4902 > URL: https://issues.apache.org/jira/browse/SPARK-4902 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 1.2.0 >Reporter: Guoqiang Li > > {{CacheManager.getOrCompute}} returns an instance of InterruptibleIterator > that contains an array or a iterator(when the memory is not enough). > The GapSamplingIterator implementation is as follows > {code} > private val iterDrop: Int => Unit = { > val arrayClass = Array.empty[T].iterator.getClass > val arrayBufferClass = ArrayBuffer.empty[T].iterator.getClass > data.getClass match { > case `arrayClass` => ((n: Int) => { data = data.drop(n) }) > case `arrayBufferClass` => ((n: Int) => { data = data.drop(n) }) > case _ => ((n: Int) => { > var j = 0 > while (j < n && data.hasNext) { > data.next() > j += 1 > } > }) > } > } > {code} > The code does not deal with InterruptibleIterator. > This leads to the following code can't use the {{Iterator.drop}} method > {code} > rdd.cache() > rdd.sample(false,0.1) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4904) Remove the foldable checking in HiveGenericUdf.eval
Cheng Hao created SPARK-4904: Summary: Remove the foldable checking in HiveGenericUdf.eval Key: SPARK-4904 URL: https://issues.apache.org/jira/browse/SPARK-4904 Project: Spark Issue Type: Improvement Components: SQL Reporter: Cheng Hao Priority: Minor Since https://github.com/apache/spark/pull/3429 has been merged, the bug of wrapping to Writable for HiveGenericUDF is resolved, we can safely remove the foldable checking in `HiveGenericUdf.eval`, which discussed in https://github.com/apache/spark/pull/2802. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4904) Remove the foldable checking in HiveGenericUdf.eval
[ https://issues.apache.org/jira/browse/SPARK-4904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14253418#comment-14253418 ] Apache Spark commented on SPARK-4904: - User 'chenghao-intel' has created a pull request for this issue: https://github.com/apache/spark/pull/3745 > Remove the foldable checking in HiveGenericUdf.eval > --- > > Key: SPARK-4904 > URL: https://issues.apache.org/jira/browse/SPARK-4904 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Cheng Hao >Priority: Minor > > Since https://github.com/apache/spark/pull/3429 has been merged, the bug of > wrapping to Writable for HiveGenericUDF is resolved, we can safely remove the > foldable checking in `HiveGenericUdf.eval`, which discussed in > https://github.com/apache/spark/pull/2802. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4867) UDF clean up
[ https://issues.apache.org/jira/browse/SPARK-4867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14253579#comment-14253579 ] William Benton commented on SPARK-4867: --- [~marmbrus] I actually think exposing an interface that looks something like overloading might be the right approach. (To be clear, I think polymorphism poses a far greater difficulty with implicit coercion than without it, but it might be possible to solve the ambiguity there by letting users register functions in a priority order.) > UDF clean up > > > Key: SPARK-4867 > URL: https://issues.apache.org/jira/browse/SPARK-4867 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Michael Armbrust >Priority: Blocker > > Right now our support and internal implementation of many functions has a few > issues. Specifically: > - UDFS don't know their input types and thus don't do type coercion. > - We hard code a bunch of built in functions into the parser. This is bad > because in SQL it creates new reserved words for things that aren't actually > keywords. Also it means that for each function we need to add support to > both SQLContext and HiveContext separately. > For this JIRA I propose we do the following: > - Change the interfaces for registerFunction and ScalaUdf to include types > for the input arguments as well as the output type. > - Add a rule to analysis that does type coercion for UDFs. > - Add a parse rule for functions to SQLParser. > - Rewrite all the UDFs that are currently hacked into the various parsers > using this new functionality. > Depending on how big this refactoring becomes we could split parts 1&2 from > part 3 above. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4901) Hot fix for the BytesWritable.copyBytes not exists in Hadoop1
[ https://issues.apache.org/jira/browse/SPARK-4901?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-4901: -- Assignee: Cheng Hao > Hot fix for the BytesWritable.copyBytes not exists in Hadoop1 > - > > Key: SPARK-4901 > URL: https://issues.apache.org/jira/browse/SPARK-4901 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Cheng Hao >Assignee: Cheng Hao >Priority: Minor > Fix For: 1.3.0 > > > HiveInspectors.scala failed in compiling with Hadoop 1, as the > BytesWritable.copyBytes is not available in Hadoop 1. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-4901) Hot fix for the BytesWritable.copyBytes not exists in Hadoop1
[ https://issues.apache.org/jira/browse/SPARK-4901?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen resolved SPARK-4901. --- Resolution: Fixed Fix Version/s: 1.3.0 Issue resolved by pull request 3742 [https://github.com/apache/spark/pull/3742] > Hot fix for the BytesWritable.copyBytes not exists in Hadoop1 > - > > Key: SPARK-4901 > URL: https://issues.apache.org/jira/browse/SPARK-4901 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Cheng Hao >Priority: Minor > Fix For: 1.3.0 > > > HiveInspectors.scala failed in compiling with Hadoop 1, as the > BytesWritable.copyBytes is not available in Hadoop 1. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2447) Add common solution for sending upsert actions to HBase (put, deletes, and increment)
[ https://issues.apache.org/jira/browse/SPARK-2447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14253586#comment-14253586 ] Ted Malaska commented on SPARK-2447: Hey guy, Just wanted to update this jira. In summery the Spark committers is still deciding how this will be or not be include in the external part of Spark. For now because the demand is there and because the solution works I'm going to host the solution on Cloudera Labs. Here is the blog post that walks through the solution. http://blog.cloudera.com/blog/2014/12/new-in-cloudera-labs-sparkonhbase/ > Add common solution for sending upsert actions to HBase (put, deletes, and > increment) > - > > Key: SPARK-2447 > URL: https://issues.apache.org/jira/browse/SPARK-2447 > Project: Spark > Issue Type: New Feature > Components: Spark Core, Streaming >Reporter: Ted Malaska >Assignee: Ted Malaska > > Going to review the design with Tdas today. > But first thoughts is to have an extension of VoidFunction that handles the > connection to HBase and allows for options such as turning auto flush off for > higher through put. > Need to answer the following questions first. > - Can it be written in Java or should it be written in Scala? > - What is the best way to add the HBase dependency? (will review how Flume > does this as the first option) > - What is the best way to do testing? (will review how Flume does this as the > first option) > - How to support python? (python may be a different Jira it is unknown at > this time) > Goals: > - Simple to use > - Stable > - Supports high load > - Documented (May be in a separate Jira need to ask Tdas) > - Supports Java, Scala, and hopefully Python > - Supports Streaming and normal Spark -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3686) flume.SparkSinkSuite.Success is flaky
[ https://issues.apache.org/jira/browse/SPARK-3686?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-3686: -- Labels: flaky-test (was: ) > flume.SparkSinkSuite.Success is flaky > - > > Key: SPARK-3686 > URL: https://issues.apache.org/jira/browse/SPARK-3686 > Project: Spark > Issue Type: Bug > Components: Streaming >Reporter: Patrick Wendell >Assignee: Hari Shreedharan >Priority: Blocker > Labels: flaky-test > Fix For: 1.2.0 > > > {code} > Error Message > 4000 did not equal 5000 > Stacktrace > sbt.ForkMain$ForkError: 4000 did not equal 5000 > at > org.scalatest.Assertions$class.newAssertionFailedException(Assertions.scala:498) > at > org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1559) > at > org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:416) > at > org.apache.spark.streaming.flume.sink.SparkSinkSuite.org$apache$spark$streaming$flume$sink$SparkSinkSuite$$assertChannelIsEmpty(SparkSinkSuite.scala:195) > at > org.apache.spark.streaming.flume.sink.SparkSinkSuite$$anonfun$1.apply$mcV$sp(SparkSinkSuite.scala:54) > at > org.apache.spark.streaming.flume.sink.SparkSinkSuite$$anonfun$1.apply(SparkSinkSuite.scala:40) > at > org.apache.spark.streaming.flume.sink.SparkSinkSuite$$anonfun$1.apply(SparkSinkSuite.scala:40) > at > org.scalatest.Transformer$$anonfun$apply$1.apply(Transformer.scala:22) > at > org.scalatest.Transformer$$anonfun$apply$1.apply(Transformer.scala:22) > at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85) > at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) > at org.scalatest.Transformer.apply(Transformer.scala:22) > at org.scalatest.Transformer.apply(Transformer.scala:20) > at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:158) > at org.scalatest.Suite$class.withFixture(Suite.scala:1121) > at org.scalatest.FunSuite.withFixture(FunSuite.scala:1559) > at > org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:155) > at > org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:167) > at > org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:167) > at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306) > at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:167) > at org.scalatest.FunSuite.runTest(FunSuite.scala:1559) > at > org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:200) > at > org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:200) > at > org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:413) > at > org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:401) > at scala.collection.immutable.List.foreach(List.scala:318) > at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401) > at > org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:396) > at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:483) > at org.scalatest.FunSuiteLike$class.runTests(FunSuiteLike.scala:200) > at org.scalatest.FunSuite.runTests(FunSuite.scala:1559) > at org.scalatest.Suite$class.run(Suite.scala:1423) > at > org.scalatest.FunSuite.org$scalatest$FunSuiteLike$$super$run(FunSuite.scala:1559) > at > org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:204) > at > org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:204) > at org.scalatest.SuperEngine.runImpl(Engine.scala:545) > at org.scalatest.FunSuiteLike$class.run(FunSuiteLike.scala:204) > at org.scalatest.FunSuite.run(FunSuite.scala:1559) > at > org.scalatest.tools.Framework.org$scalatest$tools$Framework$$runSuite(Framework.scala:444) > at > org.scalatest.tools.Framework$ScalaTestTask.execute(Framework.scala:651) > at sbt.ForkMain$Run$2.call(ForkMain.java:294) > at sbt.ForkMain$Run$2.call(ForkMain.java:284) > at java.util.concurrent.FutureTask.run(FutureTask.java:262) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > {code} > Example test result (this will stop working in a few days): > https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-SBT/719/AMPLAB_JENKINS_BUILD_PROFILE=hadoop2.3,label=centos/testReport/junit/org.apache.spark.streaming.flume.sink/SparkSinkSuite/Success_with_ack/ -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (SPARK-3912) FlumeStreamSuite is flaky, fails either with port binding issues or data not being reliably sent
[ https://issues.apache.org/jira/browse/SPARK-3912?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-3912: -- Labels: flaky-test (was: ) > FlumeStreamSuite is flaky, fails either with port binding issues or data not > being reliably sent > > > Key: SPARK-3912 > URL: https://issues.apache.org/jira/browse/SPARK-3912 > Project: Spark > Issue Type: Bug > Components: Streaming >Affects Versions: 1.1.0 >Reporter: Tathagata Das >Assignee: Tathagata Das >Priority: Critical > Labels: flaky-test > Fix For: 1.2.0 > > > Two problems. > 1. Attempts to start the service to start on different possible ports (to > avoid bind failures) was incorrect as the service is actually start lazily > (when receiver starts, not when the flume input stream is created). > 2. Lots of Thread.sleep was used to improve the probabilities that data sent > through avro to flume receiver was being sent. However, the sending may fail > for various unknown reasons, causing the test to fail. > 3. Thread.sleep was also used to send one record per batch and checks were > made on whether only one records was received in every batch. This was an > overkill because all we need to test in this unit test is whether data is > being sent and received or not, not about timings. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-1603) flaky test case in StreamingContextSuite
[ https://issues.apache.org/jira/browse/SPARK-1603?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-1603: -- Labels: flaky-test (was: ) > flaky test case in StreamingContextSuite > > > Key: SPARK-1603 > URL: https://issues.apache.org/jira/browse/SPARK-1603 > Project: Spark > Issue Type: Bug > Components: Streaming >Affects Versions: 0.9.0, 0.9.1, 1.0.0 >Reporter: Nan Zhu >Assignee: Nan Zhu > Labels: flaky-test > > When Jenkins was testing 5 PRs at the same time, the test results in my PR > shows that stop gracefully in StreamingContextSuite failed, > the stacktrace is as > {quote} > stop gracefully *** FAILED *** (8 seconds, 350 milliseconds) > [info] akka.actor.InvalidActorNameException: actor name [JobScheduler] is > not unique! > [info] at > akka.actor.dungeon.ChildrenContainer$TerminatingChildrenContainer.reserve(ChildrenContainer.scala:192) > [info] at akka.actor.dungeon.Children$class.reserveChild(Children.scala:77) > [info] at akka.actor.ActorCell.reserveChild(ActorCell.scala:338) > [info] at akka.actor.dungeon.Children$class.makeChild(Children.scala:186) > [info] at akka.actor.dungeon.Children$class.attachChild(Children.scala:42) > [info] at akka.actor.ActorCell.attachChild(ActorCell.scala:338) > [info] at akka.actor.ActorSystemImpl.actorOf(ActorSystem.scala:518) > [info] at > org.apache.spark.streaming.scheduler.JobScheduler.start(JobScheduler.scala:57) > [info] at > org.apache.spark.streaming.StreamingContext.start(StreamingContext.scala:434) > [info] at > org.apache.spark.streaming.StreamingContextSuite$$anonfun$14$$anonfun$apply$mcV$sp$3.apply$mcVI$sp(StreamingContextSuite.scala:174) > [info] at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141) > [info] at > org.apache.spark.streaming.StreamingContextSuite$$anonfun$14.apply$mcV$sp(StreamingContextSuite.scala:163) > [info] at > org.apache.spark.streaming.StreamingContextSuite$$anonfun$14.apply(StreamingContextSuite.scala:159) > [info] at > org.apache.spark.streaming.StreamingContextSuite$$anonfun$14.apply(StreamingContextSuite.scala:159) > [info] at org.scalatest.FunSuite$$anon$1.apply(FunSuite.scala:1265) > [info] at org.scalatest.Suite$class.withFixture(Suite.scala:1974) > [info] at > org.apache.spark.streaming.StreamingContextSuite.withFixture(StreamingContextSuite.scala:34) > [info] at > org.scalatest.FunSuite$class.invokeWithFixture$1(FunSuite.scala:1262) > [info] at > org.scalatest.FunSuite$$anonfun$runTest$1.apply(FunSuite.scala:1271) > [info] at > org.scalatest.FunSuite$$anonfun$runTest$1.apply(FunSuite.scala:1271) > [info] at org.scalatest.SuperEngine.runTestImpl(Engine.scala:198) > [info] at org.scalatest.FunSuite$class.runTest(FunSuite.scala:1271) > [info] at > org.apache.spark.streaming.StreamingContextSuite.org$scalatest$BeforeAndAfter$$super$runTest(StreamingContextSuite.scala:34) > [info] at > org.scalatest.BeforeAndAfter$class.runTest(BeforeAndAfter.scala:171) > [info] at > org.apache.spark.streaming.StreamingContextSuite.runTest(StreamingContextSuite.scala:34) > [info] at > org.scalatest.FunSuite$$anonfun$runTests$1.apply(FunSuite.scala:1304) > [info] at > org.scalatest.FunSuite$$anonfun$runTests$1.apply(FunSuite.scala:1304) > [info] at > org.scalatest.SuperEngine$$anonfun$org$scalatest$SuperEngine$$runTestsInBranch$1.apply(Engine.scala:260) > [info] at > org.scalatest.SuperEngine$$anonfun$org$scalatest$SuperEngine$$runTestsInBranch$1.apply(Engine.scala:249) > [info] at scala.collection.immutable.List.foreach(List.scala:318) > [info] at > org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:249) > [info] at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:326) > [info] at org.scalatest.FunSuite$class.runTests(FunSuite.scala:1304) > [info] at > org.apache.spark.streaming.StreamingContextSuite.runTests(StreamingContextSuite.scala:34) > [info] at org.scalatest.Suite$class.run(Suite.scala:2303) > [info] at > org.apache.spark.streaming.StreamingContextSuite.org$scalatest$FunSuite$$super$run(StreamingContextSuite.scala:34) > [info] at org.scalatest.FunSuite$$anonfun$run$1.apply(FunSuite.scala:1310) > [info] at org.scalatest.FunSuite$$anonfun$run$1.apply(FunSuite.scala:1310) > [info] at org.scalatest.SuperEngine.runImpl(Engine.scala:362) > [info] at org.scalatest.FunSuite$class.run(FunSuite.scala:1310) > [info] at > org.apache.spark.streaming.StreamingContextSuite.org$scalatest$BeforeAndAfter$$super$run(StreamingContextSuite.scala:34) > [info] at org.scalatest.BeforeAndAfter$class.run(BeforeAndAfter.scala:208) > [info] at > org.apache.spark.streaming.StreamingContextSuite.run(StreamingContextSuite.scala:34) > [info] at > org.scalatest.tools.Scal
[jira] [Updated] (SPARK-4053) Block generator throttling in NetworkReceiverSuite is flaky
[ https://issues.apache.org/jira/browse/SPARK-4053?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-4053: -- Labels: flaky-test (was: ) > Block generator throttling in NetworkReceiverSuite is flaky > --- > > Key: SPARK-4053 > URL: https://issues.apache.org/jira/browse/SPARK-4053 > Project: Spark > Issue Type: Bug > Components: Streaming >Affects Versions: 1.1.0 >Reporter: Tathagata Das >Assignee: Tathagata Das >Priority: Minor > Labels: flaky-test > Fix For: 1.2.0 > > > In the unit test that checked whether blocks generated by throttled block > generator had expected number of records, the thresholds are too tight, which > sometimes led to the test failing. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-1158) Fix flaky RateLimitedOutputStreamSuite
[ https://issues.apache.org/jira/browse/SPARK-1158?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-1158: -- Labels: flaky-test (was: ) > Fix flaky RateLimitedOutputStreamSuite > -- > > Key: SPARK-1158 > URL: https://issues.apache.org/jira/browse/SPARK-1158 > Project: Spark > Issue Type: Bug >Reporter: Reynold Xin >Assignee: Reynold Xin > Labels: flaky-test > Fix For: 1.0.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4905) Flaky FlumeStreamSuite test: org.apache.spark.streaming.flume.FlumeStreamSuite.flume input stream
Josh Rosen created SPARK-4905: - Summary: Flaky FlumeStreamSuite test: org.apache.spark.streaming.flume.FlumeStreamSuite.flume input stream Key: SPARK-4905 URL: https://issues.apache.org/jira/browse/SPARK-4905 Project: Spark Issue Type: Bug Components: Streaming Affects Versions: 1.3.0 Reporter: Josh Rosen It looks like the "org.apache.spark.streaming.flume.FlumeStreamSuite.flume input stream" test might be flaky ([link|https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24647/testReport/junit/org.apache.spark.streaming.flume/FlumeStreamSuite/flume_input_stream/]): {code} Error Message The code passed to eventually never returned normally. Attempted 106 times over 10.045097243 seconds. Last failure message: ArrayBuffer("", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "") was not equal to Vector("1", "2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12", "13", "14", "15", "16", "17", "18", "19", "20", "21", "22", "23", "24", "25", "26", "27", "28", "29", "30", "31", "32", "33", "34", "35", "36", "37", "38", "39", "40", "41", "42", "43", "44", "45", "46", "47", "48", "49", "50", "51", "52", "53", "54", "55", "56", "57", "58", "59", "60", "61", "62", "63", "64", "65", "66", "67", "68", "69", "70", "71", "72", "73", "74", "75", "76", "77", "78", "79", "80", "81", "82", "83", "84", "85", "86", "87", "88", "89", "90", "91", "92", "93", "94", "95", "96", "97", "98", "99", "100"). Stacktrace sbt.ForkMain$ForkError: The code passed to eventually never returned normally. Attempted 106 times over 10.045097243 seconds. Last failure message: ArrayBuffer("", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "") was not equal to Vector("1", "2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12", "13", "14", "15", "16", "17", "18", "19", "20", "21", "22", "23", "24", "25", "26", "27", "28", "29", "30", "31", "32", "33", "34", "35", "36", "37", "38", "39", "40", "41", "42", "43", "44", "45", "46", "47", "48", "49", "50", "51", "52", "53", "54", "55", "56", "57", "58", "59", "60", "61", "62", "63", "64", "65", "66", "67", "68", "69", "70", "71", "72", "73", "74", "75", "76", "77", "78", "79", "80", "81", "82", "83", "84", "85", "86", "87", "88", "89", "90", "91", "92", "93", "94", "95", "96", "97", "98", "99", "100"). at org.scalatest.concurrent.Eventually$class.tryTryAgain$1(Eventually.scala:420) at org.scalatest.concurrent.Eventually$class.eventually(Eventually.scala:438) at org.scalatest.concurrent.Eventually$.eventually(Eventually.scala:478) at org.scalatest.concurrent.Eventually$class.eventually(Eventually.scala:307) at org.scalatest.concurrent.Eventually$.eventually(Eventually.scala:478) at org.apache.spark.streaming.flume.FlumeStreamSuite.writeAndVerify(FlumeStreamSuite.scala:142) at org.apache.spark.streaming.flume.FlumeStreamSuite.org$apache$spark$streaming$flume$FlumeStreamSuite$$testFlumeStream(FlumeStreamSuite.scala:74) at org.apache.spark.streaming.flume.FlumeStreamSuite$$anonfun$2.apply$mcV$sp(FlumeStreamSuite.scala:62) at org.apache.spark.streaming.flume.FlumeStreamSuite$$anonfun$2.apply(FlumeStreamSuite.scala:62) at org.apache.spark.streaming.flume.FlumeStreamSuite$$anonfun$2.apply(FlumeStreamSuite.scala:62) at org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22) at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85) at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) at org.scalatest.Transformer.apply(Transformer.scala:22) at org.scalatest.Transformer.apply(Transformer.scala:20) at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166) at org.scalatest.Suite$class.withFixture(Suite.scala:1122) at org.scalatest.FunSuite.withFixture(FunSuite.scala:1555) at org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163) at org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175) at org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175) at org.scalatest.SuperEngine.r
[jira] [Commented] (SPARK-4869) The variable names in IF statement of Spark SQL doesn't resolve to its value.
[ https://issues.apache.org/jira/browse/SPARK-4869?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14253659#comment-14253659 ] Arnab commented on SPARK-4869: -- Can you kindly clarify what DAYS_30 refers to. I tried out a nested if statement and it seems to work fine in SparkSql. val child = sqlContext.sql("select name,age,IF(age < 20,IF(age<12,0,1),1) as child from people") child.collect.foreach(println) > The variable names in IF statement of Spark SQL doesn't resolve to its value. > -- > > Key: SPARK-4869 > URL: https://issues.apache.org/jira/browse/SPARK-4869 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.1.1 >Reporter: Ajay >Priority: Blocker > > We got stuck with “IF-THEN” statement in Spark SQL. As per our usecase, we > have to have nested “if” statements. But, spark sql is not able to resolve > the variable names in final evaluation but the literal values are working. An > "Unresolved Attributes" error is being thrown. Please fix this bug. > This works: > sqlSC.sql("SELECT DISTINCT UNIT, PAST_DUE ,IF( PAST_DUE = 'CURRENT_MONTH', > 0,1) as ROLL_BACKWARD FROM OUTER_RDD") > This doesn’t : > sqlSC.sql("SELECT DISTINCT UNIT, PAST_DUE ,IF( PAST_DUE = 'CURRENT_MONTH', > 0,DAYS_30) as ROLL_BACKWARD FROM OUTER_RDD") -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4906) Spark master OOMs with exception stack trace stored in JobProgressListener
Mingyu Kim created SPARK-4906: - Summary: Spark master OOMs with exception stack trace stored in JobProgressListener Key: SPARK-4906 URL: https://issues.apache.org/jira/browse/SPARK-4906 Project: Spark Issue Type: Bug Affects Versions: 1.1.1 Reporter: Mingyu Kim Spark master was OOMing with a lot of stack traces retained in JobProgressListener. The object dependency goes like the following. JobProgressListener.stageIdToData => StageUIData.taskData => TaskUIData.errorMessage Each error message is ~10kb since it has the entire stack trace. As we have a lot of tasks, when all of the tasks across multiple stages go bad, these error messages accounted for 0.5GB of heap at some point. Please correct me if I'm wrong, but it looks like all the task info for running applications are kept in memory, which means it's almost always bound to OOM for long-running applications. Would it make sense to fix this, for example, by spilling some UI states to disk? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4896) Don't redundantly copy executor dependencies in Utils.fetchFile
[ https://issues.apache.org/jira/browse/SPARK-4896?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14253779#comment-14253779 ] Apache Spark commented on SPARK-4896: - User 'ryan-williams' has created a pull request for this issue: https://github.com/apache/spark/pull/2848 > Don't redundantly copy executor dependencies in Utils.fetchFile > --- > > Key: SPARK-4896 > URL: https://issues.apache.org/jira/browse/SPARK-4896 > Project: Spark > Issue Type: Improvement >Reporter: Josh Rosen > > This JIRA is spun off from a comment by [~rdub] on SPARK-3967, quoted here: > {quote} > I've been debugging this issue as well and I think I've found an issue in > {{org.apache.spark.util.Utils}} that is contributing to / causing the problem: > {{Files.move}} on [line > 390|https://github.com/apache/spark/blob/v1.1.0/core/src/main/scala/org/apache/spark/util/Utils.scala#L390] > is called even if {{targetFile}} exists and {{tempFile}} and {{targetFile}} > are equal. > The check on [line > 379|https://github.com/apache/spark/blob/v1.1.0/core/src/main/scala/org/apache/spark/util/Utils.scala#L379] > seems to imply the desire to skip a redundant overwrite if the file is > already there and has the contents that it should have. > Gating the {{Files.move}} call on a further {{if (!targetFile.exists)}} fixes > the issue for me; attached is a patch of the change. > In practice all of my executors that hit this code path are finding every > dependency JAR to already exist and be exactly equal to what they need it to > be, meaning they were all needlessly overwriting all of their dependency > JARs, and now are all basically no-op-ing in {{Utils.fetchFile}}; I've not > determined who/what is putting the JARs there, why the issue only crops up in > {{yarn-cluster}} mode (or {{--master yarn --deploy-mode cluster}}), etc., but > it seems like either way this patch is probably desirable. > {quote} > I'm spinning this off into its own JIRA so that we can track the merging of > https://github.com/apache/spark/pull/2848 separately (since we have multiple > PRs that contribute to fixing the original issue). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4903) RDD remains cached after "DROP TABLE"
[ https://issues.apache.org/jira/browse/SPARK-4903?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-4903: Target Version/s: 1.3.0 > RDD remains cached after "DROP TABLE" > - > > Key: SPARK-4903 > URL: https://issues.apache.org/jira/browse/SPARK-4903 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.3.0 > Environment: Spark master @ Dec 17 > (3cd516191baadf8496ccdae499771020e89acd7e) >Reporter: Evert Lammerts >Priority: Critical > > In beeline, when I run: > {code:sql} > CREATE TABLE test AS select col from table; > CACHE TABLE test > DROP TABLE test > {code} > The the table is removed but the RDD is still cached. Running UNCACHE is not > possible anymore (table not found from metastore). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4892) java.io.FileNotFound exceptions when creating EXTERNAL hive tables
[ https://issues.apache.org/jira/browse/SPARK-4892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14253809#comment-14253809 ] Michael Armbrust commented on SPARK-4892: - I'll add that the right fix here is probably to just set that automatically when we detect hive 13 mode, since afaict this is a Hive bug. > java.io.FileNotFound exceptions when creating EXTERNAL hive tables > -- > > Key: SPARK-4892 > URL: https://issues.apache.org/jira/browse/SPARK-4892 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.2.0 >Reporter: Patrick Wendell >Assignee: Michael Armbrust > Labels: starter > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4892) java.io.FileNotFound exceptions when creating EXTERNAL hive tables
[ https://issues.apache.org/jira/browse/SPARK-4892?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-4892: Target Version/s: 1.3.0 > java.io.FileNotFound exceptions when creating EXTERNAL hive tables > -- > > Key: SPARK-4892 > URL: https://issues.apache.org/jira/browse/SPARK-4892 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.2.0 >Reporter: Patrick Wendell >Assignee: Michael Armbrust > Labels: starter > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4892) java.io.FileNotFound exceptions when creating EXTERNAL hive tables
[ https://issues.apache.org/jira/browse/SPARK-4892?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-4892: Labels: starter (was: ) > java.io.FileNotFound exceptions when creating EXTERNAL hive tables > -- > > Key: SPARK-4892 > URL: https://issues.apache.org/jira/browse/SPARK-4892 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.2.0 >Reporter: Patrick Wendell >Assignee: Michael Armbrust > Labels: starter > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4520) SparkSQL exception when reading certain columns from a parquet file
[ https://issues.apache.org/jira/browse/SPARK-4520?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-4520: Target Version/s: 1.3.0 (was: 1.2.0) > SparkSQL exception when reading certain columns from a parquet file > --- > > Key: SPARK-4520 > URL: https://issues.apache.org/jira/browse/SPARK-4520 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.2.0 >Reporter: sadhan sood >Priority: Critical > Attachments: part-r-0.parquet > > > I am seeing this issue with spark sql throwing an exception when trying to > read selective columns from a thrift parquet file and also when caching them. > On some further digging, I was able to narrow it down to at-least one > particular column type: map> to be causing this issue. To > reproduce this I created a test thrift file with a very basic schema and > stored some sample data in a parquet file: > Test.thrift > === > {code} > typedef binary SomeId > enum SomeExclusionCause { > WHITELIST = 1, > HAS_PURCHASE = 2, > } > struct SampleThriftObject { > 10: string col_a; > 20: string col_b; > 30: string col_c; > 40: optional map> col_d; > } > {code} > = > And loading the data in spark through schemaRDD: > {code} > import org.apache.spark.sql.SchemaRDD > val sqlContext = new org.apache.spark.sql.SQLContext(sc); > val parquetFile = "/path/to/generated/parquet/file" > val parquetFileRDD = sqlContext.parquetFile(parquetFile) > parquetFileRDD.printSchema > root > |-- col_a: string (nullable = true) > |-- col_b: string (nullable = true) > |-- col_c: string (nullable = true) > |-- col_d: map (nullable = true) > ||-- key: string > ||-- value: array (valueContainsNull = true) > |||-- element: string (containsNull = false) > parquetFileRDD.registerTempTable("test") > sqlContext.cacheTable("test") > sqlContext.sql("select col_a from test").collect() <-- see the exception > stack here > {code} > {code} > org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in > stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 > (TID 0, localhost): parquet.io.ParquetDecodingException: Can not read value > at 0 in block -1 in file file:/tmp/xyz/part-r-0.parquet > at > parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:213) > at > parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:204) > at > org.apache.spark.rdd.NewHadoopRDD$$anon$1.hasNext(NewHadoopRDD.scala:145) > at > org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) > at scala.collection.Iterator$$anon$14.hasNext(Iterator.scala:388) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) > at scala.collection.Iterator$class.foreach(Iterator.scala:727) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) > at > scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48) > at > scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103) > at > scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47) > at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273) > at scala.collection.AbstractIterator.to(Iterator.scala:1157) > at > scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265) > at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157) > at > scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252) > at scala.collection.AbstractIterator.toArray(Iterator.scala:1157) > at org.apache.spark.rdd.RDD$$anonfun$16.apply(RDD.scala:780) > at org.apache.spark.rdd.RDD$$anonfun$16.apply(RDD.scala:780) > at > org.apache.spark.SparkContext$$anonfun$runJob$3.apply(SparkContext.scala:1223) > at > org.apache.spark.SparkContext$$anonfun$runJob$3.apply(SparkContext.scala:1223) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61) > at org.apache.spark.scheduler.Task.run(Task.scala:56) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:195) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > Caused by: java.lang.ArrayIndexOutOfBoundsException: -1 > at java.util.ArrayList.elementData(ArrayList.java:418) > at java.util.ArrayList.get(ArrayList.java:431) > at parquet.io.GroupColumnIO.getLast(GroupColumnIO.ja
[jira] [Updated] (SPARK-4850) "GROUP BY" can't work if the schema of SchemaRDD contains struct or array type
[ https://issues.apache.org/jira/browse/SPARK-4850?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-4850: Description: Code in Spark Shell as follows: {code} val sqlContext = new org.apache.spark.sql.SQLContext(sc) val path = "path/to/json" sqlContext.jsonFile(path).register("Table") val t = sqlContext.sql("select * from Table group by a") t.collect {code} Let's look into the schema of `Table` {code} root |-- a: integer (nullable = true) |-- arr: array (nullable = true) ||-- element: integer (containsNull = false) |-- createdAt: string (nullable = true) |-- f: struct (nullable = true) ||-- __type: string (nullable = true) ||-- className: string (nullable = true) ||-- objectId: string (nullable = true) |-- objectId: string (nullable = true) |-- s: string (nullable = true) |-- updatedAt: string (nullable = true) {code} Exception will be throwed: {code} org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Expression not in GROUP BY: arr#9, tree: Aggregate [a#8], [a#8,arr#9,createdAt#10,f#11,objectId#12,s#13,updatedAt#14] Subquery TestImport LogicalRDD [a#8,arr#9,createdAt#10,f#11,objectId#12,s#13,updatedAt#14], MappedRDD[18] at map at JsonRDD.scala:47 at org.apache.spark.sql.catalyst.analysis.Analyzer$CheckAggregation$$anonfun$apply$3$$anonfun$applyOrElse$7.apply(Analyzer.scala:126) at org.apache.spark.sql.catalyst.analysis.Analyzer$CheckAggregation$$anonfun$apply$3$$anonfun$applyOrElse$7.apply(Analyzer.scala:125) at scala.Option.foreach(Option.scala:236) at org.apache.spark.sql.catalyst.analysis.Analyzer$CheckAggregation$$anonfun$apply$3.applyOrElse(Analyzer.scala:125) at org.apache.spark.sql.catalyst.analysis.Analyzer$CheckAggregation$$anonfun$apply$3.applyOrElse(Analyzer.scala:108) at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:144) at org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:135) at org.apache.spark.sql.catalyst.analysis.Analyzer$CheckAggregation$.apply(Analyzer.scala:108) at org.apache.spark.sql.catalyst.analysis.Analyzer$CheckAggregation$.apply(Analyzer.scala:106) at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1$$anonfun$apply$2.apply(RuleExecutor.scala:61) at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1$$anonfun$apply$2.apply(RuleExecutor.scala:59) at scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:51) at scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:60) at scala.collection.mutable.WrappedArray.foldLeft(WrappedArray.scala:34) at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1.apply(RuleExecutor.scala:59) at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1.apply(RuleExecutor.scala:51) at scala.collection.immutable.List.foreach(List.scala:318) at org.apache.spark.sql.catalyst.rules.RuleExecutor.apply(RuleExecutor.scala:51) at org.apache.spark.sql.SQLContext$QueryExecution.analyzed$lzycompute(SQLContext.scala:411) at org.apache.spark.sql.SQLContext$QueryExecution.analyzed(SQLContext.scala:411) at org.apache.spark.sql.SQLContext$QueryExecution.withCachedData$lzycompute(SQLContext.scala:412) at org.apache.spark.sql.SQLContext$QueryExecution.withCachedData(SQLContext.scala:412) at org.apache.spark.sql.SQLContext$QueryExecution.optimizedPlan$lzycompute(SQLContext.scala:413) at org.apache.spark.sql.SQLContext$QueryExecution.optimizedPlan(SQLContext.scala:413) at org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan$lzycompute(SQLContext.scala:418) at org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan(SQLContext.scala:416) at org.apache.spark.sql.SQLContext$QueryExecution.executedPlan$lzycompute(SQLContext.scala:422) at org.apache.spark.sql.SQLContext$QueryExecution.executedPlan(SQLContext.scala:422) at org.apache.spark.sql.SchemaRDD.collect(SchemaRDD.scala:444) at $iwC$$iwC$$iwC$$iwC.(:17) at $iwC$$iwC$$iwC.(:22) at $iwC$$iwC.(:24) at $iwC.(:26) at (:28) at .(:32) at .() at .(:7) at .() at $print() at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:852) at org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1125) at org.apache.spark.repl.SparkIMain.loadAndR
[jira] [Updated] (SPARK-4850) "GROUP BY" can't work if the schema of SchemaRDD contains struct or array type
[ https://issues.apache.org/jira/browse/SPARK-4850?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-4850: Assignee: Cheng Lian > "GROUP BY" can't work if the schema of SchemaRDD contains struct or array type > -- > > Key: SPARK-4850 > URL: https://issues.apache.org/jira/browse/SPARK-4850 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.1.0, 1.1.1, 1.2.0, 1.1.2 >Reporter: Chaozhong Yang >Assignee: Cheng Lian > Labels: group, sql > Original Estimate: 96h > Remaining Estimate: 96h > > Code in Spark Shell as follows: > {code} > val sqlContext = new org.apache.spark.sql.SQLContext(sc) > val path = "path/to/json" > sqlContext.jsonFile(path).register("Table") > val t = sqlContext.sql("select * from Table group by a") > t.collect > {code} > Let's look into the schema of `Table` > {code} > root > |-- a: integer (nullable = true) > |-- arr: array (nullable = true) > ||-- element: integer (containsNull = false) > |-- createdAt: string (nullable = true) > |-- f: struct (nullable = true) > ||-- __type: string (nullable = true) > ||-- className: string (nullable = true) > ||-- objectId: string (nullable = true) > |-- objectId: string (nullable = true) > |-- s: string (nullable = true) > |-- updatedAt: string (nullable = true) > {code} > Exception will be throwed: > {code} > org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Expression > not in GROUP BY: arr#9, tree: > Aggregate [a#8], [a#8,arr#9,createdAt#10,f#11,objectId#12,s#13,updatedAt#14] > Subquery TestImport > LogicalRDD [a#8,arr#9,createdAt#10,f#11,objectId#12,s#13,updatedAt#14], > MappedRDD[18] at map at JsonRDD.scala:47 > at > org.apache.spark.sql.catalyst.analysis.Analyzer$CheckAggregation$$anonfun$apply$3$$anonfun$applyOrElse$7.apply(Analyzer.scala:126) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$CheckAggregation$$anonfun$apply$3$$anonfun$applyOrElse$7.apply(Analyzer.scala:125) > at scala.Option.foreach(Option.scala:236) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$CheckAggregation$$anonfun$apply$3.applyOrElse(Analyzer.scala:125) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$CheckAggregation$$anonfun$apply$3.applyOrElse(Analyzer.scala:108) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:144) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:135) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$CheckAggregation$.apply(Analyzer.scala:108) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$CheckAggregation$.apply(Analyzer.scala:106) > at > org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1$$anonfun$apply$2.apply(RuleExecutor.scala:61) > at > org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1$$anonfun$apply$2.apply(RuleExecutor.scala:59) > at > scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:51) > at > scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:60) > at scala.collection.mutable.WrappedArray.foldLeft(WrappedArray.scala:34) > at > org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1.apply(RuleExecutor.scala:59) > at > org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1.apply(RuleExecutor.scala:51) > at scala.collection.immutable.List.foreach(List.scala:318) > at > org.apache.spark.sql.catalyst.rules.RuleExecutor.apply(RuleExecutor.scala:51) > at > org.apache.spark.sql.SQLContext$QueryExecution.analyzed$lzycompute(SQLContext.scala:411) > at > org.apache.spark.sql.SQLContext$QueryExecution.analyzed(SQLContext.scala:411) > at > org.apache.spark.sql.SQLContext$QueryExecution.withCachedData$lzycompute(SQLContext.scala:412) > at > org.apache.spark.sql.SQLContext$QueryExecution.withCachedData(SQLContext.scala:412) > at > org.apache.spark.sql.SQLContext$QueryExecution.optimizedPlan$lzycompute(SQLContext.scala:413) > at > org.apache.spark.sql.SQLContext$QueryExecution.optimizedPlan(SQLContext.scala:413) > at > org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan$lzycompute(SQLContext.scala:418) > at > org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan(SQLContext.scala:416) > at > org.apache.spark.sql.SQLContext$QueryExecution.executedPlan$lzycompute(SQLContext.scala:422) > at > org.apache.spark.sql.SQLContext$QueryExecution.executedPlan(SQLContext.scala:422) > at org.apache.spark.sql.SchemaRDD.collect(SchemaRDD.scala:444) > at $iwC$$iwC$$iwC$$iwC.(:17) > at $iwC$$iwC$$iwC.(:22) > at $iwC$$
[jira] [Updated] (SPARK-4850) "GROUP BY" can't work if the schema of SchemaRDD contains struct or array type
[ https://issues.apache.org/jira/browse/SPARK-4850?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-4850: Target Version/s: 1.3.0 (was: 1.2.0) > "GROUP BY" can't work if the schema of SchemaRDD contains struct or array type > -- > > Key: SPARK-4850 > URL: https://issues.apache.org/jira/browse/SPARK-4850 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.1.0, 1.1.1, 1.2.0, 1.1.2 >Reporter: Chaozhong Yang >Assignee: Cheng Lian > Labels: group, sql > Original Estimate: 96h > Remaining Estimate: 96h > > Code in Spark Shell as follows: > {code} > val sqlContext = new org.apache.spark.sql.SQLContext(sc) > val path = "path/to/json" > sqlContext.jsonFile(path).register("Table") > val t = sqlContext.sql("select * from Table group by a") > t.collect > {code} > Let's look into the schema of `Table` > {code} > root > |-- a: integer (nullable = true) > |-- arr: array (nullable = true) > ||-- element: integer (containsNull = false) > |-- createdAt: string (nullable = true) > |-- f: struct (nullable = true) > ||-- __type: string (nullable = true) > ||-- className: string (nullable = true) > ||-- objectId: string (nullable = true) > |-- objectId: string (nullable = true) > |-- s: string (nullable = true) > |-- updatedAt: string (nullable = true) > {code} > Exception will be throwed: > {code} > org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Expression > not in GROUP BY: arr#9, tree: > Aggregate [a#8], [a#8,arr#9,createdAt#10,f#11,objectId#12,s#13,updatedAt#14] > Subquery TestImport > LogicalRDD [a#8,arr#9,createdAt#10,f#11,objectId#12,s#13,updatedAt#14], > MappedRDD[18] at map at JsonRDD.scala:47 > at > org.apache.spark.sql.catalyst.analysis.Analyzer$CheckAggregation$$anonfun$apply$3$$anonfun$applyOrElse$7.apply(Analyzer.scala:126) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$CheckAggregation$$anonfun$apply$3$$anonfun$applyOrElse$7.apply(Analyzer.scala:125) > at scala.Option.foreach(Option.scala:236) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$CheckAggregation$$anonfun$apply$3.applyOrElse(Analyzer.scala:125) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$CheckAggregation$$anonfun$apply$3.applyOrElse(Analyzer.scala:108) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:144) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:135) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$CheckAggregation$.apply(Analyzer.scala:108) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$CheckAggregation$.apply(Analyzer.scala:106) > at > org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1$$anonfun$apply$2.apply(RuleExecutor.scala:61) > at > org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1$$anonfun$apply$2.apply(RuleExecutor.scala:59) > at > scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:51) > at > scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:60) > at scala.collection.mutable.WrappedArray.foldLeft(WrappedArray.scala:34) > at > org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1.apply(RuleExecutor.scala:59) > at > org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1.apply(RuleExecutor.scala:51) > at scala.collection.immutable.List.foreach(List.scala:318) > at > org.apache.spark.sql.catalyst.rules.RuleExecutor.apply(RuleExecutor.scala:51) > at > org.apache.spark.sql.SQLContext$QueryExecution.analyzed$lzycompute(SQLContext.scala:411) > at > org.apache.spark.sql.SQLContext$QueryExecution.analyzed(SQLContext.scala:411) > at > org.apache.spark.sql.SQLContext$QueryExecution.withCachedData$lzycompute(SQLContext.scala:412) > at > org.apache.spark.sql.SQLContext$QueryExecution.withCachedData(SQLContext.scala:412) > at > org.apache.spark.sql.SQLContext$QueryExecution.optimizedPlan$lzycompute(SQLContext.scala:413) > at > org.apache.spark.sql.SQLContext$QueryExecution.optimizedPlan(SQLContext.scala:413) > at > org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan$lzycompute(SQLContext.scala:418) > at > org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan(SQLContext.scala:416) > at > org.apache.spark.sql.SQLContext$QueryExecution.executedPlan$lzycompute(SQLContext.scala:422) > at > org.apache.spark.sql.SQLContext$QueryExecution.executedPlan(SQLContext.scala:422) > at org.apache.spark.sql.SchemaRDD.collect(SchemaRDD.scala:444) > at $iwC$$iwC$$iwC$$iwC.(:17) > at $iwC$$iwC$$iwC.(:22)
[jira] [Updated] (SPARK-4811) Custom UDTFs not working in Spark SQL
[ https://issues.apache.org/jira/browse/SPARK-4811?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-4811: Target Version/s: 1.3.0 (was: 1.2.0) > Custom UDTFs not working in Spark SQL > - > > Key: SPARK-4811 > URL: https://issues.apache.org/jira/browse/SPARK-4811 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.1.0, 1.1.1 >Reporter: Saurabh Santhosh >Priority: Critical > > I am using the Thrift srever interface to Spark SQL and using beeline to > connect to it. > I tried Spark SQL versions 1.1.0 and 1.1.1 and both are throwing the > following exception when using any custom UDTF. > These are the steps i did : > *Created a UDTF 'com.x.y.xxx'.* > Registered the UDTF using following query : > *create temporary function xxx as 'com.x.y.xxx'* > The registration went through without any errors. But when i tried executing > the UDTF i got the following error. > *java.lang.ClassNotFoundException: xxx* > Funny thing is that Its trying to load the function name instead of the > funtion class. The exception is at *line no: 81 in hiveudfs.scala* > I have been at it for quite a long time. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4553) query for parquet table with string fields in spark sql hive get binary result
[ https://issues.apache.org/jira/browse/SPARK-4553?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-4553: Target Version/s: 1.3.0 (was: 1.2.0) > query for parquet table with string fields in spark sql hive get binary result > -- > > Key: SPARK-4553 > URL: https://issues.apache.org/jira/browse/SPARK-4553 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.1.0 >Reporter: wangfei > Fix For: 1.2.0 > > > run > create table test_parquet(key int, value string) stored as parquet; > insert into table test_parquet select * from src; > select * from test_parquet; > get result as follow > ... > 282 [B@38fda3b > 138 [B@1407a24 > 238 [B@12de6fb > 419 [B@6c97695 > 15 [B@4885067 > 118 [B@156a8d3 > 72 [B@65d20dd > 90 [B@4c18906 > 307 [B@60b24cc > 19 [B@59cf51b > 435 [B@39fdf37 > 10 [B@4f799d7 > 277 [B@3950951 > 273 [B@596bf4b > 306 [B@3e91557 > 224 [B@3781d61 > 309 [B@2d0d128 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3863) Cache broadcasted tables and reuse them across queries
[ https://issues.apache.org/jira/browse/SPARK-3863?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-3863: Target Version/s: 1.3.0 (was: 1.2.0) > Cache broadcasted tables and reuse them across queries > -- > > Key: SPARK-3863 > URL: https://issues.apache.org/jira/browse/SPARK-3863 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin > > There is no point re-broadcasting the same dataset every time. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3862) MultiWayBroadcastInnerHashJoin
[ https://issues.apache.org/jira/browse/SPARK-3862?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-3862: Target Version/s: 1.3.0 (was: 1.2.0) > MultiWayBroadcastInnerHashJoin > -- > > Key: SPARK-3862 > URL: https://issues.apache.org/jira/browse/SPARK-3862 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin > > It is common to have a single fact table inner join many small dimension > tables. We can exploit this fact and create a MultiWayBroadcastInnerHashJoin > (or maybe just MultiwayDimensionJoin) operator that optimizes for this > pattern. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3865) Dimension table broadcast shouldn't be eager
[ https://issues.apache.org/jira/browse/SPARK-3865?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-3865: Target Version/s: 1.3.0 (was: 1.2.0) > Dimension table broadcast shouldn't be eager > > > Key: SPARK-3865 > URL: https://issues.apache.org/jira/browse/SPARK-3865 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin > > We eagerly broadcast dimension tables in BroadcastJoin. This is bad because > even explain would trigger a job to execute the broadcast. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3864) Specialize join for tables with unique integer keys
[ https://issues.apache.org/jira/browse/SPARK-3864?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-3864: Target Version/s: 1.3.0 (was: 1.2.0) > Specialize join for tables with unique integer keys > --- > > Key: SPARK-3864 > URL: https://issues.apache.org/jira/browse/SPARK-3864 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin > > We can create a new operator that uses an array as the underlying storage to > avoid hash lookups entirely for dimension tables that have integer keys. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4794) Wrong parse of GROUP BY query
[ https://issues.apache.org/jira/browse/SPARK-4794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14253964#comment-14253964 ] Michael Armbrust commented on SPARK-4794: - Ping. > Wrong parse of GROUP BY query > - > > Key: SPARK-4794 > URL: https://issues.apache.org/jira/browse/SPARK-4794 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.2.0 >Reporter: Damien Carol > > Spark is not able to parse this query : > {code:sql} > select > `cf_encaissement_fact_pq`.`annee` as `Annee`, > `cf_encaissement_fact_pq`.`mois` as `Mois`, > `cf_encaissement_fact_pq`.`jour` as `Jour`, > `cf_encaissement_fact_pq`.`heure` as `Heure`, > `cf_encaissement_fact_pq`.`nom_societe` as `Societe`, > `cf_encaissement_fact_pq`.`id_magasin` as `Magasin`, > `cf_encaissement_fact_pq`.`CarteFidelitePresentee` as `CF_Presentee`, > `cf_encaissement_fact_pq`.`CompteCarteFidelite` as `CompteCarteFidelite`, > `cf_encaissement_fact_pq`.`NbCompteCarteFidelite` as > `NbCompteCarteFidelite`, > `cf_encaissement_fact_pq`.`DetentionCF` as `DetentionCF`, > `cf_encaissement_fact_pq`.`NbCarteFidelite` as `NbCarteFidelite`, > `cf_encaissement_fact_pq`.`Id_CF_Dim_DUCB` as `Plage_DUCB`, > `cf_encaissement_fact_pq`.`NbCheque` as `NbCheque`, > `cf_encaissement_fact_pq`.`CACheque` as `CACheque`, > `cf_encaissement_fact_pq`.`NbImpaye` as `NbImpaye`, > `cf_encaissement_fact_pq`.`Id_Ensemble` as `NbEnsemble`, > `cf_encaissement_fact_pq`.`ZIBZIN` as `NbCompte`, > `cf_encaissement_fact_pq`.`ResteDuImpaye` as `ResteDuImpaye` > from > `testsimon3`.`cf_encaissement_fact_pq` as `cf_encaissement_fact_pq` > where > `cf_encaissement_fact_pq`.`annee` = 2013 > and > `cf_encaissement_fact_pq`.`mois` = 7 > and > `cf_encaissement_fact_pq`.`jour` = 12 > order by > `cf_encaissement_fact_pq`.`annee` ASC, > `cf_encaissement_fact_pq`.`mois` ASC, > `cf_encaissement_fact_pq`.`jour` ASC, > `cf_encaissement_fact_pq`.`heure` ASC, > `cf_encaissement_fact_pq`.`nom_societe` ASC, > `cf_encaissement_fact_pq`.`id_magasin` ASC, > `cf_encaissement_fact_pq`.`CarteFidelitePresentee` ASC, > `cf_encaissement_fact_pq`.`CompteCarteFidelite` ASC, > `cf_encaissement_fact_pq`.`NbCompteCarteFidelite` ASC, > `cf_encaissement_fact_pq`.`DetentionCF` ASC, > `cf_encaissement_fact_pq`.`NbCarteFidelite` ASC, > `cf_encaissement_fact_pq`.`Id_CF_Dim_DUCB` ASC > {code} > If I remove table name in ORDER BY conditions, Spark can handle it. > {code:sql} > select > `cf_encaissement_fact_pq`.`annee` as `Annee`, > `cf_encaissement_fact_pq`.`mois` as `Mois`, > `cf_encaissement_fact_pq`.`jour` as `Jour`, > `cf_encaissement_fact_pq`.`heure` as `Heure`, > `cf_encaissement_fact_pq`.`nom_societe` as `Societe`, > `cf_encaissement_fact_pq`.`id_magasin` as `Magasin`, > `cf_encaissement_fact_pq`.`CarteFidelitePresentee` as `CFPresentee`, > `cf_encaissement_fact_pq`.`CompteCarteFidelite` as `CompteCarteFidelite`, > `cf_encaissement_fact_pq`.`NbCompteCarteFidelite` as > `NbCompteCarteFidelite`, > `cf_encaissement_fact_pq`.`DetentionCF` as `DetentionCF`, > `cf_encaissement_fact_pq`.`NbCarteFidelite` as `NbCarteFidelite`, > `cf_encaissement_fact_pq`.`Id_CF_Dim_DUCB` as `PlageDUCB`, > `cf_encaissement_fact_pq`.`NbCheque` as `NbCheque`, > `cf_encaissement_fact_pq`.`CACheque` as `CACheque`, > `cf_encaissement_fact_pq`.`NbImpaye` as `NbImpaye`, > `cf_encaissement_fact_pq`.`Id_Ensemble` as `NbEnsemble`, > `cf_encaissement_fact_pq`.`ZIBZIN` as `NbCompte`, > `cf_encaissement_fact_pq`.`ResteDuImpaye` as `ResteDuImpaye` > from > `testsimon3`.`cf_encaissement_fact_pq` as `cf_encaissement_fact_pq` > where > `cf_encaissement_fact_pq`.`annee` = 2013 > and > `cf_encaissement_fact_pq`.`mois` = 7 > and > `cf_encaissement_fact_pq`.`jour` = 12 > order by > `annee` ASC, > `mois` ASC, > `jour` ASC, > `heure` ASC, > `nom_societe` ASC, > `id_magasin` ASC, > `CarteFidelitePresentee` ASC, > `CompteCarteFidelite` ASC, > `NbCompteCarteFidelite` ASC, > `DetentionCF` ASC, > `NbCarteFidelite` ASC, > `Id_CF_Dim_DUCB` ASC > {code} > I'm using Spark Master with Thrift server (HIVE 0.12) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4904) Remove the foldable checking in HiveGenericUdf.eval
[ https://issues.apache.org/jira/browse/SPARK-4904?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-4904: Target Version/s: 1.3.0 > Remove the foldable checking in HiveGenericUdf.eval > --- > > Key: SPARK-4904 > URL: https://issues.apache.org/jira/browse/SPARK-4904 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Cheng Hao >Priority: Minor > > Since https://github.com/apache/spark/pull/3429 has been merged, the bug of > wrapping to Writable for HiveGenericUDF is resolved, we can safely remove the > foldable checking in `HiveGenericUdf.eval`, which discussed in > https://github.com/apache/spark/pull/2802. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4689) Unioning 2 SchemaRDDs should return a SchemaRDD in Python, Scala, and Java
[ https://issues.apache.org/jira/browse/SPARK-4689?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-4689: Labels: 1.0.3 (was: ) > Unioning 2 SchemaRDDs should return a SchemaRDD in Python, Scala, and Java > -- > > Key: SPARK-4689 > URL: https://issues.apache.org/jira/browse/SPARK-4689 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.1.0 >Reporter: Chris Fregly >Priority: Minor > Labels: 1.0.3 > > Currently, you need to use unionAll() in Scala. > Python does not expose this functionality at the moment. > The current work around is to use the UNION ALL HiveQL functionality detailed > here: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Union -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4801) Add CTE capability to HiveContext
[ https://issues.apache.org/jira/browse/SPARK-4801?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-4801: Description: This is a request to add CTE functionality to HiveContext. Common Table Expressions are added in Hive 0.13.0 with HIVE-1180. Using CTE style syntax within HiveContext currently results in the following caused by message: {code} Caused by: scala.MatchError: TOK_CTE (of class org.apache.hadoop.hive.ql.parse.ASTNode) at org.apache.spark.sql.hive.HiveQl$$anonfun$13.apply(HiveQl.scala:500) at org.apache.spark.sql.hive.HiveQl$$anonfun$13.apply(HiveQl.scala:500) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.immutable.List.foreach(List.scala:318) at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) at scala.collection.AbstractTraversable.map(Traversable.scala:105) at org.apache.spark.sql.hive.HiveQl$.nodeToPlan(HiveQl.scala:500) at org.apache.spark.sql.hive.HiveQl$.parseSql(HiveQl.scala:248) {code} was: This is a request to add CTE functionality to HiveContext. Common Table Expressions are added in Hive 0.13.0 with HIVE-1180. Using CTE style syntax within HiveContext currently results in the following caused by message: Caused by: scala.MatchError: TOK_CTE (of class org.apache.hadoop.hive.ql.parse.ASTNode) at org.apache.spark.sql.hive.HiveQl$$anonfun$13.apply(HiveQl.scala:500) at org.apache.spark.sql.hive.HiveQl$$anonfun$13.apply(HiveQl.scala:500) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.immutable.List.foreach(List.scala:318) at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) at scala.collection.AbstractTraversable.map(Traversable.scala:105) at org.apache.spark.sql.hive.HiveQl$.nodeToPlan(HiveQl.scala:500) at org.apache.spark.sql.hive.HiveQl$.parseSql(HiveQl.scala:248) > Add CTE capability to HiveContext > - > > Key: SPARK-4801 > URL: https://issues.apache.org/jira/browse/SPARK-4801 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.1.0 >Reporter: Jacob Davis > > This is a request to add CTE functionality to HiveContext. Common Table > Expressions are added in Hive 0.13.0 with HIVE-1180. Using CTE style syntax > within HiveContext currently results in the following caused by message: > {code} > Caused by: scala.MatchError: TOK_CTE (of class > org.apache.hadoop.hive.ql.parse.ASTNode) > at org.apache.spark.sql.hive.HiveQl$$anonfun$13.apply(HiveQl.scala:500) > at org.apache.spark.sql.hive.HiveQl$$anonfun$13.apply(HiveQl.scala:500) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at scala.collection.immutable.List.foreach(List.scala:318) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) > at scala.collection.AbstractTraversable.map(Traversable.scala:105) > at org.apache.spark.sql.hive.HiveQl$.nodeToPlan(HiveQl.scala:500) > at org.apache.spark.sql.hive.HiveQl$.parseSql(HiveQl.scala:248) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4801) Add CTE capability to HiveContext
[ https://issues.apache.org/jira/browse/SPARK-4801?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-4801: Target Version/s: 1.3.0 > Add CTE capability to HiveContext > - > > Key: SPARK-4801 > URL: https://issues.apache.org/jira/browse/SPARK-4801 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.1.0 >Reporter: Jacob Davis > > This is a request to add CTE functionality to HiveContext. Common Table > Expressions are added in Hive 0.13.0 with HIVE-1180. Using CTE style syntax > within HiveContext currently results in the following caused by message: > {code} > Caused by: scala.MatchError: TOK_CTE (of class > org.apache.hadoop.hive.ql.parse.ASTNode) > at org.apache.spark.sql.hive.HiveQl$$anonfun$13.apply(HiveQl.scala:500) > at org.apache.spark.sql.hive.HiveQl$$anonfun$13.apply(HiveQl.scala:500) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at scala.collection.immutable.List.foreach(List.scala:318) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) > at scala.collection.AbstractTraversable.map(Traversable.scala:105) > at org.apache.spark.sql.hive.HiveQl$.nodeToPlan(HiveQl.scala:500) > at org.apache.spark.sql.hive.HiveQl$.parseSql(HiveQl.scala:248) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-4735) Spark SQL UDF doesn't support 0 arguments.
[ https://issues.apache.org/jira/browse/SPARK-4735?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-4735. - Resolution: Fixed Fix Version/s: 1.3.0 Assignee: Cheng Hao > Spark SQL UDF doesn't support 0 arguments. > -- > > Key: SPARK-4735 > URL: https://issues.apache.org/jira/browse/SPARK-4735 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Cheng Hao >Assignee: Cheng Hao >Priority: Minor > Fix For: 1.3.0 > > > To reproduce that with: > val udf = () => {Seq(1,2,3)} > sqlCtx.registerFunction("myudf", udf) > sqlCtx.sql("select myudf() from tbl limit 1").collect.foreach(println) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4907) Inconsistent loss and gradient in LeastSquaresGradient compared with R
DB Tsai created SPARK-4907: -- Summary: Inconsistent loss and gradient in LeastSquaresGradient compared with R Key: SPARK-4907 URL: https://issues.apache.org/jira/browse/SPARK-4907 Project: Spark Issue Type: Bug Components: MLlib Reporter: DB Tsai In most of the academic paper and algorithm implementations, people use L = 1/2n ||A weights-y||^2 instead of L = 1/n ||A weights-y||^2 for least-squared loss. See Eq. (1) in http://web.stanford.edu/~hastie/Papers/glmnet.pdf Since MLlib uses different convention, this will result different residuals and all the stats properties will be different from GLMNET package in R. The model coefficients will be still the same under this change. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4907) Inconsistent loss and gradient in LeastSquaresGradient compared with R
[ https://issues.apache.org/jira/browse/SPARK-4907?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14253978#comment-14253978 ] Apache Spark commented on SPARK-4907: - User 'dbtsai' has created a pull request for this issue: https://github.com/apache/spark/pull/3746 > Inconsistent loss and gradient in LeastSquaresGradient compared with R > -- > > Key: SPARK-4907 > URL: https://issues.apache.org/jira/browse/SPARK-4907 > Project: Spark > Issue Type: Bug > Components: MLlib >Reporter: DB Tsai > > In most of the academic paper and algorithm implementations, people use L = > 1/2n ||A weights-y||^2 instead of L = 1/n ||A weights-y||^2 for least-squared > loss. See Eq. (1) in http://web.stanford.edu/~hastie/Papers/glmnet.pdf > Since MLlib uses different convention, this will result different residuals > and all the stats properties will be different from GLMNET package in R. The > model coefficients will be still the same under this change. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4865) rdds exposed to sql context via registerTempTable are not listed via thrift jdbc show tables
[ https://issues.apache.org/jira/browse/SPARK-4865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14253979#comment-14253979 ] Michael Armbrust commented on SPARK-4865: - Temporary tables are tied to a specific SQLContext and thus can't be seen or queried across different JVMs. Is that the issue you are reporting? This is a fundamental design thing that we are not going to change. Or are you creating a JDBC server with an existing HiveContext and then not seeing the tables (a separate issue that I do want to fix). > rdds exposed to sql context via registerTempTable are not listed via thrift > jdbc show tables > > > Key: SPARK-4865 > URL: https://issues.apache.org/jira/browse/SPARK-4865 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.2.0 >Reporter: Misha Chernetsov > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-4762) Add support for tuples in 'where in' clause query
[ https://issues.apache.org/jira/browse/SPARK-4762?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-4762. - Resolution: Won't Fix This issue can be reopened if the hive parser is ever extended to support this syntax. > Add support for tuples in 'where in' clause query > - > > Key: SPARK-4762 > URL: https://issues.apache.org/jira/browse/SPARK-4762 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.1.0 >Reporter: Yash Datta > Fix For: 1.3.0 > > > Currently, in the where in clause the filter is applied only on a single > column. We can enhance it to accept filter on multiple columns. > So current support is for queries like : > Select * from table where c1 in (value1,value2,...value n); > Need to add support for queries like : > Select * from table where (c1,c2,... cn) in ((value1,value2...value n), > (value1' , value2' ... ,value n') ) > Also, we can add optimized version of where in clause of tuples , where we > create a hashset of the filter tuples for matching rows. > This also requires a change in the hive parser since currently there is no > support for multiple columns in IN clause. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-2075) Anonymous classes are missing from Spark distribution
[ https://issues.apache.org/jira/browse/SPARK-2075?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-2075: --- Assignee: Shixiong Zhu > Anonymous classes are missing from Spark distribution > - > > Key: SPARK-2075 > URL: https://issues.apache.org/jira/browse/SPARK-2075 > Project: Spark > Issue Type: Bug > Components: Build, Spark Core >Affects Versions: 1.0.0 >Reporter: Paul R. Brown >Assignee: Shixiong Zhu >Priority: Critical > > Running a job built against the Maven dep for 1.0.0 and the hadoop1 > distribution produces: > {code} > java.lang.ClassNotFoundException: > org.apache.spark.rdd.RDD$$anonfun$saveAsTextFile$1 > {code} > Here's what's in the Maven dep as of 1.0.0: > {code} > jar tvf > ~/.m2/repository/org/apache/spark/spark-core_2.10/1.0.0/spark-core_2.10-1.0.0.jar > | grep 'rdd/RDD' | grep 'saveAs' > 1519 Mon May 26 13:57:58 PDT 2014 > org/apache/spark/rdd/RDD$anonfun$saveAsTextFile$1.class > 1560 Mon May 26 13:57:58 PDT 2014 > org/apache/spark/rdd/RDD$anonfun$saveAsTextFile$2.class > {code} > And here's what's in the hadoop1 distribution: > {code} > jar tvf spark-assembly-1.0.0-hadoop1.0.4.jar| grep 'rdd/RDD' | grep 'saveAs' > {code} > I.e., it's not there. It is in the hadoop2 distribution: > {code} > jar tvf spark-assembly-1.0.0-hadoop2.2.0.jar| grep 'rdd/RDD' | grep 'saveAs' > 1519 Mon May 26 07:29:54 PDT 2014 > org/apache/spark/rdd/RDD$anonfun$saveAsTextFile$1.class > 1560 Mon May 26 07:29:54 PDT 2014 > org/apache/spark/rdd/RDD$anonfun$saveAsTextFile$2.class > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4865) rdds exposed to sql context via registerTempTable are not listed via thrift jdbc show tables
[ https://issues.apache.org/jira/browse/SPARK-4865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14253994#comment-14253994 ] Misha Chernetsov commented on SPARK-4865: - > Or are you creating a JDBC server with an existing HiveContext and then not > seeing the tables (a separate issue that I do want to fix). I am reporting that one. > rdds exposed to sql context via registerTempTable are not listed via thrift > jdbc show tables > > > Key: SPARK-4865 > URL: https://issues.apache.org/jira/browse/SPARK-4865 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.2.0 >Reporter: Misha Chernetsov > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4636) Cluster By & Distribute By output different with Hive
[ https://issues.apache.org/jira/browse/SPARK-4636?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-4636: Target Version/s: 1.3.0 > Cluster By & Distribute By output different with Hive > - > > Key: SPARK-4636 > URL: https://issues.apache.org/jira/browse/SPARK-4636 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Cheng Hao > > This is a very interesting bug. > Semantically, Cluster By & Distribute By will not cause a global ordering, as > described in Hive wiki: > https://cwiki.apache.org/confluence/display/Hive/LanguageManual+SortBy > However, the partition keys are sorted in MapReduce after shuffle, so from > the user point of view, the partition key itself is global ordered, and it > may looks like: > http://stackoverflow.com/questions/13715044/hive-cluster-by-vs-order-by-vs-sort-by -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4589) ML add-ons to SchemaRDD
[ https://issues.apache.org/jira/browse/SPARK-4589?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14254002#comment-14254002 ] Michael Armbrust commented on SPARK-4589: - Can you elaborate what you are thinking about? Is this something like: {code} def transformColumn[A,B](columnName: String, f: A => B) {code} Is there anything else? > ML add-ons to SchemaRDD > --- > > Key: SPARK-4589 > URL: https://issues.apache.org/jira/browse/SPARK-4589 > Project: Spark > Issue Type: New Feature > Components: ML, MLlib, SQL >Reporter: Xiangrui Meng > > One feedback we received from the Pipeline API (SPARK-3530) is about the > boilerplate code in the implementation. We can add more Scala DSL to simplify > the code for the operations we need in ML. Those operations could live under > spark.ml via implicit, or be added to SchemaRDD directly if they are also > useful for general purpose. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-2973) Add a way to show tables without executing a job
[ https://issues.apache.org/jira/browse/SPARK-2973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-2973: Target Version/s: 1.3.0 (was: 1.2.0) > Add a way to show tables without executing a job > > > Key: SPARK-2973 > URL: https://issues.apache.org/jira/browse/SPARK-2973 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Aaron Davidson >Assignee: Cheng Lian >Priority: Critical > Fix For: 1.2.0 > > > Right now, sql("show tables").collect() will start a Spark job which shows up > in the UI. There should be a way to get these without this step. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2973) Add a way to show tables without executing a job
[ https://issues.apache.org/jira/browse/SPARK-2973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14254006#comment-14254006 ] Michael Armbrust commented on SPARK-2973: - I think the solution here is to also special case take in a SparkPlan and use that from schema rdd. > Add a way to show tables without executing a job > > > Key: SPARK-2973 > URL: https://issues.apache.org/jira/browse/SPARK-2973 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Aaron Davidson >Assignee: Cheng Lian >Priority: Critical > Fix For: 1.2.0 > > > Right now, sql("show tables").collect() will start a Spark job which shows up > in the UI. There should be a way to get these without this step. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4865) Include temporary tables in SHOW TABLES
[ https://issues.apache.org/jira/browse/SPARK-4865?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-4865: Summary: Include temporary tables in SHOW TABLES (was: rdds exposed to sql context via registerTempTable are not listed via thrift jdbc show tables) > Include temporary tables in SHOW TABLES > --- > > Key: SPARK-4865 > URL: https://issues.apache.org/jira/browse/SPARK-4865 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.2.0 >Reporter: Misha Chernetsov > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4865) Include temporary tables in SHOW TABLES
[ https://issues.apache.org/jira/browse/SPARK-4865?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-4865: Priority: Critical (was: Major) > Include temporary tables in SHOW TABLES > --- > > Key: SPARK-4865 > URL: https://issues.apache.org/jira/browse/SPARK-4865 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.2.0 >Reporter: Misha Chernetsov >Priority: Critical > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4865) Include temporary tables in SHOW TABLES
[ https://issues.apache.org/jira/browse/SPARK-4865?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-4865: Target Version/s: 1.3.0 > Include temporary tables in SHOW TABLES > --- > > Key: SPARK-4865 > URL: https://issues.apache.org/jira/browse/SPARK-4865 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.2.0 >Reporter: Misha Chernetsov > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4629) Spark SQL uses Hadoop Configuration in a thread-unsafe manner when writing Parquet files
[ https://issues.apache.org/jira/browse/SPARK-4629?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-4629: Target Version/s: 1.3.0 > Spark SQL uses Hadoop Configuration in a thread-unsafe manner when writing > Parquet files > > > Key: SPARK-4629 > URL: https://issues.apache.org/jira/browse/SPARK-4629 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.1.0 >Reporter: Michael Allman > > The method {{ParquetRelation.createEmpty}} mutates its given Hadoop > {{Configuration}} instance to set the Parquet writer compression level (cf. > https://github.com/apache/spark/blob/v1.1.0/sql/core/src/main/scala/org/apache/spark/sql/parquet/ParquetRelation.scala#L149). > This can lead to a {{ConcurrentModificationException}} when running > concurrent jobs sharing a single {{SparkContext}} which involve saving > Parquet files. > Our "fix" was to simply remove the line in question and set the compression > level in the hadoop configuration before starting our jobs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4760) "ANALYZE TABLE table COMPUTE STATISTICS noscan" failed estimating table size for tables created from Parquet files
[ https://issues.apache.org/jira/browse/SPARK-4760?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-4760: Target Version/s: 1.3.0 Affects Version/s: (was: 1.3.0) > "ANALYZE TABLE table COMPUTE STATISTICS noscan" failed estimating table size > for tables created from Parquet files > -- > > Key: SPARK-4760 > URL: https://issues.apache.org/jira/browse/SPARK-4760 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.2.0 >Reporter: Jianshi Huang > > In a older Spark version built around Oct. 12, I was able to use > ANALYZE TABLE table COMPUTE STATISTICS noscan > to get estimated table size, which is important for optimizing joins. (I'm > joining 15 small dimension tables, and this is crucial to me). > In the more recent Spark builds, it fails to estimate the table size unless I > remove "noscan". > Here's the statistics I got using DESC EXTENDED: > old: > parameters:{EXTERNAL=TRUE, transient_lastDdlTime=1417763591, totalSize=56166} > new: > parameters:{numFiles=0, EXTERNAL=TRUE, transient_lastDdlTime=1417763892, > COLUMN_STATS_ACCURATE=false, totalSize=0, numRows=-1, rawDataSize=-1} > And I've tried turning off spark.sql.hive.convertMetastoreParquet in my > spark-defaults.conf and the result is unaffected (in both versions). > Looks like the Parquet support in new Hive (0.13.1) is broken? > Jianshi -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4689) Unioning 2 SchemaRDDs should return a SchemaRDD in Python, Scala, and Java
[ https://issues.apache.org/jira/browse/SPARK-4689?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-4689: Labels: starter (was: 1.0.3) > Unioning 2 SchemaRDDs should return a SchemaRDD in Python, Scala, and Java > -- > > Key: SPARK-4689 > URL: https://issues.apache.org/jira/browse/SPARK-4689 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.1.0 >Reporter: Chris Fregly >Priority: Minor > Labels: starter > > Currently, you need to use unionAll() in Scala. > Python does not expose this functionality at the moment. > The current work around is to use the UNION ALL HiveQL functionality detailed > here: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Union -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4760) "ANALYZE TABLE table COMPUTE STATISTICS noscan" failed estimating table size for tables created from Parquet files
[ https://issues.apache.org/jira/browse/SPARK-4760?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-4760: Priority: Critical (was: Major) > "ANALYZE TABLE table COMPUTE STATISTICS noscan" failed estimating table size > for tables created from Parquet files > -- > > Key: SPARK-4760 > URL: https://issues.apache.org/jira/browse/SPARK-4760 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.2.0 >Reporter: Jianshi Huang >Priority: Critical > > In a older Spark version built around Oct. 12, I was able to use > ANALYZE TABLE table COMPUTE STATISTICS noscan > to get estimated table size, which is important for optimizing joins. (I'm > joining 15 small dimension tables, and this is crucial to me). > In the more recent Spark builds, it fails to estimate the table size unless I > remove "noscan". > Here's the statistics I got using DESC EXTENDED: > old: > parameters:{EXTERNAL=TRUE, transient_lastDdlTime=1417763591, totalSize=56166} > new: > parameters:{numFiles=0, EXTERNAL=TRUE, transient_lastDdlTime=1417763892, > COLUMN_STATS_ACCURATE=false, totalSize=0, numRows=-1, rawDataSize=-1} > And I've tried turning off spark.sql.hive.convertMetastoreParquet in my > spark-defaults.conf and the result is unaffected (in both versions). > Looks like the Parquet support in new Hive (0.13.1) is broken? > Jianshi -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4689) Unioning 2 SchemaRDDs should return a SchemaRDD in Python, Scala, and Java
[ https://issues.apache.org/jira/browse/SPARK-4689?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-4689: Target Version/s: 1.3.0 > Unioning 2 SchemaRDDs should return a SchemaRDD in Python, Scala, and Java > -- > > Key: SPARK-4689 > URL: https://issues.apache.org/jira/browse/SPARK-4689 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.1.0 >Reporter: Chris Fregly >Priority: Minor > Labels: starter > > Currently, you need to use unionAll() in Scala. > Python does not expose this functionality at the moment. > The current work around is to use the UNION ALL HiveQL functionality detailed > here: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Union -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4648) Support COALESCE function in Spark SQL and HiveQL
[ https://issues.apache.org/jira/browse/SPARK-4648?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-4648: Target Version/s: 1.3.0 Assignee: Ravindra Pesala > Support COALESCE function in Spark SQL and HiveQL > - > > Key: SPARK-4648 > URL: https://issues.apache.org/jira/browse/SPARK-4648 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.1.0 >Reporter: Ravindra Pesala >Assignee: Ravindra Pesala > > Support Coalesce function in Spark SQL. > Support type widening in Coalesce function. > And replace Coalesce UDF in Spark Hive with local Coalesce function since it > is memory efficient and faster. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4564) SchemaRDD.groupBy(groupingExprs)(aggregateExprs) doesn't return the groupingExprs as part of the output schema
[ https://issues.apache.org/jira/browse/SPARK-4564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14254016#comment-14254016 ] Michael Armbrust commented on SPARK-4564: - It is however consistent with SQL, where GROUP BY expression are only included if they are part of the SELECT clause. Since the goal here is to provide programatic SQL I'm inclined to stick with the current semantics. Changing this would also be a fairly major breaking change to the API if people were dependent on the position of columns in the result. > SchemaRDD.groupBy(groupingExprs)(aggregateExprs) doesn't return the > groupingExprs as part of the output schema > -- > > Key: SPARK-4564 > URL: https://issues.apache.org/jira/browse/SPARK-4564 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.1.0 > Environment: Mac OSX, local mode, but should hold true for all > environments >Reporter: Dean Wampler > > In the following example, I would expect the "grouped" schema to contain two > fields, the String name and the Long count, but it only contains the Long > count. > {code} > // Assumes val sc = new SparkContext(...), e.g., in Spark Shell > import org.apache.spark.sql.{SQLContext, SchemaRDD} > import org.apache.spark.sql.catalyst.expressions._ > val sqlc = new SQLContext(sc) > import sqlc._ > case class Record(name: String, n: Int) > val records = List( > Record("three", 1), > Record("three", 2), > Record("two", 3), > Record("three", 4), > Record("two", 5)) > val recs = sc.parallelize(records) > recs.registerTempTable("records") > val grouped = recs.select('name, 'n).groupBy('name)(Count('n) as 'count) > grouped.printSchema > // root > // |-- count: long (nullable = false) > grouped foreach println > // [2] > // [3] > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-4564) SchemaRDD.groupBy(groupingExprs)(aggregateExprs) doesn't return the groupingExprs as part of the output schema
[ https://issues.apache.org/jira/browse/SPARK-4564?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-4564. - Resolution: Won't Fix I'm going to close this wontfix unless there is major objection. Happy to accept PRs to clarify the documentation though :) > SchemaRDD.groupBy(groupingExprs)(aggregateExprs) doesn't return the > groupingExprs as part of the output schema > -- > > Key: SPARK-4564 > URL: https://issues.apache.org/jira/browse/SPARK-4564 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.1.0 > Environment: Mac OSX, local mode, but should hold true for all > environments >Reporter: Dean Wampler > > In the following example, I would expect the "grouped" schema to contain two > fields, the String name and the Long count, but it only contains the Long > count. > {code} > // Assumes val sc = new SparkContext(...), e.g., in Spark Shell > import org.apache.spark.sql.{SQLContext, SchemaRDD} > import org.apache.spark.sql.catalyst.expressions._ > val sqlc = new SQLContext(sc) > import sqlc._ > case class Record(name: String, n: Int) > val records = List( > Record("three", 1), > Record("three", 2), > Record("two", 3), > Record("three", 4), > Record("two", 5)) > val recs = sc.parallelize(records) > recs.registerTempTable("records") > val grouped = recs.select('name, 'n).groupBy('name)(Count('n) as 'count) > grouped.printSchema > // root > // |-- count: long (nullable = false) > grouped foreach println > // [2] > // [3] > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4502) Spark SQL reads unneccesary fields from Parquet
[ https://issues.apache.org/jira/browse/SPARK-4502?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-4502: Priority: Critical (was: Major) Target Version/s: 1.3.0 > Spark SQL reads unneccesary fields from Parquet > --- > > Key: SPARK-4502 > URL: https://issues.apache.org/jira/browse/SPARK-4502 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.1.0 >Reporter: Liwen Sun >Priority: Critical > > When reading a field of a nested column from Parquet, SparkSQL reads and > assemble all the fields of that nested column. This is unnecessary, as > Parquet supports fine-grained field reads out of a nested column. This may > degrades the performance significantly when a nested column has many fields. > For example, I loaded json tweets data into SparkSQL and ran the following > query: > {{SELECT User.contributors_enabled from Tweets;}} > User is a nested structure that has 38 primitive fields (for Tweets schema, > see: https://dev.twitter.com/overview/api/tweets), here is the log message: > {{14/11/19 16:36:49 INFO InternalParquetRecordReader: Assembled and processed > 385779 records from 38 columns in 3976 ms: 97.02691 rec/ms, 3687.0227 > cell/ms}} > For comparison, I also ran: > {{SELECT User FROM Tweets;}} > And here is the log message: > {{14/11/19 16:45:40 INFO InternalParquetRecordReader: Assembled and processed > 385779 records from 38 columns in 9461 ms: 40.77571 rec/ms, 1549.477 cell/ms}} > So both queries load 38 columns from Parquet, while the first query only > needs 1 column. I also measured the bytes read within Parquet. In these two > cases, the same number of bytes (99365194 bytes) were read. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4476) Use MapType for dict in json which has unique keys in each row.
[ https://issues.apache.org/jira/browse/SPARK-4476?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-4476: Target Version/s: 1.3.0 > Use MapType for dict in json which has unique keys in each row. > --- > > Key: SPARK-4476 > URL: https://issues.apache.org/jira/browse/SPARK-4476 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Davies Liu >Priority: Critical > > For the jsonRDD like this: > {code} > """ {a: 1} """ > """ {b: 2} """ > """ {c: 3} """ > """ {d: 4} """ > """ {e: 5} """ > {code} > It will create a StructType with 5 fileds in it, each field come from a > different row. It will be a problem if the RDD is large. A StructType with > thousands or millions fields is hard to play with (will cause stack overflow > during serialization). > It should be MapType for this case. We need a clear rule to decide StructType > or MapType will be used for dict in json data. > cc [~yhuai] [~marmbrus] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4367) Process the "distinct" value before shuffling for aggregation
[ https://issues.apache.org/jira/browse/SPARK-4367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14254020#comment-14254020 ] Michael Armbrust commented on SPARK-4367: - So we already do this for SUM and COUNT, and I don't think there is a AVG DISTINCT currently. Should we close this or is there more to it? > Process the "distinct" value before shuffling for aggregation > - > > Key: SPARK-4367 > URL: https://issues.apache.org/jira/browse/SPARK-4367 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Cheng Hao > > Most of aggregate function(e.g average) with "distinct" value will requires > all of the records in the same group to be shuffled into a single node, > however, as part of the optimization, those records can be partially > aggregated before shuffling, that probably reduces the overhead of shuffling > significantly. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-4469) Move the SemanticAnalyzer from Physical Execution to Analysis
[ https://issues.apache.org/jira/browse/SPARK-4469?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-4469. - Resolution: Fixed Assignee: Cheng Hao > Move the SemanticAnalyzer from Physical Execution to Analysis > - > > Key: SPARK-4469 > URL: https://issues.apache.org/jira/browse/SPARK-4469 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Cheng Hao >Assignee: Cheng Hao >Priority: Minor > > This is the code refactor and follow ups for > "https://github.com/apache/spark/pull/2570"; -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4657) Suport storing decimals in Parquet that don't fit in a LONG
[ https://issues.apache.org/jira/browse/SPARK-4657?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-4657: Summary: Suport storing decimals in Parquet that don't fit in a LONG (was: RuntimeException: Unsupported datatype DecimalType()) > Suport storing decimals in Parquet that don't fit in a LONG > --- > > Key: SPARK-4657 > URL: https://issues.apache.org/jira/browse/SPARK-4657 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.2.0 >Reporter: pengyanhong > > execute a query statement on a Hive table which contains decimal data type > field, than save the result into tachyon as parquet file, got error as below: > {quote} > java.lang.RuntimeException: Unsupported datatype DecimalType() > at scala.sys.package$.error(package.scala:27) > at > org.apache.spark.sql.parquet.ParquetTypesConverter$$anonfun$fromDataType$2.apply(ParquetTypes.scala:343) > at > org.apache.spark.sql.parquet.ParquetTypesConverter$$anonfun$fromDataType$2.apply(ParquetTypes.scala:292) > at scala.Option.getOrElse(Option.scala:120) > at > org.apache.spark.sql.parquet.ParquetTypesConverter$.fromDataType(ParquetTypes.scala:291) > at > org.apache.spark.sql.parquet.ParquetTypesConverter$$anonfun$4.apply(ParquetTypes.scala:363) > at > org.apache.spark.sql.parquet.ParquetTypesConverter$$anonfun$4.apply(ParquetTypes.scala:362) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) > at > scala.collection.TraversableLike$class.map(TraversableLike.scala:244) > at scala.collection.AbstractTraversable.map(Traversable.scala:105) > at > org.apache.spark.sql.parquet.ParquetTypesConverter$.convertFromAttributes(ParquetTypes.scala:361) > at > org.apache.spark.sql.parquet.ParquetTypesConverter$.writeMetaData(ParquetTypes.scala:407) > at > org.apache.spark.sql.parquet.ParquetRelation$.createEmpty(ParquetRelation.scala:151) > at > org.apache.spark.sql.parquet.ParquetRelation$.create(ParquetRelation.scala:130) > at > org.apache.spark.sql.execution.SparkStrategies$ParquetOperations$.apply(SparkStrategies.scala:204) > at > org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58) > at > org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58) > at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) > at > org.apache.spark.sql.catalyst.planning.QueryPlanner.apply(QueryPlanner.scala:59) > at > org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan$lzycompute(SQLContext.scala:417) > at > org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan(SQLContext.scala:415) > at > org.apache.spark.sql.SQLContext$QueryExecution.executedPlan$lzycompute(SQLContext.scala:421) > at > org.apache.spark.sql.SQLContext$QueryExecution.executedPlan(SQLContext.scala:421) > at > org.apache.spark.sql.SQLContext$QueryExecution.toRdd$lzycompute(SQLContext.scala:424) > at > org.apache.spark.sql.SQLContext$QueryExecution.toRdd(SQLContext.scala:424) > at > org.apache.spark.sql.SchemaRDDLike$class.saveAsParquetFile(SchemaRDDLike.scala:76) > at > org.apache.spark.sql.SchemaRDD.saveAsParquetFile(SchemaRDD.scala:103) > at com.jd.jddp.spark.hive.Cache$.cacheTable(Cache.scala:33) > at com.jd.jddp.spark.hive.Cache$$anonfun$main$5.apply(Cache.scala:61) > at com.jd.jddp.spark.hive.Cache$$anonfun$main$5.apply(Cache.scala:59) > at > scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772) > at > scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) > at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108) > at > scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771) > at com.jd.jddp.spark.hive.Cache$.main(Cache.scala:59) > at com.jd.jddp.spark.hive.Cache.main(Cache.scala) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) > at java.lang.reflect.Method.invoke(Method.java:597) > at > org.apache.spark.deploy.yarn.ApplicationMaster$$a
[jira] [Updated] (SPARK-4657) Suport storing decimals in Parquet that don't fit in a LONG
[ https://issues.apache.org/jira/browse/SPARK-4657?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-4657: Target Version/s: 1.3.0 Issue Type: Improvement (was: Bug) > Suport storing decimals in Parquet that don't fit in a LONG > --- > > Key: SPARK-4657 > URL: https://issues.apache.org/jira/browse/SPARK-4657 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.2.0 >Reporter: pengyanhong > > execute a query statement on a Hive table which contains decimal data type > field, than save the result into tachyon as parquet file, got error as below: > {quote} > java.lang.RuntimeException: Unsupported datatype DecimalType() > at scala.sys.package$.error(package.scala:27) > at > org.apache.spark.sql.parquet.ParquetTypesConverter$$anonfun$fromDataType$2.apply(ParquetTypes.scala:343) > at > org.apache.spark.sql.parquet.ParquetTypesConverter$$anonfun$fromDataType$2.apply(ParquetTypes.scala:292) > at scala.Option.getOrElse(Option.scala:120) > at > org.apache.spark.sql.parquet.ParquetTypesConverter$.fromDataType(ParquetTypes.scala:291) > at > org.apache.spark.sql.parquet.ParquetTypesConverter$$anonfun$4.apply(ParquetTypes.scala:363) > at > org.apache.spark.sql.parquet.ParquetTypesConverter$$anonfun$4.apply(ParquetTypes.scala:362) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) > at > scala.collection.TraversableLike$class.map(TraversableLike.scala:244) > at scala.collection.AbstractTraversable.map(Traversable.scala:105) > at > org.apache.spark.sql.parquet.ParquetTypesConverter$.convertFromAttributes(ParquetTypes.scala:361) > at > org.apache.spark.sql.parquet.ParquetTypesConverter$.writeMetaData(ParquetTypes.scala:407) > at > org.apache.spark.sql.parquet.ParquetRelation$.createEmpty(ParquetRelation.scala:151) > at > org.apache.spark.sql.parquet.ParquetRelation$.create(ParquetRelation.scala:130) > at > org.apache.spark.sql.execution.SparkStrategies$ParquetOperations$.apply(SparkStrategies.scala:204) > at > org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58) > at > org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58) > at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) > at > org.apache.spark.sql.catalyst.planning.QueryPlanner.apply(QueryPlanner.scala:59) > at > org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan$lzycompute(SQLContext.scala:417) > at > org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan(SQLContext.scala:415) > at > org.apache.spark.sql.SQLContext$QueryExecution.executedPlan$lzycompute(SQLContext.scala:421) > at > org.apache.spark.sql.SQLContext$QueryExecution.executedPlan(SQLContext.scala:421) > at > org.apache.spark.sql.SQLContext$QueryExecution.toRdd$lzycompute(SQLContext.scala:424) > at > org.apache.spark.sql.SQLContext$QueryExecution.toRdd(SQLContext.scala:424) > at > org.apache.spark.sql.SchemaRDDLike$class.saveAsParquetFile(SchemaRDDLike.scala:76) > at > org.apache.spark.sql.SchemaRDD.saveAsParquetFile(SchemaRDD.scala:103) > at com.jd.jddp.spark.hive.Cache$.cacheTable(Cache.scala:33) > at com.jd.jddp.spark.hive.Cache$$anonfun$main$5.apply(Cache.scala:61) > at com.jd.jddp.spark.hive.Cache$$anonfun$main$5.apply(Cache.scala:59) > at > scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772) > at > scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) > at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108) > at > scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771) > at com.jd.jddp.spark.hive.Cache$.main(Cache.scala:59) > at com.jd.jddp.spark.hive.Cache.main(Cache.scala) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) > at java.lang.reflect.Method.invoke(Method.java:597) > at > org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:459) > {quote} -
[jira] [Updated] (SPARK-4176) Support decimals with precision > 18 in Parquet
[ https://issues.apache.org/jira/browse/SPARK-4176?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-4176: Target Version/s: 1.3.0 > Support decimals with precision > 18 in Parquet > --- > > Key: SPARK-4176 > URL: https://issues.apache.org/jira/browse/SPARK-4176 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Matei Zaharia > > After https://issues.apache.org/jira/browse/SPARK-3929, only decimals with > precisions <= 18 (that can be read into a Long) will be readable from > Parquet, so we still need more work to support these larger ones. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-4657) Suport storing decimals in Parquet that don't fit in a LONG
[ https://issues.apache.org/jira/browse/SPARK-4657?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-4657. - Resolution: Duplicate > Suport storing decimals in Parquet that don't fit in a LONG > --- > > Key: SPARK-4657 > URL: https://issues.apache.org/jira/browse/SPARK-4657 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.2.0 >Reporter: pengyanhong > > execute a query statement on a Hive table which contains decimal data type > field, than save the result into tachyon as parquet file, got error as below: > {quote} > java.lang.RuntimeException: Unsupported datatype DecimalType() > at scala.sys.package$.error(package.scala:27) > at > org.apache.spark.sql.parquet.ParquetTypesConverter$$anonfun$fromDataType$2.apply(ParquetTypes.scala:343) > at > org.apache.spark.sql.parquet.ParquetTypesConverter$$anonfun$fromDataType$2.apply(ParquetTypes.scala:292) > at scala.Option.getOrElse(Option.scala:120) > at > org.apache.spark.sql.parquet.ParquetTypesConverter$.fromDataType(ParquetTypes.scala:291) > at > org.apache.spark.sql.parquet.ParquetTypesConverter$$anonfun$4.apply(ParquetTypes.scala:363) > at > org.apache.spark.sql.parquet.ParquetTypesConverter$$anonfun$4.apply(ParquetTypes.scala:362) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) > at > scala.collection.TraversableLike$class.map(TraversableLike.scala:244) > at scala.collection.AbstractTraversable.map(Traversable.scala:105) > at > org.apache.spark.sql.parquet.ParquetTypesConverter$.convertFromAttributes(ParquetTypes.scala:361) > at > org.apache.spark.sql.parquet.ParquetTypesConverter$.writeMetaData(ParquetTypes.scala:407) > at > org.apache.spark.sql.parquet.ParquetRelation$.createEmpty(ParquetRelation.scala:151) > at > org.apache.spark.sql.parquet.ParquetRelation$.create(ParquetRelation.scala:130) > at > org.apache.spark.sql.execution.SparkStrategies$ParquetOperations$.apply(SparkStrategies.scala:204) > at > org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58) > at > org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58) > at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) > at > org.apache.spark.sql.catalyst.planning.QueryPlanner.apply(QueryPlanner.scala:59) > at > org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan$lzycompute(SQLContext.scala:417) > at > org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan(SQLContext.scala:415) > at > org.apache.spark.sql.SQLContext$QueryExecution.executedPlan$lzycompute(SQLContext.scala:421) > at > org.apache.spark.sql.SQLContext$QueryExecution.executedPlan(SQLContext.scala:421) > at > org.apache.spark.sql.SQLContext$QueryExecution.toRdd$lzycompute(SQLContext.scala:424) > at > org.apache.spark.sql.SQLContext$QueryExecution.toRdd(SQLContext.scala:424) > at > org.apache.spark.sql.SchemaRDDLike$class.saveAsParquetFile(SchemaRDDLike.scala:76) > at > org.apache.spark.sql.SchemaRDD.saveAsParquetFile(SchemaRDD.scala:103) > at com.jd.jddp.spark.hive.Cache$.cacheTable(Cache.scala:33) > at com.jd.jddp.spark.hive.Cache$$anonfun$main$5.apply(Cache.scala:61) > at com.jd.jddp.spark.hive.Cache$$anonfun$main$5.apply(Cache.scala:59) > at > scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772) > at > scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) > at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108) > at > scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771) > at com.jd.jddp.spark.hive.Cache$.main(Cache.scala:59) > at com.jd.jddp.spark.hive.Cache.main(Cache.scala) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) > at java.lang.reflect.Method.invoke(Method.java:597) > at > org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:459) > {quote} -- This message was sent by Atlassian JIRA (v6.
[jira] [Updated] (SPARK-4512) Unresolved Attribute Exception for sort by
[ https://issues.apache.org/jira/browse/SPARK-4512?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-4512: Target Version/s: 1.3.0 > Unresolved Attribute Exception for sort by > -- > > Key: SPARK-4512 > URL: https://issues.apache.org/jira/browse/SPARK-4512 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Cheng Hao > > It will cause exception while do query like: > SELECT key+key FROM src sort by value; -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4302) Make jsonRDD/jsonFile support more field data types
[ https://issues.apache.org/jira/browse/SPARK-4302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-4302: Target Version/s: 1.3.0 > Make jsonRDD/jsonFile support more field data types > --- > > Key: SPARK-4302 > URL: https://issues.apache.org/jira/browse/SPARK-4302 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.1.0 >Reporter: Yin Huai > > Since we allow users to specify schemas, jsonRDD/jsonFile should support all > Spark SQL data types in the provided schema. > A related post in mailing list: > http://apache-spark-user-list.1001560.n3.nabble.com/jsonRdd-and-MapType-td18376.html > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4296) Throw "Expression not in GROUP BY" when using same expression in group by clause and select clause
[ https://issues.apache.org/jira/browse/SPARK-4296?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-4296: Priority: Critical (was: Major) > Throw "Expression not in GROUP BY" when using same expression in group by > clause and select clause > --- > > Key: SPARK-4296 > URL: https://issues.apache.org/jira/browse/SPARK-4296 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.1.0 >Reporter: Shixiong Zhu >Assignee: Cheng Lian >Priority: Critical > > When the input data has a complex structure, using same expression in group > by clause and select clause will throw "Expression not in GROUP BY". > {code:java} > val sqlContext = new org.apache.spark.sql.SQLContext(sc) > import sqlContext.createSchemaRDD > case class Birthday(date: String) > case class Person(name: String, birthday: Birthday) > val people = sc.parallelize(List(Person("John", Birthday("1990-01-22")), > Person("Jim", Birthday("1980-02-28" > people.registerTempTable("people") > val year = sqlContext.sql("select count(*), upper(birthday.date) from people > group by upper(birthday.date)") > year.collect > {code} > Here is the plan of year: > {code:java} > SchemaRDD[3] at RDD at SchemaRDD.scala:105 > == Query Plan == > == Physical Plan == > org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Expression > not in GROUP BY: Upper(birthday#1.date AS date#9) AS c1#3, tree: > Aggregate [Upper(birthday#1.date)], [COUNT(1) AS c0#2L,Upper(birthday#1.date > AS date#9) AS c1#3] > Subquery people > LogicalRDD [name#0,birthday#1], MapPartitionsRDD[1] at mapPartitions at > ExistingRDD.scala:36 > {code} > The bug is the equality test for `Upper(birthday#1.date)` and > `Upper(birthday#1.date AS date#9)`. > Maybe Spark SQL needs a mechanism to compare Alias expression and non-Alias > expression. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4296) Throw "Expression not in GROUP BY" when using same expression in group by clause and select clause
[ https://issues.apache.org/jira/browse/SPARK-4296?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-4296: Target Version/s: 1.3.0 > Throw "Expression not in GROUP BY" when using same expression in group by > clause and select clause > --- > > Key: SPARK-4296 > URL: https://issues.apache.org/jira/browse/SPARK-4296 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.1.0 >Reporter: Shixiong Zhu >Assignee: Cheng Lian > > When the input data has a complex structure, using same expression in group > by clause and select clause will throw "Expression not in GROUP BY". > {code:java} > val sqlContext = new org.apache.spark.sql.SQLContext(sc) > import sqlContext.createSchemaRDD > case class Birthday(date: String) > case class Person(name: String, birthday: Birthday) > val people = sc.parallelize(List(Person("John", Birthday("1990-01-22")), > Person("Jim", Birthday("1980-02-28" > people.registerTempTable("people") > val year = sqlContext.sql("select count(*), upper(birthday.date) from people > group by upper(birthday.date)") > year.collect > {code} > Here is the plan of year: > {code:java} > SchemaRDD[3] at RDD at SchemaRDD.scala:105 > == Query Plan == > == Physical Plan == > org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Expression > not in GROUP BY: Upper(birthday#1.date AS date#9) AS c1#3, tree: > Aggregate [Upper(birthday#1.date)], [COUNT(1) AS c0#2L,Upper(birthday#1.date > AS date#9) AS c1#3] > Subquery people > LogicalRDD [name#0,birthday#1], MapPartitionsRDD[1] at mapPartitions at > ExistingRDD.scala:36 > {code} > The bug is the equality test for `Upper(birthday#1.date)` and > `Upper(birthday#1.date AS date#9)`. > Maybe Spark SQL needs a mechanism to compare Alias expression and non-Alias > expression. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-4209) Support UDT in UDF
[ https://issues.apache.org/jira/browse/SPARK-4209?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-4209. - Resolution: Fixed Fix Version/s: 1.2.0 Assignee: Michael Armbrust Fixed here: https://github.com/apache/spark/commit/15b58a2234ab7ba30c9c0cbb536177a3c725e350 > Support UDT in UDF > -- > > Key: SPARK-4209 > URL: https://issues.apache.org/jira/browse/SPARK-4209 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Xiangrui Meng >Assignee: Michael Armbrust > Fix For: 1.2.0 > > > UDF doesn't recognize functions defined with UDTs. Before execution, an SQL > internal datum should be converted to Scala types, and after execution, the > result should be converted back to internal format (maybe this part is > already done). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-4201) Can't use concat() on partition column in where condition (Hive compatibility problem)
[ https://issues.apache.org/jira/browse/SPARK-4201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-4201. - Resolution: Fixed Fix Version/s: 1.2.0 Since this was reported working in master I'm going to close. Please reopen if you are still having problems. > Can't use concat() on partition column in where condition (Hive compatibility > problem) > -- > > Key: SPARK-4201 > URL: https://issues.apache.org/jira/browse/SPARK-4201 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.0.0, 1.1.0 > Environment: Hive 0.12+hadoop 2.4/hadoop 2.2 +spark 1.1 >Reporter: dongxu >Priority: Minor > Labels: com > Fix For: 1.2.0 > > > The team used hive to query,we try to move it to spark-sql. > when I search sentences like that. > select count(1) from gulfstream_day_driver_base_2 where > concat(year,month,day) = '20140929'; > It can't work ,but it work well in hive. > I have to rewrite the sql to "select count(1) from > gulfstream_day_driver_base_2 where year = 2014 and month = 09 day= 29. > There are some error log. > 14/11/03 15:05:03 ERROR SparkSQLDriver: Failed in [select count(1) from > gulfstream_day_driver_base_2 where concat(year,month,day) = '20140929'] > org.apache.spark.sql.catalyst.errors.package$TreeNodeException: execute, tree: > Aggregate false, [], [SUM(PartialCount#1390L) AS c_0#1337L] > Exchange SinglePartition > Aggregate true, [], [COUNT(1) AS PartialCount#1390L] >HiveTableScan [], (MetastoreRelation default, > gulfstream_day_driver_base_2, None), > Some((HiveGenericUdf#org.apache.hadoop.hive.ql.udf.generic.GenericUDFConcat(year#1339,month#1340,day#1341) > = 20140929)) > at > org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:47) > at org.apache.spark.sql.execution.Aggregate.execute(Aggregate.scala:126) > at > org.apache.spark.sql.hive.HiveContext$QueryExecution.toRdd$lzycompute(HiveContext.scala:360) > at > org.apache.spark.sql.hive.HiveContext$QueryExecution.toRdd(HiveContext.scala:360) > at > org.apache.spark.sql.hive.HiveContext$QueryExecution.stringResult(HiveContext.scala:415) > at > org.apache.spark.sql.hive.thriftserver.SparkSQLDriver.run(SparkSQLDriver.scala:59) > at > org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.processCmd(SparkSQLCLIDriver.scala:291) > at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:413) > at > org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:226) > at > org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) > at java.lang.reflect.Method.invoke(Method.java:597) > at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:328) > at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:75) > at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) > Caused by: org.apache.spark.sql.catalyst.errors.package$TreeNodeException: > execute, tree: > Exchange SinglePartition > Aggregate true, [], [COUNT(1) AS PartialCount#1390L] > HiveTableScan [], (MetastoreRelation default, gulfstream_day_driver_base_2, > None), > Some((HiveGenericUdf#org.apache.hadoop.hive.ql.udf.generic.GenericUDFConcat(year#1339,month#1340,day#1341) > = 20140929)) > at > org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:47) > at org.apache.spark.sql.execution.Exchange.execute(Exchange.scala:44) > at > org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1.apply(Aggregate.scala:128) > at > org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1.apply(Aggregate.scala:127) > at > org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:46) > ... 16 more > Caused by: org.apache.spark.sql.catalyst.errors.package$TreeNodeException: > execute, tree: > Aggregate true, [], [COUNT(1) AS PartialCount#1390L] > HiveTableScan [], (MetastoreRelation default, gulfstream_day_driver_base_2, > None), > Some((HiveGenericUdf#org.apache.hadoop.hive.ql.udf.generic.GenericUDFConcat(year#1339,month#1340,day#1341) > = 20140929)) > at > org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:47) > at org.apache.spark.sql.execution.Aggregate.execute(Aggregate.scala:126) > at > org.apache.spark.sql.execution.Exchange$$anonfun$execute$1.apply(Exchange.scala:86) > at > or
[jira] [Resolved] (SPARK-4135) Error reading Parquet file generated with SparkSQL
[ https://issues.apache.org/jira/browse/SPARK-4135?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-4135. - Resolution: Won't Fix Assignee: Michael Armbrust > Error reading Parquet file generated with SparkSQL > -- > > Key: SPARK-4135 > URL: https://issues.apache.org/jira/browse/SPARK-4135 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.1.0 >Reporter: Hossein Falaki >Assignee: Michael Armbrust > Attachments: _metadata, part-r-1.parquet > > > I read a tsv version of the one million songs dataset (available here: > http://tbmmsd.s3.amazonaws.com/) > After reading it I create a SchemaRDD with following schema: > {code} > root > |-- track_id: string (nullable = true) > |-- analysis_sample_rate: string (nullable = true) > |-- artist_7digitalid: string (nullable = true) > |-- artist_familiarity: double (nullable = true) > |-- artist_hotness: double (nullable = true) > |-- artist_id: string (nullable = true) > |-- artist_latitude: string (nullable = true) > |-- artist_location: string (nullable = true) > |-- artist_longitude: string (nullable = true) > |-- artist_mbid: string (nullable = true) > |-- artist_mbtags: array (nullable = true) > ||-- element: string (containsNull = true) > |-- artist_mbtags_count: array (nullable = true) > ||-- element: double (containsNull = true) > |-- artist_name: string (nullable = true) > |-- artist_playmeid: string (nullable = true) > |-- artist_terms: array (nullable = true) > ||-- element: string (containsNull = true) > |-- artist_terms_freq: array (nullable = true) > ||-- element: double (containsNull = true) > |-- artist_terms_weight: array (nullable = true) > ||-- element: double (containsNull = true) > |-- audio_md5: string (nullable = true) > |-- bars_confidence: array (nullable = true) > ||-- element: double (containsNull = true) > |-- bars_start: array (nullable = true) > ||-- element: double (containsNull = true) > |-- beats_confidence: array (nullable = true) > ||-- element: double (containsNull = true) > |-- beats_start: array (nullable = true) > ||-- element: double (containsNull = true) > |-- danceability: double (nullable = true) > |-- duration: double (nullable = true) > |-- end_of_fade_in: double (nullable = true) > |-- energy: double (nullable = true) > |-- key: string (nullable = true) > |-- key_confidence: double (nullable = true) > |-- loudness: double (nullable = true) > |-- mode: double (nullable = true) > |-- mode_confidence: double (nullable = true) > |-- release: string (nullable = true) > |-- release_7digitalid: string (nullable = true) > |-- sections_confidence: array (nullable = true) > ||-- element: double (containsNull = true) > |-- sections_start: array (nullable = true) > ||-- element: double (containsNull = true) > |-- segments_confidence: array (nullable = true) > ||-- element: double (containsNull = true) > |-- segments_loudness_max: array (nullable = true) > ||-- element: double (containsNull = true) > |-- segments_loudness_max_time: array (nullable = true) > ||-- element: double (containsNull = true) > |-- segments_loudness_start: array (nullable = true) > ||-- element: double (containsNull = true) > |-- segments_pitches: array (nullable = true) > ||-- element: double (containsNull = true) > |-- segments_start: array (nullable = true) > ||-- element: double (containsNull = true) > |-- segments_timbre: array (nullable = true) > ||-- element: double (containsNull = true) > |-- similar_artists: array (nullable = true) > ||-- element: string (containsNull = true) > |-- song_hotness: double (nullable = true) > |-- song_id: string (nullable = true) > |-- start_of_fade_out: double (nullable = true) > |-- tatums_confidence: array (nullable = true) > ||-- element: double (containsNull = true) > |-- tatums_start: array (nullable = true) > ||-- element: double (containsNull = true) > |-- tempo: double (nullable = true) > |-- time_signature: double (nullable = true) > |-- time_signature_confidence: double (nullable = true) > |-- title: string (nullable = true) > |-- track_7digitalid: string (nullable = true) > |-- year: double (nullable = true) > {code} > I select a single record from it and save it using saveAsParquetFile(). > When I read it later and try to query it I get the following exception: > {code} > Error in SQL statement: java.lang.RuntimeException: > java.lang.reflect.InvocationTargetException > at sun.reflect.GeneratedMethodAccessor208.invoke(Unknown Source) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Meth
[jira] [Commented] (SPARK-4135) Error reading Parquet file generated with SparkSQL
[ https://issues.apache.org/jira/browse/SPARK-4135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14254035#comment-14254035 ] Michael Armbrust commented on SPARK-4135: - The problem here is you have to columns with the same name, "beats_start". The new version of parquet gives you a better error message. > Error reading Parquet file generated with SparkSQL > -- > > Key: SPARK-4135 > URL: https://issues.apache.org/jira/browse/SPARK-4135 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.1.0 >Reporter: Hossein Falaki > Attachments: _metadata, part-r-1.parquet > > > I read a tsv version of the one million songs dataset (available here: > http://tbmmsd.s3.amazonaws.com/) > After reading it I create a SchemaRDD with following schema: > {code} > root > |-- track_id: string (nullable = true) > |-- analysis_sample_rate: string (nullable = true) > |-- artist_7digitalid: string (nullable = true) > |-- artist_familiarity: double (nullable = true) > |-- artist_hotness: double (nullable = true) > |-- artist_id: string (nullable = true) > |-- artist_latitude: string (nullable = true) > |-- artist_location: string (nullable = true) > |-- artist_longitude: string (nullable = true) > |-- artist_mbid: string (nullable = true) > |-- artist_mbtags: array (nullable = true) > ||-- element: string (containsNull = true) > |-- artist_mbtags_count: array (nullable = true) > ||-- element: double (containsNull = true) > |-- artist_name: string (nullable = true) > |-- artist_playmeid: string (nullable = true) > |-- artist_terms: array (nullable = true) > ||-- element: string (containsNull = true) > |-- artist_terms_freq: array (nullable = true) > ||-- element: double (containsNull = true) > |-- artist_terms_weight: array (nullable = true) > ||-- element: double (containsNull = true) > |-- audio_md5: string (nullable = true) > |-- bars_confidence: array (nullable = true) > ||-- element: double (containsNull = true) > |-- bars_start: array (nullable = true) > ||-- element: double (containsNull = true) > |-- beats_confidence: array (nullable = true) > ||-- element: double (containsNull = true) > |-- beats_start: array (nullable = true) > ||-- element: double (containsNull = true) > |-- danceability: double (nullable = true) > |-- duration: double (nullable = true) > |-- end_of_fade_in: double (nullable = true) > |-- energy: double (nullable = true) > |-- key: string (nullable = true) > |-- key_confidence: double (nullable = true) > |-- loudness: double (nullable = true) > |-- mode: double (nullable = true) > |-- mode_confidence: double (nullable = true) > |-- release: string (nullable = true) > |-- release_7digitalid: string (nullable = true) > |-- sections_confidence: array (nullable = true) > ||-- element: double (containsNull = true) > |-- sections_start: array (nullable = true) > ||-- element: double (containsNull = true) > |-- segments_confidence: array (nullable = true) > ||-- element: double (containsNull = true) > |-- segments_loudness_max: array (nullable = true) > ||-- element: double (containsNull = true) > |-- segments_loudness_max_time: array (nullable = true) > ||-- element: double (containsNull = true) > |-- segments_loudness_start: array (nullable = true) > ||-- element: double (containsNull = true) > |-- segments_pitches: array (nullable = true) > ||-- element: double (containsNull = true) > |-- segments_start: array (nullable = true) > ||-- element: double (containsNull = true) > |-- segments_timbre: array (nullable = true) > ||-- element: double (containsNull = true) > |-- similar_artists: array (nullable = true) > ||-- element: string (containsNull = true) > |-- song_hotness: double (nullable = true) > |-- song_id: string (nullable = true) > |-- start_of_fade_out: double (nullable = true) > |-- tatums_confidence: array (nullable = true) > ||-- element: double (containsNull = true) > |-- tatums_start: array (nullable = true) > ||-- element: double (containsNull = true) > |-- tempo: double (nullable = true) > |-- time_signature: double (nullable = true) > |-- time_signature_confidence: double (nullable = true) > |-- title: string (nullable = true) > |-- track_7digitalid: string (nullable = true) > |-- year: double (nullable = true) > {code} > I select a single record from it and save it using saveAsParquetFile(). > When I read it later and try to query it I get the following exception: > {code} > Error in SQL statement: java.lang.RuntimeException: > java.lang.reflect.InvocationTargetException > at sun.reflect.GeneratedMethodAccessor208.invoke(Unknown Source) > at > sun.reflect.DelegatingMethodAccessorIm
[jira] [Resolved] (SPARK-4248) [SQL] spark sql not support add jar
[ https://issues.apache.org/jira/browse/SPARK-4248?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-4248. - Resolution: Fixed Fix Version/s: 1.2.0 > [SQL] spark sql not support add jar > > > Key: SPARK-4248 > URL: https://issues.apache.org/jira/browse/SPARK-4248 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.1.1 > Environment: java:1.7 > hadoop:2.3.0-cdh5.0.0 > spark:1.1.1 > thriftserver-with-hive:0.12 > hive metaserver:0.13.1 >Reporter: qiaohaijun > Fix For: 1.2.0 > > > add jar not support > the udf jar need use --jars upload -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4317) Error querying Avro files imported by Sqoop: org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Unresolved attributes
[ https://issues.apache.org/jira/browse/SPARK-4317?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14254039#comment-14254039 ] Michael Armbrust commented on SPARK-4317: - Is this still a problem in recent versions? There has been quite a bit of work in this part of the code. > Error querying Avro files imported by Sqoop: > org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Unresolved > attributes > -- > > Key: SPARK-4317 > URL: https://issues.apache.org/jira/browse/SPARK-4317 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.1.0 > Environment: Spark 1.1.0, Sqoop 1.4.5, PostgreSQL 9.3 >Reporter: Hendy Irawan > > After importing table from PostgreSQL 9.3 to Avro file using Sqoop 1.4.5, > Spark SQL 1.1.0 is unable to process it: > (note that Hive 0.13 can process the Avro file just fine) > {code} > spark-sql> select city from place; > 14/11/10 10:15:08 INFO ParseDriver: Parsing command: select city from place > 14/11/10 10:15:08 INFO ParseDriver: Parse Completed > 14/11/10 10:15:08 INFO HiveMetaStore: 0: get_table : db=default tbl=place > 14/11/10 10:15:08 INFO audit: ugi=ceefour ip=unknown-ip-addr > cmd=get_table : db=default tbl=place > 14/11/10 10:15:08 ERROR SparkSQLDriver: Failed in [select city from place] > org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Unresolved > attributes: 'city, tree: > Project ['city] > LowerCaseSchema > MetastoreRelation default, place, None > at > org.apache.spark.sql.catalyst.analysis.Analyzer$CheckResolution$$anonfun$apply$1.applyOrElse(Analyzer.scala:72) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$CheckResolution$$anonfun$apply$1.applyOrElse(Analyzer.scala:70) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:165) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:156) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$CheckResolution$.apply(Analyzer.scala:70) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$CheckResolution$.apply(Analyzer.scala:68) > at > org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1$$anonfun$apply$2.apply(RuleExecutor.scala:61) > at > org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1$$anonfun$apply$2.apply(RuleExecutor.scala:59) > at > scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:51) > at > scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:60) > at > scala.collection.mutable.WrappedArray.foldLeft(WrappedArray.scala:34) > at > org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1.apply(RuleExecutor.scala:59) > at > org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1.apply(RuleExecutor.scala:51) > at scala.collection.immutable.List.foreach(List.scala:318) > at > org.apache.spark.sql.catalyst.rules.RuleExecutor.apply(RuleExecutor.scala:51) > at > org.apache.spark.sql.SQLContext$QueryExecution.analyzed$lzycompute(SQLContext.scala:397) > at > org.apache.spark.sql.SQLContext$QueryExecution.analyzed(SQLContext.scala:397) > at > org.apache.spark.sql.hive.HiveContext$QueryExecution.optimizedPlan$lzycompute(HiveContext.scala:358) > at > org.apache.spark.sql.hive.HiveContext$QueryExecution.optimizedPlan(HiveContext.scala:357) > at > org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan$lzycompute(SQLContext.scala:402) > at > org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan(SQLContext.scala:400) > at > org.apache.spark.sql.SQLContext$QueryExecution.executedPlan$lzycompute(SQLContext.scala:406) > at > org.apache.spark.sql.SQLContext$QueryExecution.executedPlan(SQLContext.scala:406) > at > org.apache.spark.sql.hive.HiveContext$QueryExecution.stringResult(HiveContext.scala:406) > at > org.apache.spark.sql.hive.thriftserver.SparkSQLDriver.run(SparkSQLDriver.scala:59) > at > org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.processCmd(SparkSQLCLIDriver.scala:291) > at > org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:413) > at > org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:226) > at > org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMeth
[jira] [Updated] (SPARK-3851) Support for reading parquet files with different but compatible schema
[ https://issues.apache.org/jira/browse/SPARK-3851?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-3851: Priority: Critical (was: Major) Target Version/s: 1.3.0 Issue Type: Improvement (was: Bug) > Support for reading parquet files with different but compatible schema > -- > > Key: SPARK-3851 > URL: https://issues.apache.org/jira/browse/SPARK-3851 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Michael Armbrust >Priority: Critical > > Right now it is required that all of the parquet files have the same schema. > It would be nice to support some safe subset of cases where the schemas of > files is different. For example: > - Adding and removing nullable columns. > - Widening types (a column that is of both Int and Long type) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-3295) [Spark SQL] schemaRdd1 ++ schemaRdd2 does not return another SchemaRdd
[ https://issues.apache.org/jira/browse/SPARK-3295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-3295. - Resolution: Won't Fix These are actually different operations. UnionAll is similar to the SQL command and will fail if the two schema are different. union and ++ will not. > [Spark SQL] schemaRdd1 ++ schemaRdd2 does not return another SchemaRdd > --- > > Key: SPARK-3295 > URL: https://issues.apache.org/jira/browse/SPARK-3295 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.0.2 >Reporter: Evan Chan >Priority: Minor > > Right now, > schemaRdd1.unionAll(schemaRdd2) returns a SchemaRdd. > However, > schemaRdd1 ++ schemaRdd2 returns an RDD[Row]. > Similarly, > schemaRdd1.union(schemaRdd2) returns an RDD[Row]. > This is inconsistent. Let's make ++ and union have the same behavior as > unionAll. > Actually, not sure there needs to be both union and unionAll. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org