[jira] [Created] (SPARK-4898) Replace cloudpickle with Dill

2014-12-19 Thread Josh Rosen (JIRA)
Josh Rosen created SPARK-4898:
-

 Summary: Replace cloudpickle with Dill
 Key: SPARK-4898
 URL: https://issues.apache.org/jira/browse/SPARK-4898
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Reporter: Josh Rosen


We should consider replacing our modified version of {{cloudpickle}} with 
[Dill|https://github.com/uqfoundation/dill], since it supports both Python 2 
and 3 and might do a better job of handling certain corner-cases.

I attempted to do this a few months ago but ran into cases where Dill had 
issues pickling objects defined in doctests, which broke our test suite: 
https://github.com/uqfoundation/dill/issues/50.  This issue may have been 
resolved now; I haven't checked.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4897) Python 3 support

2014-12-19 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4897?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-4897:
--
Description: 
It would be nice to have Python 3 support in PySpark, provided that we can do 
it in a way that maintains backwards-compatibility with Python 2.6.

I started looking into porting this; my WIP work can be found at 
https://github.com/JoshRosen/spark/compare/python3

I was able to use the 
[futurize|http://python-future.org/futurize.html#forwards-conversion-stage1] 
tool to handle the basic conversion of things like {{print}} statements, etc. 
and had to manually fix up a few imports for packages that moved / were 
renamed, but the major blocker that I hit was {{cloudpickle}}:

{code}
[joshrosen python (python3)]$ PYSPARK_PYTHON=python3 ../bin/pyspark
Python 3.4.2 (default, Oct 19 2014, 17:52:17)
[GCC 4.2.1 Compatible Apple LLVM 6.0 (clang-600.0.51)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
Traceback (most recent call last):
  File "/Users/joshrosen/Documents/Spark/python/pyspark/shell.py", line 28, in 

import pyspark
  File "/Users/joshrosen/Documents/spark/python/pyspark/__init__.py", line 41, 
in 
from pyspark.context import SparkContext
  File "/Users/joshrosen/Documents/spark/python/pyspark/context.py", line 26, 
in 
from pyspark import accumulators
  File "/Users/joshrosen/Documents/spark/python/pyspark/accumulators.py", line 
97, in 
from pyspark.cloudpickle import CloudPickler
  File "/Users/joshrosen/Documents/spark/python/pyspark/cloudpickle.py", line 
120, in 
class CloudPickler(pickle.Pickler):
  File "/Users/joshrosen/Documents/spark/python/pyspark/cloudpickle.py", line 
122, in CloudPickler
dispatch = pickle.Pickler.dispatch.copy()
AttributeError: type object '_pickle.Pickler' has no attribute 'dispatch'
{code}

This code looks like it will be hard difficult to port to Python 3, so this 
might be a good reason to switch to [Dill|https://github.com/uqfoundation/dill] 
for Python serialization.

  was:
It would be nice to have Python 3 support in PySpark, provided that we can do 
it in a way that maintains backwards-compatibility with Python 2.6.

I started looking into porting this; my WIP work can be found at 
https://github.com/JoshRosen/spark/compare/python3?expand=1

I was able to use the 
[futurize|http://python-future.org/futurize.html#forwards-conversion-stage1] 
tool to handle the basic conversion of things like {{print}} statements, etc. 
and had to manually fix up a few imports for packages that moved / were 
renamed, but the major blocker that I hit was {{cloudpickle}}:

{code}
[joshrosen python (python3)]$ PYSPARK_PYTHON=python3 ../bin/pyspark
Python 3.4.2 (default, Oct 19 2014, 17:52:17)
[GCC 4.2.1 Compatible Apple LLVM 6.0 (clang-600.0.51)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
Traceback (most recent call last):
  File "/Users/joshrosen/Documents/Spark/python/pyspark/shell.py", line 28, in 

import pyspark
  File "/Users/joshrosen/Documents/spark/python/pyspark/__init__.py", line 41, 
in 
from pyspark.context import SparkContext
  File "/Users/joshrosen/Documents/spark/python/pyspark/context.py", line 26, 
in 
from pyspark import accumulators
  File "/Users/joshrosen/Documents/spark/python/pyspark/accumulators.py", line 
97, in 
from pyspark.cloudpickle import CloudPickler
  File "/Users/joshrosen/Documents/spark/python/pyspark/cloudpickle.py", line 
120, in 
class CloudPickler(pickle.Pickler):
  File "/Users/joshrosen/Documents/spark/python/pyspark/cloudpickle.py", line 
122, in CloudPickler
dispatch = pickle.Pickler.dispatch.copy()
AttributeError: type object '_pickle.Pickler' has no attribute 'dispatch'
{code}

This code looks like it will be hard difficult to port to Python 3, so this 
might be a good reason to switch to [Dill|https://github.com/uqfoundation/dill] 
for Python serialization.


> Python 3 support
> 
>
> Key: SPARK-4897
> URL: https://issues.apache.org/jira/browse/SPARK-4897
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Reporter: Josh Rosen
>Priority: Minor
>
> It would be nice to have Python 3 support in PySpark, provided that we can do 
> it in a way that maintains backwards-compatibility with Python 2.6.
> I started looking into porting this; my WIP work can be found at 
> https://github.com/JoshRosen/spark/compare/python3
> I was able to use the 
> [futurize|http://python-future.org/futurize.html#forwards-conversion-stage1] 
> tool to handle the basic conversion of things like {{print}} statements, etc. 
> and had to manually fix up a few imports for packages that moved / were 
> renamed, but the major blocker that I hit was {{cloudpickle}}:
> {code}
> [joshrosen python (python3)]$ PYSPARK_PYTHON=python3 ..

[jira] [Commented] (SPARK-4886) Support cache control for each partition of a Hive partitioned table

2014-12-19 Thread guowei (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4886?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14253138#comment-14253138
 ] 

guowei commented on SPARK-4886:
---

use  "CACHE TABLE ... AS SELECT..."

> Support cache control for each partition of a Hive partitioned table
> 
>
> Key: SPARK-4886
> URL: https://issues.apache.org/jira/browse/SPARK-4886
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Xudong Zheng
>
> SparkSQL currently don't support control cache for each partition of a Hive 
> partitioned table. If we could add this feature, user could have a better 
> cache control of a cache table. And in many scenarios, the data is 
> periodically appended into a table as a new partition, with this feature, 
> user could easily control a sliding windows of data to be cached in memory.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3619) Upgrade to Mesos 0.21 to work around MESOS-1688

2014-12-19 Thread Andrew Ash (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3619?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Ash updated SPARK-3619:
--
Assignee: Timothy Chen

> Upgrade to Mesos 0.21 to work around MESOS-1688
> ---
>
> Key: SPARK-3619
> URL: https://issues.apache.org/jira/browse/SPARK-3619
> Project: Spark
>  Issue Type: Improvement
>  Components: Mesos
>Reporter: Matei Zaharia
>Assignee: Timothy Chen
>
> When Mesos 0.21 comes out, it will have a fix for 
> https://issues.apache.org/jira/browse/MESOS-1688, which affects Spark jobs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3619) Upgrade to Mesos 0.21 to work around MESOS-1688

2014-12-19 Thread Andrew Ash (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3619?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14253150#comment-14253150
 ] 

Andrew Ash commented on SPARK-3619:
---

[~activars] Spark 1.2.0 is being released with a Mesos dependency on 0.18.1 so 
a fix was not included for the Spark release.

[~tnachen] are you still interested in this?  I'm assigning the Jira to you

> Upgrade to Mesos 0.21 to work around MESOS-1688
> ---
>
> Key: SPARK-3619
> URL: https://issues.apache.org/jira/browse/SPARK-3619
> Project: Spark
>  Issue Type: Improvement
>  Components: Mesos
>Reporter: Matei Zaharia
>
> When Mesos 0.21 comes out, it will have a fix for 
> https://issues.apache.org/jira/browse/MESOS-1688, which affects Spark jobs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3619) Upgrade to Mesos 0.21 to work around MESOS-1688

2014-12-19 Thread Andrew Ash (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3619?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Ash updated SPARK-3619:
--
Description: The Mesos 0.21 release has a fix for 
https://issues.apache.org/jira/browse/MESOS-1688, which affects Spark jobs.  
(was: When Mesos 0.21 comes out, it will have a fix for 
https://issues.apache.org/jira/browse/MESOS-1688, which affects Spark jobs.)

> Upgrade to Mesos 0.21 to work around MESOS-1688
> ---
>
> Key: SPARK-3619
> URL: https://issues.apache.org/jira/browse/SPARK-3619
> Project: Spark
>  Issue Type: Improvement
>  Components: Mesos
>Reporter: Matei Zaharia
>Assignee: Timothy Chen
>
> The Mesos 0.21 release has a fix for 
> https://issues.apache.org/jira/browse/MESOS-1688, which affects Spark jobs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4886) Support cache control for each partition of a Hive partitioned table

2014-12-19 Thread Xudong Zheng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4886?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14253154#comment-14253154
 ] 

Xudong Zheng commented on SPARK-4886:
-

Hi Guowei,

"CACHE TABLE ... AS SELECT..." will create a new cache table instead of caching 
the partition of the original table. The query on original table will still go 
to HDFS. And this is not convenient for appending scenario, because will need 
to create a new table every time we add a new partition.  Actually, that is 
still table level cache control.

> Support cache control for each partition of a Hive partitioned table
> 
>
> Key: SPARK-4886
> URL: https://issues.apache.org/jira/browse/SPARK-4886
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Xudong Zheng
>
> SparkSQL currently don't support control cache for each partition of a Hive 
> partitioned table. If we could add this feature, user could have a better 
> cache control of a cache table. And in many scenarios, the data is 
> periodically appended into a table as a new partition, with this feature, 
> user could easily control a sliding windows of data to be cached in memory.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-4899) Support Mesos features: roles and checkpoints

2014-12-19 Thread Andrew Ash (JIRA)
Andrew Ash created SPARK-4899:
-

 Summary: Support Mesos features: roles and checkpoints
 Key: SPARK-4899
 URL: https://issues.apache.org/jira/browse/SPARK-4899
 Project: Spark
  Issue Type: New Feature
  Components: Mesos
Affects Versions: 1.2.0
Reporter: Andrew Ash


Inspired by https://github.com/apache/spark/pull/60

Mesos has two features that would be nice for Spark to take advantage of:

1. Roles -- a way to specify ACLs and priorities for users
2. Checkpoints -- a way to restart a failed Mesos slave without losing all the 
work that was happening on the box

Some of these may require a Mesos upgrade past our current 0.18.1



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4872) Provide sample format of training/test data in MLlib programming guide

2014-12-19 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4872?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14253171#comment-14253171
 ] 

Sean Owen commented on SPARK-4872:
--

[~zhjunwei] This is not at all specific to Spark. No, you can certainly use 
features with 3 values. You should 1-hot encode them though. You will have 
"Weather-Sunny", "Weather-Cloudy", "Weather-Rainy" binary features instead of 
one "Weather" feature for example. Although there is some separate support in 
Spark for this, it's pretty simple to translate this yourself.

Is there a remaining action on this issue or did that clarify the usage to you?

> Provide sample format of training/test data in MLlib programming guide
> --
>
> Key: SPARK-4872
> URL: https://issues.apache.org/jira/browse/SPARK-4872
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Affects Versions: 1.1.1
>Reporter: zhang jun wei
>  Labels: documentation
>
> I suggest: in samples of the online programming guide of MLlib, it's better 
> to give examples in the real life data, and list the translated data format 
> for the model to consume. 
> The problem blocking me is how to translate the real life data into the 
> format which MLLib  can understand correctly. 
> Here is one sample, I want to use NaiveBayes to train and predict tennis-play 
> decision, the original data is:
> Weather | Temperature | Humidity | Wind  => Decision to play tennis
> Sunny | Hot   | High   | No => No
> Sunny | Hot   | High   | Yes=> No
> Cloudy| Normal | Normal   | No => Yes
> Rainy  | Cold | Normal   | Yes=> No
> Now, from my understanding, one potential translation is:
> 1) put every feature value word into a line:
> Sunny Cloudy Rainy Hot Normal Cold High Normal Yes No
> 2) map them to numbers:
> 1 2 3 4 5 6 7 8 9 10
> 3) map decision labels to numbers:
> 0 - No
> 1 - Yes
> 4) set the value to 1 if it appears, or 0 if not, for the above example, here 
> is the data format for MLUtils.loadLibSVMFile to use:
> 0 1:1 2:0 3:0 4:1 5:0 6:0 7:1 8:0 9:0 10:1
> 0 1:1 2:0 3:0 4:1 5:0 6:0 7:1 8:0 9:1 10:0
> 1 1:0 2:1 3:0 4:0 5:1 6:0 7:0 8:1 9:0 10:1
> 0 1:0 2:0 3:1 4:0 5:0 6:1 7:0 8:1 9:1 10:0
> ==> Is this a correct understanding?
> And another way I can image is:
> 1) put every feature name into a line:
> Weather  Temperature  Humidity  Wind
> 2) map them to numbers:
> 1 2 3 4 
> 3) map decision labels to numbers:
> 0 - No
> 1 - Yes
> 4) map each value of each feature to a number (e.g. Sunny to 1, Cloudy to 2, 
> Rainy to 3; Hot to 1, Normal to 2, Cold to 3; High to 1, Normal to 2; Yes to 
> 1, No to 2) for the above example, here is the data format for 
> MLUtils.loadLibSVMFile to use:
> 0 1:1 2:1 3:1 4:2
> 0 1:1 2:1 3:1 4:1
> 1 1:2 2:2 3:2 4:2
> 0 1:3 2:3 3:2 4:1
> ==> but when I read the source code in NaiveBayes.scala, seems this is not 
> correct, I am not sure though...
> So which data format translation way is correct?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4094) checkpoint should still be available after rdd actions

2014-12-19 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14253191#comment-14253191
 ] 

Sean Owen commented on SPARK-4094:
--

[~liyezhang556520] But this is exactly what the doc says is not permitted. By 
invoking action C, you necessarily execute the job for RDD B, after which time 
you can't checkpoint it.

My question, if you're proposing to loosen the restriction, I wonder what 
problem there was originally to allowing this, and why the change resolves that?

> checkpoint should still be available after rdd actions
> --
>
> Key: SPARK-4094
> URL: https://issues.apache.org/jira/browse/SPARK-4094
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.1.0
>Reporter: Zhang, Liye
>Assignee: Zhang, Liye
>
> rdd.checkpoint() must be called before any actions on this rdd, if there is 
> any other actions before, checkpoint would never succeed. For the following 
> code as example:
> *rdd = sc.makeRDD(...)*
> *rdd.collect()*
> *rdd.checkpoint()*
> *rdd.count()*
> This rdd would never be checkpointed. For algorithms that have many 
> iterations would have some problem. Such as graph algorithm, there will have 
> many iterations which will cause the RDD lineage very long. So RDD may need 
> checkpoint after a certain iteration number. And if there is also any action 
> within the iteration loop, the checkpoint() operation will never work for the 
> later iterations after the iteration whichs call the action operation.
> But this would not happen for RDD cache. RDD cache would always make 
> successfully before rdd actions no matter whether there is any actions before 
> cache().
> So rdd.checkpoint() should also be with the same behavior with rdd.cache().



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2075) Anonymous classes are missing from Spark distribution

2014-12-19 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14253203#comment-14253203
 ] 

Sean Owen commented on SPARK-2075:
--

[~sunrui] From digging in to the various reports of this issue, it seemed to me 
that in each case the Hadoop version did not match. That is, I do not know that 
it's true that the issue manifests when the Hadoop version matches; that would 
indeed be strange. I could have missed it; this is a bit hard to follow. But do 
you see evidence of this?

I don't think publishing two versions fixes anything, really. The PR might get 
at the heart of the difference here and resolve it for real. It doesn't happen 
if you match binaries, which is good practice anyway.

> Anonymous classes are missing from Spark distribution
> -
>
> Key: SPARK-2075
> URL: https://issues.apache.org/jira/browse/SPARK-2075
> Project: Spark
>  Issue Type: Bug
>  Components: Build, Spark Core
>Affects Versions: 1.0.0
>Reporter: Paul R. Brown
>Priority: Critical
>
> Running a job built against the Maven dep for 1.0.0 and the hadoop1 
> distribution produces:
> {code}
> java.lang.ClassNotFoundException:
> org.apache.spark.rdd.RDD$$anonfun$saveAsTextFile$1
> {code}
> Here's what's in the Maven dep as of 1.0.0:
> {code}
> jar tvf 
> ~/.m2/repository/org/apache/spark/spark-core_2.10/1.0.0/spark-core_2.10-1.0.0.jar
>  | grep 'rdd/RDD' | grep 'saveAs'
>   1519 Mon May 26 13:57:58 PDT 2014 
> org/apache/spark/rdd/RDD$anonfun$saveAsTextFile$1.class
>   1560 Mon May 26 13:57:58 PDT 2014 
> org/apache/spark/rdd/RDD$anonfun$saveAsTextFile$2.class
> {code}
> And here's what's in the hadoop1 distribution:
> {code}
> jar tvf spark-assembly-1.0.0-hadoop1.0.4.jar| grep 'rdd/RDD' | grep 'saveAs'
> {code}
> I.e., it's not there.  It is in the hadoop2 distribution:
> {code}
> jar tvf spark-assembly-1.0.0-hadoop2.2.0.jar| grep 'rdd/RDD' | grep 'saveAs'
>   1519 Mon May 26 07:29:54 PDT 2014 
> org/apache/spark/rdd/RDD$anonfun$saveAsTextFile$1.class
>   1560 Mon May 26 07:29:54 PDT 2014 
> org/apache/spark/rdd/RDD$anonfun$saveAsTextFile$2.class
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-4900) MLlib SingularValueDecomposition ARPACK IllegalStateException

2014-12-19 Thread Mike Beyer (JIRA)
Mike Beyer created SPARK-4900:
-

 Summary: MLlib SingularValueDecomposition ARPACK 
IllegalStateException 
 Key: SPARK-4900
 URL: https://issues.apache.org/jira/browse/SPARK-4900
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Affects Versions: 1.1.1
 Environment: Ubuntu 1410, Java HotSpot(TM) 64-Bit Server VM (build 
25.25-b02, mixed mode)
Reporter: Mike Beyer
Priority: Blocker




java.lang.reflect.InvocationTargetException
...
Caused by: java.lang.IllegalStateException: ARPACK returns non-zero info = 3 
Please refer ARPACK user guide for error message.
at 
org.apache.spark.mllib.linalg.EigenValueDecomposition$.symmetricEigs(EigenValueDecomposition.scala:120)
at 
org.apache.spark.mllib.linalg.distributed.RowMatrix.computeSVD(RowMatrix.scala:235)
at 
org.apache.spark.mllib.linalg.distributed.RowMatrix.computeSVD(RowMatrix.scala:171)
...



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-4901) Hot fix for the BytesWritable.copyBytes not exists in Hadoop1

2014-12-19 Thread Cheng Hao (JIRA)
Cheng Hao created SPARK-4901:


 Summary: Hot fix for the BytesWritable.copyBytes not exists in 
Hadoop1
 Key: SPARK-4901
 URL: https://issues.apache.org/jira/browse/SPARK-4901
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Cheng Hao
Priority: Minor


HiveInspectors.scala failed in compiling with Hadoop 1, as the 
BytesWritable.copyBytes is not available in Hadoop 1. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4901) Hot fix for the BytesWritable.copyBytes not exists in Hadoop1

2014-12-19 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14253306#comment-14253306
 ] 

Apache Spark commented on SPARK-4901:
-

User 'chenghao-intel' has created a pull request for this issue:
https://github.com/apache/spark/pull/3742

> Hot fix for the BytesWritable.copyBytes not exists in Hadoop1
> -
>
> Key: SPARK-4901
> URL: https://issues.apache.org/jira/browse/SPARK-4901
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Cheng Hao
>Priority: Minor
>
> HiveInspectors.scala failed in compiling with Hadoop 1, as the 
> BytesWritable.copyBytes is not available in Hadoop 1. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4900) MLlib SingularValueDecomposition ARPACK IllegalStateException

2014-12-19 Thread Mike Beyer (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4900?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mike Beyer updated SPARK-4900:
--
Priority: Major  (was: Blocker)

> MLlib SingularValueDecomposition ARPACK IllegalStateException 
> --
>
> Key: SPARK-4900
> URL: https://issues.apache.org/jira/browse/SPARK-4900
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.1.1
> Environment: Ubuntu 1410, Java HotSpot(TM) 64-Bit Server VM (build 
> 25.25-b02, mixed mode)
>Reporter: Mike Beyer
>
> java.lang.reflect.InvocationTargetException
> ...
> Caused by: java.lang.IllegalStateException: ARPACK returns non-zero info = 3 
> Please refer ARPACK user guide for error message.
> at 
> org.apache.spark.mllib.linalg.EigenValueDecomposition$.symmetricEigs(EigenValueDecomposition.scala:120)
> at 
> org.apache.spark.mllib.linalg.distributed.RowMatrix.computeSVD(RowMatrix.scala:235)
> at 
> org.apache.spark.mllib.linalg.distributed.RowMatrix.computeSVD(RowMatrix.scala:171)
>   ...



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3373) Filtering operations should optionally rebuild routing tables

2014-12-19 Thread uncleGen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3373?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

uncleGen updated SPARK-3373:

 Target Version/s: 1.3.0, 1.2.1  (was: 1.1.1, 1.2.0)
Affects Version/s: (was: 1.0.2)
   (was: 1.0.0)
   1.1.0
   1.1.1

> Filtering operations should optionally rebuild routing tables
> -
>
> Key: SPARK-3373
> URL: https://issues.apache.org/jira/browse/SPARK-3373
> Project: Spark
>  Issue Type: Improvement
>  Components: GraphX
>Affects Versions: 1.1.0, 1.1.1
>Reporter: uncleGen
>Priority: Minor
>
> Graph operations that filter the edges (subgraph, mask, groupEdges) currently 
> reuse the existing routing table to avoid the shuffle which would be required 
> to build a new one. However, this may be inefficient when the filtering is 
> highly selective. Vertices will be sent to more partitions than necessary, 
> and the extra routing information may take up excessive space.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3373) Filtering operations should optionally rebuild routing tables

2014-12-19 Thread uncleGen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3373?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

uncleGen updated SPARK-3373:

Priority: Major  (was: Minor)

> Filtering operations should optionally rebuild routing tables
> -
>
> Key: SPARK-3373
> URL: https://issues.apache.org/jira/browse/SPARK-3373
> Project: Spark
>  Issue Type: Improvement
>  Components: GraphX
>Affects Versions: 1.1.0, 1.1.1
>Reporter: uncleGen
>
> Graph operations that filter the edges (subgraph, mask, groupEdges) currently 
> reuse the existing routing table to avoid the shuffle which would be required 
> to build a new one. However, this may be inefficient when the filtering is 
> highly selective. Vertices will be sent to more partitions than necessary, 
> and the extra routing information may take up excessive space.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-4902) gap-sampling performance optimization

2014-12-19 Thread Guoqiang Li (JIRA)
Guoqiang Li created SPARK-4902:
--

 Summary: gap-sampling performance optimization
 Key: SPARK-4902
 URL: https://issues.apache.org/jira/browse/SPARK-4902
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.2.0
Reporter: Guoqiang Li


{{CacheManager.getOrCompute}} returns an instance of InterruptibleIterator that 
contains an array or a iterator(when the memory is not enough). 
The GapSamplingIterator implementation is as follows
{code}
private val iterDrop: Int => Unit = {
val arrayClass = Array.empty[T].iterator.getClass
val arrayBufferClass = ArrayBuffer.empty[T].iterator.getClass
data.getClass match {
  case `arrayClass` => ((n: Int) => { data = data.drop(n) })
  case `arrayBufferClass` => ((n: Int) => { data = data.drop(n) })
  case _ => ((n: Int) => {
  var j = 0
  while (j < n && data.hasNext) {
data.next()
j += 1
  }
})
}
  }
{code}

The code does not deal with InterruptibleIterator.
This leads to the following code can't use the {{Iterator.drop}} method
{code}
rdd.cache()
data.sample(false,0.1)
{code}




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-4844) SGD should support custom sampling.

2014-12-19 Thread Guoqiang Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4844?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Guoqiang Li resolved SPARK-4844.

Resolution: Won't Fix

See: SPARK-4902

> SGD should support custom sampling.
> ---
>
> Key: SPARK-4844
> URL: https://issues.apache.org/jira/browse/SPARK-4844
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: Guoqiang Li
> Fix For: 1.3.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3619) Upgrade to Mesos 0.21 to work around MESOS-1688

2014-12-19 Thread Jing Dong (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3619?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14253394#comment-14253394
 ] 

Jing Dong commented on SPARK-3619:
--

Has anyone succeed to run Spark 1.1.1 on Mesos 0.21? What's the known issue 
running Spark on latest Mesos?

> Upgrade to Mesos 0.21 to work around MESOS-1688
> ---
>
> Key: SPARK-3619
> URL: https://issues.apache.org/jira/browse/SPARK-3619
> Project: Spark
>  Issue Type: Improvement
>  Components: Mesos
>Reporter: Matei Zaharia
>Assignee: Timothy Chen
>
> The Mesos 0.21 release has a fix for 
> https://issues.apache.org/jira/browse/MESOS-1688, which affects Spark jobs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4902) gap-sampling performance optimization

2014-12-19 Thread Guoqiang Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4902?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Guoqiang Li updated SPARK-4902:
---
Description: 
{{CacheManager.getOrCompute}} returns an instance of InterruptibleIterator that 
contains an array or a iterator(when the memory is not enough). 
The GapSamplingIterator implementation is as follows
{code}
private val iterDrop: Int => Unit = {
val arrayClass = Array.empty[T].iterator.getClass
val arrayBufferClass = ArrayBuffer.empty[T].iterator.getClass
data.getClass match {
  case `arrayClass` => ((n: Int) => { data = data.drop(n) })
  case `arrayBufferClass` => ((n: Int) => { data = data.drop(n) })
  case _ => ((n: Int) => {
  var j = 0
  while (j < n && data.hasNext) {
data.next()
j += 1
  }
})
}
  }
{code}

The code does not deal with InterruptibleIterator.
This leads to the following code can't use the {{Iterator.drop}} method
{code}
rdd.cache()
rdd.sample(false,0.1)
{code}


  was:
{{CacheManager.getOrCompute}} returns an instance of InterruptibleIterator that 
contains an array or a iterator(when the memory is not enough). 
The GapSamplingIterator implementation is as follows
{code}
private val iterDrop: Int => Unit = {
val arrayClass = Array.empty[T].iterator.getClass
val arrayBufferClass = ArrayBuffer.empty[T].iterator.getClass
data.getClass match {
  case `arrayClass` => ((n: Int) => { data = data.drop(n) })
  case `arrayBufferClass` => ((n: Int) => { data = data.drop(n) })
  case _ => ((n: Int) => {
  var j = 0
  while (j < n && data.hasNext) {
data.next()
j += 1
  }
})
}
  }
{code}

The code does not deal with InterruptibleIterator.
This leads to the following code can't use the {{Iterator.drop}} method
{code}
rdd.cache()
data.sample(false,0.1)
{code}



> gap-sampling performance optimization
> -
>
> Key: SPARK-4902
> URL: https://issues.apache.org/jira/browse/SPARK-4902
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.2.0
>Reporter: Guoqiang Li
>
> {{CacheManager.getOrCompute}} returns an instance of InterruptibleIterator 
> that contains an array or a iterator(when the memory is not enough). 
> The GapSamplingIterator implementation is as follows
> {code}
> private val iterDrop: Int => Unit = {
> val arrayClass = Array.empty[T].iterator.getClass
> val arrayBufferClass = ArrayBuffer.empty[T].iterator.getClass
> data.getClass match {
>   case `arrayClass` => ((n: Int) => { data = data.drop(n) })
>   case `arrayBufferClass` => ((n: Int) => { data = data.drop(n) })
>   case _ => ((n: Int) => {
>   var j = 0
>   while (j < n && data.hasNext) {
> data.next()
> j += 1
>   }
> })
> }
>   }
> {code}
> The code does not deal with InterruptibleIterator.
> This leads to the following code can't use the {{Iterator.drop}} method
> {code}
> rdd.cache()
> rdd.sample(false,0.1)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-4903) RDD remains cached after "DROP TABLE"

2014-12-19 Thread Evert Lammerts (JIRA)
Evert Lammerts created SPARK-4903:
-

 Summary: RDD remains cached after "DROP TABLE"
 Key: SPARK-4903
 URL: https://issues.apache.org/jira/browse/SPARK-4903
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.3.0
 Environment: Spark master @ Dec 17 
(3cd516191baadf8496ccdae499771020e89acd7e)
Reporter: Evert Lammerts
Priority: Critical


In beeline, when I run:
{code:sql}
CREATE TABLE test AS select col from table;
CACHE TABLE test
DROP TABLE test
{code}
The the table is removed but the RDD is still cached. Running UNCACHE is not 
possible anymore (table not found from thriftserver).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4903) RDD remains cached after "DROP TABLE"

2014-12-19 Thread Evert Lammerts (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4903?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Evert Lammerts updated SPARK-4903:
--
Description: 
In beeline, when I run:
{code:sql}
CREATE TABLE test AS select col from table;
CACHE TABLE test
DROP TABLE test
{code}
The the table is removed but the RDD is still cached. Running UNCACHE is not 
possible anymore (table not found from metastore).

  was:
In beeline, when I run:
{code:sql}
CREATE TABLE test AS select col from table;
CACHE TABLE test
DROP TABLE test
{code}
The the table is removed but the RDD is still cached. Running UNCACHE is not 
possible anymore (table not found from thriftserver).


> RDD remains cached after "DROP TABLE"
> -
>
> Key: SPARK-4903
> URL: https://issues.apache.org/jira/browse/SPARK-4903
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.3.0
> Environment: Spark master @ Dec 17 
> (3cd516191baadf8496ccdae499771020e89acd7e)
>Reporter: Evert Lammerts
>Priority: Critical
>
> In beeline, when I run:
> {code:sql}
> CREATE TABLE test AS select col from table;
> CACHE TABLE test
> DROP TABLE test
> {code}
> The the table is removed but the RDD is still cached. Running UNCACHE is not 
> possible anymore (table not found from metastore).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4902) gap-sampling performance optimization

2014-12-19 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14253409#comment-14253409
 ] 

Apache Spark commented on SPARK-4902:
-

User 'witgo' has created a pull request for this issue:
https://github.com/apache/spark/pull/3744

> gap-sampling performance optimization
> -
>
> Key: SPARK-4902
> URL: https://issues.apache.org/jira/browse/SPARK-4902
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.2.0
>Reporter: Guoqiang Li
>
> {{CacheManager.getOrCompute}} returns an instance of InterruptibleIterator 
> that contains an array or a iterator(when the memory is not enough). 
> The GapSamplingIterator implementation is as follows
> {code}
> private val iterDrop: Int => Unit = {
> val arrayClass = Array.empty[T].iterator.getClass
> val arrayBufferClass = ArrayBuffer.empty[T].iterator.getClass
> data.getClass match {
>   case `arrayClass` => ((n: Int) => { data = data.drop(n) })
>   case `arrayBufferClass` => ((n: Int) => { data = data.drop(n) })
>   case _ => ((n: Int) => {
>   var j = 0
>   while (j < n && data.hasNext) {
> data.next()
> j += 1
>   }
> })
> }
>   }
> {code}
> The code does not deal with InterruptibleIterator.
> This leads to the following code can't use the {{Iterator.drop}} method
> {code}
> rdd.cache()
> rdd.sample(false,0.1)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-4904) Remove the foldable checking in HiveGenericUdf.eval

2014-12-19 Thread Cheng Hao (JIRA)
Cheng Hao created SPARK-4904:


 Summary: Remove the foldable checking in HiveGenericUdf.eval
 Key: SPARK-4904
 URL: https://issues.apache.org/jira/browse/SPARK-4904
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Cheng Hao
Priority: Minor


Since https://github.com/apache/spark/pull/3429 has been merged, the bug of 
wrapping to Writable for HiveGenericUDF is resolved, we can safely remove the 
foldable checking in `HiveGenericUdf.eval`, which discussed in
https://github.com/apache/spark/pull/2802.





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4904) Remove the foldable checking in HiveGenericUdf.eval

2014-12-19 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14253418#comment-14253418
 ] 

Apache Spark commented on SPARK-4904:
-

User 'chenghao-intel' has created a pull request for this issue:
https://github.com/apache/spark/pull/3745

> Remove the foldable checking in HiveGenericUdf.eval
> ---
>
> Key: SPARK-4904
> URL: https://issues.apache.org/jira/browse/SPARK-4904
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Cheng Hao
>Priority: Minor
>
> Since https://github.com/apache/spark/pull/3429 has been merged, the bug of 
> wrapping to Writable for HiveGenericUDF is resolved, we can safely remove the 
> foldable checking in `HiveGenericUdf.eval`, which discussed in
> https://github.com/apache/spark/pull/2802.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4867) UDF clean up

2014-12-19 Thread William Benton (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14253579#comment-14253579
 ] 

William Benton commented on SPARK-4867:
---

[~marmbrus] I actually think exposing an interface that looks something like 
overloading might be the right approach.  (To be clear, I think polymorphism 
poses a far greater difficulty with implicit coercion than without it, but it 
might be possible to solve the ambiguity there by letting users register 
functions in a priority order.)

> UDF clean up
> 
>
> Key: SPARK-4867
> URL: https://issues.apache.org/jira/browse/SPARK-4867
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Michael Armbrust
>Priority: Blocker
>
> Right now our support and internal implementation of many functions has a few 
> issues.  Specifically:
>  - UDFS don't know their input types and thus don't do type coercion.
>  - We hard code a bunch of built in functions into the parser.  This is bad 
> because in SQL it creates new reserved words for things that aren't actually 
> keywords.  Also it means that for each function we need to add support to 
> both SQLContext and HiveContext separately.
> For this JIRA I propose we do the following:
>  - Change the interfaces for registerFunction and ScalaUdf to include types 
> for the input arguments as well as the output type.
>  - Add a rule to analysis that does type coercion for UDFs.
>  - Add a parse rule for functions to SQLParser.
>  - Rewrite all the UDFs that are currently hacked into the various parsers 
> using this new functionality.
> Depending on how big this refactoring becomes we could split parts 1&2 from 
> part 3 above.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4901) Hot fix for the BytesWritable.copyBytes not exists in Hadoop1

2014-12-19 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4901?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-4901:
--
Assignee: Cheng Hao

> Hot fix for the BytesWritable.copyBytes not exists in Hadoop1
> -
>
> Key: SPARK-4901
> URL: https://issues.apache.org/jira/browse/SPARK-4901
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Cheng Hao
>Assignee: Cheng Hao
>Priority: Minor
> Fix For: 1.3.0
>
>
> HiveInspectors.scala failed in compiling with Hadoop 1, as the 
> BytesWritable.copyBytes is not available in Hadoop 1. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-4901) Hot fix for the BytesWritable.copyBytes not exists in Hadoop1

2014-12-19 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4901?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen resolved SPARK-4901.
---
   Resolution: Fixed
Fix Version/s: 1.3.0

Issue resolved by pull request 3742
[https://github.com/apache/spark/pull/3742]

> Hot fix for the BytesWritable.copyBytes not exists in Hadoop1
> -
>
> Key: SPARK-4901
> URL: https://issues.apache.org/jira/browse/SPARK-4901
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Cheng Hao
>Priority: Minor
> Fix For: 1.3.0
>
>
> HiveInspectors.scala failed in compiling with Hadoop 1, as the 
> BytesWritable.copyBytes is not available in Hadoop 1. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2447) Add common solution for sending upsert actions to HBase (put, deletes, and increment)

2014-12-19 Thread Ted Malaska (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14253586#comment-14253586
 ] 

Ted Malaska commented on SPARK-2447:


Hey guy,

Just wanted to update this jira.  In summery the Spark committers is still 
deciding how this will be or not be include in the external part of Spark.  

For now because the demand is there and because the solution works I'm going to 
host the solution on Cloudera Labs.
Here is the blog post that walks through the solution.

http://blog.cloudera.com/blog/2014/12/new-in-cloudera-labs-sparkonhbase/

> Add common solution for sending upsert actions to HBase (put, deletes, and 
> increment)
> -
>
> Key: SPARK-2447
> URL: https://issues.apache.org/jira/browse/SPARK-2447
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core, Streaming
>Reporter: Ted Malaska
>Assignee: Ted Malaska
>
> Going to review the design with Tdas today.  
> But first thoughts is to have an extension of VoidFunction that handles the 
> connection to HBase and allows for options such as turning auto flush off for 
> higher through put.
> Need to answer the following questions first.
> - Can it be written in Java or should it be written in Scala?
> - What is the best way to add the HBase dependency? (will review how Flume 
> does this as the first option)
> - What is the best way to do testing? (will review how Flume does this as the 
> first option)
> - How to support python? (python may be a different Jira it is unknown at 
> this time)
> Goals:
> - Simple to use
> - Stable
> - Supports high load
> - Documented (May be in a separate Jira need to ask Tdas)
> - Supports Java, Scala, and hopefully Python
> - Supports Streaming and normal Spark



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3686) flume.SparkSinkSuite.Success is flaky

2014-12-19 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3686?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-3686:
--
Labels: flaky-test  (was: )

> flume.SparkSinkSuite.Success is flaky
> -
>
> Key: SPARK-3686
> URL: https://issues.apache.org/jira/browse/SPARK-3686
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Reporter: Patrick Wendell
>Assignee: Hari Shreedharan
>Priority: Blocker
>  Labels: flaky-test
> Fix For: 1.2.0
>
>
> {code}
> Error Message
> 4000 did not equal 5000
> Stacktrace
> sbt.ForkMain$ForkError: 4000 did not equal 5000
>   at 
> org.scalatest.Assertions$class.newAssertionFailedException(Assertions.scala:498)
>   at 
> org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1559)
>   at 
> org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:416)
>   at 
> org.apache.spark.streaming.flume.sink.SparkSinkSuite.org$apache$spark$streaming$flume$sink$SparkSinkSuite$$assertChannelIsEmpty(SparkSinkSuite.scala:195)
>   at 
> org.apache.spark.streaming.flume.sink.SparkSinkSuite$$anonfun$1.apply$mcV$sp(SparkSinkSuite.scala:54)
>   at 
> org.apache.spark.streaming.flume.sink.SparkSinkSuite$$anonfun$1.apply(SparkSinkSuite.scala:40)
>   at 
> org.apache.spark.streaming.flume.sink.SparkSinkSuite$$anonfun$1.apply(SparkSinkSuite.scala:40)
>   at 
> org.scalatest.Transformer$$anonfun$apply$1.apply(Transformer.scala:22)
>   at 
> org.scalatest.Transformer$$anonfun$apply$1.apply(Transformer.scala:22)
>   at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
>   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
>   at org.scalatest.Transformer.apply(Transformer.scala:22)
>   at org.scalatest.Transformer.apply(Transformer.scala:20)
>   at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:158)
>   at org.scalatest.Suite$class.withFixture(Suite.scala:1121)
>   at org.scalatest.FunSuite.withFixture(FunSuite.scala:1559)
>   at 
> org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:155)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:167)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:167)
>   at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306)
>   at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:167)
>   at org.scalatest.FunSuite.runTest(FunSuite.scala:1559)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:200)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:200)
>   at 
> org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:413)
>   at 
> org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:401)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401)
>   at 
> org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:396)
>   at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:483)
>   at org.scalatest.FunSuiteLike$class.runTests(FunSuiteLike.scala:200)
>   at org.scalatest.FunSuite.runTests(FunSuite.scala:1559)
>   at org.scalatest.Suite$class.run(Suite.scala:1423)
>   at 
> org.scalatest.FunSuite.org$scalatest$FunSuiteLike$$super$run(FunSuite.scala:1559)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:204)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:204)
>   at org.scalatest.SuperEngine.runImpl(Engine.scala:545)
>   at org.scalatest.FunSuiteLike$class.run(FunSuiteLike.scala:204)
>   at org.scalatest.FunSuite.run(FunSuite.scala:1559)
>   at 
> org.scalatest.tools.Framework.org$scalatest$tools$Framework$$runSuite(Framework.scala:444)
>   at 
> org.scalatest.tools.Framework$ScalaTestTask.execute(Framework.scala:651)
>   at sbt.ForkMain$Run$2.call(ForkMain.java:294)
>   at sbt.ForkMain$Run$2.call(ForkMain.java:284)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:262)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:745)
> {code}
> Example test result (this will stop working in a few days):
> https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-SBT/719/AMPLAB_JENKINS_BUILD_PROFILE=hadoop2.3,label=centos/testReport/junit/org.apache.spark.streaming.flume.sink/SparkSinkSuite/Success_with_ack/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (SPARK-3912) FlumeStreamSuite is flaky, fails either with port binding issues or data not being reliably sent

2014-12-19 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3912?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-3912:
--
Labels: flaky-test  (was: )

> FlumeStreamSuite is flaky, fails either with port binding issues or data not 
> being reliably sent
> 
>
> Key: SPARK-3912
> URL: https://issues.apache.org/jira/browse/SPARK-3912
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 1.1.0
>Reporter: Tathagata Das
>Assignee: Tathagata Das
>Priority: Critical
>  Labels: flaky-test
> Fix For: 1.2.0
>
>
> Two problems.
> 1. Attempts to start the service to start on different possible ports (to 
> avoid bind failures) was incorrect as the service is actually start lazily 
> (when receiver starts, not when the flume input stream is created). 
> 2. Lots of Thread.sleep was used to improve the probabilities that data sent 
> through avro to flume receiver was being sent. However, the sending may fail 
> for various unknown reasons, causing the test to fail.
> 3. Thread.sleep was also used to send one record per batch and checks were 
> made on whether only one records was received in every batch. This was an 
> overkill because all we need to test in this unit test is whether data is 
> being sent and received or not, not about timings.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-1603) flaky test case in StreamingContextSuite

2014-12-19 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1603?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-1603:
--
Labels: flaky-test  (was: )

> flaky test case in StreamingContextSuite
> 
>
> Key: SPARK-1603
> URL: https://issues.apache.org/jira/browse/SPARK-1603
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 0.9.0, 0.9.1, 1.0.0
>Reporter: Nan Zhu
>Assignee: Nan Zhu
>  Labels: flaky-test
>
> When Jenkins was testing 5 PRs at the same time, the test results in my PR 
> shows that  stop gracefully in StreamingContextSuite failed, 
> the stacktrace is as
> {quote}
>  stop gracefully *** FAILED *** (8 seconds, 350 milliseconds)
> [info]   akka.actor.InvalidActorNameException: actor name [JobScheduler] is 
> not unique!
> [info]   at 
> akka.actor.dungeon.ChildrenContainer$TerminatingChildrenContainer.reserve(ChildrenContainer.scala:192)
> [info]   at akka.actor.dungeon.Children$class.reserveChild(Children.scala:77)
> [info]   at akka.actor.ActorCell.reserveChild(ActorCell.scala:338)
> [info]   at akka.actor.dungeon.Children$class.makeChild(Children.scala:186)
> [info]   at akka.actor.dungeon.Children$class.attachChild(Children.scala:42)
> [info]   at akka.actor.ActorCell.attachChild(ActorCell.scala:338)
> [info]   at akka.actor.ActorSystemImpl.actorOf(ActorSystem.scala:518)
> [info]   at 
> org.apache.spark.streaming.scheduler.JobScheduler.start(JobScheduler.scala:57)
> [info]   at 
> org.apache.spark.streaming.StreamingContext.start(StreamingContext.scala:434)
> [info]   at 
> org.apache.spark.streaming.StreamingContextSuite$$anonfun$14$$anonfun$apply$mcV$sp$3.apply$mcVI$sp(StreamingContextSuite.scala:174)
> [info]   at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141)
> [info]   at 
> org.apache.spark.streaming.StreamingContextSuite$$anonfun$14.apply$mcV$sp(StreamingContextSuite.scala:163)
> [info]   at 
> org.apache.spark.streaming.StreamingContextSuite$$anonfun$14.apply(StreamingContextSuite.scala:159)
> [info]   at 
> org.apache.spark.streaming.StreamingContextSuite$$anonfun$14.apply(StreamingContextSuite.scala:159)
> [info]   at org.scalatest.FunSuite$$anon$1.apply(FunSuite.scala:1265)
> [info]   at org.scalatest.Suite$class.withFixture(Suite.scala:1974)
> [info]   at 
> org.apache.spark.streaming.StreamingContextSuite.withFixture(StreamingContextSuite.scala:34)
> [info]   at 
> org.scalatest.FunSuite$class.invokeWithFixture$1(FunSuite.scala:1262)
> [info]   at 
> org.scalatest.FunSuite$$anonfun$runTest$1.apply(FunSuite.scala:1271)
> [info]   at 
> org.scalatest.FunSuite$$anonfun$runTest$1.apply(FunSuite.scala:1271)
> [info]   at org.scalatest.SuperEngine.runTestImpl(Engine.scala:198)
> [info]   at org.scalatest.FunSuite$class.runTest(FunSuite.scala:1271)
> [info]   at 
> org.apache.spark.streaming.StreamingContextSuite.org$scalatest$BeforeAndAfter$$super$runTest(StreamingContextSuite.scala:34)
> [info]   at 
> org.scalatest.BeforeAndAfter$class.runTest(BeforeAndAfter.scala:171)
> [info]   at 
> org.apache.spark.streaming.StreamingContextSuite.runTest(StreamingContextSuite.scala:34)
> [info]   at 
> org.scalatest.FunSuite$$anonfun$runTests$1.apply(FunSuite.scala:1304)
> [info]   at 
> org.scalatest.FunSuite$$anonfun$runTests$1.apply(FunSuite.scala:1304)
> [info]   at 
> org.scalatest.SuperEngine$$anonfun$org$scalatest$SuperEngine$$runTestsInBranch$1.apply(Engine.scala:260)
> [info]   at 
> org.scalatest.SuperEngine$$anonfun$org$scalatest$SuperEngine$$runTestsInBranch$1.apply(Engine.scala:249)
> [info]   at scala.collection.immutable.List.foreach(List.scala:318)
> [info]   at 
> org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:249)
> [info]   at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:326)
> [info]   at org.scalatest.FunSuite$class.runTests(FunSuite.scala:1304)
> [info]   at 
> org.apache.spark.streaming.StreamingContextSuite.runTests(StreamingContextSuite.scala:34)
> [info]   at org.scalatest.Suite$class.run(Suite.scala:2303)
> [info]   at 
> org.apache.spark.streaming.StreamingContextSuite.org$scalatest$FunSuite$$super$run(StreamingContextSuite.scala:34)
> [info]   at org.scalatest.FunSuite$$anonfun$run$1.apply(FunSuite.scala:1310)
> [info]   at org.scalatest.FunSuite$$anonfun$run$1.apply(FunSuite.scala:1310)
> [info]   at org.scalatest.SuperEngine.runImpl(Engine.scala:362)
> [info]   at org.scalatest.FunSuite$class.run(FunSuite.scala:1310)
> [info]   at 
> org.apache.spark.streaming.StreamingContextSuite.org$scalatest$BeforeAndAfter$$super$run(StreamingContextSuite.scala:34)
> [info]   at org.scalatest.BeforeAndAfter$class.run(BeforeAndAfter.scala:208)
> [info]   at 
> org.apache.spark.streaming.StreamingContextSuite.run(StreamingContextSuite.scala:34)
> [info]   at 
> org.scalatest.tools.Scal

[jira] [Updated] (SPARK-4053) Block generator throttling in NetworkReceiverSuite is flaky

2014-12-19 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4053?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-4053:
--
Labels: flaky-test  (was: )

> Block generator throttling in NetworkReceiverSuite is flaky
> ---
>
> Key: SPARK-4053
> URL: https://issues.apache.org/jira/browse/SPARK-4053
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 1.1.0
>Reporter: Tathagata Das
>Assignee: Tathagata Das
>Priority: Minor
>  Labels: flaky-test
> Fix For: 1.2.0
>
>
> In the unit test that checked whether blocks generated by throttled block 
> generator had expected number of records, the thresholds are too tight, which 
> sometimes led to the test failing. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-1158) Fix flaky RateLimitedOutputStreamSuite

2014-12-19 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1158?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-1158:
--
Labels: flaky-test  (was: )

> Fix flaky RateLimitedOutputStreamSuite
> --
>
> Key: SPARK-1158
> URL: https://issues.apache.org/jira/browse/SPARK-1158
> Project: Spark
>  Issue Type: Bug
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>  Labels: flaky-test
> Fix For: 1.0.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-4905) Flaky FlumeStreamSuite test: org.apache.spark.streaming.flume.FlumeStreamSuite.flume input stream

2014-12-19 Thread Josh Rosen (JIRA)
Josh Rosen created SPARK-4905:
-

 Summary: Flaky FlumeStreamSuite test: 
org.apache.spark.streaming.flume.FlumeStreamSuite.flume input stream
 Key: SPARK-4905
 URL: https://issues.apache.org/jira/browse/SPARK-4905
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Affects Versions: 1.3.0
Reporter: Josh Rosen


It looks like the "org.apache.spark.streaming.flume.FlumeStreamSuite.flume 
input stream" test might be flaky 
([link|https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24647/testReport/junit/org.apache.spark.streaming.flume/FlumeStreamSuite/flume_input_stream/]):

{code}
Error Message

The code passed to eventually never returned normally. Attempted 106 times over 
10.045097243 seconds. Last failure message: ArrayBuffer("", "", "", "", "", "", 
"", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", 
"", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", 
"", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", 
"", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", 
"", "", "", "", "", "", "", "", "", "", "", "", "", "") was not equal to 
Vector("1", "2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12", "13", 
"14", "15", "16", "17", "18", "19", "20", "21", "22", "23", "24", "25", "26", 
"27", "28", "29", "30", "31", "32", "33", "34", "35", "36", "37", "38", "39", 
"40", "41", "42", "43", "44", "45", "46", "47", "48", "49", "50", "51", "52", 
"53", "54", "55", "56", "57", "58", "59", "60", "61", "62", "63", "64", "65", 
"66", "67", "68", "69", "70", "71", "72", "73", "74", "75", "76", "77", "78", 
"79", "80", "81", "82", "83", "84", "85", "86", "87", "88", "89", "90", "91", 
"92", "93", "94", "95", "96", "97", "98", "99", "100").
Stacktrace

sbt.ForkMain$ForkError: The code passed to eventually never returned normally. 
Attempted 106 times over 10.045097243 seconds. Last failure message: 
ArrayBuffer("", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", 
"", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", 
"", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", 
"", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", 
"", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", 
"", "", "") was not equal to Vector("1", "2", "3", "4", "5", "6", "7", "8", 
"9", "10", "11", "12", "13", "14", "15", "16", "17", "18", "19", "20", "21", 
"22", "23", "24", "25", "26", "27", "28", "29", "30", "31", "32", "33", "34", 
"35", "36", "37", "38", "39", "40", "41", "42", "43", "44", "45", "46", "47", 
"48", "49", "50", "51", "52", "53", "54", "55", "56", "57", "58", "59", "60", 
"61", "62", "63", "64", "65", "66", "67", "68", "69", "70", "71", "72", "73", 
"74", "75", "76", "77", "78", "79", "80", "81", "82", "83", "84", "85", "86", 
"87", "88", "89", "90", "91", "92", "93", "94", "95", "96", "97", "98", "99", 
"100").
at 
org.scalatest.concurrent.Eventually$class.tryTryAgain$1(Eventually.scala:420)
at 
org.scalatest.concurrent.Eventually$class.eventually(Eventually.scala:438)
at org.scalatest.concurrent.Eventually$.eventually(Eventually.scala:478)
at 
org.scalatest.concurrent.Eventually$class.eventually(Eventually.scala:307)
at org.scalatest.concurrent.Eventually$.eventually(Eventually.scala:478)
at 
org.apache.spark.streaming.flume.FlumeStreamSuite.writeAndVerify(FlumeStreamSuite.scala:142)
at 
org.apache.spark.streaming.flume.FlumeStreamSuite.org$apache$spark$streaming$flume$FlumeStreamSuite$$testFlumeStream(FlumeStreamSuite.scala:74)
at 
org.apache.spark.streaming.flume.FlumeStreamSuite$$anonfun$2.apply$mcV$sp(FlumeStreamSuite.scala:62)
at 
org.apache.spark.streaming.flume.FlumeStreamSuite$$anonfun$2.apply(FlumeStreamSuite.scala:62)
at 
org.apache.spark.streaming.flume.FlumeStreamSuite$$anonfun$2.apply(FlumeStreamSuite.scala:62)
at 
org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22)
at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
at org.scalatest.Transformer.apply(Transformer.scala:22)
at org.scalatest.Transformer.apply(Transformer.scala:20)
at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166)
at org.scalatest.Suite$class.withFixture(Suite.scala:1122)
at org.scalatest.FunSuite.withFixture(FunSuite.scala:1555)
at 
org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163)
at 
org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
at 
org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
at org.scalatest.SuperEngine.r

[jira] [Commented] (SPARK-4869) The variable names in IF statement of Spark SQL doesn't resolve to its value.

2014-12-19 Thread Arnab (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4869?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14253659#comment-14253659
 ] 

Arnab commented on SPARK-4869:
--

Can you kindly clarify what DAYS_30 refers to. 
I tried out a nested if statement and it seems to work fine in SparkSql.

val child = sqlContext.sql("select name,age,IF(age < 20,IF(age<12,0,1),1) as 
child from people")
child.collect.foreach(println)

> The variable names in IF statement of Spark SQL doesn't resolve to its value. 
> --
>
> Key: SPARK-4869
> URL: https://issues.apache.org/jira/browse/SPARK-4869
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.1.1
>Reporter: Ajay
>Priority: Blocker
>
> We got stuck with “IF-THEN” statement in Spark SQL. As per our usecase, we 
> have to have nested “if” statements. But, spark sql is not able to resolve 
> the variable names in final evaluation but the literal values are working. An 
> "Unresolved Attributes" error is being thrown. Please fix this bug. 
> This works:
> sqlSC.sql("SELECT DISTINCT UNIT, PAST_DUE ,IF( PAST_DUE = 'CURRENT_MONTH', 
> 0,1) as ROLL_BACKWARD FROM OUTER_RDD")
> This doesn’t :
> sqlSC.sql("SELECT DISTINCT UNIT, PAST_DUE ,IF( PAST_DUE = 'CURRENT_MONTH', 
> 0,DAYS_30) as ROLL_BACKWARD FROM OUTER_RDD")



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-4906) Spark master OOMs with exception stack trace stored in JobProgressListener

2014-12-19 Thread Mingyu Kim (JIRA)
Mingyu Kim created SPARK-4906:
-

 Summary: Spark master OOMs with exception stack trace stored in 
JobProgressListener
 Key: SPARK-4906
 URL: https://issues.apache.org/jira/browse/SPARK-4906
 Project: Spark
  Issue Type: Bug
Affects Versions: 1.1.1
Reporter: Mingyu Kim


Spark master was OOMing with a lot of stack traces retained in 
JobProgressListener. The object dependency goes like the following.

JobProgressListener.stageIdToData => StageUIData.taskData => 
TaskUIData.errorMessage

Each error message is ~10kb since it has the entire stack trace. As we have a 
lot of tasks, when all of the tasks across multiple stages go bad, these error 
messages accounted for 0.5GB of heap at some point.

Please correct me if I'm wrong, but it looks like all the task info for running 
applications are kept in memory, which means it's almost always bound to OOM 
for long-running applications. Would it make sense to fix this, for example, by 
spilling some UI states to disk?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4896) Don't redundantly copy executor dependencies in Utils.fetchFile

2014-12-19 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4896?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14253779#comment-14253779
 ] 

Apache Spark commented on SPARK-4896:
-

User 'ryan-williams' has created a pull request for this issue:
https://github.com/apache/spark/pull/2848

> Don't redundantly copy executor dependencies in Utils.fetchFile
> ---
>
> Key: SPARK-4896
> URL: https://issues.apache.org/jira/browse/SPARK-4896
> Project: Spark
>  Issue Type: Improvement
>Reporter: Josh Rosen
>
> This JIRA is spun off from a comment by [~rdub] on SPARK-3967, quoted here:
> {quote}
> I've been debugging this issue as well and I think I've found an issue in 
> {{org.apache.spark.util.Utils}} that is contributing to / causing the problem:
> {{Files.move}} on [line 
> 390|https://github.com/apache/spark/blob/v1.1.0/core/src/main/scala/org/apache/spark/util/Utils.scala#L390]
>  is called even if {{targetFile}} exists and {{tempFile}} and {{targetFile}} 
> are equal.
> The check on [line 
> 379|https://github.com/apache/spark/blob/v1.1.0/core/src/main/scala/org/apache/spark/util/Utils.scala#L379]
>  seems to imply the desire to skip a redundant overwrite if the file is 
> already there and has the contents that it should have.
> Gating the {{Files.move}} call on a further {{if (!targetFile.exists)}} fixes 
> the issue for me; attached is a patch of the change.
> In practice all of my executors that hit this code path are finding every 
> dependency JAR to already exist and be exactly equal to what they need it to 
> be, meaning they were all needlessly overwriting all of their dependency 
> JARs, and now are all basically no-op-ing in {{Utils.fetchFile}}; I've not 
> determined who/what is putting the JARs there, why the issue only crops up in 
> {{yarn-cluster}} mode (or {{--master yarn --deploy-mode cluster}}), etc., but 
> it seems like either way this patch is probably desirable.
> {quote}
> I'm spinning this off into its own JIRA so that we can track the merging of 
> https://github.com/apache/spark/pull/2848 separately (since we have multiple 
> PRs that contribute to fixing the original issue).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4903) RDD remains cached after "DROP TABLE"

2014-12-19 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4903?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-4903:

Target Version/s: 1.3.0

> RDD remains cached after "DROP TABLE"
> -
>
> Key: SPARK-4903
> URL: https://issues.apache.org/jira/browse/SPARK-4903
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.3.0
> Environment: Spark master @ Dec 17 
> (3cd516191baadf8496ccdae499771020e89acd7e)
>Reporter: Evert Lammerts
>Priority: Critical
>
> In beeline, when I run:
> {code:sql}
> CREATE TABLE test AS select col from table;
> CACHE TABLE test
> DROP TABLE test
> {code}
> The the table is removed but the RDD is still cached. Running UNCACHE is not 
> possible anymore (table not found from metastore).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4892) java.io.FileNotFound exceptions when creating EXTERNAL hive tables

2014-12-19 Thread Michael Armbrust (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14253809#comment-14253809
 ] 

Michael Armbrust commented on SPARK-4892:
-

I'll add that the right fix here is probably to just set that automatically 
when we detect hive 13 mode, since afaict this is a Hive bug.

> java.io.FileNotFound exceptions when creating EXTERNAL hive tables
> --
>
> Key: SPARK-4892
> URL: https://issues.apache.org/jira/browse/SPARK-4892
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.2.0
>Reporter: Patrick Wendell
>Assignee: Michael Armbrust
>  Labels: starter
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4892) java.io.FileNotFound exceptions when creating EXTERNAL hive tables

2014-12-19 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4892?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-4892:

Target Version/s: 1.3.0

> java.io.FileNotFound exceptions when creating EXTERNAL hive tables
> --
>
> Key: SPARK-4892
> URL: https://issues.apache.org/jira/browse/SPARK-4892
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.2.0
>Reporter: Patrick Wendell
>Assignee: Michael Armbrust
>  Labels: starter
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4892) java.io.FileNotFound exceptions when creating EXTERNAL hive tables

2014-12-19 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4892?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-4892:

Labels: starter  (was: )

> java.io.FileNotFound exceptions when creating EXTERNAL hive tables
> --
>
> Key: SPARK-4892
> URL: https://issues.apache.org/jira/browse/SPARK-4892
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.2.0
>Reporter: Patrick Wendell
>Assignee: Michael Armbrust
>  Labels: starter
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4520) SparkSQL exception when reading certain columns from a parquet file

2014-12-19 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4520?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-4520:

Target Version/s: 1.3.0  (was: 1.2.0)

> SparkSQL exception when reading certain columns from a parquet file
> ---
>
> Key: SPARK-4520
> URL: https://issues.apache.org/jira/browse/SPARK-4520
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.2.0
>Reporter: sadhan sood
>Priority: Critical
> Attachments: part-r-0.parquet
>
>
> I am seeing this issue with spark sql throwing an exception when trying to 
> read selective columns from a thrift parquet file and also when caching them.
> On some further digging, I was able to narrow it down to at-least one 
> particular column type: map> to be causing this issue. To 
> reproduce this I created a test thrift file with a very basic schema and 
> stored some sample data in a parquet file:
> Test.thrift
> ===
> {code}
> typedef binary SomeId
> enum SomeExclusionCause {
>   WHITELIST = 1,
>   HAS_PURCHASE = 2,
> }
> struct SampleThriftObject {
>   10: string col_a;
>   20: string col_b;
>   30: string col_c;
>   40: optional map> col_d;
> }
> {code}
> =
> And loading the data in spark through schemaRDD:
> {code}
> import org.apache.spark.sql.SchemaRDD
> val sqlContext = new org.apache.spark.sql.SQLContext(sc);
> val parquetFile = "/path/to/generated/parquet/file"
> val parquetFileRDD = sqlContext.parquetFile(parquetFile)
> parquetFileRDD.printSchema
> root
>  |-- col_a: string (nullable = true)
>  |-- col_b: string (nullable = true)
>  |-- col_c: string (nullable = true)
>  |-- col_d: map (nullable = true)
>  ||-- key: string
>  ||-- value: array (valueContainsNull = true)
>  |||-- element: string (containsNull = false)
> parquetFileRDD.registerTempTable("test")
> sqlContext.cacheTable("test")
> sqlContext.sql("select col_a from test").collect() <-- see the exception 
> stack here 
> {code}
> {code}
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
> stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 
> (TID 0, localhost): parquet.io.ParquetDecodingException: Can not read value 
> at 0 in block -1 in file file:/tmp/xyz/part-r-0.parquet
>   at 
> parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:213)
>   at 
> parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:204)
>   at 
> org.apache.spark.rdd.NewHadoopRDD$$anon$1.hasNext(NewHadoopRDD.scala:145)
>   at 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
>   at scala.collection.Iterator$$anon$14.hasNext(Iterator.scala:388)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
>   at 
> scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
>   at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
>   at scala.collection.AbstractIterator.to(Iterator.scala:1157)
>   at 
> scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
>   at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
>   at 
> scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
>   at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
>   at org.apache.spark.rdd.RDD$$anonfun$16.apply(RDD.scala:780)
>   at org.apache.spark.rdd.RDD$$anonfun$16.apply(RDD.scala:780)
>   at 
> org.apache.spark.SparkContext$$anonfun$runJob$3.apply(SparkContext.scala:1223)
>   at 
> org.apache.spark.SparkContext$$anonfun$runJob$3.apply(SparkContext.scala:1223)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
>   at org.apache.spark.scheduler.Task.run(Task.scala:56)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:195)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: java.lang.ArrayIndexOutOfBoundsException: -1
>   at java.util.ArrayList.elementData(ArrayList.java:418)
>   at java.util.ArrayList.get(ArrayList.java:431)
>   at parquet.io.GroupColumnIO.getLast(GroupColumnIO.ja

[jira] [Updated] (SPARK-4850) "GROUP BY" can't work if the schema of SchemaRDD contains struct or array type

2014-12-19 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4850?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-4850:

Description: 
Code in Spark Shell as follows:

{code}
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val path = "path/to/json"
sqlContext.jsonFile(path).register("Table")
val t = sqlContext.sql("select * from Table group by a")
t.collect
{code}

Let's look into the schema of `Table`
{code}
root
 |-- a: integer (nullable = true)
 |-- arr: array (nullable = true)
 ||-- element: integer (containsNull = false)
 |-- createdAt: string (nullable = true)
 |-- f: struct (nullable = true)
 ||-- __type: string (nullable = true)
 ||-- className: string (nullable = true)
 ||-- objectId: string (nullable = true)
 |-- objectId: string (nullable = true)
 |-- s: string (nullable = true)
 |-- updatedAt: string (nullable = true)
{code}

Exception will be throwed:

{code}

org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Expression not 
in GROUP BY: arr#9, tree:
Aggregate [a#8], [a#8,arr#9,createdAt#10,f#11,objectId#12,s#13,updatedAt#14]
 Subquery TestImport
  LogicalRDD [a#8,arr#9,createdAt#10,f#11,objectId#12,s#13,updatedAt#14], 
MappedRDD[18] at map at JsonRDD.scala:47

at 
org.apache.spark.sql.catalyst.analysis.Analyzer$CheckAggregation$$anonfun$apply$3$$anonfun$applyOrElse$7.apply(Analyzer.scala:126)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer$CheckAggregation$$anonfun$apply$3$$anonfun$applyOrElse$7.apply(Analyzer.scala:125)
at scala.Option.foreach(Option.scala:236)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer$CheckAggregation$$anonfun$apply$3.applyOrElse(Analyzer.scala:125)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer$CheckAggregation$$anonfun$apply$3.applyOrElse(Analyzer.scala:108)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:144)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:135)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer$CheckAggregation$.apply(Analyzer.scala:108)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer$CheckAggregation$.apply(Analyzer.scala:106)
at 
org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1$$anonfun$apply$2.apply(RuleExecutor.scala:61)
at 
org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1$$anonfun$apply$2.apply(RuleExecutor.scala:59)
at 
scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:51)
at 
scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:60)
at scala.collection.mutable.WrappedArray.foldLeft(WrappedArray.scala:34)
at 
org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1.apply(RuleExecutor.scala:59)
at 
org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1.apply(RuleExecutor.scala:51)
at scala.collection.immutable.List.foreach(List.scala:318)
at 
org.apache.spark.sql.catalyst.rules.RuleExecutor.apply(RuleExecutor.scala:51)
at 
org.apache.spark.sql.SQLContext$QueryExecution.analyzed$lzycompute(SQLContext.scala:411)
at 
org.apache.spark.sql.SQLContext$QueryExecution.analyzed(SQLContext.scala:411)
at 
org.apache.spark.sql.SQLContext$QueryExecution.withCachedData$lzycompute(SQLContext.scala:412)
at 
org.apache.spark.sql.SQLContext$QueryExecution.withCachedData(SQLContext.scala:412)
at 
org.apache.spark.sql.SQLContext$QueryExecution.optimizedPlan$lzycompute(SQLContext.scala:413)
at 
org.apache.spark.sql.SQLContext$QueryExecution.optimizedPlan(SQLContext.scala:413)
at 
org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan$lzycompute(SQLContext.scala:418)
at 
org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan(SQLContext.scala:416)
at 
org.apache.spark.sql.SQLContext$QueryExecution.executedPlan$lzycompute(SQLContext.scala:422)
at 
org.apache.spark.sql.SQLContext$QueryExecution.executedPlan(SQLContext.scala:422)
at org.apache.spark.sql.SchemaRDD.collect(SchemaRDD.scala:444)
at $iwC$$iwC$$iwC$$iwC.(:17)
at $iwC$$iwC$$iwC.(:22)
at $iwC$$iwC.(:24)
at $iwC.(:26)
at (:28)
at .(:32)
at .()
at .(:7)
at .()
at $print()
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at 
org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:852)
at 
org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1125)
at 
org.apache.spark.repl.SparkIMain.loadAndR

[jira] [Updated] (SPARK-4850) "GROUP BY" can't work if the schema of SchemaRDD contains struct or array type

2014-12-19 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4850?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-4850:

Assignee: Cheng Lian

> "GROUP BY" can't work if the schema of SchemaRDD contains struct or array type
> --
>
> Key: SPARK-4850
> URL: https://issues.apache.org/jira/browse/SPARK-4850
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.1.0, 1.1.1, 1.2.0, 1.1.2
>Reporter: Chaozhong Yang
>Assignee: Cheng Lian
>  Labels: group, sql
>   Original Estimate: 96h
>  Remaining Estimate: 96h
>
> Code in Spark Shell as follows:
> {code}
> val sqlContext = new org.apache.spark.sql.SQLContext(sc)
> val path = "path/to/json"
> sqlContext.jsonFile(path).register("Table")
> val t = sqlContext.sql("select * from Table group by a")
> t.collect
> {code}
> Let's look into the schema of `Table`
> {code}
> root
>  |-- a: integer (nullable = true)
>  |-- arr: array (nullable = true)
>  ||-- element: integer (containsNull = false)
>  |-- createdAt: string (nullable = true)
>  |-- f: struct (nullable = true)
>  ||-- __type: string (nullable = true)
>  ||-- className: string (nullable = true)
>  ||-- objectId: string (nullable = true)
>  |-- objectId: string (nullable = true)
>  |-- s: string (nullable = true)
>  |-- updatedAt: string (nullable = true)
> {code}
> Exception will be throwed:
> {code}
> org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Expression 
> not in GROUP BY: arr#9, tree:
> Aggregate [a#8], [a#8,arr#9,createdAt#10,f#11,objectId#12,s#13,updatedAt#14]
>  Subquery TestImport
>   LogicalRDD [a#8,arr#9,createdAt#10,f#11,objectId#12,s#13,updatedAt#14], 
> MappedRDD[18] at map at JsonRDD.scala:47
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$CheckAggregation$$anonfun$apply$3$$anonfun$applyOrElse$7.apply(Analyzer.scala:126)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$CheckAggregation$$anonfun$apply$3$$anonfun$applyOrElse$7.apply(Analyzer.scala:125)
>   at scala.Option.foreach(Option.scala:236)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$CheckAggregation$$anonfun$apply$3.applyOrElse(Analyzer.scala:125)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$CheckAggregation$$anonfun$apply$3.applyOrElse(Analyzer.scala:108)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:144)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:135)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$CheckAggregation$.apply(Analyzer.scala:108)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$CheckAggregation$.apply(Analyzer.scala:106)
>   at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1$$anonfun$apply$2.apply(RuleExecutor.scala:61)
>   at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1$$anonfun$apply$2.apply(RuleExecutor.scala:59)
>   at 
> scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:51)
>   at 
> scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:60)
>   at scala.collection.mutable.WrappedArray.foldLeft(WrappedArray.scala:34)
>   at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1.apply(RuleExecutor.scala:59)
>   at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1.apply(RuleExecutor.scala:51)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor.apply(RuleExecutor.scala:51)
>   at 
> org.apache.spark.sql.SQLContext$QueryExecution.analyzed$lzycompute(SQLContext.scala:411)
>   at 
> org.apache.spark.sql.SQLContext$QueryExecution.analyzed(SQLContext.scala:411)
>   at 
> org.apache.spark.sql.SQLContext$QueryExecution.withCachedData$lzycompute(SQLContext.scala:412)
>   at 
> org.apache.spark.sql.SQLContext$QueryExecution.withCachedData(SQLContext.scala:412)
>   at 
> org.apache.spark.sql.SQLContext$QueryExecution.optimizedPlan$lzycompute(SQLContext.scala:413)
>   at 
> org.apache.spark.sql.SQLContext$QueryExecution.optimizedPlan(SQLContext.scala:413)
>   at 
> org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan$lzycompute(SQLContext.scala:418)
>   at 
> org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan(SQLContext.scala:416)
>   at 
> org.apache.spark.sql.SQLContext$QueryExecution.executedPlan$lzycompute(SQLContext.scala:422)
>   at 
> org.apache.spark.sql.SQLContext$QueryExecution.executedPlan(SQLContext.scala:422)
>   at org.apache.spark.sql.SchemaRDD.collect(SchemaRDD.scala:444)
>   at $iwC$$iwC$$iwC$$iwC.(:17)
>   at $iwC$$iwC$$iwC.(:22)
>   at $iwC$$

[jira] [Updated] (SPARK-4850) "GROUP BY" can't work if the schema of SchemaRDD contains struct or array type

2014-12-19 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4850?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-4850:

Target Version/s: 1.3.0  (was: 1.2.0)

> "GROUP BY" can't work if the schema of SchemaRDD contains struct or array type
> --
>
> Key: SPARK-4850
> URL: https://issues.apache.org/jira/browse/SPARK-4850
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.1.0, 1.1.1, 1.2.0, 1.1.2
>Reporter: Chaozhong Yang
>Assignee: Cheng Lian
>  Labels: group, sql
>   Original Estimate: 96h
>  Remaining Estimate: 96h
>
> Code in Spark Shell as follows:
> {code}
> val sqlContext = new org.apache.spark.sql.SQLContext(sc)
> val path = "path/to/json"
> sqlContext.jsonFile(path).register("Table")
> val t = sqlContext.sql("select * from Table group by a")
> t.collect
> {code}
> Let's look into the schema of `Table`
> {code}
> root
>  |-- a: integer (nullable = true)
>  |-- arr: array (nullable = true)
>  ||-- element: integer (containsNull = false)
>  |-- createdAt: string (nullable = true)
>  |-- f: struct (nullable = true)
>  ||-- __type: string (nullable = true)
>  ||-- className: string (nullable = true)
>  ||-- objectId: string (nullable = true)
>  |-- objectId: string (nullable = true)
>  |-- s: string (nullable = true)
>  |-- updatedAt: string (nullable = true)
> {code}
> Exception will be throwed:
> {code}
> org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Expression 
> not in GROUP BY: arr#9, tree:
> Aggregate [a#8], [a#8,arr#9,createdAt#10,f#11,objectId#12,s#13,updatedAt#14]
>  Subquery TestImport
>   LogicalRDD [a#8,arr#9,createdAt#10,f#11,objectId#12,s#13,updatedAt#14], 
> MappedRDD[18] at map at JsonRDD.scala:47
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$CheckAggregation$$anonfun$apply$3$$anonfun$applyOrElse$7.apply(Analyzer.scala:126)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$CheckAggregation$$anonfun$apply$3$$anonfun$applyOrElse$7.apply(Analyzer.scala:125)
>   at scala.Option.foreach(Option.scala:236)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$CheckAggregation$$anonfun$apply$3.applyOrElse(Analyzer.scala:125)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$CheckAggregation$$anonfun$apply$3.applyOrElse(Analyzer.scala:108)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:144)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:135)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$CheckAggregation$.apply(Analyzer.scala:108)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$CheckAggregation$.apply(Analyzer.scala:106)
>   at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1$$anonfun$apply$2.apply(RuleExecutor.scala:61)
>   at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1$$anonfun$apply$2.apply(RuleExecutor.scala:59)
>   at 
> scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:51)
>   at 
> scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:60)
>   at scala.collection.mutable.WrappedArray.foldLeft(WrappedArray.scala:34)
>   at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1.apply(RuleExecutor.scala:59)
>   at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1.apply(RuleExecutor.scala:51)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor.apply(RuleExecutor.scala:51)
>   at 
> org.apache.spark.sql.SQLContext$QueryExecution.analyzed$lzycompute(SQLContext.scala:411)
>   at 
> org.apache.spark.sql.SQLContext$QueryExecution.analyzed(SQLContext.scala:411)
>   at 
> org.apache.spark.sql.SQLContext$QueryExecution.withCachedData$lzycompute(SQLContext.scala:412)
>   at 
> org.apache.spark.sql.SQLContext$QueryExecution.withCachedData(SQLContext.scala:412)
>   at 
> org.apache.spark.sql.SQLContext$QueryExecution.optimizedPlan$lzycompute(SQLContext.scala:413)
>   at 
> org.apache.spark.sql.SQLContext$QueryExecution.optimizedPlan(SQLContext.scala:413)
>   at 
> org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan$lzycompute(SQLContext.scala:418)
>   at 
> org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan(SQLContext.scala:416)
>   at 
> org.apache.spark.sql.SQLContext$QueryExecution.executedPlan$lzycompute(SQLContext.scala:422)
>   at 
> org.apache.spark.sql.SQLContext$QueryExecution.executedPlan(SQLContext.scala:422)
>   at org.apache.spark.sql.SchemaRDD.collect(SchemaRDD.scala:444)
>   at $iwC$$iwC$$iwC$$iwC.(:17)
>   at $iwC$$iwC$$iwC.(:22)

[jira] [Updated] (SPARK-4811) Custom UDTFs not working in Spark SQL

2014-12-19 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4811?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-4811:

Target Version/s: 1.3.0  (was: 1.2.0)

> Custom UDTFs not working in Spark SQL
> -
>
> Key: SPARK-4811
> URL: https://issues.apache.org/jira/browse/SPARK-4811
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.1.0, 1.1.1
>Reporter: Saurabh Santhosh
>Priority: Critical
>
> I am using the Thrift srever interface to Spark SQL and using beeline to 
> connect to it.
> I tried Spark SQL versions 1.1.0 and 1.1.1 and both are throwing the 
> following exception when using any custom UDTF.
> These are the steps i did :
> *Created a UDTF 'com.x.y.xxx'.*
> Registered the UDTF using following query : 
> *create temporary function xxx as 'com.x.y.xxx'*
> The registration went through without any errors. But when i tried executing 
> the UDTF i got the following error.
> *java.lang.ClassNotFoundException: xxx*
> Funny thing is that Its trying to load the function name instead of the 
> funtion class. The exception is at *line no: 81 in hiveudfs.scala*
> I have been at it for quite a long time.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4553) query for parquet table with string fields in spark sql hive get binary result

2014-12-19 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4553?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-4553:

Target Version/s: 1.3.0  (was: 1.2.0)

> query for parquet table with string fields in spark sql hive get binary result
> --
>
> Key: SPARK-4553
> URL: https://issues.apache.org/jira/browse/SPARK-4553
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.1.0
>Reporter: wangfei
> Fix For: 1.2.0
>
>
> run 
> create table test_parquet(key int, value string) stored as parquet;
> insert into table test_parquet select * from src;
> select * from test_parquet;
> get result as follow
> ...
> 282 [B@38fda3b
> 138 [B@1407a24
> 238 [B@12de6fb
> 419 [B@6c97695
> 15 [B@4885067
> 118 [B@156a8d3
> 72 [B@65d20dd
> 90 [B@4c18906
> 307 [B@60b24cc
> 19 [B@59cf51b
> 435 [B@39fdf37
> 10 [B@4f799d7
> 277 [B@3950951
> 273 [B@596bf4b
> 306 [B@3e91557
> 224 [B@3781d61
> 309 [B@2d0d128



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3863) Cache broadcasted tables and reuse them across queries

2014-12-19 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3863?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-3863:

Target Version/s: 1.3.0  (was: 1.2.0)

> Cache broadcasted tables and reuse them across queries
> --
>
> Key: SPARK-3863
> URL: https://issues.apache.org/jira/browse/SPARK-3863
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>
> There is no point re-broadcasting the same dataset every time.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3862) MultiWayBroadcastInnerHashJoin

2014-12-19 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3862?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-3862:

Target Version/s: 1.3.0  (was: 1.2.0)

> MultiWayBroadcastInnerHashJoin
> --
>
> Key: SPARK-3862
> URL: https://issues.apache.org/jira/browse/SPARK-3862
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>
> It is common to have a single fact table inner join many small dimension 
> tables.  We can exploit this fact and create a MultiWayBroadcastInnerHashJoin 
> (or maybe just MultiwayDimensionJoin) operator that optimizes for this 
> pattern.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3865) Dimension table broadcast shouldn't be eager

2014-12-19 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3865?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-3865:

Target Version/s: 1.3.0  (was: 1.2.0)

> Dimension table broadcast shouldn't be eager
> 
>
> Key: SPARK-3865
> URL: https://issues.apache.org/jira/browse/SPARK-3865
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>
> We eagerly broadcast dimension tables in BroadcastJoin. This is bad because 
> even explain would trigger a job to execute the broadcast.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3864) Specialize join for tables with unique integer keys

2014-12-19 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3864?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-3864:

Target Version/s: 1.3.0  (was: 1.2.0)

> Specialize join for tables with unique integer keys
> ---
>
> Key: SPARK-3864
> URL: https://issues.apache.org/jira/browse/SPARK-3864
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>
> We can create a new operator that uses an array as the underlying storage to 
> avoid hash lookups entirely for dimension tables that have integer keys.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4794) Wrong parse of GROUP BY query

2014-12-19 Thread Michael Armbrust (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14253964#comment-14253964
 ] 

Michael Armbrust commented on SPARK-4794:
-

Ping.

> Wrong parse of GROUP BY query
> -
>
> Key: SPARK-4794
> URL: https://issues.apache.org/jira/browse/SPARK-4794
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.2.0
>Reporter: Damien Carol
>
> Spark is not able to parse this query :
> {code:sql}
> select
> `cf_encaissement_fact_pq`.`annee` as `Annee`,
> `cf_encaissement_fact_pq`.`mois` as `Mois`,
> `cf_encaissement_fact_pq`.`jour` as `Jour`,
> `cf_encaissement_fact_pq`.`heure` as `Heure`,
> `cf_encaissement_fact_pq`.`nom_societe` as `Societe`,
> `cf_encaissement_fact_pq`.`id_magasin` as `Magasin`,
> `cf_encaissement_fact_pq`.`CarteFidelitePresentee` as `CF_Presentee`,
> `cf_encaissement_fact_pq`.`CompteCarteFidelite` as `CompteCarteFidelite`,
> `cf_encaissement_fact_pq`.`NbCompteCarteFidelite` as 
> `NbCompteCarteFidelite`,
> `cf_encaissement_fact_pq`.`DetentionCF` as `DetentionCF`,
> `cf_encaissement_fact_pq`.`NbCarteFidelite` as `NbCarteFidelite`,
> `cf_encaissement_fact_pq`.`Id_CF_Dim_DUCB` as `Plage_DUCB`,
> `cf_encaissement_fact_pq`.`NbCheque` as `NbCheque`,
> `cf_encaissement_fact_pq`.`CACheque` as `CACheque`,
> `cf_encaissement_fact_pq`.`NbImpaye` as `NbImpaye`,
> `cf_encaissement_fact_pq`.`Id_Ensemble` as `NbEnsemble`,
> `cf_encaissement_fact_pq`.`ZIBZIN` as `NbCompte`,
> `cf_encaissement_fact_pq`.`ResteDuImpaye` as `ResteDuImpaye`
> from
> `testsimon3`.`cf_encaissement_fact_pq` as `cf_encaissement_fact_pq`
> where
> `cf_encaissement_fact_pq`.`annee` = 2013
> and
> `cf_encaissement_fact_pq`.`mois` = 7
> and
> `cf_encaissement_fact_pq`.`jour` = 12
> order by
> `cf_encaissement_fact_pq`.`annee` ASC,
> `cf_encaissement_fact_pq`.`mois` ASC,
> `cf_encaissement_fact_pq`.`jour` ASC,
> `cf_encaissement_fact_pq`.`heure` ASC,
> `cf_encaissement_fact_pq`.`nom_societe` ASC,
> `cf_encaissement_fact_pq`.`id_magasin` ASC,
> `cf_encaissement_fact_pq`.`CarteFidelitePresentee` ASC,
> `cf_encaissement_fact_pq`.`CompteCarteFidelite` ASC,
> `cf_encaissement_fact_pq`.`NbCompteCarteFidelite` ASC,
> `cf_encaissement_fact_pq`.`DetentionCF` ASC,
> `cf_encaissement_fact_pq`.`NbCarteFidelite` ASC,
> `cf_encaissement_fact_pq`.`Id_CF_Dim_DUCB` ASC
> {code}
> If I remove table name in ORDER BY conditions, Spark can handle it.
> {code:sql}
> select
> `cf_encaissement_fact_pq`.`annee` as `Annee`,
> `cf_encaissement_fact_pq`.`mois` as `Mois`,
> `cf_encaissement_fact_pq`.`jour` as `Jour`,
> `cf_encaissement_fact_pq`.`heure` as `Heure`,
> `cf_encaissement_fact_pq`.`nom_societe` as `Societe`,
> `cf_encaissement_fact_pq`.`id_magasin` as `Magasin`,
> `cf_encaissement_fact_pq`.`CarteFidelitePresentee` as `CFPresentee`,
> `cf_encaissement_fact_pq`.`CompteCarteFidelite` as `CompteCarteFidelite`,
> `cf_encaissement_fact_pq`.`NbCompteCarteFidelite` as 
> `NbCompteCarteFidelite`,
> `cf_encaissement_fact_pq`.`DetentionCF` as `DetentionCF`,
> `cf_encaissement_fact_pq`.`NbCarteFidelite` as `NbCarteFidelite`,
> `cf_encaissement_fact_pq`.`Id_CF_Dim_DUCB` as `PlageDUCB`,
> `cf_encaissement_fact_pq`.`NbCheque` as `NbCheque`,
> `cf_encaissement_fact_pq`.`CACheque` as `CACheque`,
> `cf_encaissement_fact_pq`.`NbImpaye` as `NbImpaye`,
> `cf_encaissement_fact_pq`.`Id_Ensemble` as `NbEnsemble`,
> `cf_encaissement_fact_pq`.`ZIBZIN` as `NbCompte`,
> `cf_encaissement_fact_pq`.`ResteDuImpaye` as `ResteDuImpaye`
> from
> `testsimon3`.`cf_encaissement_fact_pq` as `cf_encaissement_fact_pq`
> where
> `cf_encaissement_fact_pq`.`annee` = 2013
> and
> `cf_encaissement_fact_pq`.`mois` = 7
> and
> `cf_encaissement_fact_pq`.`jour` = 12
> order by
> `annee` ASC,
> `mois` ASC,
> `jour` ASC,
> `heure` ASC,
> `nom_societe` ASC,
> `id_magasin` ASC,
> `CarteFidelitePresentee` ASC,
> `CompteCarteFidelite` ASC,
> `NbCompteCarteFidelite` ASC,
> `DetentionCF` ASC,
> `NbCarteFidelite` ASC,
> `Id_CF_Dim_DUCB` ASC
> {code}
> I'm using Spark Master with Thrift server (HIVE 0.12)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4904) Remove the foldable checking in HiveGenericUdf.eval

2014-12-19 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4904?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-4904:

Target Version/s: 1.3.0

> Remove the foldable checking in HiveGenericUdf.eval
> ---
>
> Key: SPARK-4904
> URL: https://issues.apache.org/jira/browse/SPARK-4904
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Cheng Hao
>Priority: Minor
>
> Since https://github.com/apache/spark/pull/3429 has been merged, the bug of 
> wrapping to Writable for HiveGenericUDF is resolved, we can safely remove the 
> foldable checking in `HiveGenericUdf.eval`, which discussed in
> https://github.com/apache/spark/pull/2802.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4689) Unioning 2 SchemaRDDs should return a SchemaRDD in Python, Scala, and Java

2014-12-19 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4689?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-4689:

Labels: 1.0.3  (was: )

> Unioning 2 SchemaRDDs should return a SchemaRDD in Python, Scala, and Java
> --
>
> Key: SPARK-4689
> URL: https://issues.apache.org/jira/browse/SPARK-4689
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.1.0
>Reporter: Chris Fregly
>Priority: Minor
>  Labels: 1.0.3
>
> Currently, you need to use unionAll() in Scala.  
> Python does not expose this functionality at the moment.
> The current work around is to use the UNION ALL HiveQL functionality detailed 
> here:  https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Union



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4801) Add CTE capability to HiveContext

2014-12-19 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4801?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-4801:

Description: 
This is a request to add CTE functionality to HiveContext.  Common Table 
Expressions are added in Hive 0.13.0 with HIVE-1180.  Using CTE style syntax 
within HiveContext currently results in the following caused by message:

{code}
Caused by: scala.MatchError: TOK_CTE (of class 
org.apache.hadoop.hive.ql.parse.ASTNode)
at org.apache.spark.sql.hive.HiveQl$$anonfun$13.apply(HiveQl.scala:500)
at org.apache.spark.sql.hive.HiveQl$$anonfun$13.apply(HiveQl.scala:500)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at scala.collection.immutable.List.foreach(List.scala:318)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
at scala.collection.AbstractTraversable.map(Traversable.scala:105)
at org.apache.spark.sql.hive.HiveQl$.nodeToPlan(HiveQl.scala:500)
at org.apache.spark.sql.hive.HiveQl$.parseSql(HiveQl.scala:248)
{code}

  was:
This is a request to add CTE functionality to HiveContext.  Common Table 
Expressions are added in Hive 0.13.0 with HIVE-1180.  Using CTE style syntax 
within HiveContext currently results in the following caused by message:

Caused by: scala.MatchError: TOK_CTE (of class 
org.apache.hadoop.hive.ql.parse.ASTNode)
at org.apache.spark.sql.hive.HiveQl$$anonfun$13.apply(HiveQl.scala:500)
at org.apache.spark.sql.hive.HiveQl$$anonfun$13.apply(HiveQl.scala:500)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at scala.collection.immutable.List.foreach(List.scala:318)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
at scala.collection.AbstractTraversable.map(Traversable.scala:105)
at org.apache.spark.sql.hive.HiveQl$.nodeToPlan(HiveQl.scala:500)
at org.apache.spark.sql.hive.HiveQl$.parseSql(HiveQl.scala:248)


> Add CTE capability to HiveContext
> -
>
> Key: SPARK-4801
> URL: https://issues.apache.org/jira/browse/SPARK-4801
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.1.0
>Reporter: Jacob Davis
>
> This is a request to add CTE functionality to HiveContext.  Common Table 
> Expressions are added in Hive 0.13.0 with HIVE-1180.  Using CTE style syntax 
> within HiveContext currently results in the following caused by message:
> {code}
> Caused by: scala.MatchError: TOK_CTE (of class 
> org.apache.hadoop.hive.ql.parse.ASTNode)
> at org.apache.spark.sql.hive.HiveQl$$anonfun$13.apply(HiveQl.scala:500)
> at org.apache.spark.sql.hive.HiveQl$$anonfun$13.apply(HiveQl.scala:500)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> at scala.collection.immutable.List.foreach(List.scala:318)
> at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
> at scala.collection.AbstractTraversable.map(Traversable.scala:105)
> at org.apache.spark.sql.hive.HiveQl$.nodeToPlan(HiveQl.scala:500)
> at org.apache.spark.sql.hive.HiveQl$.parseSql(HiveQl.scala:248)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4801) Add CTE capability to HiveContext

2014-12-19 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4801?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-4801:

Target Version/s: 1.3.0

> Add CTE capability to HiveContext
> -
>
> Key: SPARK-4801
> URL: https://issues.apache.org/jira/browse/SPARK-4801
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.1.0
>Reporter: Jacob Davis
>
> This is a request to add CTE functionality to HiveContext.  Common Table 
> Expressions are added in Hive 0.13.0 with HIVE-1180.  Using CTE style syntax 
> within HiveContext currently results in the following caused by message:
> {code}
> Caused by: scala.MatchError: TOK_CTE (of class 
> org.apache.hadoop.hive.ql.parse.ASTNode)
> at org.apache.spark.sql.hive.HiveQl$$anonfun$13.apply(HiveQl.scala:500)
> at org.apache.spark.sql.hive.HiveQl$$anonfun$13.apply(HiveQl.scala:500)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> at scala.collection.immutable.List.foreach(List.scala:318)
> at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
> at scala.collection.AbstractTraversable.map(Traversable.scala:105)
> at org.apache.spark.sql.hive.HiveQl$.nodeToPlan(HiveQl.scala:500)
> at org.apache.spark.sql.hive.HiveQl$.parseSql(HiveQl.scala:248)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-4735) Spark SQL UDF doesn't support 0 arguments.

2014-12-19 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4735?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-4735.
-
   Resolution: Fixed
Fix Version/s: 1.3.0
 Assignee: Cheng Hao

> Spark SQL UDF doesn't support 0 arguments.
> --
>
> Key: SPARK-4735
> URL: https://issues.apache.org/jira/browse/SPARK-4735
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Cheng Hao
>Assignee: Cheng Hao
>Priority: Minor
> Fix For: 1.3.0
>
>
> To reproduce that with:
> val udf = () => {Seq(1,2,3)}
> sqlCtx.registerFunction("myudf", udf)
> sqlCtx.sql("select myudf() from tbl limit 1").collect.foreach(println)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-4907) Inconsistent loss and gradient in LeastSquaresGradient compared with R

2014-12-19 Thread DB Tsai (JIRA)
DB Tsai created SPARK-4907:
--

 Summary: Inconsistent loss and gradient in LeastSquaresGradient 
compared with R
 Key: SPARK-4907
 URL: https://issues.apache.org/jira/browse/SPARK-4907
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Reporter: DB Tsai


In most of the academic paper and algorithm implementations, people use L = 
1/2n ||A weights-y||^2 instead of L = 1/n ||A weights-y||^2 for least-squared 
loss. See Eq. (1) in http://web.stanford.edu/~hastie/Papers/glmnet.pdf

Since MLlib uses different convention, this will result different residuals and 
all the stats properties will be different from GLMNET package in R. The model 
coefficients will be still the same under this change. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4907) Inconsistent loss and gradient in LeastSquaresGradient compared with R

2014-12-19 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4907?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14253978#comment-14253978
 ] 

Apache Spark commented on SPARK-4907:
-

User 'dbtsai' has created a pull request for this issue:
https://github.com/apache/spark/pull/3746

> Inconsistent loss and gradient in LeastSquaresGradient compared with R
> --
>
> Key: SPARK-4907
> URL: https://issues.apache.org/jira/browse/SPARK-4907
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Reporter: DB Tsai
>
> In most of the academic paper and algorithm implementations, people use L = 
> 1/2n ||A weights-y||^2 instead of L = 1/n ||A weights-y||^2 for least-squared 
> loss. See Eq. (1) in http://web.stanford.edu/~hastie/Papers/glmnet.pdf
> Since MLlib uses different convention, this will result different residuals 
> and all the stats properties will be different from GLMNET package in R. The 
> model coefficients will be still the same under this change. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4865) rdds exposed to sql context via registerTempTable are not listed via thrift jdbc show tables

2014-12-19 Thread Michael Armbrust (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14253979#comment-14253979
 ] 

Michael Armbrust commented on SPARK-4865:
-

Temporary tables are tied to a specific SQLContext and thus can't be seen or 
queried across different JVMs.  Is that the issue you are reporting?  This is a 
fundamental design thing that we are not going to change.

Or are you creating a JDBC server with an existing HiveContext and then not 
seeing the tables (a separate issue that I do want to fix).

> rdds exposed to sql context via registerTempTable are not listed via thrift 
> jdbc show tables
> 
>
> Key: SPARK-4865
> URL: https://issues.apache.org/jira/browse/SPARK-4865
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.2.0
>Reporter: Misha Chernetsov
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-4762) Add support for tuples in 'where in' clause query

2014-12-19 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4762?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-4762.
-
Resolution: Won't Fix

This issue can be reopened if the hive parser is ever extended to support this 
syntax.

> Add support for tuples in 'where in' clause query
> -
>
> Key: SPARK-4762
> URL: https://issues.apache.org/jira/browse/SPARK-4762
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.1.0
>Reporter: Yash Datta
> Fix For: 1.3.0
>
>
> Currently, in the where in clause the filter is applied only on a single 
> column. We can enhance it to accept filter on multiple columns.
> So current support is for queries like :
> Select * from table where c1 in (value1,value2,...value n);
> Need to add support for queries like :
> Select * from table where (c1,c2,... cn) in ((value1,value2...value n), 
> (value1' , value2' ... ,value n') )
> Also, we can add optimized version of where in clause of tuples , where we 
> create a hashset of the filter tuples for matching rows.
> This also requires a change in the hive parser since currently there is no 
> support for multiple columns in IN clause.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-2075) Anonymous classes are missing from Spark distribution

2014-12-19 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2075?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-2075:
---
Assignee: Shixiong Zhu

> Anonymous classes are missing from Spark distribution
> -
>
> Key: SPARK-2075
> URL: https://issues.apache.org/jira/browse/SPARK-2075
> Project: Spark
>  Issue Type: Bug
>  Components: Build, Spark Core
>Affects Versions: 1.0.0
>Reporter: Paul R. Brown
>Assignee: Shixiong Zhu
>Priority: Critical
>
> Running a job built against the Maven dep for 1.0.0 and the hadoop1 
> distribution produces:
> {code}
> java.lang.ClassNotFoundException:
> org.apache.spark.rdd.RDD$$anonfun$saveAsTextFile$1
> {code}
> Here's what's in the Maven dep as of 1.0.0:
> {code}
> jar tvf 
> ~/.m2/repository/org/apache/spark/spark-core_2.10/1.0.0/spark-core_2.10-1.0.0.jar
>  | grep 'rdd/RDD' | grep 'saveAs'
>   1519 Mon May 26 13:57:58 PDT 2014 
> org/apache/spark/rdd/RDD$anonfun$saveAsTextFile$1.class
>   1560 Mon May 26 13:57:58 PDT 2014 
> org/apache/spark/rdd/RDD$anonfun$saveAsTextFile$2.class
> {code}
> And here's what's in the hadoop1 distribution:
> {code}
> jar tvf spark-assembly-1.0.0-hadoop1.0.4.jar| grep 'rdd/RDD' | grep 'saveAs'
> {code}
> I.e., it's not there.  It is in the hadoop2 distribution:
> {code}
> jar tvf spark-assembly-1.0.0-hadoop2.2.0.jar| grep 'rdd/RDD' | grep 'saveAs'
>   1519 Mon May 26 07:29:54 PDT 2014 
> org/apache/spark/rdd/RDD$anonfun$saveAsTextFile$1.class
>   1560 Mon May 26 07:29:54 PDT 2014 
> org/apache/spark/rdd/RDD$anonfun$saveAsTextFile$2.class
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4865) rdds exposed to sql context via registerTempTable are not listed via thrift jdbc show tables

2014-12-19 Thread Misha Chernetsov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14253994#comment-14253994
 ] 

Misha Chernetsov commented on SPARK-4865:
-

> Or are you creating a JDBC server with an existing HiveContext and then not 
> seeing the tables (a separate issue that I do want to fix).
I am reporting that one.

> rdds exposed to sql context via registerTempTable are not listed via thrift 
> jdbc show tables
> 
>
> Key: SPARK-4865
> URL: https://issues.apache.org/jira/browse/SPARK-4865
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.2.0
>Reporter: Misha Chernetsov
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4636) Cluster By & Distribute By output different with Hive

2014-12-19 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4636?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-4636:

Target Version/s: 1.3.0

> Cluster By & Distribute By output different with Hive
> -
>
> Key: SPARK-4636
> URL: https://issues.apache.org/jira/browse/SPARK-4636
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Cheng Hao
>
> This is a very interesting bug.
> Semantically, Cluster By & Distribute By will not cause a global ordering, as 
> described in Hive wiki:
> https://cwiki.apache.org/confluence/display/Hive/LanguageManual+SortBy
> However, the partition keys are sorted in MapReduce after shuffle, so from 
> the user point of view, the partition key itself is global ordered, and it 
> may looks like:
> http://stackoverflow.com/questions/13715044/hive-cluster-by-vs-order-by-vs-sort-by



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4589) ML add-ons to SchemaRDD

2014-12-19 Thread Michael Armbrust (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4589?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14254002#comment-14254002
 ] 

Michael Armbrust commented on SPARK-4589:
-

Can you elaborate what you are thinking about?  Is this something like:

{code}
def transformColumn[A,B](columnName: String, f: A => B)
{code}

Is there anything else?

> ML add-ons to SchemaRDD
> ---
>
> Key: SPARK-4589
> URL: https://issues.apache.org/jira/browse/SPARK-4589
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, MLlib, SQL
>Reporter: Xiangrui Meng
>
> One feedback we received from the Pipeline API (SPARK-3530) is about the 
> boilerplate code in the implementation. We can add more Scala DSL to simplify 
> the code for the operations we need in ML. Those operations could live under 
> spark.ml via implicit, or be added to SchemaRDD directly if they are also 
> useful for general purpose.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-2973) Add a way to show tables without executing a job

2014-12-19 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-2973:

Target Version/s: 1.3.0  (was: 1.2.0)

> Add a way to show tables without executing a job
> 
>
> Key: SPARK-2973
> URL: https://issues.apache.org/jira/browse/SPARK-2973
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Aaron Davidson
>Assignee: Cheng Lian
>Priority: Critical
> Fix For: 1.2.0
>
>
> Right now, sql("show tables").collect() will start a Spark job which shows up 
> in the UI. There should be a way to get these without this step.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2973) Add a way to show tables without executing a job

2014-12-19 Thread Michael Armbrust (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14254006#comment-14254006
 ] 

Michael Armbrust commented on SPARK-2973:
-

I think the solution here is to also special case take in a SparkPlan and use 
that from schema rdd.

> Add a way to show tables without executing a job
> 
>
> Key: SPARK-2973
> URL: https://issues.apache.org/jira/browse/SPARK-2973
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Aaron Davidson
>Assignee: Cheng Lian
>Priority: Critical
> Fix For: 1.2.0
>
>
> Right now, sql("show tables").collect() will start a Spark job which shows up 
> in the UI. There should be a way to get these without this step.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4865) Include temporary tables in SHOW TABLES

2014-12-19 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4865?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-4865:

Summary: Include temporary tables in SHOW TABLES  (was: rdds exposed to sql 
context via registerTempTable are not listed via thrift jdbc show tables)

> Include temporary tables in SHOW TABLES
> ---
>
> Key: SPARK-4865
> URL: https://issues.apache.org/jira/browse/SPARK-4865
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.2.0
>Reporter: Misha Chernetsov
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4865) Include temporary tables in SHOW TABLES

2014-12-19 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4865?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-4865:

Priority: Critical  (was: Major)

> Include temporary tables in SHOW TABLES
> ---
>
> Key: SPARK-4865
> URL: https://issues.apache.org/jira/browse/SPARK-4865
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.2.0
>Reporter: Misha Chernetsov
>Priority: Critical
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4865) Include temporary tables in SHOW TABLES

2014-12-19 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4865?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-4865:

Target Version/s: 1.3.0

> Include temporary tables in SHOW TABLES
> ---
>
> Key: SPARK-4865
> URL: https://issues.apache.org/jira/browse/SPARK-4865
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.2.0
>Reporter: Misha Chernetsov
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4629) Spark SQL uses Hadoop Configuration in a thread-unsafe manner when writing Parquet files

2014-12-19 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4629?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-4629:

Target Version/s: 1.3.0

> Spark SQL uses Hadoop Configuration in a thread-unsafe manner when writing 
> Parquet files
> 
>
> Key: SPARK-4629
> URL: https://issues.apache.org/jira/browse/SPARK-4629
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.1.0
>Reporter: Michael Allman
>
> The method {{ParquetRelation.createEmpty}} mutates its given Hadoop 
> {{Configuration}} instance to set the Parquet writer compression level (cf. 
> https://github.com/apache/spark/blob/v1.1.0/sql/core/src/main/scala/org/apache/spark/sql/parquet/ParquetRelation.scala#L149).
>  This can lead to a {{ConcurrentModificationException}} when running 
> concurrent jobs sharing a single {{SparkContext}} which involve saving 
> Parquet files.
> Our "fix" was to simply remove the line in question and set the compression 
> level in the hadoop configuration before starting our jobs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4760) "ANALYZE TABLE table COMPUTE STATISTICS noscan" failed estimating table size for tables created from Parquet files

2014-12-19 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4760?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-4760:

 Target Version/s: 1.3.0
Affects Version/s: (was: 1.3.0)

> "ANALYZE TABLE table COMPUTE STATISTICS noscan" failed estimating table size 
> for tables created from Parquet files
> --
>
> Key: SPARK-4760
> URL: https://issues.apache.org/jira/browse/SPARK-4760
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.2.0
>Reporter: Jianshi Huang
>
> In a older Spark version built around Oct. 12, I was able to use 
>   ANALYZE TABLE table COMPUTE STATISTICS noscan
> to get estimated table size, which is important for optimizing joins. (I'm 
> joining 15 small dimension tables, and this is crucial to me).
> In the more recent Spark builds, it fails to estimate the table size unless I 
> remove "noscan".
> Here's the statistics I got using DESC EXTENDED:
> old:
> parameters:{EXTERNAL=TRUE, transient_lastDdlTime=1417763591, totalSize=56166}
> new:
> parameters:{numFiles=0, EXTERNAL=TRUE, transient_lastDdlTime=1417763892, 
> COLUMN_STATS_ACCURATE=false, totalSize=0, numRows=-1, rawDataSize=-1}
> And I've tried turning off spark.sql.hive.convertMetastoreParquet in my 
> spark-defaults.conf and the result is unaffected (in both versions).
> Looks like the Parquet support in new Hive (0.13.1) is broken?
> Jianshi



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4689) Unioning 2 SchemaRDDs should return a SchemaRDD in Python, Scala, and Java

2014-12-19 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4689?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-4689:

Labels: starter  (was: 1.0.3)

> Unioning 2 SchemaRDDs should return a SchemaRDD in Python, Scala, and Java
> --
>
> Key: SPARK-4689
> URL: https://issues.apache.org/jira/browse/SPARK-4689
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.1.0
>Reporter: Chris Fregly
>Priority: Minor
>  Labels: starter
>
> Currently, you need to use unionAll() in Scala.  
> Python does not expose this functionality at the moment.
> The current work around is to use the UNION ALL HiveQL functionality detailed 
> here:  https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Union



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4760) "ANALYZE TABLE table COMPUTE STATISTICS noscan" failed estimating table size for tables created from Parquet files

2014-12-19 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4760?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-4760:

Priority: Critical  (was: Major)

> "ANALYZE TABLE table COMPUTE STATISTICS noscan" failed estimating table size 
> for tables created from Parquet files
> --
>
> Key: SPARK-4760
> URL: https://issues.apache.org/jira/browse/SPARK-4760
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.2.0
>Reporter: Jianshi Huang
>Priority: Critical
>
> In a older Spark version built around Oct. 12, I was able to use 
>   ANALYZE TABLE table COMPUTE STATISTICS noscan
> to get estimated table size, which is important for optimizing joins. (I'm 
> joining 15 small dimension tables, and this is crucial to me).
> In the more recent Spark builds, it fails to estimate the table size unless I 
> remove "noscan".
> Here's the statistics I got using DESC EXTENDED:
> old:
> parameters:{EXTERNAL=TRUE, transient_lastDdlTime=1417763591, totalSize=56166}
> new:
> parameters:{numFiles=0, EXTERNAL=TRUE, transient_lastDdlTime=1417763892, 
> COLUMN_STATS_ACCURATE=false, totalSize=0, numRows=-1, rawDataSize=-1}
> And I've tried turning off spark.sql.hive.convertMetastoreParquet in my 
> spark-defaults.conf and the result is unaffected (in both versions).
> Looks like the Parquet support in new Hive (0.13.1) is broken?
> Jianshi



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4689) Unioning 2 SchemaRDDs should return a SchemaRDD in Python, Scala, and Java

2014-12-19 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4689?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-4689:

Target Version/s: 1.3.0

> Unioning 2 SchemaRDDs should return a SchemaRDD in Python, Scala, and Java
> --
>
> Key: SPARK-4689
> URL: https://issues.apache.org/jira/browse/SPARK-4689
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.1.0
>Reporter: Chris Fregly
>Priority: Minor
>  Labels: starter
>
> Currently, you need to use unionAll() in Scala.  
> Python does not expose this functionality at the moment.
> The current work around is to use the UNION ALL HiveQL functionality detailed 
> here:  https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Union



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4648) Support COALESCE function in Spark SQL and HiveQL

2014-12-19 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4648?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-4648:

Target Version/s: 1.3.0
Assignee: Ravindra Pesala

> Support COALESCE function in Spark SQL and HiveQL
> -
>
> Key: SPARK-4648
> URL: https://issues.apache.org/jira/browse/SPARK-4648
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.1.0
>Reporter: Ravindra Pesala
>Assignee: Ravindra Pesala
>
> Support Coalesce function in Spark SQL.
> Support type widening in Coalesce function.
> And replace Coalesce UDF in Spark Hive with local Coalesce function since it 
> is memory efficient and faster.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4564) SchemaRDD.groupBy(groupingExprs)(aggregateExprs) doesn't return the groupingExprs as part of the output schema

2014-12-19 Thread Michael Armbrust (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14254016#comment-14254016
 ] 

Michael Armbrust commented on SPARK-4564:
-

It is however consistent with SQL, where GROUP BY expression are only included 
if they are part of the SELECT clause.  Since the goal here is to provide 
programatic SQL I'm inclined to stick with the current semantics.  Changing 
this would also be a fairly major breaking change to the API if people were 
dependent on the position of columns in the result.

> SchemaRDD.groupBy(groupingExprs)(aggregateExprs) doesn't return the 
> groupingExprs as part of the output schema
> --
>
> Key: SPARK-4564
> URL: https://issues.apache.org/jira/browse/SPARK-4564
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.1.0
> Environment: Mac OSX, local mode, but should hold true for all 
> environments
>Reporter: Dean Wampler
>
> In the following example, I would expect the "grouped" schema to contain two 
> fields, the String name and the Long count, but it only contains the Long 
> count.
> {code}
> // Assumes val sc = new SparkContext(...), e.g., in Spark Shell
> import org.apache.spark.sql.{SQLContext, SchemaRDD}
> import org.apache.spark.sql.catalyst.expressions._
> val sqlc = new SQLContext(sc)
> import sqlc._
> case class Record(name: String, n: Int)
> val records = List(
>   Record("three",   1),
>   Record("three",   2),
>   Record("two", 3),
>   Record("three",   4),
>   Record("two", 5))
> val recs = sc.parallelize(records)
> recs.registerTempTable("records")
> val grouped = recs.select('name, 'n).groupBy('name)(Count('n) as 'count)
> grouped.printSchema
> // root
> //  |-- count: long (nullable = false)
> grouped foreach println
> // [2]
> // [3]
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-4564) SchemaRDD.groupBy(groupingExprs)(aggregateExprs) doesn't return the groupingExprs as part of the output schema

2014-12-19 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4564?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-4564.
-
Resolution: Won't Fix

I'm going to close this wontfix unless there is major objection.  Happy to 
accept PRs to clarify the documentation though :)

> SchemaRDD.groupBy(groupingExprs)(aggregateExprs) doesn't return the 
> groupingExprs as part of the output schema
> --
>
> Key: SPARK-4564
> URL: https://issues.apache.org/jira/browse/SPARK-4564
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.1.0
> Environment: Mac OSX, local mode, but should hold true for all 
> environments
>Reporter: Dean Wampler
>
> In the following example, I would expect the "grouped" schema to contain two 
> fields, the String name and the Long count, but it only contains the Long 
> count.
> {code}
> // Assumes val sc = new SparkContext(...), e.g., in Spark Shell
> import org.apache.spark.sql.{SQLContext, SchemaRDD}
> import org.apache.spark.sql.catalyst.expressions._
> val sqlc = new SQLContext(sc)
> import sqlc._
> case class Record(name: String, n: Int)
> val records = List(
>   Record("three",   1),
>   Record("three",   2),
>   Record("two", 3),
>   Record("three",   4),
>   Record("two", 5))
> val recs = sc.parallelize(records)
> recs.registerTempTable("records")
> val grouped = recs.select('name, 'n).groupBy('name)(Count('n) as 'count)
> grouped.printSchema
> // root
> //  |-- count: long (nullable = false)
> grouped foreach println
> // [2]
> // [3]
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4502) Spark SQL reads unneccesary fields from Parquet

2014-12-19 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4502?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-4502:

Priority: Critical  (was: Major)
Target Version/s: 1.3.0

> Spark SQL reads unneccesary fields from Parquet
> ---
>
> Key: SPARK-4502
> URL: https://issues.apache.org/jira/browse/SPARK-4502
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.1.0
>Reporter: Liwen Sun
>Priority: Critical
>
> When reading a field of a nested column from Parquet, SparkSQL reads and 
> assemble all the fields of that nested column. This is unnecessary, as 
> Parquet supports fine-grained field reads out of a nested column. This may 
> degrades the performance significantly when a nested column has many fields. 
> For example, I loaded json tweets data into SparkSQL and ran the following 
> query:
> {{SELECT User.contributors_enabled from Tweets;}}
> User is a nested structure that has 38 primitive fields (for Tweets schema, 
> see: https://dev.twitter.com/overview/api/tweets), here is the log message:
> {{14/11/19 16:36:49 INFO InternalParquetRecordReader: Assembled and processed 
> 385779 records from 38 columns in 3976 ms: 97.02691 rec/ms, 3687.0227 
> cell/ms}}
> For comparison, I also ran:
> {{SELECT User FROM Tweets;}}
> And here is the log message:
> {{14/11/19 16:45:40 INFO InternalParquetRecordReader: Assembled and processed 
> 385779 records from 38 columns in 9461 ms: 40.77571 rec/ms, 1549.477 cell/ms}}
> So both queries load 38 columns from Parquet, while the first query only 
> needs 1 column. I also measured the bytes read within Parquet. In these two 
> cases, the same number of bytes (99365194 bytes) were read. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4476) Use MapType for dict in json which has unique keys in each row.

2014-12-19 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4476?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-4476:

Target Version/s: 1.3.0

> Use MapType for dict in json which has unique keys in each row.
> ---
>
> Key: SPARK-4476
> URL: https://issues.apache.org/jira/browse/SPARK-4476
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Davies Liu
>Priority: Critical
>
> For the jsonRDD like this: 
> {code}
> """ {a: 1} """
> """ {b: 2} """
> """ {c: 3} """
> """ {d: 4} """
> """ {e: 5} """
> {code}
> It will create a StructType with 5 fileds in it, each field come from a 
> different row. It will be a problem if the RDD is large. A StructType with 
> thousands or millions fields is hard to play with (will cause stack overflow 
> during serialization).
> It should be MapType for this case. We need a clear rule to decide StructType 
> or MapType will be used for dict in json data. 
> cc [~yhuai] [~marmbrus]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4367) Process the "distinct" value before shuffling for aggregation

2014-12-19 Thread Michael Armbrust (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14254020#comment-14254020
 ] 

Michael Armbrust commented on SPARK-4367:
-

So we already do this for SUM and COUNT, and I don't think there is a AVG 
DISTINCT currently.  Should we close this or is there more to it?

> Process the "distinct" value before shuffling for aggregation
> -
>
> Key: SPARK-4367
> URL: https://issues.apache.org/jira/browse/SPARK-4367
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Cheng Hao
>
> Most of aggregate function(e.g average) with "distinct" value will requires 
> all of the records in the same group to be shuffled into a single node, 
> however, as part of the optimization, those records can be partially 
> aggregated before shuffling, that probably reduces the overhead of shuffling 
> significantly. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-4469) Move the SemanticAnalyzer from Physical Execution to Analysis

2014-12-19 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4469?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-4469.
-
Resolution: Fixed
  Assignee: Cheng Hao

> Move the SemanticAnalyzer from Physical Execution to Analysis
> -
>
> Key: SPARK-4469
> URL: https://issues.apache.org/jira/browse/SPARK-4469
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Cheng Hao
>Assignee: Cheng Hao
>Priority: Minor
>
> This is the code refactor and follow ups for 
> "https://github.com/apache/spark/pull/2570";



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4657) Suport storing decimals in Parquet that don't fit in a LONG

2014-12-19 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4657?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-4657:

Summary: Suport storing decimals in Parquet that don't fit in a LONG  (was: 
RuntimeException: Unsupported datatype DecimalType())

> Suport storing decimals in Parquet that don't fit in a LONG
> ---
>
> Key: SPARK-4657
> URL: https://issues.apache.org/jira/browse/SPARK-4657
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.2.0
>Reporter: pengyanhong
>
> execute a query statement on a Hive table which contains decimal data type 
> field, than save the result into tachyon as parquet file, got error as below:
> {quote}
> java.lang.RuntimeException: Unsupported datatype DecimalType()
> at scala.sys.package$.error(package.scala:27)
> at 
> org.apache.spark.sql.parquet.ParquetTypesConverter$$anonfun$fromDataType$2.apply(ParquetTypes.scala:343)
> at 
> org.apache.spark.sql.parquet.ParquetTypesConverter$$anonfun$fromDataType$2.apply(ParquetTypes.scala:292)
> at scala.Option.getOrElse(Option.scala:120)
> at 
> org.apache.spark.sql.parquet.ParquetTypesConverter$.fromDataType(ParquetTypes.scala:291)
> at 
> org.apache.spark.sql.parquet.ParquetTypesConverter$$anonfun$4.apply(ParquetTypes.scala:363)
> at 
> org.apache.spark.sql.parquet.ParquetTypesConverter$$anonfun$4.apply(ParquetTypes.scala:362)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
> at 
> scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
> at scala.collection.AbstractTraversable.map(Traversable.scala:105)
> at 
> org.apache.spark.sql.parquet.ParquetTypesConverter$.convertFromAttributes(ParquetTypes.scala:361)
> at 
> org.apache.spark.sql.parquet.ParquetTypesConverter$.writeMetaData(ParquetTypes.scala:407)
> at 
> org.apache.spark.sql.parquet.ParquetRelation$.createEmpty(ParquetRelation.scala:151)
> at 
> org.apache.spark.sql.parquet.ParquetRelation$.create(ParquetRelation.scala:130)
> at 
> org.apache.spark.sql.execution.SparkStrategies$ParquetOperations$.apply(SparkStrategies.scala:204)
> at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
> at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
> at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
> at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner.apply(QueryPlanner.scala:59)
> at 
> org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan$lzycompute(SQLContext.scala:417)
> at 
> org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan(SQLContext.scala:415)
> at 
> org.apache.spark.sql.SQLContext$QueryExecution.executedPlan$lzycompute(SQLContext.scala:421)
> at 
> org.apache.spark.sql.SQLContext$QueryExecution.executedPlan(SQLContext.scala:421)
> at 
> org.apache.spark.sql.SQLContext$QueryExecution.toRdd$lzycompute(SQLContext.scala:424)
> at 
> org.apache.spark.sql.SQLContext$QueryExecution.toRdd(SQLContext.scala:424)
> at 
> org.apache.spark.sql.SchemaRDDLike$class.saveAsParquetFile(SchemaRDDLike.scala:76)
> at 
> org.apache.spark.sql.SchemaRDD.saveAsParquetFile(SchemaRDD.scala:103)
> at com.jd.jddp.spark.hive.Cache$.cacheTable(Cache.scala:33)
> at com.jd.jddp.spark.hive.Cache$$anonfun$main$5.apply(Cache.scala:61)
> at com.jd.jddp.spark.hive.Cache$$anonfun$main$5.apply(Cache.scala:59)
> at 
> scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
> at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
> at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
> at 
> scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771)
> at com.jd.jddp.spark.hive.Cache$.main(Cache.scala:59)
> at com.jd.jddp.spark.hive.Cache.main(Cache.scala)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> at java.lang.reflect.Method.invoke(Method.java:597)
> at 
> org.apache.spark.deploy.yarn.ApplicationMaster$$a

[jira] [Updated] (SPARK-4657) Suport storing decimals in Parquet that don't fit in a LONG

2014-12-19 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4657?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-4657:

Target Version/s: 1.3.0
  Issue Type: Improvement  (was: Bug)

> Suport storing decimals in Parquet that don't fit in a LONG
> ---
>
> Key: SPARK-4657
> URL: https://issues.apache.org/jira/browse/SPARK-4657
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.2.0
>Reporter: pengyanhong
>
> execute a query statement on a Hive table which contains decimal data type 
> field, than save the result into tachyon as parquet file, got error as below:
> {quote}
> java.lang.RuntimeException: Unsupported datatype DecimalType()
> at scala.sys.package$.error(package.scala:27)
> at 
> org.apache.spark.sql.parquet.ParquetTypesConverter$$anonfun$fromDataType$2.apply(ParquetTypes.scala:343)
> at 
> org.apache.spark.sql.parquet.ParquetTypesConverter$$anonfun$fromDataType$2.apply(ParquetTypes.scala:292)
> at scala.Option.getOrElse(Option.scala:120)
> at 
> org.apache.spark.sql.parquet.ParquetTypesConverter$.fromDataType(ParquetTypes.scala:291)
> at 
> org.apache.spark.sql.parquet.ParquetTypesConverter$$anonfun$4.apply(ParquetTypes.scala:363)
> at 
> org.apache.spark.sql.parquet.ParquetTypesConverter$$anonfun$4.apply(ParquetTypes.scala:362)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
> at 
> scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
> at scala.collection.AbstractTraversable.map(Traversable.scala:105)
> at 
> org.apache.spark.sql.parquet.ParquetTypesConverter$.convertFromAttributes(ParquetTypes.scala:361)
> at 
> org.apache.spark.sql.parquet.ParquetTypesConverter$.writeMetaData(ParquetTypes.scala:407)
> at 
> org.apache.spark.sql.parquet.ParquetRelation$.createEmpty(ParquetRelation.scala:151)
> at 
> org.apache.spark.sql.parquet.ParquetRelation$.create(ParquetRelation.scala:130)
> at 
> org.apache.spark.sql.execution.SparkStrategies$ParquetOperations$.apply(SparkStrategies.scala:204)
> at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
> at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
> at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
> at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner.apply(QueryPlanner.scala:59)
> at 
> org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan$lzycompute(SQLContext.scala:417)
> at 
> org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan(SQLContext.scala:415)
> at 
> org.apache.spark.sql.SQLContext$QueryExecution.executedPlan$lzycompute(SQLContext.scala:421)
> at 
> org.apache.spark.sql.SQLContext$QueryExecution.executedPlan(SQLContext.scala:421)
> at 
> org.apache.spark.sql.SQLContext$QueryExecution.toRdd$lzycompute(SQLContext.scala:424)
> at 
> org.apache.spark.sql.SQLContext$QueryExecution.toRdd(SQLContext.scala:424)
> at 
> org.apache.spark.sql.SchemaRDDLike$class.saveAsParquetFile(SchemaRDDLike.scala:76)
> at 
> org.apache.spark.sql.SchemaRDD.saveAsParquetFile(SchemaRDD.scala:103)
> at com.jd.jddp.spark.hive.Cache$.cacheTable(Cache.scala:33)
> at com.jd.jddp.spark.hive.Cache$$anonfun$main$5.apply(Cache.scala:61)
> at com.jd.jddp.spark.hive.Cache$$anonfun$main$5.apply(Cache.scala:59)
> at 
> scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
> at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
> at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
> at 
> scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771)
> at com.jd.jddp.spark.hive.Cache$.main(Cache.scala:59)
> at com.jd.jddp.spark.hive.Cache.main(Cache.scala)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> at java.lang.reflect.Method.invoke(Method.java:597)
> at 
> org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:459)
> {quote}



-

[jira] [Updated] (SPARK-4176) Support decimals with precision > 18 in Parquet

2014-12-19 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4176?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-4176:

Target Version/s: 1.3.0

> Support decimals with precision > 18 in Parquet
> ---
>
> Key: SPARK-4176
> URL: https://issues.apache.org/jira/browse/SPARK-4176
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Matei Zaharia
>
> After https://issues.apache.org/jira/browse/SPARK-3929, only decimals with 
> precisions <= 18 (that can be read into a Long) will be readable from 
> Parquet, so we still need more work to support these larger ones.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-4657) Suport storing decimals in Parquet that don't fit in a LONG

2014-12-19 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4657?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-4657.
-
Resolution: Duplicate

> Suport storing decimals in Parquet that don't fit in a LONG
> ---
>
> Key: SPARK-4657
> URL: https://issues.apache.org/jira/browse/SPARK-4657
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.2.0
>Reporter: pengyanhong
>
> execute a query statement on a Hive table which contains decimal data type 
> field, than save the result into tachyon as parquet file, got error as below:
> {quote}
> java.lang.RuntimeException: Unsupported datatype DecimalType()
> at scala.sys.package$.error(package.scala:27)
> at 
> org.apache.spark.sql.parquet.ParquetTypesConverter$$anonfun$fromDataType$2.apply(ParquetTypes.scala:343)
> at 
> org.apache.spark.sql.parquet.ParquetTypesConverter$$anonfun$fromDataType$2.apply(ParquetTypes.scala:292)
> at scala.Option.getOrElse(Option.scala:120)
> at 
> org.apache.spark.sql.parquet.ParquetTypesConverter$.fromDataType(ParquetTypes.scala:291)
> at 
> org.apache.spark.sql.parquet.ParquetTypesConverter$$anonfun$4.apply(ParquetTypes.scala:363)
> at 
> org.apache.spark.sql.parquet.ParquetTypesConverter$$anonfun$4.apply(ParquetTypes.scala:362)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
> at 
> scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
> at scala.collection.AbstractTraversable.map(Traversable.scala:105)
> at 
> org.apache.spark.sql.parquet.ParquetTypesConverter$.convertFromAttributes(ParquetTypes.scala:361)
> at 
> org.apache.spark.sql.parquet.ParquetTypesConverter$.writeMetaData(ParquetTypes.scala:407)
> at 
> org.apache.spark.sql.parquet.ParquetRelation$.createEmpty(ParquetRelation.scala:151)
> at 
> org.apache.spark.sql.parquet.ParquetRelation$.create(ParquetRelation.scala:130)
> at 
> org.apache.spark.sql.execution.SparkStrategies$ParquetOperations$.apply(SparkStrategies.scala:204)
> at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
> at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
> at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
> at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner.apply(QueryPlanner.scala:59)
> at 
> org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan$lzycompute(SQLContext.scala:417)
> at 
> org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan(SQLContext.scala:415)
> at 
> org.apache.spark.sql.SQLContext$QueryExecution.executedPlan$lzycompute(SQLContext.scala:421)
> at 
> org.apache.spark.sql.SQLContext$QueryExecution.executedPlan(SQLContext.scala:421)
> at 
> org.apache.spark.sql.SQLContext$QueryExecution.toRdd$lzycompute(SQLContext.scala:424)
> at 
> org.apache.spark.sql.SQLContext$QueryExecution.toRdd(SQLContext.scala:424)
> at 
> org.apache.spark.sql.SchemaRDDLike$class.saveAsParquetFile(SchemaRDDLike.scala:76)
> at 
> org.apache.spark.sql.SchemaRDD.saveAsParquetFile(SchemaRDD.scala:103)
> at com.jd.jddp.spark.hive.Cache$.cacheTable(Cache.scala:33)
> at com.jd.jddp.spark.hive.Cache$$anonfun$main$5.apply(Cache.scala:61)
> at com.jd.jddp.spark.hive.Cache$$anonfun$main$5.apply(Cache.scala:59)
> at 
> scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
> at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
> at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
> at 
> scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771)
> at com.jd.jddp.spark.hive.Cache$.main(Cache.scala:59)
> at com.jd.jddp.spark.hive.Cache.main(Cache.scala)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> at java.lang.reflect.Method.invoke(Method.java:597)
> at 
> org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:459)
> {quote}



--
This message was sent by Atlassian JIRA
(v6.

[jira] [Updated] (SPARK-4512) Unresolved Attribute Exception for sort by

2014-12-19 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4512?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-4512:

Target Version/s: 1.3.0

> Unresolved Attribute Exception for sort by
> --
>
> Key: SPARK-4512
> URL: https://issues.apache.org/jira/browse/SPARK-4512
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Cheng Hao
>
> It will cause exception while do query like:
> SELECT key+key FROM src sort by value;



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4302) Make jsonRDD/jsonFile support more field data types

2014-12-19 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-4302:

Target Version/s: 1.3.0

> Make jsonRDD/jsonFile support more field data types
> ---
>
> Key: SPARK-4302
> URL: https://issues.apache.org/jira/browse/SPARK-4302
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.1.0
>Reporter: Yin Huai
>
> Since we allow users to specify schemas, jsonRDD/jsonFile should support all 
> Spark SQL data types in the provided schema.
> A related post in mailing list: 
> http://apache-spark-user-list.1001560.n3.nabble.com/jsonRdd-and-MapType-td18376.html
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4296) Throw "Expression not in GROUP BY" when using same expression in group by clause and select clause

2014-12-19 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4296?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-4296:

Priority: Critical  (was: Major)

> Throw "Expression not in GROUP BY" when using same expression in group by 
> clause and  select clause
> ---
>
> Key: SPARK-4296
> URL: https://issues.apache.org/jira/browse/SPARK-4296
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.1.0
>Reporter: Shixiong Zhu
>Assignee: Cheng Lian
>Priority: Critical
>
> When the input data has a complex structure, using same expression in group 
> by clause and  select clause will throw "Expression not in GROUP BY".
> {code:java}
> val sqlContext = new org.apache.spark.sql.SQLContext(sc)
> import sqlContext.createSchemaRDD
> case class Birthday(date: String)
> case class Person(name: String, birthday: Birthday)
> val people = sc.parallelize(List(Person("John", Birthday("1990-01-22")), 
> Person("Jim", Birthday("1980-02-28"
> people.registerTempTable("people")
> val year = sqlContext.sql("select count(*), upper(birthday.date) from people 
> group by upper(birthday.date)")
> year.collect
> {code}
> Here is the plan of year:
> {code:java}
> SchemaRDD[3] at RDD at SchemaRDD.scala:105
> == Query Plan ==
> == Physical Plan ==
> org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Expression 
> not in GROUP BY: Upper(birthday#1.date AS date#9) AS c1#3, tree:
> Aggregate [Upper(birthday#1.date)], [COUNT(1) AS c0#2L,Upper(birthday#1.date 
> AS date#9) AS c1#3]
>  Subquery people
>   LogicalRDD [name#0,birthday#1], MapPartitionsRDD[1] at mapPartitions at 
> ExistingRDD.scala:36
> {code}
> The bug is the equality test for `Upper(birthday#1.date)` and 
> `Upper(birthday#1.date AS date#9)`.
> Maybe Spark SQL needs a mechanism to compare Alias expression and non-Alias 
> expression.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4296) Throw "Expression not in GROUP BY" when using same expression in group by clause and select clause

2014-12-19 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4296?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-4296:

Target Version/s: 1.3.0

> Throw "Expression not in GROUP BY" when using same expression in group by 
> clause and  select clause
> ---
>
> Key: SPARK-4296
> URL: https://issues.apache.org/jira/browse/SPARK-4296
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.1.0
>Reporter: Shixiong Zhu
>Assignee: Cheng Lian
>
> When the input data has a complex structure, using same expression in group 
> by clause and  select clause will throw "Expression not in GROUP BY".
> {code:java}
> val sqlContext = new org.apache.spark.sql.SQLContext(sc)
> import sqlContext.createSchemaRDD
> case class Birthday(date: String)
> case class Person(name: String, birthday: Birthday)
> val people = sc.parallelize(List(Person("John", Birthday("1990-01-22")), 
> Person("Jim", Birthday("1980-02-28"
> people.registerTempTable("people")
> val year = sqlContext.sql("select count(*), upper(birthday.date) from people 
> group by upper(birthday.date)")
> year.collect
> {code}
> Here is the plan of year:
> {code:java}
> SchemaRDD[3] at RDD at SchemaRDD.scala:105
> == Query Plan ==
> == Physical Plan ==
> org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Expression 
> not in GROUP BY: Upper(birthday#1.date AS date#9) AS c1#3, tree:
> Aggregate [Upper(birthday#1.date)], [COUNT(1) AS c0#2L,Upper(birthday#1.date 
> AS date#9) AS c1#3]
>  Subquery people
>   LogicalRDD [name#0,birthday#1], MapPartitionsRDD[1] at mapPartitions at 
> ExistingRDD.scala:36
> {code}
> The bug is the equality test for `Upper(birthday#1.date)` and 
> `Upper(birthday#1.date AS date#9)`.
> Maybe Spark SQL needs a mechanism to compare Alias expression and non-Alias 
> expression.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-4209) Support UDT in UDF

2014-12-19 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4209?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-4209.
-
   Resolution: Fixed
Fix Version/s: 1.2.0
 Assignee: Michael Armbrust

Fixed here: 
https://github.com/apache/spark/commit/15b58a2234ab7ba30c9c0cbb536177a3c725e350

> Support UDT in UDF
> --
>
> Key: SPARK-4209
> URL: https://issues.apache.org/jira/browse/SPARK-4209
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Xiangrui Meng
>Assignee: Michael Armbrust
> Fix For: 1.2.0
>
>
> UDF doesn't recognize functions defined with UDTs. Before execution, an SQL 
> internal datum should be converted to Scala types, and after execution, the 
> result should be converted back to internal format (maybe this part is 
> already done).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-4201) Can't use concat() on partition column in where condition (Hive compatibility problem)

2014-12-19 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-4201.
-
   Resolution: Fixed
Fix Version/s: 1.2.0

Since this was reported working in master I'm going to close.  Please reopen if 
you are still having problems.

> Can't use concat() on partition column in where condition (Hive compatibility 
> problem)
> --
>
> Key: SPARK-4201
> URL: https://issues.apache.org/jira/browse/SPARK-4201
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.0.0, 1.1.0
> Environment: Hive 0.12+hadoop 2.4/hadoop 2.2 +spark 1.1
>Reporter: dongxu
>Priority: Minor
>  Labels: com
> Fix For: 1.2.0
>
>
> The team used hive to query,we try to  move it to spark-sql.
> when I search sentences like that. 
> select count(1) from  gulfstream_day_driver_base_2 where  
> concat(year,month,day) = '20140929';
> It can't work ,but it work well in hive.
> I have to rewrite the sql to  "select count(1) from  
> gulfstream_day_driver_base_2 where  year = 2014 and  month = 09 day= 29.
> There are some error log.
> 14/11/03 15:05:03 ERROR SparkSQLDriver: Failed in [select count(1) from  
> gulfstream_day_driver_base_2 where  concat(year,month,day) = '20140929']
> org.apache.spark.sql.catalyst.errors.package$TreeNodeException: execute, tree:
> Aggregate false, [], [SUM(PartialCount#1390L) AS c_0#1337L]
>  Exchange SinglePartition
>   Aggregate true, [], [COUNT(1) AS PartialCount#1390L]
>HiveTableScan [], (MetastoreRelation default, 
> gulfstream_day_driver_base_2, None), 
> Some((HiveGenericUdf#org.apache.hadoop.hive.ql.udf.generic.GenericUDFConcat(year#1339,month#1340,day#1341)
>  = 20140929))
>   at 
> org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:47)
>   at org.apache.spark.sql.execution.Aggregate.execute(Aggregate.scala:126)
>   at 
> org.apache.spark.sql.hive.HiveContext$QueryExecution.toRdd$lzycompute(HiveContext.scala:360)
>   at 
> org.apache.spark.sql.hive.HiveContext$QueryExecution.toRdd(HiveContext.scala:360)
>   at 
> org.apache.spark.sql.hive.HiveContext$QueryExecution.stringResult(HiveContext.scala:415)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLDriver.run(SparkSQLDriver.scala:59)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.processCmd(SparkSQLCLIDriver.scala:291)
>   at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:413)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:226)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>   at java.lang.reflect.Method.invoke(Method.java:597)
>   at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:328)
>   at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:75)
>   at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> Caused by: org.apache.spark.sql.catalyst.errors.package$TreeNodeException: 
> execute, tree:
> Exchange SinglePartition
>  Aggregate true, [], [COUNT(1) AS PartialCount#1390L]
>   HiveTableScan [], (MetastoreRelation default, gulfstream_day_driver_base_2, 
> None), 
> Some((HiveGenericUdf#org.apache.hadoop.hive.ql.udf.generic.GenericUDFConcat(year#1339,month#1340,day#1341)
>  = 20140929))
>   at 
> org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:47)
>   at org.apache.spark.sql.execution.Exchange.execute(Exchange.scala:44)
>   at 
> org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1.apply(Aggregate.scala:128)
>   at 
> org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1.apply(Aggregate.scala:127)
>   at 
> org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:46)
>   ... 16 more
> Caused by: org.apache.spark.sql.catalyst.errors.package$TreeNodeException: 
> execute, tree:
> Aggregate true, [], [COUNT(1) AS PartialCount#1390L]
>  HiveTableScan [], (MetastoreRelation default, gulfstream_day_driver_base_2, 
> None), 
> Some((HiveGenericUdf#org.apache.hadoop.hive.ql.udf.generic.GenericUDFConcat(year#1339,month#1340,day#1341)
>  = 20140929))
>   at 
> org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:47)
>   at org.apache.spark.sql.execution.Aggregate.execute(Aggregate.scala:126)
>   at 
> org.apache.spark.sql.execution.Exchange$$anonfun$execute$1.apply(Exchange.scala:86)
>   at 
> or

[jira] [Resolved] (SPARK-4135) Error reading Parquet file generated with SparkSQL

2014-12-19 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4135?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-4135.
-
Resolution: Won't Fix
  Assignee: Michael Armbrust

> Error reading Parquet file generated with SparkSQL
> --
>
> Key: SPARK-4135
> URL: https://issues.apache.org/jira/browse/SPARK-4135
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.1.0
>Reporter: Hossein Falaki
>Assignee: Michael Armbrust
> Attachments: _metadata, part-r-1.parquet
>
>
> I read a tsv version of the one million songs dataset (available here: 
> http://tbmmsd.s3.amazonaws.com/)
> After reading it I create a SchemaRDD with following schema:
> {code}
> root
>  |-- track_id: string (nullable = true)
>  |-- analysis_sample_rate: string (nullable = true)
>  |-- artist_7digitalid: string (nullable = true)
>  |-- artist_familiarity: double (nullable = true)
>  |-- artist_hotness: double (nullable = true)
>  |-- artist_id: string (nullable = true)
>  |-- artist_latitude: string (nullable = true)
>  |-- artist_location: string (nullable = true)
>  |-- artist_longitude: string (nullable = true)
>  |-- artist_mbid: string (nullable = true)
>  |-- artist_mbtags: array (nullable = true)
>  ||-- element: string (containsNull = true)
>  |-- artist_mbtags_count: array (nullable = true)
>  ||-- element: double (containsNull = true)
>  |-- artist_name: string (nullable = true)
>  |-- artist_playmeid: string (nullable = true)
>  |-- artist_terms: array (nullable = true)
>  ||-- element: string (containsNull = true)
>  |-- artist_terms_freq: array (nullable = true)
>  ||-- element: double (containsNull = true)
>  |-- artist_terms_weight: array (nullable = true)
>  ||-- element: double (containsNull = true)
>  |-- audio_md5: string (nullable = true)
>  |-- bars_confidence: array (nullable = true)
>  ||-- element: double (containsNull = true)
>  |-- bars_start: array (nullable = true)
>  ||-- element: double (containsNull = true)
>  |-- beats_confidence: array (nullable = true)
>  ||-- element: double (containsNull = true)
>  |-- beats_start: array (nullable = true)
>  ||-- element: double (containsNull = true)
>  |-- danceability: double (nullable = true)
>  |-- duration: double (nullable = true)
>  |-- end_of_fade_in: double (nullable = true)
>  |-- energy: double (nullable = true)
>  |-- key: string (nullable = true)
>  |-- key_confidence: double (nullable = true)
>  |-- loudness: double (nullable = true)
>  |-- mode: double (nullable = true)
>  |-- mode_confidence: double (nullable = true)
>  |-- release: string (nullable = true)
>  |-- release_7digitalid: string (nullable = true)
>  |-- sections_confidence: array (nullable = true)
>  ||-- element: double (containsNull = true)
>  |-- sections_start: array (nullable = true)
>  ||-- element: double (containsNull = true)
>  |-- segments_confidence: array (nullable = true)
>  ||-- element: double (containsNull = true)
>  |-- segments_loudness_max: array (nullable = true)
>  ||-- element: double (containsNull = true)
>  |-- segments_loudness_max_time: array (nullable = true)
>  ||-- element: double (containsNull = true)
>  |-- segments_loudness_start: array (nullable = true)
>  ||-- element: double (containsNull = true)
>  |-- segments_pitches: array (nullable = true)
>  ||-- element: double (containsNull = true)
>  |-- segments_start: array (nullable = true)
>  ||-- element: double (containsNull = true)
>  |-- segments_timbre: array (nullable = true)
>  ||-- element: double (containsNull = true)
>  |-- similar_artists: array (nullable = true)
>  ||-- element: string (containsNull = true)
>  |-- song_hotness: double (nullable = true)
>  |-- song_id: string (nullable = true)
>  |-- start_of_fade_out: double (nullable = true)
>  |-- tatums_confidence: array (nullable = true)
>  ||-- element: double (containsNull = true)
>  |-- tatums_start: array (nullable = true)
>  ||-- element: double (containsNull = true)
>  |-- tempo: double (nullable = true)
>  |-- time_signature: double (nullable = true)
>  |-- time_signature_confidence: double (nullable = true)
>  |-- title: string (nullable = true)
>  |-- track_7digitalid: string (nullable = true)
>  |-- year: double (nullable = true)
> {code}
> I select a single record from it and save it using saveAsParquetFile(). 
> When I read it later and try to query it I get the following exception:
> {code}
> Error in SQL statement: java.lang.RuntimeException: 
> java.lang.reflect.InvocationTargetException
>   at sun.reflect.GeneratedMethodAccessor208.invoke(Unknown Source)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Meth

[jira] [Commented] (SPARK-4135) Error reading Parquet file generated with SparkSQL

2014-12-19 Thread Michael Armbrust (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14254035#comment-14254035
 ] 

Michael Armbrust commented on SPARK-4135:
-

The problem here is you have to columns with the same name, "beats_start".  The 
new version of parquet gives you a better error message.

> Error reading Parquet file generated with SparkSQL
> --
>
> Key: SPARK-4135
> URL: https://issues.apache.org/jira/browse/SPARK-4135
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.1.0
>Reporter: Hossein Falaki
> Attachments: _metadata, part-r-1.parquet
>
>
> I read a tsv version of the one million songs dataset (available here: 
> http://tbmmsd.s3.amazonaws.com/)
> After reading it I create a SchemaRDD with following schema:
> {code}
> root
>  |-- track_id: string (nullable = true)
>  |-- analysis_sample_rate: string (nullable = true)
>  |-- artist_7digitalid: string (nullable = true)
>  |-- artist_familiarity: double (nullable = true)
>  |-- artist_hotness: double (nullable = true)
>  |-- artist_id: string (nullable = true)
>  |-- artist_latitude: string (nullable = true)
>  |-- artist_location: string (nullable = true)
>  |-- artist_longitude: string (nullable = true)
>  |-- artist_mbid: string (nullable = true)
>  |-- artist_mbtags: array (nullable = true)
>  ||-- element: string (containsNull = true)
>  |-- artist_mbtags_count: array (nullable = true)
>  ||-- element: double (containsNull = true)
>  |-- artist_name: string (nullable = true)
>  |-- artist_playmeid: string (nullable = true)
>  |-- artist_terms: array (nullable = true)
>  ||-- element: string (containsNull = true)
>  |-- artist_terms_freq: array (nullable = true)
>  ||-- element: double (containsNull = true)
>  |-- artist_terms_weight: array (nullable = true)
>  ||-- element: double (containsNull = true)
>  |-- audio_md5: string (nullable = true)
>  |-- bars_confidence: array (nullable = true)
>  ||-- element: double (containsNull = true)
>  |-- bars_start: array (nullable = true)
>  ||-- element: double (containsNull = true)
>  |-- beats_confidence: array (nullable = true)
>  ||-- element: double (containsNull = true)
>  |-- beats_start: array (nullable = true)
>  ||-- element: double (containsNull = true)
>  |-- danceability: double (nullable = true)
>  |-- duration: double (nullable = true)
>  |-- end_of_fade_in: double (nullable = true)
>  |-- energy: double (nullable = true)
>  |-- key: string (nullable = true)
>  |-- key_confidence: double (nullable = true)
>  |-- loudness: double (nullable = true)
>  |-- mode: double (nullable = true)
>  |-- mode_confidence: double (nullable = true)
>  |-- release: string (nullable = true)
>  |-- release_7digitalid: string (nullable = true)
>  |-- sections_confidence: array (nullable = true)
>  ||-- element: double (containsNull = true)
>  |-- sections_start: array (nullable = true)
>  ||-- element: double (containsNull = true)
>  |-- segments_confidence: array (nullable = true)
>  ||-- element: double (containsNull = true)
>  |-- segments_loudness_max: array (nullable = true)
>  ||-- element: double (containsNull = true)
>  |-- segments_loudness_max_time: array (nullable = true)
>  ||-- element: double (containsNull = true)
>  |-- segments_loudness_start: array (nullable = true)
>  ||-- element: double (containsNull = true)
>  |-- segments_pitches: array (nullable = true)
>  ||-- element: double (containsNull = true)
>  |-- segments_start: array (nullable = true)
>  ||-- element: double (containsNull = true)
>  |-- segments_timbre: array (nullable = true)
>  ||-- element: double (containsNull = true)
>  |-- similar_artists: array (nullable = true)
>  ||-- element: string (containsNull = true)
>  |-- song_hotness: double (nullable = true)
>  |-- song_id: string (nullable = true)
>  |-- start_of_fade_out: double (nullable = true)
>  |-- tatums_confidence: array (nullable = true)
>  ||-- element: double (containsNull = true)
>  |-- tatums_start: array (nullable = true)
>  ||-- element: double (containsNull = true)
>  |-- tempo: double (nullable = true)
>  |-- time_signature: double (nullable = true)
>  |-- time_signature_confidence: double (nullable = true)
>  |-- title: string (nullable = true)
>  |-- track_7digitalid: string (nullable = true)
>  |-- year: double (nullable = true)
> {code}
> I select a single record from it and save it using saveAsParquetFile(). 
> When I read it later and try to query it I get the following exception:
> {code}
> Error in SQL statement: java.lang.RuntimeException: 
> java.lang.reflect.InvocationTargetException
>   at sun.reflect.GeneratedMethodAccessor208.invoke(Unknown Source)
>   at 
> sun.reflect.DelegatingMethodAccessorIm

[jira] [Resolved] (SPARK-4248) [SQL] spark sql not support add jar

2014-12-19 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4248?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-4248.
-
   Resolution: Fixed
Fix Version/s: 1.2.0

> [SQL] spark sql not support add jar 
> 
>
> Key: SPARK-4248
> URL: https://issues.apache.org/jira/browse/SPARK-4248
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.1.1
> Environment: java:1.7
> hadoop:2.3.0-cdh5.0.0
> spark:1.1.1
> thriftserver-with-hive:0.12
> hive metaserver:0.13.1
>Reporter: qiaohaijun
> Fix For: 1.2.0
>
>
> add jar not support
> the udf jar need use --jars upload



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4317) Error querying Avro files imported by Sqoop: org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Unresolved attributes

2014-12-19 Thread Michael Armbrust (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4317?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14254039#comment-14254039
 ] 

Michael Armbrust commented on SPARK-4317:
-

Is this still a problem in recent versions?  There has been quite a bit of work 
in this part of the code.

> Error querying Avro files imported by Sqoop: 
> org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Unresolved 
> attributes
> --
>
> Key: SPARK-4317
> URL: https://issues.apache.org/jira/browse/SPARK-4317
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.1.0
> Environment: Spark 1.1.0, Sqoop 1.4.5, PostgreSQL 9.3
>Reporter: Hendy Irawan
>
> After importing table from PostgreSQL 9.3 to Avro file using Sqoop 1.4.5, 
> Spark SQL 1.1.0 is unable to process it:
> (note that Hive 0.13 can process the Avro file just fine)
> {code}
> spark-sql> select city from place;
> 14/11/10 10:15:08 INFO ParseDriver: Parsing command: select city from place
> 14/11/10 10:15:08 INFO ParseDriver: Parse Completed
> 14/11/10 10:15:08 INFO HiveMetaStore: 0: get_table : db=default tbl=place
> 14/11/10 10:15:08 INFO audit: ugi=ceefour   ip=unknown-ip-addr  
> cmd=get_table : db=default tbl=place
> 14/11/10 10:15:08 ERROR SparkSQLDriver: Failed in [select city from place]
> org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Unresolved 
> attributes: 'city, tree:
> Project ['city]
>  LowerCaseSchema 
>   MetastoreRelation default, place, None
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$CheckResolution$$anonfun$apply$1.applyOrElse(Analyzer.scala:72)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$CheckResolution$$anonfun$apply$1.applyOrElse(Analyzer.scala:70)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:165)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:156)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$CheckResolution$.apply(Analyzer.scala:70)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$CheckResolution$.apply(Analyzer.scala:68)
> at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1$$anonfun$apply$2.apply(RuleExecutor.scala:61)
> at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1$$anonfun$apply$2.apply(RuleExecutor.scala:59)
> at 
> scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:51)
> at 
> scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:60)
> at 
> scala.collection.mutable.WrappedArray.foldLeft(WrappedArray.scala:34)
> at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1.apply(RuleExecutor.scala:59)
> at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1.apply(RuleExecutor.scala:51)
> at scala.collection.immutable.List.foreach(List.scala:318)
> at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor.apply(RuleExecutor.scala:51)
> at 
> org.apache.spark.sql.SQLContext$QueryExecution.analyzed$lzycompute(SQLContext.scala:397)
> at 
> org.apache.spark.sql.SQLContext$QueryExecution.analyzed(SQLContext.scala:397)
> at 
> org.apache.spark.sql.hive.HiveContext$QueryExecution.optimizedPlan$lzycompute(HiveContext.scala:358)
> at 
> org.apache.spark.sql.hive.HiveContext$QueryExecution.optimizedPlan(HiveContext.scala:357)
> at 
> org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan$lzycompute(SQLContext.scala:402)
> at 
> org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan(SQLContext.scala:400)
> at 
> org.apache.spark.sql.SQLContext$QueryExecution.executedPlan$lzycompute(SQLContext.scala:406)
> at 
> org.apache.spark.sql.SQLContext$QueryExecution.executedPlan(SQLContext.scala:406)
> at 
> org.apache.spark.sql.hive.HiveContext$QueryExecution.stringResult(HiveContext.scala:406)
> at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLDriver.run(SparkSQLDriver.scala:59)
> at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.processCmd(SparkSQLCLIDriver.scala:291)
> at 
> org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:413)
> at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:226)
> at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMeth

[jira] [Updated] (SPARK-3851) Support for reading parquet files with different but compatible schema

2014-12-19 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3851?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-3851:

Priority: Critical  (was: Major)
Target Version/s: 1.3.0
  Issue Type: Improvement  (was: Bug)

> Support for reading parquet files with different but compatible schema
> --
>
> Key: SPARK-3851
> URL: https://issues.apache.org/jira/browse/SPARK-3851
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Michael Armbrust
>Priority: Critical
>
> Right now it is required that all of the parquet files have the same schema.  
> It would be nice to support some safe subset of cases where the schemas of 
> files is different.  For example:
>  - Adding and removing nullable columns.
>  - Widening types (a column that is of both Int and Long type)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-3295) [Spark SQL] schemaRdd1 ++ schemaRdd2 does not return another SchemaRdd

2014-12-19 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-3295.
-
Resolution: Won't Fix

These are actually different operations.  UnionAll is similar to the SQL 
command and will fail if the two schema are different.  union and ++ will not.

> [Spark SQL] schemaRdd1 ++ schemaRdd2  does not return another SchemaRdd
> ---
>
> Key: SPARK-3295
> URL: https://issues.apache.org/jira/browse/SPARK-3295
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.0.2
>Reporter: Evan Chan
>Priority: Minor
>
> Right now, 
> schemaRdd1.unionAll(schemaRdd2) returns a SchemaRdd.
> However,
> schemaRdd1 ++ schemaRdd2 returns an RDD[Row].
> Similarly,
> schemaRdd1.union(schemaRdd2) returns an RDD[Row].
> This is inconsistent.  Let's make ++ and union have the same behavior as 
> unionAll.
> Actually, not sure there needs to be both union and unionAll.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   >