[jira] [Resolved] (SPARK-10446) Support to specify join type when calling join with usingColumns

2015-09-21 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10446?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-10446.
-
   Resolution: Fixed
 Assignee: Liang-Chi Hsieh
Fix Version/s: 1.6.0

> Support to specify join type when calling join with usingColumns
> 
>
> Key: SPARK-10446
> URL: https://issues.apache.org/jira/browse/SPARK-10446
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Liang-Chi Hsieh
>Assignee: Liang-Chi Hsieh
> Fix For: 1.6.0
>
>
> Currently the method join(right: DataFrame, usingColumns: Seq[String]) only 
> supports inner join. It is more convenient to have it support other join 
> types.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-10419) Add JDBC dialect for Microsoft SQL Server

2015-09-21 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10419?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-10419.
-
   Resolution: Fixed
 Assignee: Ewan Leith
Fix Version/s: 1.6.0

> Add JDBC dialect for Microsoft SQL Server
> -
>
> Key: SPARK-10419
> URL: https://issues.apache.org/jira/browse/SPARK-10419
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.4.1, 1.5.0
>Reporter: Ewan Leith
>Assignee: Ewan Leith
>Priority: Minor
> Fix For: 1.6.0
>
>
> Running JDBC connections against Microsoft SQL Server database tables, when a 
> table contains a datetimeoffset column type, the following error is received:
> {code}
> sqlContext.read.jdbc("jdbc:sqlserver://127.0.0.1:1433;DatabaseName=testdb", 
> "sampletable", prop)
> java.sql.SQLException: Unsupported type -155
> at 
> org.apache.spark.sql.jdbc.JDBCRDD$.org$apache$spark$sql$jdbc$JDBCRDD$$getCatalystType(JDBCRDD.scala:100)
> at 
> org.apache.spark.sql.jdbc.JDBCRDD$$anonfun$1.apply(JDBCRDD.scala:137)
> at 
> org.apache.spark.sql.jdbc.JDBCRDD$$anonfun$1.apply(JDBCRDD.scala:137)
> at scala.Option.getOrElse(Option.scala:120)
> at org.apache.spark.sql.jdbc.JDBCRDD$.resolveTable(JDBCRDD.scala:136)
> at 
> org.apache.spark.sql.jdbc.JDBCRelation.(JDBCRelation.scala:128)
> at 
> org.apache.spark.sql.DataFrameReader.jdbc(DataFrameReader.scala:200)
> at 
> org.apache.spark.sql.DataFrameReader.jdbc(DataFrameReader.scala:130)
> {code}
> Based on the JdbcDialect code for DB2 and the Microsoft SQL Server 
> documentation, we should probably treat datetimeoffset types as Strings 
> https://technet.microsoft.com/en-us/library/bb630289%28v=sql.105%29.aspx
> We've created a small addition to JdbcDialects.scala to do this conversion, 
> I'll create a pull request for it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10577) [PySpark] DataFrame hint for broadcast join

2015-09-21 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14902049#comment-14902049
 ] 

Reynold Xin commented on SPARK-10577:
-

[~maver1ck] the patch is now merged - you can just create a broadcast function 
yourself similar to the ones here in order to use it in Spark 1.5: 
https://github.com/apache/spark/pull/8801/files


> [PySpark] DataFrame hint for broadcast join
> ---
>
> Key: SPARK-10577
> URL: https://issues.apache.org/jira/browse/SPARK-10577
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 1.5.0
>Reporter: Maciej Bryński
>  Labels: starter
> Fix For: 1.6.0
>
>
> As in https://issues.apache.org/jira/browse/SPARK-8300
> there should by possibility to add hint for broadcast join in:
> - Pyspark



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-10577) [PySpark] DataFrame hint for broadcast join

2015-09-21 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10577?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-10577.
-
   Resolution: Fixed
Fix Version/s: 1.6.0

> [PySpark] DataFrame hint for broadcast join
> ---
>
> Key: SPARK-10577
> URL: https://issues.apache.org/jira/browse/SPARK-10577
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 1.5.0
>Reporter: Maciej Bryński
>  Labels: starter
> Fix For: 1.6.0
>
>
> As in https://issues.apache.org/jira/browse/SPARK-8300
> there should by possibility to add hint for broadcast join in:
> - Pyspark



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10750) ML Param validate should print better error information

2015-09-21 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10750?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10750:


Assignee: Apache Spark

> ML Param validate should print better error information
> ---
>
> Key: SPARK-10750
> URL: https://issues.apache.org/jira/browse/SPARK-10750
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Reporter: Yanbo Liang
>Assignee: Apache Spark
>Priority: Minor
>
> Currently when you set illegal value for params of array type (such as 
> IntArrayParam, DoubleArrayParam, StringArrayParam), it will throw 
> IllegalArgumentException but with incomprehensible error information.
> For example:
> val vectorSlicer = new 
> VectorSlicer().setInputCol("features").setOutputCol("result")
> vectorSlicer.setIndices(Array.empty).setNames(Array("f1", "f4", "f1"))
> It will throw IllegalArgumentException as:
> vectorSlicer_b3b4d1a10f43 parameter names given invalid value 
> [Ljava.lang.String;@798256c5.
> java.lang.IllegalArgumentException: vectorSlicer_b3b4d1a10f43 parameter names 
> given invalid value [Ljava.lang.String;@798256c5.
> Users can not understand which params were set incorrectly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10750) ML Param validate should print better error information

2015-09-21 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10750?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10750:


Assignee: (was: Apache Spark)

> ML Param validate should print better error information
> ---
>
> Key: SPARK-10750
> URL: https://issues.apache.org/jira/browse/SPARK-10750
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Reporter: Yanbo Liang
>Priority: Minor
>
> Currently when you set illegal value for params of array type (such as 
> IntArrayParam, DoubleArrayParam, StringArrayParam), it will throw 
> IllegalArgumentException but with incomprehensible error information.
> For example:
> val vectorSlicer = new 
> VectorSlicer().setInputCol("features").setOutputCol("result")
> vectorSlicer.setIndices(Array.empty).setNames(Array("f1", "f4", "f1"))
> It will throw IllegalArgumentException as:
> vectorSlicer_b3b4d1a10f43 parameter names given invalid value 
> [Ljava.lang.String;@798256c5.
> java.lang.IllegalArgumentException: vectorSlicer_b3b4d1a10f43 parameter names 
> given invalid value [Ljava.lang.String;@798256c5.
> Users can not understand which params were set incorrectly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10750) ML Param validate should print better error information

2015-09-21 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10750?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14902044#comment-14902044
 ] 

Apache Spark commented on SPARK-10750:
--

User 'yanboliang' has created a pull request for this issue:
https://github.com/apache/spark/pull/8863

> ML Param validate should print better error information
> ---
>
> Key: SPARK-10750
> URL: https://issues.apache.org/jira/browse/SPARK-10750
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Reporter: Yanbo Liang
>Priority: Minor
>
> Currently when you set illegal value for params of array type (such as 
> IntArrayParam, DoubleArrayParam, StringArrayParam), it will throw 
> IllegalArgumentException but with incomprehensible error information.
> For example:
> val vectorSlicer = new 
> VectorSlicer().setInputCol("features").setOutputCol("result")
> vectorSlicer.setIndices(Array.empty).setNames(Array("f1", "f4", "f1"))
> It will throw IllegalArgumentException as:
> vectorSlicer_b3b4d1a10f43 parameter names given invalid value 
> [Ljava.lang.String;@798256c5.
> java.lang.IllegalArgumentException: vectorSlicer_b3b4d1a10f43 parameter names 
> given invalid value [Ljava.lang.String;@798256c5.
> Users can not understand which params were set incorrectly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8386) DataFrame and JDBC regression

2015-09-21 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14902042#comment-14902042
 ] 

Reynold Xin commented on SPARK-8386:


[~viirya] do you have time to take a look?

> DataFrame and JDBC regression
> -
>
> Key: SPARK-8386
> URL: https://issues.apache.org/jira/browse/SPARK-8386
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.0
> Environment: RHEL 7.1
>Reporter: Peter Haumer
>Priority: Critical
>
> I have an ETL app that appends to a JDBC table new results found at each run. 
>  In 1.3.1 I did this:
> testResultsDF.insertIntoJDBC(CONNECTION_URL, TABLE_NAME, false);
> When I do this now in 1.4 it complains that the "object" 'TABLE_NAME' already 
> exists. I get this even if I switch the overwrite to true.  I also tried this 
> now:
> testResultsDF.write().mode(SaveMode.Append).jdbc(CONNECTION_URL, TABLE_NAME, 
> connectionProperties);
> getting the same error. It works running the first time creating the new 
> table and adding data successfully. But, running it a second time it (the 
> jdbc driver) will tell me that the table already exists. Even 
> SaveMode.Overwrite will give me the same error. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-10716) spark-1.5.0-bin-hadoop2.6.tgz file doesn't uncompress on OS X due to hidden file

2015-09-21 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10716?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-10716.
-
   Resolution: Fixed
 Assignee: Sean Owen
Fix Version/s: 1.5.1
   1.6.0

> spark-1.5.0-bin-hadoop2.6.tgz file doesn't uncompress on OS X due to hidden 
> file
> 
>
> Key: SPARK-10716
> URL: https://issues.apache.org/jira/browse/SPARK-10716
> Project: Spark
>  Issue Type: Bug
>  Components: Build, Deploy
>Affects Versions: 1.5.0
> Environment: Yosemite 10.10.5
>Reporter: Jack Jack
>Assignee: Sean Owen
>Priority: Minor
> Fix For: 1.6.0, 1.5.1
>
>
> Directly downloaded prebuilt binaries of 
> http://d3kbcqa49mib13.cloudfront.net/spark-1.5.0-bin-hadoop2.6.tgz 
> got error when tar xvzf it.  Tried download twice and extract twice.
> error log:
> ..
> x spark-1.5.0-bin-hadoop2.6/lib/
> x spark-1.5.0-bin-hadoop2.6/lib/datanucleus-core-3.2.10.jar
> x spark-1.5.0-bin-hadoop2.6/lib/datanucleus-api-jdo-3.2.6.jar
> x spark-1.5.0-bin-hadoop2.6/lib/datanucleus-rdbms-3.2.9.jar
> x spark-1.5.0-bin-hadoop2.6/lib/spark-examples-1.5.0-hadoop2.6.0.jar
> x spark-1.5.0-bin-hadoop2.6/lib/spark-assembly-1.5.0-hadoop2.6.0.jar
> x spark-1.5.0-bin-hadoop2.6/lib/spark-1.5.0-yarn-shuffle.jar
> x spark-1.5.0-bin-hadoop2.6/README.md
> tar: copyfile unpack 
> (spark-1.5.0-bin-hadoop2.6/python/test_support/sql/orc_partitioned/SUCCESS.crc)
>  failed: No such file or directory
> ~ :>



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10750) ML Param validate should print better error information

2015-09-21 Thread Yanbo Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10750?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14902029#comment-14902029
 ] 

Yanbo Liang commented on SPARK-10750:
-

This is because Param.validate(value: T) use value.toString at error 
information which did not distinguish value of array type or not.

> ML Param validate should print better error information
> ---
>
> Key: SPARK-10750
> URL: https://issues.apache.org/jira/browse/SPARK-10750
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Reporter: Yanbo Liang
>Priority: Minor
>
> Currently when you set illegal value for params of array type (such as 
> IntArrayParam, DoubleArrayParam, StringArrayParam), it will throw 
> IllegalArgumentException but with incomprehensible error information.
> For example:
> val vectorSlicer = new 
> VectorSlicer().setInputCol("features").setOutputCol("result")
> vectorSlicer.setIndices(Array.empty).setNames(Array("f1", "f4", "f1"))
> It will throw IllegalArgumentException as:
> vectorSlicer_b3b4d1a10f43 parameter names given invalid value 
> [Ljava.lang.String;@798256c5.
> java.lang.IllegalArgumentException: vectorSlicer_b3b4d1a10f43 parameter names 
> given invalid value [Ljava.lang.String;@798256c5.
> Users can not understand which params were set incorrectly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-10751) ML Param validate should print better error information

2015-09-21 Thread Yanbo Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10751?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang closed SPARK-10751.
---
Resolution: Duplicate

> ML Param validate should print better error information
> ---
>
> Key: SPARK-10751
> URL: https://issues.apache.org/jira/browse/SPARK-10751
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Reporter: Yanbo Liang
>Priority: Minor
>
> Currently when you set illegal value for params of array type (such as 
> IntArrayParam, DoubleArrayParam, StringArrayParam), it will throw 
> IllegalArgumentException but with incomprehensible error information.
> For example:
> val vectorSlicer = new 
> VectorSlicer().setInputCol("features").setOutputCol("result")
> vectorSlicer.setIndices(Array.empty).setNames(Array("f1", "f4", "f1"))
> It will throw IllegalArgumentException as:
> vectorSlicer_b3b4d1a10f43 parameter names given invalid value 
> [Ljava.lang.String;@798256c5.
> java.lang.IllegalArgumentException: vectorSlicer_b3b4d1a10f43 parameter names 
> given invalid value [Ljava.lang.String;@798256c5.
> Users can not understand which params were set incorrectly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-9821) pyspark reduceByKey should allow a custom partitioner

2015-09-21 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9821?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-9821.
---
   Resolution: Fixed
Fix Version/s: 1.6.0

Issue resolved by pull request 8569
[https://github.com/apache/spark/pull/8569]

> pyspark reduceByKey should allow a custom partitioner
> -
>
> Key: SPARK-9821
> URL: https://issues.apache.org/jira/browse/SPARK-9821
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 1.3.0
>Reporter: Diana Carroll
>Priority: Minor
> Fix For: 1.6.0
>
>
> In Scala, I can supply a custom partitioner to reduceByKey (and other 
> aggregation/repartitioning methods like aggregateByKey and combinedByKey), 
> but as far as I can tell from the Pyspark API, there's no way to do the same 
> in Python.
> Here's an example of my code in Scala:
> {code}weblogs.map(s => (getFileType(s), 1)).reduceByKey(new 
> FileTypePartitioner(),_+_){code}
> But I can't figure out how to do the same in Python.  The closest I can get 
> is to call repartition before reduceByKey like so:
> {code}weblogs.map(lambda s: (getFileType(s), 
> 1)).partitionBy(3,hash_filetype).reduceByKey(lambda v1,v2: 
> v1+v2).collect(){code}
> But that defeats the purpose, because I'm shuffling twice instead of once, so 
> my performance is worse instead of better.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10751) ML Param validate should print better error information

2015-09-21 Thread Yanbo Liang (JIRA)
Yanbo Liang created SPARK-10751:
---

 Summary: ML Param validate should print better error information
 Key: SPARK-10751
 URL: https://issues.apache.org/jira/browse/SPARK-10751
 Project: Spark
  Issue Type: Bug
  Components: ML
Reporter: Yanbo Liang
Priority: Minor


Currently when you set illegal value for params of array type (such as 
IntArrayParam, DoubleArrayParam, StringArrayParam), it will throw 
IllegalArgumentException but with incomprehensible error information.
For example:

val vectorSlicer = new 
VectorSlicer().setInputCol("features").setOutputCol("result")
vectorSlicer.setIndices(Array.empty).setNames(Array("f1", "f4", "f1"))

It will throw IllegalArgumentException as:
vectorSlicer_b3b4d1a10f43 parameter names given invalid value 
[Ljava.lang.String;@798256c5.
java.lang.IllegalArgumentException: vectorSlicer_b3b4d1a10f43 parameter names 
given invalid value [Ljava.lang.String;@798256c5.

Users can not understand which params were set incorrectly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10750) ML Param validate should print better error information

2015-09-21 Thread Yanbo Liang (JIRA)
Yanbo Liang created SPARK-10750:
---

 Summary: ML Param validate should print better error information
 Key: SPARK-10750
 URL: https://issues.apache.org/jira/browse/SPARK-10750
 Project: Spark
  Issue Type: Bug
  Components: ML
Reporter: Yanbo Liang
Priority: Minor


Currently when you set illegal value for params of array type (such as 
IntArrayParam, DoubleArrayParam, StringArrayParam), it will throw 
IllegalArgumentException but with incomprehensible error information.
For example:

val vectorSlicer = new 
VectorSlicer().setInputCol("features").setOutputCol("result")
vectorSlicer.setIndices(Array.empty).setNames(Array("f1", "f4", "f1"))

It will throw IllegalArgumentException as:
vectorSlicer_b3b4d1a10f43 parameter names given invalid value 
[Ljava.lang.String;@798256c5.
java.lang.IllegalArgumentException: vectorSlicer_b3b4d1a10f43 parameter names 
given invalid value [Ljava.lang.String;@798256c5.

Users can not understand which params were set incorrectly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10749) Support multiple roles with Spark Mesos dispatcher

2015-09-21 Thread Timothy Chen (JIRA)
Timothy Chen created SPARK-10749:


 Summary: Support multiple roles with Spark Mesos dispatcher
 Key: SPARK-10749
 URL: https://issues.apache.org/jira/browse/SPARK-10749
 Project: Spark
  Issue Type: Improvement
  Components: Mesos
Reporter: Timothy Chen


Although you can currently set the framework role of the Mesos dispatcher, it 
doesn't correctly use the offers given to it.

It should inherit how Coarse/Fine grain scheduler works and use multiple roles 
offers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10748) Log error instead of crashing Spark Mesos dispatcher when a job is misconfigured

2015-09-21 Thread Timothy Chen (JIRA)
Timothy Chen created SPARK-10748:


 Summary: Log error instead of crashing Spark Mesos dispatcher when 
a job is misconfigured
 Key: SPARK-10748
 URL: https://issues.apache.org/jira/browse/SPARK-10748
 Project: Spark
  Issue Type: Bug
  Components: Mesos
Reporter: Timothy Chen


Currently when a dispatcher is submitting a new driver, it simply throws a 
SparkExecption when necessary configuration is not set. We should log and keep 
the dispatcher running instead of crashing.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10731) The head() implementation of dataframe is very slow

2015-09-21 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10731?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10731:


Assignee: Apache Spark

> The head() implementation of dataframe is very slow
> ---
>
> Key: SPARK-10731
> URL: https://issues.apache.org/jira/browse/SPARK-10731
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.4.1, 1.5.0
>Reporter: Jerry Lam
>Assignee: Apache Spark
>  Labels: pyspark
>
> {code}
> df=sqlContext.read.parquet("someparquetfiles")
> df.head()
> {code}
> The above lines take over 15 minutes. It seems the dataframe requires 3 
> stages to return the first row. It reads all data (which is about 1 billion 
> rows) and run Limit twice. The take(1) implementation in the RDD performs 
> much better.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10731) The head() implementation of dataframe is very slow

2015-09-21 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10731?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10731:


Assignee: (was: Apache Spark)

> The head() implementation of dataframe is very slow
> ---
>
> Key: SPARK-10731
> URL: https://issues.apache.org/jira/browse/SPARK-10731
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.4.1, 1.5.0
>Reporter: Jerry Lam
>  Labels: pyspark
>
> {code}
> df=sqlContext.read.parquet("someparquetfiles")
> df.head()
> {code}
> The above lines take over 15 minutes. It seems the dataframe requires 3 
> stages to return the first row. It reads all data (which is about 1 billion 
> rows) and run Limit twice. The take(1) implementation in the RDD performs 
> much better.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10731) The head() implementation of dataframe is very slow

2015-09-21 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14901983#comment-14901983
 ] 

Apache Spark commented on SPARK-10731:
--

User 'davies' has created a pull request for this issue:
https://github.com/apache/spark/pull/8852

> The head() implementation of dataframe is very slow
> ---
>
> Key: SPARK-10731
> URL: https://issues.apache.org/jira/browse/SPARK-10731
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.4.1, 1.5.0
>Reporter: Jerry Lam
>  Labels: pyspark
>
> {code}
> df=sqlContext.read.parquet("someparquetfiles")
> df.head()
> {code}
> The above lines take over 15 minutes. It seems the dataframe requires 3 
> stages to return the first row. It reads all data (which is about 1 billion 
> rows) and run Limit twice. The take(1) implementation in the RDD performs 
> much better.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9595) Adding API to SparkConf for kryo serializers registration

2015-09-21 Thread John Chen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9595?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14901979#comment-14901979
 ] 

John Chen commented on SPARK-9595:
--

Yes, thanks a lot.

> Adding API to SparkConf for kryo serializers registration
> -
>
> Key: SPARK-9595
> URL: https://issues.apache.org/jira/browse/SPARK-9595
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.3.1, 1.4.1
>Reporter: John Chen
>Priority: Minor
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> Currently SparkConf has a registerKryoClasses API for kryo registration. 
> However, this only works when you register classes. If you want to register 
> customized kryo serializers, you'll have to extend the KryoSerializer class 
> and write some codes.
> This is not only very inconvenient, but also require the registration to be 
> done in compile-time, which is not always possible. Thus, I suggest another 
> API to SparkConf for registering customized kryo serializers. It could be 
> like this:
> def registerKryoSerializers(serializers: Map[Class[_], Serializer]): SparkConf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-10733) TungstenAggregation cannot acquire page after switching to sort-based

2015-09-21 Thread Yi Zhou (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14901959#comment-14901959
 ] 

Yi Zhou edited comment on SPARK-10733 at 9/22/15 5:18 AM:
--

1) I have never set `spark.task.cpus` and only use  by default.
2) scale factor=1000 (1TB data set)
3) Spark conf like below
spark.shuffle.manager=sort
spark.sql.hive.metastore.version=1.1.0
spark.sql.hive.metastore.jars=/usr/lib/hive/lib/\*:/usr/lib/hadoop/client/\*
spark.executor.extraClassPath=/etc/hive/conf
spark.driver.extraClassPath=/etc/hive/conf
spark.serializer=org.apache.spark.serializer.KryoSerializer




was (Author: jameszhouyi):
1) I have never set `spark.task.cpus` and only use  by default.
2) scale factor=1000 (1TB data set)
3) Spark conf like below
spark.shuffle.manager=sort
spark.sql.hive.metastore.version=1.1.0
spark.sql.hive.metastore.jars=/usr/lib/hive/lib/*:/usr/lib/hadoop/client/*
spark.executor.extraClassPath=/etc/hive/conf
spark.driver.extraClassPath=/etc/hive/conf
spark.serializer=org.apache.spark.serializer.KryoSerializer



> TungstenAggregation cannot acquire page after switching to sort-based
> -
>
> Key: SPARK-10733
> URL: https://issues.apache.org/jira/browse/SPARK-10733
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Andrew Or
>Assignee: Andrew Or
>Priority: Blocker
>
> This is uncovered after fixing SPARK-10474. Stack trace:
> {code}
> 15/09/21 12:51:46 WARN scheduler.TaskSetManager: Lost task 115.0 in stage 
> 152.0 (TID 1736, bb-node2): java.io.IOException: Unable to acquire 16777216 
> bytes of memory
> at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.acquireNewPage(UnsafeExternalSorter.java:378)
> at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.acquireNewPageIfNecessary(UnsafeExternalSorter.java:359)
> at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.insertKVRecord(UnsafeExternalSorter.java:488)
> at 
> org.apache.spark.sql.execution.UnsafeKVExternalSorter.insertKV(UnsafeKVExternalSorter.java:144)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.switchToSortBasedAggregation(TungstenAggregationIterator.scala:465)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.processInputs(TungstenAggregationIterator.scala:379)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.start(TungstenAggregationIterator.scala:622)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1.org$apache$spark$sql$execution$aggregate$TungstenAggregate$$anonfun$$executePartition$1(TungstenAggregate.scala:110)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:119)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:119)
> at 
> org.apache.spark.rdd.MapPartitionsWithPreparationRDD.compute(MapPartitionsWithPreparationRDD.scala:64)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
> at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
> at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
> at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
> at org.apache.spark.scheduler.Task.run(Task.scala:88)
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10733) TungstenAggregation cannot acquire page after switching to sort-based

2015-09-21 Thread Yi Zhou (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14901959#comment-14901959
 ] 

Yi Zhou commented on SPARK-10733:
-

1) I have never set `spark.task.cpus` and only use  by default.
2) scale factor=1000 (1TB data set)
3) Spark conf like below
spark.shuffle.manager=sort
spark.sql.hive.metastore.version=1.1.0
spark.sql.hive.metastore.jars=/usr/lib/hive/lib/*:/usr/lib/hadoop/client/*
spark.executor.extraClassPath=/etc/hive/conf
spark.driver.extraClassPath=/etc/hive/conf
spark.serializer=org.apache.spark.serializer.KryoSerializer



> TungstenAggregation cannot acquire page after switching to sort-based
> -
>
> Key: SPARK-10733
> URL: https://issues.apache.org/jira/browse/SPARK-10733
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Andrew Or
>Assignee: Andrew Or
>Priority: Blocker
>
> This is uncovered after fixing SPARK-10474. Stack trace:
> {code}
> 15/09/21 12:51:46 WARN scheduler.TaskSetManager: Lost task 115.0 in stage 
> 152.0 (TID 1736, bb-node2): java.io.IOException: Unable to acquire 16777216 
> bytes of memory
> at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.acquireNewPage(UnsafeExternalSorter.java:378)
> at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.acquireNewPageIfNecessary(UnsafeExternalSorter.java:359)
> at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.insertKVRecord(UnsafeExternalSorter.java:488)
> at 
> org.apache.spark.sql.execution.UnsafeKVExternalSorter.insertKV(UnsafeKVExternalSorter.java:144)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.switchToSortBasedAggregation(TungstenAggregationIterator.scala:465)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.processInputs(TungstenAggregationIterator.scala:379)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.start(TungstenAggregationIterator.scala:622)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1.org$apache$spark$sql$execution$aggregate$TungstenAggregate$$anonfun$$executePartition$1(TungstenAggregate.scala:110)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:119)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:119)
> at 
> org.apache.spark.rdd.MapPartitionsWithPreparationRDD.compute(MapPartitionsWithPreparationRDD.scala:64)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
> at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
> at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
> at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
> at org.apache.spark.scheduler.Task.run(Task.scala:88)
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10733) TungstenAggregation cannot acquire page after switching to sort-based

2015-09-21 Thread Yi Zhou (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14901951#comment-14901951
 ] 

Yi Zhou commented on SPARK-10733:
-

Key SQL Query:
INSERT INTO TABLE test_table
SELECT
  ss.ss_customer_sk AS cid,
  count(CASE WHEN i.i_class_id=1  THEN 1 ELSE NULL END) AS id1,
  count(CASE WHEN i.i_class_id=3  THEN 1 ELSE NULL END) AS id3,
  count(CASE WHEN i.i_class_id=5  THEN 1 ELSE NULL END) AS id5,
  count(CASE WHEN i.i_class_id=7  THEN 1 ELSE NULL END) AS id7,
  count(CASE WHEN i.i_class_id=9  THEN 1 ELSE NULL END) AS id9,
  count(CASE WHEN i.i_class_id=11 THEN 1 ELSE NULL END) AS id11,
  count(CASE WHEN i.i_class_id=13 THEN 1 ELSE NULL END) AS id13,
  count(CASE WHEN i.i_class_id=15 THEN 1 ELSE NULL END) AS id15,
  count(CASE WHEN i.i_class_id=2  THEN 1 ELSE NULL END) AS id2,
  count(CASE WHEN i.i_class_id=4  THEN 1 ELSE NULL END) AS id4,
  count(CASE WHEN i.i_class_id=6  THEN 1 ELSE NULL END) AS id6,
  count(CASE WHEN i.i_class_id=8  THEN 1 ELSE NULL END) AS id8,
  count(CASE WHEN i.i_class_id=10 THEN 1 ELSE NULL END) AS id10,
  count(CASE WHEN i.i_class_id=14 THEN 1 ELSE NULL END) AS id14,
  count(CASE WHEN i.i_class_id=16 THEN 1 ELSE NULL END) AS id16
FROM store_sales ss
INNER JOIN item i ON ss.ss_item_sk = i.i_item_sk
WHERE i.i_category IN ('Books')
AND ss.ss_customer_sk IS NOT NULL
GROUP BY ss.ss_customer_sk
HAVING count(ss.ss_item_sk) > 5

Note:
the store_sales is a big fact table and item is a small dimension table.

> TungstenAggregation cannot acquire page after switching to sort-based
> -
>
> Key: SPARK-10733
> URL: https://issues.apache.org/jira/browse/SPARK-10733
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Andrew Or
>Assignee: Andrew Or
>Priority: Blocker
>
> This is uncovered after fixing SPARK-10474. Stack trace:
> {code}
> 15/09/21 12:51:46 WARN scheduler.TaskSetManager: Lost task 115.0 in stage 
> 152.0 (TID 1736, bb-node2): java.io.IOException: Unable to acquire 16777216 
> bytes of memory
> at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.acquireNewPage(UnsafeExternalSorter.java:378)
> at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.acquireNewPageIfNecessary(UnsafeExternalSorter.java:359)
> at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.insertKVRecord(UnsafeExternalSorter.java:488)
> at 
> org.apache.spark.sql.execution.UnsafeKVExternalSorter.insertKV(UnsafeKVExternalSorter.java:144)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.switchToSortBasedAggregation(TungstenAggregationIterator.scala:465)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.processInputs(TungstenAggregationIterator.scala:379)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.start(TungstenAggregationIterator.scala:622)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1.org$apache$spark$sql$execution$aggregate$TungstenAggregate$$anonfun$$executePartition$1(TungstenAggregate.scala:110)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:119)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:119)
> at 
> org.apache.spark.rdd.MapPartitionsWithPreparationRDD.compute(MapPartitionsWithPreparationRDD.scala:64)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
> at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
> at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
> at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
> at org.apache.spark.scheduler.Task.run(Task.scala:88)
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10739) Add attempt window for long running Spark application on Yarn

2015-09-21 Thread Saisai Shao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14901944#comment-14901944
 ] 

Saisai Shao commented on SPARK-10739:
-

[~srowen] I'm not sure what you mentioned about the previous JIRA is exactly 
the same thing as I mentioned here. 

This JIRA is trying to let YARN know that this is a long running service and 
ignore out of window failures (YARN-611), probably is the different thing you 
mentioned about.

> Add attempt window for long running Spark application on Yarn
> -
>
> Key: SPARK-10739
> URL: https://issues.apache.org/jira/browse/SPARK-10739
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Reporter: Saisai Shao
>Priority: Minor
>
> Currently Spark on Yarn uses max attempts to control the failure number, if 
> application's failure number reaches to the max attempts, application will 
> not be recovered by RM, it is not very effective for long running 
> applications, since it will easily exceed the max number at a long time 
> period, also setting a very large max attempts will hide the real problem.
> So here introduce an attempt window to control the application attempt times, 
> this will ignore the out of window attempts, it is introduced in Hadoop 2.6+ 
> to support long running application, it is quite useful for Spark Streaming, 
> Spark shell like applications.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10739) Add attempt window for long running Spark application on Yarn

2015-09-21 Thread Sandy Ryza (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14901901#comment-14901901
 ] 

Sandy Ryza commented on SPARK-10739:


I recall there was a JIRA similar to this that avoided killing the application 
when we reached a certain number of executor failures.  However, IIUC, this is 
about something different: deciding whether to have YARN restart the 
application when it fails.

> Add attempt window for long running Spark application on Yarn
> -
>
> Key: SPARK-10739
> URL: https://issues.apache.org/jira/browse/SPARK-10739
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Reporter: Saisai Shao
>Priority: Minor
>
> Currently Spark on Yarn uses max attempts to control the failure number, if 
> application's failure number reaches to the max attempts, application will 
> not be recovered by RM, it is not very effective for long running 
> applications, since it will easily exceed the max number at a long time 
> period, also setting a very large max attempts will hide the real problem.
> So here introduce an attempt window to control the application attempt times, 
> this will ignore the out of window attempts, it is introduced in Hadoop 2.6+ 
> to support long running application, it is quite useful for Spark Streaming, 
> Spark shell like applications.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-10711) Do not assume spark.submit.deployMode is always set

2015-09-21 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10711?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-10711.
-
   Resolution: Fixed
 Assignee: Hossein Falaki
Fix Version/s: 1.6.0

> Do not assume spark.submit.deployMode is always set
> ---
>
> Key: SPARK-10711
> URL: https://issues.apache.org/jira/browse/SPARK-10711
> Project: Spark
>  Issue Type: Bug
>  Components: R
>Affects Versions: 1.5.0
>Reporter: Hossein Falaki
>Assignee: Hossein Falaki
>Priority: Critical
> Fix For: 1.6.0, 1.5.1
>
>
> in RRDD.createRProcess() we call RUtils.sparkRPackagePath(), which assumes 
> "... that Spark properties `spark.master` and `spark.submit.deployMode` are 
> set."
> It is better to assume safe defaults if they are not set.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10739) Add attempt window for long running Spark application on Yarn

2015-09-21 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14901866#comment-14901866
 ] 

Sean Owen commented on SPARK-10739:
---

I'm sure we've discussed this one before and that there's a JIRA for it ... but 
can't for the life of me find it. I feel like [~sandyr] or [~vanzin] commented 
on it. the question was how long back you looked when considering if "a lot" of 
failures had occurred, etc.

> Add attempt window for long running Spark application on Yarn
> -
>
> Key: SPARK-10739
> URL: https://issues.apache.org/jira/browse/SPARK-10739
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Reporter: Saisai Shao
>Priority: Minor
>
> Currently Spark on Yarn uses max attempts to control the failure number, if 
> application's failure number reaches to the max attempts, application will 
> not be recovered by RM, it is not very effective for long running 
> applications, since it will easily exceed the max number at a long time 
> period, also setting a very large max attempts will hide the real problem.
> So here introduce an attempt window to control the application attempt times, 
> this will ignore the out of window attempts, it is introduced in Hadoop 2.6+ 
> to support long running application, it is quite useful for Spark Streaming, 
> Spark shell like applications.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10495) For json data source, date values are saved as int strings

2015-09-21 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10495?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14901847#comment-14901847
 ] 

Apache Spark commented on SPARK-10495:
--

User 'yhuai' has created a pull request for this issue:
https://github.com/apache/spark/pull/8861

> For json data source, date values are saved as int strings
> --
>
> Key: SPARK-10495
> URL: https://issues.apache.org/jira/browse/SPARK-10495
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Yin Huai
>Assignee: Yin Huai
>Priority: Blocker
> Fix For: 1.6.0, 1.5.1
>
>
> {code}
> val df = Seq((1, java.sql.Date.valueOf("1900-01-01"))).toDF("i", "j")
> df.write.format("json").save("/tmp/testJson")
> sc.textFile("/tmp/testJson").collect.foreach(println)
> {"i":1,"j":"-25567"}
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10747) add support for window specification to include how NULLS are ordered

2015-09-21 Thread N Campbell (JIRA)
N Campbell created SPARK-10747:
--

 Summary: add support for window specification to include how NULLS 
are ordered
 Key: SPARK-10747
 URL: https://issues.apache.org/jira/browse/SPARK-10747
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.5.0
Reporter: N Campbell


You cannot express how NULLS are to be sorted in the window order specification 
and have to use a compensating expression to simulate.

Error: org.apache.spark.sql.AnalysisException: line 1:76 missing ) at 'nulls' 
near 'nulls'
line 1:82 missing EOF at 'last' near 'nulls';
SQLState:  null

Same limitation as Hive reported in Apache JIRA HIVE-9535 )

This fails
select rnum, c1, c2, c3, dense_rank() over(partition by c1 order by c3 desc 
nulls last) from tolap

select rnum, c1, c2, c3, dense_rank() over(partition by c1 order by case when 
c3 is null then 1 else 0 end) from tolap



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10746) count ( distinct columnref) over () returns wrong result set

2015-09-21 Thread N Campbell (JIRA)
N Campbell created SPARK-10746:
--

 Summary: count ( distinct columnref) over () returns wrong result 
set
 Key: SPARK-10746
 URL: https://issues.apache.org/jira/browse/SPARK-10746
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.5.0
Reporter: N Campbell


Same issue as report against HIVE (HIVE-9534) 

Result set was expected to contain 5 rows instead of 1 row as others vendors 
(ORACLE, Netezza etc) would.

select count( distinct column) over () from t1



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10745) Separate configs between shuffle and RPC

2015-09-21 Thread Shixiong Zhu (JIRA)
Shixiong Zhu created SPARK-10745:


 Summary: Separate configs between shuffle and RPC
 Key: SPARK-10745
 URL: https://issues.apache.org/jira/browse/SPARK-10745
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Shixiong Zhu


SPARK-6028 uses network module to implement RPC. However, there are some 
configurations named with `spark.shuffle` prefix in the network module. We 
should refactor them and make sure the user can control them in shuffle and RPC 
separately.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10744) parser error (constant * column is null interpreted as constant * boolean)

2015-09-21 Thread N Campbell (JIRA)
N Campbell created SPARK-10744:
--

 Summary: parser error (constant * column is null interpreted as 
constant * boolean)
 Key: SPARK-10744
 URL: https://issues.apache.org/jira/browse/SPARK-10744
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.5.0
Reporter: N Campbell
Priority: Minor


SPARK SQL inherits the same defect as Hive where this statement will not 
parse/execute. See HIVE-9530 

 select c1 from t1 where 1 * cnnull is null 
-vs- 
 select c1 from t1 where (1 * cnnull) is null 




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10310) [Spark SQL] All result records will be popluated into ONE line during the script transform due to missing the correct line/filed delimiter

2015-09-21 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10310?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14901774#comment-14901774
 ] 

Apache Spark commented on SPARK-10310:
--

User 'liancheng' has created a pull request for this issue:
https://github.com/apache/spark/pull/8860

> [Spark SQL] All result records will be popluated into ONE line during the 
> script transform due to missing the correct line/filed delimiter
> --
>
> Key: SPARK-10310
> URL: https://issues.apache.org/jira/browse/SPARK-10310
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Yi Zhou
>Assignee: zhichao-li
>Priority: Critical
>
> There is real case using python stream script in Spark SQL query. We found 
> that all result records were wroten in ONE line as input from "select" 
> pipeline for python script and so it caused script will not identify each 
> record.Other, filed separator in spark sql will be '^A' or '\001' which is 
> inconsistent/incompatible the '\t' in Hive implementation.
> Key query:
> {code:sql}
> CREATE VIEW temp1 AS
> SELECT *
> FROM
> (
>   FROM
>   (
> SELECT
>   c.wcs_user_sk,
>   w.wp_type,
>   (wcs_click_date_sk * 24 * 60 * 60 + wcs_click_time_sk) AS tstamp_inSec
> FROM web_clickstreams c, web_page w
> WHERE c.wcs_web_page_sk = w.wp_web_page_sk
> AND   c.wcs_web_page_sk IS NOT NULL
> AND   c.wcs_user_sk IS NOT NULL
> AND   c.wcs_sales_skIS NULL --abandoned implies: no sale
> DISTRIBUTE BY wcs_user_sk SORT BY wcs_user_sk, tstamp_inSec
>   ) clicksAnWebPageType
>   REDUCE
> wcs_user_sk,
> tstamp_inSec,
> wp_type
>   USING 'python sessionize.py 3600'
>   AS (
> wp_type STRING,
> tstamp BIGINT, 
> sessionid STRING)
> ) sessionized
> {code}
> Key Python script:
> {noformat}
> for line in sys.stdin:
>  user_sk,  tstamp_str, value  = line.strip().split("\t")
> {noformat}
> Sample SELECT result:
> {noformat}
> ^V31^A3237764860^Afeedback^U31^A3237769106^Adynamic^T31^A3237779027^Areview
> {noformat}
> Expected result:
> {noformat}
> 31   3237764860   feedback
> 31   3237769106   dynamic
> 31   3237779027   review
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10743) keep the name of expression if possible when do cast

2015-09-21 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10743?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10743:


Assignee: (was: Apache Spark)

> keep the name of expression if possible when do cast
> 
>
> Key: SPARK-10743
> URL: https://issues.apache.org/jira/browse/SPARK-10743
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Wenchen Fan
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10743) keep the name of expression if possible when do cast

2015-09-21 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10743?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14901768#comment-14901768
 ] 

Apache Spark commented on SPARK-10743:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/8859

> keep the name of expression if possible when do cast
> 
>
> Key: SPARK-10743
> URL: https://issues.apache.org/jira/browse/SPARK-10743
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Wenchen Fan
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10743) keep the name of expression if possible when do cast

2015-09-21 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10743?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10743:


Assignee: Apache Spark

> keep the name of expression if possible when do cast
> 
>
> Key: SPARK-10743
> URL: https://issues.apache.org/jira/browse/SPARK-10743
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Wenchen Fan
>Assignee: Apache Spark
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10743) keep the name of expression if possible when do cast

2015-09-21 Thread Wenchen Fan (JIRA)
Wenchen Fan created SPARK-10743:
---

 Summary: keep the name of expression if possible when do cast
 Key: SPARK-10743
 URL: https://issues.apache.org/jira/browse/SPARK-10743
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Wenchen Fan






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10742) Add the ability to embed HTML relative links in job descriptions

2015-09-21 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10742?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10742:


Assignee: Apache Spark  (was: Tathagata Das)

> Add the ability to embed HTML relative links in job descriptions
> 
>
> Key: SPARK-10742
> URL: https://issues.apache.org/jira/browse/SPARK-10742
> Project: Spark
>  Issue Type: Improvement
>Reporter: Tathagata Das
>Assignee: Apache Spark
>Priority: Minor
>
> This is to allow embedding links to other Spark UI tabs within the job 
> description. For example, streaming jobs could set descriptions with links 
> pointing to the corresponding details page of the batch that the job belongs 
> to.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10742) Add the ability to embed HTML relative links in job descriptions

2015-09-21 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10742?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10742:


Assignee: Tathagata Das  (was: Apache Spark)

> Add the ability to embed HTML relative links in job descriptions
> 
>
> Key: SPARK-10742
> URL: https://issues.apache.org/jira/browse/SPARK-10742
> Project: Spark
>  Issue Type: Improvement
>Reporter: Tathagata Das
>Assignee: Tathagata Das
>Priority: Minor
>
> This is to allow embedding links to other Spark UI tabs within the job 
> description. For example, streaming jobs could set descriptions with links 
> pointing to the corresponding details page of the batch that the job belongs 
> to.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10742) Add the ability to embed HTML relative links in job descriptions

2015-09-21 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14901763#comment-14901763
 ] 

Apache Spark commented on SPARK-10742:
--

User 'tdas' has created a pull request for this issue:
https://github.com/apache/spark/pull/8791

> Add the ability to embed HTML relative links in job descriptions
> 
>
> Key: SPARK-10742
> URL: https://issues.apache.org/jira/browse/SPARK-10742
> Project: Spark
>  Issue Type: Improvement
>Reporter: Tathagata Das
>Assignee: Tathagata Das
>Priority: Minor
>
> This is to allow embedding links to other Spark UI tabs within the job 
> description. For example, streaming jobs could set descriptions with links 
> pointing to the corresponding details page of the batch that the job belongs 
> to.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10742) Add the ability to embed HTML relative links in job descriptions

2015-09-21 Thread Tathagata Das (JIRA)
Tathagata Das created SPARK-10742:
-

 Summary: Add the ability to embed HTML relative links in job 
descriptions
 Key: SPARK-10742
 URL: https://issues.apache.org/jira/browse/SPARK-10742
 Project: Spark
  Issue Type: Improvement
Reporter: Tathagata Das
Assignee: Tathagata Das
Priority: Minor


This is to allow embedding links to other Spark UI tabs within the job 
description. For example, streaming jobs could set descriptions with links 
pointing to the corresponding details page of the batch that the job belongs to.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10741) Hive Query Having/OrderBy against Parquet table is not working

2015-09-21 Thread Ian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10741?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ian updated SPARK-10741:

Description: 
Failed Query with Having Clause
{code}
  def testParquetHaving() {
val ddl =
  """CREATE TABLE IF NOT EXISTS test ( c1 string, c2 int ) STORED AS 
PARQUET"""
val failedHaving =
  """ SELECT c1, avg ( c2 ) as c_avg
| FROM test
| GROUP BY c1
| HAVING ( avg ( c2 ) > 5)  ORDER BY c1""".stripMargin
TestHive.sql(ddl)
TestHive.sql(failedHaving).collect
  }

org.apache.spark.sql.AnalysisException: resolved attribute(s) c2#16 missing 
from c1#17,c2#18 in operator !Aggregate [c1#17], [cast((avg(cast(c2#16 as 
bigint)) > cast(5 as double)) as boolean) AS 
havingCondition#12,c1#17,avg(cast(c2#18 as bigint)) AS c_avg#9];
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:37)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44)
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:154)
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:49)

{code}

Failed Query with OrderBy
{code}
  def testParquetOrderBy() {
val ddl =
  """CREATE TABLE IF NOT EXISTS test ( c1 string, c2 int ) STORED AS 
PARQUET"""
val failedOrderBy =
  """ SELECT c1, avg ( c2 ) c_avg
| FROM test
| GROUP BY c1
| ORDER BY avg ( c2 )""".stripMargin
TestHive.sql(ddl)
TestHive.sql(failedOrderBy).collect
  }

org.apache.spark.sql.AnalysisException: resolved attribute(s) c2#33 missing 
from c1#34,c2#35 in operator !Aggregate [c1#34], [avg(cast(c2#33 as bigint)) AS 
aggOrder#31,c1#34,avg(cast(c2#35 as bigint)) AS c_avg#28];
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:37)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44)
{code}

  was:
Failed Query with Having Clause
{code}
  def testParquetHaving() {
val ddl =
  """CREATE TABLE IF NOT EXISTS test ( c1 string, c2 int ) STORED AS 
PARQUET"""
val failedHaving =
  """ SELECT c1, avg ( c2 ) as c_avg
| FROM test
| GROUP BY c1
| HAVING ( avg ( c2 ) > 5)  ORDER BY c1""".stripMargin
TestHive.sql(ddl)
TestHive.sql(failedHaving).collect
  }

org.apache.spark.sql.AnalysisException: resolved attribute(s) c2#16 missing 
from c1#17,c2#18 in operator !Aggregate [c1#17], [cast((avg(cast(c2#16 as 
bigint)) > cast(5 as double)) as boolean) AS 
havingCondition#12,c1#17,avg(cast(c2#18 as bigint)) AS c_avg#9];
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:37)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44)
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:154)
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:49)

{code}

Failed Query with OrderBy
{code}
  @Test
  def testParquetOrderBy() {
val ddl =
  """CREATE TABLE IF NOT EXISTS test ( c1 string, c2 int ) STORED AS 
PARQUET"""
val failedOrderBy =
  """ SELECT c1, avg ( c2 ) c_avg
| FROM test
| GROUP BY c1
| ORDER BY avg ( c2 )""".stripMargin
TestHive.sql(ddl)
TestHive.sql(failedOrderBy).collect
  }

org.apache.spark.sql.AnalysisException: resolved attribute(s) c2#33 missing 
from c1#34,c2#35 in operator !Aggregate [c1#34], [avg(cast(c2#33 as bigint)) AS 
aggOrder#31,c1#34,avg(cast(c2#35 as bigint)) AS c_avg#28];
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:37)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44)
{code}


> Hive Query Having/OrderBy against Parquet table is not working 
> ---
>
> Key: SPARK-10741
> URL: https://issues.apache.org/jira/browse/SPARK-10741
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Ian
>
> Failed Query with Having Clause
> {code}
>   def testParquetHaving() {
> val ddl =
>   """CREATE TABLE IF NOT EXISTS test ( c1 string, c2 int ) STORED AS 
> PARQUET"""
> val failedHaving =
>   """ SELECT c1, avg ( c2 ) as c_avg
> | FROM test
> | GROUP BY c1
> | HAVING ( avg ( c2 ) > 5)  ORDER BY c1""".stripMargin
> TestHive.sql(ddl)
> TestHive.sql(failedHaving).collect
>   }
> org.apache.spark.sql.AnalysisException: resolved attribute(s) c2#16 missing 
> from c1#17,c2#18 in operator !Aggregate [c1#17], [cast((avg(ca

[jira] [Updated] (SPARK-10741) Hive Query Having/OrderBy against Parquet table is not working

2015-09-21 Thread Ian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10741?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ian updated SPARK-10741:

Description: 
Failed Query with Having Clause
{code}
  def testParquetHaving() {
val ddl =
  """CREATE TABLE IF NOT EXISTS test ( c1 string, c2 int ) STORED AS 
PARQUET"""
val failedHaving =
  """ SELECT c1, avg ( c2 ) as c_avg
| FROM test
| GROUP BY c1
| HAVING ( avg ( c2 ) > 5)  ORDER BY c1""".stripMargin
TestHive.sql(ddl)
TestHive.sql(failedHaving).collect
  }

org.apache.spark.sql.AnalysisException: resolved attribute(s) c2#16 missing 
from c1#17,c2#18 in operator !Aggregate [c1#17], [cast((avg(cast(c2#16 as 
bigint)) > cast(5 as double)) as boolean) AS 
havingCondition#12,c1#17,avg(cast(c2#18 as bigint)) AS c_avg#9];
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:37)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44)
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:154)
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:49)

{code}

Failed Query with OrderBy
{code}
  @Test
  def testParquetOrderBy() {
val ddl =
  """CREATE TABLE IF NOT EXISTS test ( c1 string, c2 int ) STORED AS 
PARQUET"""
val failedOrderBy =
  """ SELECT c1, avg ( c2 ) c_avg
| FROM test
| GROUP BY c1
| ORDER BY avg ( c2 )""".stripMargin
TestHive.sql(ddl)
TestHive.sql(failedOrderBy).collect
  }

org.apache.spark.sql.AnalysisException: resolved attribute(s) c2#33 missing 
from c1#34,c2#35 in operator !Aggregate [c1#34], [avg(cast(c2#33 as bigint)) AS 
aggOrder#31,c1#34,avg(cast(c2#35 as bigint)) AS c_avg#28];
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:37)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44)
{code}

  was:
Failed Query with Having Clause
{code}
  def testParquetHaving() {
val ddl =
  """CREATE TABLE IF NOT EXISTS test ( c1 string, c2 int ) STORED AS 
PARQUET"""
val failedHaving =
  """ SELECT c1, avg ( c2 ) as c_avg
| FROM test
| GROUP BY c1
| HAVING ( avg ( c2 ) > 5)  ORDER BY c1""".stripMargin
TestHive.sql(ddl)
TestHive.sql(failedHaving).collect
  }

org.apache.spark.sql.AnalysisException: resolved attribute(s) c2#16 missing 
from c1#17,c2#18 in operator !Aggregate [c1#17], [cast((avg(cast(c2#16 as 
bigint)) > cast(5 as double)) as boolean) AS 
havingCondition#12,c1#17,avg(cast(c2#18 as bigint)) AS c_avg#9];
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:37)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44)
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:154)
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:49)

{code}

Failed Query with OrderBy
{code}
val ddl =
  """CREATE TABLE IF NOT EXISTS test ( c1 string, c2 int ) STORED AS 
PARQUET"""
val filedOrderBy =
  """ SELECT c1, avg ( c2 ) c_avg
| FROM test
| GROUP BY c1
| ORDER BY avg ( c2 )""".stripMargin
TestHive.sql(ddl)
TestHive.sql(filedOrderBy).collect

org.apache.spark.sql.AnalysisException: resolved attribute(s) c2#33 missing 
from c1#34,c2#35 in operator !Aggregate [c1#34], [avg(cast(c2#33 as bigint)) AS 
aggOrder#31,c1#34,avg(cast(c2#35 as bigint)) AS c_avg#28];
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:37)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44)
{code}


> Hive Query Having/OrderBy against Parquet table is not working 
> ---
>
> Key: SPARK-10741
> URL: https://issues.apache.org/jira/browse/SPARK-10741
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Ian
>
> Failed Query with Having Clause
> {code}
>   def testParquetHaving() {
> val ddl =
>   """CREATE TABLE IF NOT EXISTS test ( c1 string, c2 int ) STORED AS 
> PARQUET"""
> val failedHaving =
>   """ SELECT c1, avg ( c2 ) as c_avg
> | FROM test
> | GROUP BY c1
> | HAVING ( avg ( c2 ) > 5)  ORDER BY c1""".stripMargin
> TestHive.sql(ddl)
> TestHive.sql(failedHaving).collect
>   }
> org.apache.spark.sql.AnalysisException: resolved attribute(s) c2#16 missing 
> from c1#17,c2#18 in operator !Aggregate [c1#17], [cast((avg(cast(c2#16 as 
> bigint)) > cast(5 as

[jira] [Updated] (SPARK-10741) Hive Query Having/OrderBy against Parquet table is not working

2015-09-21 Thread Ian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10741?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ian updated SPARK-10741:

Description: 
Failed Query with Having Clause
{code}
  def testParquetHaving() {
val ddl =
  """CREATE TABLE IF NOT EXISTS test ( c1 string, c2 int ) STORED AS 
PARQUET"""
val failedHaving =
  """ SELECT c1, avg ( c2 ) as c_avg
| FROM test
| GROUP BY c1
| HAVING ( avg ( c2 ) > 5)  ORDER BY c1""".stripMargin
TestHive.sql(ddl)
TestHive.sql(failedHaving).collect
  }

org.apache.spark.sql.AnalysisException: resolved attribute(s) c2#16 missing 
from c1#17,c2#18 in operator !Aggregate [c1#17], [cast((avg(cast(c2#16 as 
bigint)) > cast(5 as double)) as boolean) AS 
havingCondition#12,c1#17,avg(cast(c2#18 as bigint)) AS c_avg#9];
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:37)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44)
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:154)
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:49)

{code}

Failed Query with OrderBy
{code}
val ddl =
  """CREATE TABLE IF NOT EXISTS test ( c1 string, c2 int ) STORED AS 
PARQUET"""
val filedOrderBy =
  """ SELECT c1, avg ( c2 ) c_avg
| FROM test
| GROUP BY c1
| ORDER BY avg ( c2 )""".stripMargin
TestHive.sql(ddl)
TestHive.sql(filedOrderBy).collect

org.apache.spark.sql.AnalysisException: resolved attribute(s) c2#33 missing 
from c1#34,c2#35 in operator !Aggregate [c1#34], [avg(cast(c2#33 as bigint)) AS 
aggOrder#31,c1#34,avg(cast(c2#35 as bigint)) AS c_avg#28];
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:37)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44)
{code}

  was:
Failed Query with Having Clause
{code}
  def testParquetHaving() {
val ddl =
  """CREATE TABLE IF NOT EXISTS test ( c1 string, c2 int ) STORED AS 
PARQUET"""
val failedHaving =
  """ SELECT c1, avg ( c2 ) as c_avg
| FROM test
| GROUP BY c1
| HAVING ( avg ( c2 ) > 5)  ORDER BY c1""".stripMargin
TestHive.sql(ddl)
TestHive.sql(failedHaving).collect
  }
{code}
org.apache.spark.sql.AnalysisException: resolved attribute(s) c2#16 missing 
from c1#17,c2#18 in operator !Aggregate [c1#17], [cast((avg(cast(c2#16 as 
bigint)) > cast(5 as double)) as boolean) AS 
havingCondition#12,c1#17,avg(cast(c2#18 as bigint)) AS c_avg#9];
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:37)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44)
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:154)
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:49)




> Hive Query Having/OrderBy against Parquet table is not working 
> ---
>
> Key: SPARK-10741
> URL: https://issues.apache.org/jira/browse/SPARK-10741
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Ian
>
> Failed Query with Having Clause
> {code}
>   def testParquetHaving() {
> val ddl =
>   """CREATE TABLE IF NOT EXISTS test ( c1 string, c2 int ) STORED AS 
> PARQUET"""
> val failedHaving =
>   """ SELECT c1, avg ( c2 ) as c_avg
> | FROM test
> | GROUP BY c1
> | HAVING ( avg ( c2 ) > 5)  ORDER BY c1""".stripMargin
> TestHive.sql(ddl)
> TestHive.sql(failedHaving).collect
>   }
> org.apache.spark.sql.AnalysisException: resolved attribute(s) c2#16 missing 
> from c1#17,c2#18 in operator !Aggregate [c1#17], [cast((avg(cast(c2#16 as 
> bigint)) > cast(5 as double)) as boolean) AS 
> havingCondition#12,c1#17,avg(cast(c2#18 as bigint)) AS c_avg#9];
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:37)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:154)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:49)
> {code}
> Failed Query with OrderBy
> {code}
> val ddl =
>   """CREATE TABLE IF NOT EXISTS test ( c1 string, c2 int ) STORED AS 
> PARQUET"""
> val filedOrderBy =
>   """ SELECT c1, avg ( c2 ) c_avg
> | FROM test
> | GROUP BY c1
> 

[jira] [Updated] (SPARK-10741) Hive Query Having/OrderBy against Parquet table is not working

2015-09-21 Thread Ian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10741?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ian updated SPARK-10741:

Description: 
Failed Query with Having Clause
{code}
  def testParquetHaving() {
val ddl =
  """CREATE TABLE IF NOT EXISTS test ( c1 string, c2 int ) STORED AS 
PARQUET"""
val failedHaving =
  """ SELECT c1, avg ( c2 ) as c_avg
| FROM test
| GROUP BY c1
| HAVING ( avg ( c2 ) > 5)  ORDER BY c1""".stripMargin
TestHive.sql(ddl)
TestHive.sql(failedHaving).collect
  }
{code}
org.apache.spark.sql.AnalysisException: resolved attribute(s) c2#16 missing 
from c1#17,c2#18 in operator !Aggregate [c1#17], [cast((avg(cast(c2#16 as 
bigint)) > cast(5 as double)) as boolean) AS 
havingCondition#12,c1#17,avg(cast(c2#18 as bigint)) AS c_avg#9];
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:37)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44)
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:154)
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:49)



  was:
Failed Query with Having Clause
  def testParquetHaving() {
val ddl =
  """CREATE TABLE IF NOT EXISTS test ( c1 string, c2 int ) STORED AS 
PARQUET"""
val failedHaving =
  """ SELECT c1, avg ( c2 ) as c_avg
| FROM test
| GROUP BY c1
| HAVING ( avg ( c2 ) > 5)  ORDER BY c1""".stripMargin
TestHive.sql(ddl)
TestHive.sql(failedHaving).collect
  }

org.apache.spark.sql.AnalysisException: resolved attribute(s) c2#16 missing 
from c1#17,c2#18 in operator !Aggregate [c1#17], [cast((avg(cast(c2#16 as 
bigint)) > cast(5 as double)) as boolean) AS 
havingCondition#12,c1#17,avg(cast(c2#18 as bigint)) AS c_avg#9];
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:37)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44)
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:154)
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:49)




> Hive Query Having/OrderBy against Parquet table is not working 
> ---
>
> Key: SPARK-10741
> URL: https://issues.apache.org/jira/browse/SPARK-10741
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Ian
>
> Failed Query with Having Clause
> {code}
>   def testParquetHaving() {
> val ddl =
>   """CREATE TABLE IF NOT EXISTS test ( c1 string, c2 int ) STORED AS 
> PARQUET"""
> val failedHaving =
>   """ SELECT c1, avg ( c2 ) as c_avg
> | FROM test
> | GROUP BY c1
> | HAVING ( avg ( c2 ) > 5)  ORDER BY c1""".stripMargin
> TestHive.sql(ddl)
> TestHive.sql(failedHaving).collect
>   }
> {code}
> org.apache.spark.sql.AnalysisException: resolved attribute(s) c2#16 missing 
> from c1#17,c2#18 in operator !Aggregate [c1#17], [cast((avg(cast(c2#16 as 
> bigint)) > cast(5 as double)) as boolean) AS 
> havingCondition#12,c1#17,avg(cast(c2#18 as bigint)) AS c_avg#9];
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:37)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:154)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:49)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10741) Hive Query Having/OrderBy against Parquet table is not working

2015-09-21 Thread Ian (JIRA)
Ian created SPARK-10741:
---

 Summary: Hive Query Having/OrderBy against Parquet table is not 
working 
 Key: SPARK-10741
 URL: https://issues.apache.org/jira/browse/SPARK-10741
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.5.0
Reporter: Ian


Failed Query with Having Clause
  def testParquetHaving() {
val ddl =
  """CREATE TABLE IF NOT EXISTS test ( c1 string, c2 int ) STORED AS 
PARQUET"""
val failedHaving =
  """ SELECT c1, avg ( c2 ) as c_avg
| FROM test
| GROUP BY c1
| HAVING ( avg ( c2 ) > 5)  ORDER BY c1""".stripMargin
TestHive.sql(ddl)
TestHive.sql(failedHaving).collect
  }

org.apache.spark.sql.AnalysisException: resolved attribute(s) c2#16 missing 
from c1#17,c2#18 in operator !Aggregate [c1#17], [cast((avg(cast(c2#16 as 
bigint)) > cast(5 as double)) as boolean) AS 
havingCondition#12,c1#17,avg(cast(c2#18 as bigint)) AS c_avg#9];
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:37)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44)
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:154)
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:49)





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10723) Add RDD.reduceOption method

2015-09-21 Thread Tatsuya Atsumi (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10723?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14901723#comment-14901723
 ] 

Tatsuya Atsumi commented on SPARK-10723:


Thanks for comment and advice.
RDD.fold seems to be suitable for my case.
Thank you.

> Add RDD.reduceOption method
> ---
>
> Key: SPARK-10723
> URL: https://issues.apache.org/jira/browse/SPARK-10723
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, Spark Core
>Reporter: Tatsuya Atsumi
>Priority: Minor
>
> h2. Problem
> RDD.reduce throws exception if the RDD is empty.
> It is appropriate behavior if RDD is expected to be not empty, but if it is 
> not sure until runtime that the RDD is empty or not, it needs to wrap with 
> try-catch to call reduce safely. 
> Example Code
> {code}
> // This RDD may be empty or not
> val rdd: RDD[Int] = originalRdd.filter(_ > 10)
> val reduced: Option[Int] = try {
>   Some(rdd.reduce(_ + _))
> } catch {
>   // if rdd is empty return None.
>   case e:UnsupportedOperationException => None
> }
> {code}
> h2. Improvement idea
> Scala’s List has reduceOption method, which returns None if List is empty.
> If RDD has reduceOption API like Scala’s List, it will become easy to handle 
> above case.
> Example Code
> {code}
> val reduced: Option[Int] = originalRdd.filter(_ > 10).reduceOption(_ + _)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10652) Set meaningful job descriptions for streaming related jobs

2015-09-21 Thread Tathagata Das (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10652?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das updated SPARK-10652:
--
Target Version/s: 1.6.0, 1.5.1  (was: 1.6.0)

> Set meaningful job descriptions for streaming related jobs
> --
>
> Key: SPARK-10652
> URL: https://issues.apache.org/jira/browse/SPARK-10652
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Affects Versions: 1.4.1, 1.5.0
>Reporter: Tathagata Das
>Assignee: Tathagata Das
>
> Job descriptions will help distinguish jobs of one batch from the other in 
> the Jobs and Stages pages in the Spark UI



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10652) Set meaningful job descriptions for streaming related jobs

2015-09-21 Thread Tathagata Das (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10652?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das updated SPARK-10652:
--
Summary: Set meaningful job descriptions for streaming related jobs  (was: 
Set good job descriptions for streaming related jobs)

> Set meaningful job descriptions for streaming related jobs
> --
>
> Key: SPARK-10652
> URL: https://issues.apache.org/jira/browse/SPARK-10652
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Affects Versions: 1.4.1, 1.5.0
>Reporter: Tathagata Das
>Assignee: Tathagata Das
>
> Job descriptions will help distinguish jobs of one batch from the other in 
> the Jobs and Stages pages in the Spark UI



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-10495) For json data source, date values are saved as int strings

2015-09-21 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10495?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian resolved SPARK-10495.

   Resolution: Fixed
Fix Version/s: 1.5.1
   1.6.0

Issue resolved by pull request 8806
[https://github.com/apache/spark/pull/8806]

> For json data source, date values are saved as int strings
> --
>
> Key: SPARK-10495
> URL: https://issues.apache.org/jira/browse/SPARK-10495
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Yin Huai
>Assignee: Yin Huai
>Priority: Blocker
> Fix For: 1.6.0, 1.5.1
>
>
> {code}
> val df = Seq((1, java.sql.Date.valueOf("1900-01-01"))).toDF("i", "j")
> df.write.format("json").save("/tmp/testJson")
> sc.textFile("/tmp/testJson").collect.foreach(println)
> {"i":1,"j":"-25567"}
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10692) Failed batches are never reported through the StreamingListener interface

2015-09-21 Thread Tathagata Das (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14901695#comment-14901695
 ] 

Tathagata Das commented on SPARK-10692:
---

Yes, that is an independent problem, we need to think about that separately. 
Now at the least we need to expose them through StreamingListener so that apps 
that do not immediately exit on one failure, can generate the streaming UI 
correctly.

> Failed batches are never reported through the StreamingListener interface
> -
>
> Key: SPARK-10692
> URL: https://issues.apache.org/jira/browse/SPARK-10692
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 1.3.1, 1.4.1, 1.5.0
>Reporter: Tathagata Das
>Assignee: Shixiong Zhu
>Priority: Blocker
>
> If an output operation fails, then corresponding batch is never marked as 
> completed, as the data structure are not updated properly.
> https://github.com/apache/spark/blob/master/streaming/src/main/scala/org/apache/spark/streaming/scheduler/JobScheduler.scala#L183



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10474) TungstenAggregation cannot acquire memory for pointer array after switching to sort-based

2015-09-21 Thread Andrew Or (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14901694#comment-14901694
 ] 

Andrew Or commented on SPARK-10474:
---

Also, would you mind testing out this commit to see if it fixes your problem?
https://github.com/andrewor14/spark/commits/aggregate-test

> TungstenAggregation cannot acquire memory for pointer array after switching 
> to sort-based
> -
>
> Key: SPARK-10474
> URL: https://issues.apache.org/jira/browse/SPARK-10474
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Yi Zhou
>Assignee: Andrew Or
>Priority: Blocker
> Fix For: 1.6.0, 1.5.1
>
>
> In aggregation case, a  Lost task happened with below error.
> {code}
>  java.io.IOException: Could not acquire 65536 bytes of memory
> at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.initializeForWriting(UnsafeExternalSorter.java:169)
> at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.spill(UnsafeExternalSorter.java:220)
> at 
> org.apache.spark.sql.execution.UnsafeKVExternalSorter.(UnsafeKVExternalSorter.java:126)
> at 
> org.apache.spark.sql.execution.UnsafeFixedWidthAggregationMap.destructAndCreateExternalSorter(UnsafeFixedWidthAggregationMap.java:257)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.switchToSortBasedAggregation(TungstenAggregationIterator.scala:435)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.processInputs(TungstenAggregationIterator.scala:379)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.start(TungstenAggregationIterator.scala:622)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1.org$apache$spark$sql$execution$aggregate$TungstenAggregate$$anonfun$$executePartition$1(TungstenAggregate.scala:110)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:119)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:119)
> at 
> org.apache.spark.rdd.MapPartitionsWithPreparationRDD.compute(MapPartitionsWithPreparationRDD.scala:64)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
> at org.apache.spark.scheduler.Task.run(Task.scala:88)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> {code}
> Key SQL Query
> {code:sql}
> INSERT INTO TABLE test_table
> SELECT
>   ss.ss_customer_sk AS cid,
>   count(CASE WHEN i.i_class_id=1  THEN 1 ELSE NULL END) AS id1,
>   count(CASE WHEN i.i_class_id=3  THEN 1 ELSE NULL END) AS id3,
>   count(CASE WHEN i.i_class_id=5  THEN 1 ELSE NULL END) AS id5,
>   count(CASE WHEN i.i_class_id=7  THEN 1 ELSE NULL END) AS id7,
>   count(CASE WHEN i.i_class_id=9  THEN 1 ELSE NULL END) AS id9,
>   count(CASE WHEN i.i_class_id=11 THEN 1 ELSE NULL END) AS id11,
>   count(CASE WHEN i.i_class_id=13 THEN 1 ELSE NULL END) AS id13,
>   count(CASE WHEN i.i_class_id=15 THEN 1 ELSE NULL END) AS id15,
>   count(CASE WHEN i.i_class_id=2  THEN 1 ELSE NULL END) AS id2,
>   count(CASE WHEN i.i_class_id=4  THEN 1 ELSE NULL END) AS id4,
>   count(CASE WHEN i.i_class_id=6  THEN 1 ELSE NULL END) AS id6,
>   count(CASE WHEN i.i_class_id=8  THEN 1 ELSE NULL END) AS id8,
>   count(CASE WHEN i.i_class_id=10 THEN 1 ELSE NULL END) AS id10,
>   count(CASE WHEN i.i_class_id=14 THEN 1 ELSE NULL END) AS id14,
>   count(CASE WHEN i.i_class_id=16 THEN 1 ELSE NULL END) AS id16
> FROM store_sales ss
> INNER JOIN item i ON ss.ss_item_sk = i.i_item_sk
> WHERE i.i_category IN ('Books')
> AND ss.ss_customer_sk IS NOT NULL
> GROUP BY ss.ss_customer_sk
> HAVING count(ss.ss_item_sk) > 5
> {code}
> Note:
> the store_sales is a big fact table and item is a small dimension table.



--
This message was sent by Atlassian JIR

[jira] [Updated] (SPARK-10310) [Spark SQL] All result records will be popluated into ONE line during the script transform due to missing the correct line/filed delimiter

2015-09-21 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10310?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-10310:
---
Assignee: zhichao-li

> [Spark SQL] All result records will be popluated into ONE line during the 
> script transform due to missing the correct line/filed delimiter
> --
>
> Key: SPARK-10310
> URL: https://issues.apache.org/jira/browse/SPARK-10310
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Yi Zhou
>Assignee: zhichao-li
>Priority: Critical
>
> There is real case using python stream script in Spark SQL query. We found 
> that all result records were wroten in ONE line as input from "select" 
> pipeline for python script and so it caused script will not identify each 
> record.Other, filed separator in spark sql will be '^A' or '\001' which is 
> inconsistent/incompatible the '\t' in Hive implementation.
> Key query:
> {code:sql}
> CREATE VIEW temp1 AS
> SELECT *
> FROM
> (
>   FROM
>   (
> SELECT
>   c.wcs_user_sk,
>   w.wp_type,
>   (wcs_click_date_sk * 24 * 60 * 60 + wcs_click_time_sk) AS tstamp_inSec
> FROM web_clickstreams c, web_page w
> WHERE c.wcs_web_page_sk = w.wp_web_page_sk
> AND   c.wcs_web_page_sk IS NOT NULL
> AND   c.wcs_user_sk IS NOT NULL
> AND   c.wcs_sales_skIS NULL --abandoned implies: no sale
> DISTRIBUTE BY wcs_user_sk SORT BY wcs_user_sk, tstamp_inSec
>   ) clicksAnWebPageType
>   REDUCE
> wcs_user_sk,
> tstamp_inSec,
> wp_type
>   USING 'python sessionize.py 3600'
>   AS (
> wp_type STRING,
> tstamp BIGINT, 
> sessionid STRING)
> ) sessionized
> {code}
> Key Python script:
> {noformat}
> for line in sys.stdin:
>  user_sk,  tstamp_str, value  = line.strip().split("\t")
> {noformat}
> Sample SELECT result:
> {noformat}
> ^V31^A3237764860^Afeedback^U31^A3237769106^Adynamic^T31^A3237779027^Areview
> {noformat}
> Expected result:
> {noformat}
> 31   3237764860   feedback
> 31   3237769106   dynamic
> 31   3237779027   review
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10740) handle nondeterministic expressions correctly for set operations

2015-09-21 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10740?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10740:


Assignee: (was: Apache Spark)

> handle nondeterministic expressions correctly for set operations
> 
>
> Key: SPARK-10740
> URL: https://issues.apache.org/jira/browse/SPARK-10740
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Wenchen Fan
>
> We should only push down deterministic filter condition to set operator.
> For Union, let's say we do a non-deterministic filter on 1...5 union 1...5, 
> and we may get 1,3 for the left side and 2,4 for the right side, then the 
> result should be 1,3,2,4. If we push down this filter, we get 1,3 for both 
> side(we create a new random object with same seed in each side) and the 
> result would be 1,3,1,3.
> For Intersect, let's say there is a non-deterministic condition with a 0.5 
> possibility to accept a row and we have a row that presents in both sides of 
> an Intersect. Once we push down this condition, the possibility to accept 
> this row will be 0.25.
> For Except, let's say there is a row that presents in both sides of an 
> Except. This row should not be in the final output. However, if we pushdown a 
> non-deterministic condition, it is possible that this row is rejected from 
> one side and then we output a row that should not be a part of the result.
>  We should only push down deterministic projection to Union.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10740) handle nondeterministic expressions correctly for set operations

2015-09-21 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10740?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14901676#comment-14901676
 ] 

Apache Spark commented on SPARK-10740:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/8858

> handle nondeterministic expressions correctly for set operations
> 
>
> Key: SPARK-10740
> URL: https://issues.apache.org/jira/browse/SPARK-10740
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Wenchen Fan
>
> We should only push down deterministic filter condition to set operator.
> For Union, let's say we do a non-deterministic filter on 1...5 union 1...5, 
> and we may get 1,3 for the left side and 2,4 for the right side, then the 
> result should be 1,3,2,4. If we push down this filter, we get 1,3 for both 
> side(we create a new random object with same seed in each side) and the 
> result would be 1,3,1,3.
> For Intersect, let's say there is a non-deterministic condition with a 0.5 
> possibility to accept a row and we have a row that presents in both sides of 
> an Intersect. Once we push down this condition, the possibility to accept 
> this row will be 0.25.
> For Except, let's say there is a row that presents in both sides of an 
> Except. This row should not be in the final output. However, if we pushdown a 
> non-deterministic condition, it is possible that this row is rejected from 
> one side and then we output a row that should not be a part of the result.
>  We should only push down deterministic projection to Union.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10740) handle nondeterministic expressions correctly for set operations

2015-09-21 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10740?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10740:


Assignee: Apache Spark

> handle nondeterministic expressions correctly for set operations
> 
>
> Key: SPARK-10740
> URL: https://issues.apache.org/jira/browse/SPARK-10740
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Wenchen Fan
>Assignee: Apache Spark
>
> We should only push down deterministic filter condition to set operator.
> For Union, let's say we do a non-deterministic filter on 1...5 union 1...5, 
> and we may get 1,3 for the left side and 2,4 for the right side, then the 
> result should be 1,3,2,4. If we push down this filter, we get 1,3 for both 
> side(we create a new random object with same seed in each side) and the 
> result would be 1,3,1,3.
> For Intersect, let's say there is a non-deterministic condition with a 0.5 
> possibility to accept a row and we have a row that presents in both sides of 
> an Intersect. Once we push down this condition, the possibility to accept 
> this row will be 0.25.
> For Except, let's say there is a row that presents in both sides of an 
> Except. This row should not be in the final output. However, if we pushdown a 
> non-deterministic condition, it is possible that this row is rejected from 
> one side and then we output a row that should not be a part of the result.
>  We should only push down deterministic projection to Union.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10740) handle nondeterministic expressions correctly for set operations

2015-09-21 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10740?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-10740:

Description: 
We should only push down deterministic filter condition to set operator.
For Union, let's say we do a non-deterministic filter on 1...5 union 1...5, and 
we may get 1,3 for the left side and 2,4 for the right side, then the result 
should be 1,3,2,4. If we push down this filter, we get 1,3 for both side(we 
create a new random object with same seed in each side) and the result would be 
1,3,1,3.
For Intersect, let's say there is a non-deterministic condition with a 0.5 
possibility to accept a row and we have a row that presents in both sides of an 
Intersect. Once we push down this condition, the possibility to accept this row 
will be 0.25.
For Except, let's say there is a row that presents in both sides of an Except. 
This row should not be in the final output. However, if we pushdown a 
non-deterministic condition, it is possible that this row is rejected from one 
side and then we output a row that should not be a part of the result.

 We should only push down deterministic projection to Union.


  was:
We should only push down deterministic filter condition to set operator.
For Union, let's say we do a non-deterministic filter on 1...5 union 1...5, and 
we may get 1,3 for the left side and 2,4 for the right side, then the result 
should be 1,3,2,4. If we push down this filter, we get 1,3 for both side and 
the result would be 1,3,1,3.
For Intersect, let's say there is a non-deterministic condition with a 0.5 
possibility to accept a row and we have a row that presents in both sides of an 
Intersect. Once we push down this condition, the possibility to accept this row 
will be 0.25.
For Except, let's say there is a row that presents in both sides of an Except. 
This row should not be in the final output. However, if we pushdown a 
non-deterministic condition, it is possible that this row is rejected from one 
side and then we output a row that should not be a part of the result.

 We should only push down deterministic projection to Union.



> handle nondeterministic expressions correctly for set operations
> 
>
> Key: SPARK-10740
> URL: https://issues.apache.org/jira/browse/SPARK-10740
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Wenchen Fan
>
> We should only push down deterministic filter condition to set operator.
> For Union, let's say we do a non-deterministic filter on 1...5 union 1...5, 
> and we may get 1,3 for the left side and 2,4 for the right side, then the 
> result should be 1,3,2,4. If we push down this filter, we get 1,3 for both 
> side(we create a new random object with same seed in each side) and the 
> result would be 1,3,1,3.
> For Intersect, let's say there is a non-deterministic condition with a 0.5 
> possibility to accept a row and we have a row that presents in both sides of 
> an Intersect. Once we push down this condition, the possibility to accept 
> this row will be 0.25.
> For Except, let's say there is a row that presents in both sides of an 
> Except. This row should not be in the final output. However, if we pushdown a 
> non-deterministic condition, it is possible that this row is rejected from 
> one side and then we output a row that should not be a part of the result.
>  We should only push down deterministic projection to Union.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10740) handle nondeterministic expressions correctly for set operations

2015-09-21 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10740?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-10740:

Description: 
We should only push down deterministic filter condition to set operator.
For Union, let's say we do a non-deterministic filter on 1...5 union 1...5, and 
we may get 1,3 for the left side and 2,4 for the right side, then the result 
should be 1,3,2,4. If we push down this filter, we get 1,3 for both side and 
the result would be 1,3,1,3.
For Intersect, let's say there is a non-deterministic condition with a 0.5 
possibility to accept a row and we have a row that presents in both sides of an 
Intersect. Once we push down this condition, the possibility to accept this row 
will be 0.25.
For Except, let's say there is a row that presents in both sides of an Except. 
This row should not be in the final output. However, if we pushdown a 
non-deterministic condition, it is possible that this row is rejected from one 
side and then we output a row that should not be a part of the result.

 We should only push down deterministic projection to Union.


> handle nondeterministic expressions correctly for set operations
> 
>
> Key: SPARK-10740
> URL: https://issues.apache.org/jira/browse/SPARK-10740
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Wenchen Fan
>
> We should only push down deterministic filter condition to set operator.
> For Union, let's say we do a non-deterministic filter on 1...5 union 1...5, 
> and we may get 1,3 for the left side and 2,4 for the right side, then the 
> result should be 1,3,2,4. If we push down this filter, we get 1,3 for both 
> side and the result would be 1,3,1,3.
> For Intersect, let's say there is a non-deterministic condition with a 0.5 
> possibility to accept a row and we have a row that presents in both sides of 
> an Intersect. Once we push down this condition, the possibility to accept 
> this row will be 0.25.
> For Except, let's say there is a row that presents in both sides of an 
> Except. This row should not be in the final output. However, if we pushdown a 
> non-deterministic condition, it is possible that this row is rejected from 
> one side and then we output a row that should not be a part of the result.
>  We should only push down deterministic projection to Union.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10739) Add attempt window for long running Spark application on Yarn

2015-09-21 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14901656#comment-14901656
 ] 

Apache Spark commented on SPARK-10739:
--

User 'jerryshao' has created a pull request for this issue:
https://github.com/apache/spark/pull/8857

> Add attempt window for long running Spark application on Yarn
> -
>
> Key: SPARK-10739
> URL: https://issues.apache.org/jira/browse/SPARK-10739
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Reporter: Saisai Shao
>Priority: Minor
>
> Currently Spark on Yarn uses max attempts to control the failure number, if 
> application's failure number reaches to the max attempts, application will 
> not be recovered by RM, it is not very effective for long running 
> applications, since it will easily exceed the max number at a long time 
> period, also setting a very large max attempts will hide the real problem.
> So here introduce an attempt window to control the application attempt times, 
> this will ignore the out of window attempts, it is introduced in Hadoop 2.6+ 
> to support long running application, it is quite useful for Spark Streaming, 
> Spark shell like applications.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10739) Add attempt window for long running Spark application on Yarn

2015-09-21 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10739?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10739:


Assignee: (was: Apache Spark)

> Add attempt window for long running Spark application on Yarn
> -
>
> Key: SPARK-10739
> URL: https://issues.apache.org/jira/browse/SPARK-10739
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Reporter: Saisai Shao
>Priority: Minor
>
> Currently Spark on Yarn uses max attempts to control the failure number, if 
> application's failure number reaches to the max attempts, application will 
> not be recovered by RM, it is not very effective for long running 
> applications, since it will easily exceed the max number at a long time 
> period, also setting a very large max attempts will hide the real problem.
> So here introduce an attempt window to control the application attempt times, 
> this will ignore the out of window attempts, it is introduced in Hadoop 2.6+ 
> to support long running application, it is quite useful for Spark Streaming, 
> Spark shell like applications.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10739) Add attempt window for long running Spark application on Yarn

2015-09-21 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10739?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10739:


Assignee: Apache Spark

> Add attempt window for long running Spark application on Yarn
> -
>
> Key: SPARK-10739
> URL: https://issues.apache.org/jira/browse/SPARK-10739
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Reporter: Saisai Shao
>Assignee: Apache Spark
>Priority: Minor
>
> Currently Spark on Yarn uses max attempts to control the failure number, if 
> application's failure number reaches to the max attempts, application will 
> not be recovered by RM, it is not very effective for long running 
> applications, since it will easily exceed the max number at a long time 
> period, also setting a very large max attempts will hide the real problem.
> So here introduce an attempt window to control the application attempt times, 
> this will ignore the out of window attempts, it is introduced in Hadoop 2.6+ 
> to support long running application, it is quite useful for Spark Streaming, 
> Spark shell like applications.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10649) Streaming jobs unexpectedly inherits job group, job descriptions from context starting thread

2015-09-21 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10649?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14901645#comment-14901645
 ] 

Apache Spark commented on SPARK-10649:
--

User 'tdas' has created a pull request for this issue:
https://github.com/apache/spark/pull/8856

> Streaming jobs unexpectedly inherits job group, job descriptions from context 
> starting thread
> -
>
> Key: SPARK-10649
> URL: https://issues.apache.org/jira/browse/SPARK-10649
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 1.3.1, 1.4.1, 1.5.0
>Reporter: Tathagata Das
>Assignee: Tathagata Das
> Fix For: 1.6.0
>
>
> The job group, and job descriptions information is passed through thread 
> local properties, and get inherited by child threads. In case of spark 
> streaming, the streaming jobs inherit these properties from the thread that 
> called streamingContext.start(). This may not make sense. 
> 1. Job group: This is mainly used for cancelling a group of jobs together. It 
> does not make sense to cancel streaming jobs like this, as the effect will be 
> unpredictable. And its not a valid usecase any way, to cancel a streaming 
> context, call streamingContext.stop()
> 2. Job description: This is used to pass on nice text descriptions for jobs 
> to show up in the UI. The job description of the thread that calls 
> streamingContext.start() is not useful for all the streaming jobs, as it does 
> not make sense for all of the streaming jobs to have the same description, 
> and the description may or may not be related to streaming.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10740) handle nondeterministic expressions correctly for set operations

2015-09-21 Thread Wenchen Fan (JIRA)
Wenchen Fan created SPARK-10740:
---

 Summary: handle nondeterministic expressions correctly for set 
operations
 Key: SPARK-10740
 URL: https://issues.apache.org/jira/browse/SPARK-10740
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Wenchen Fan






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-10649) Streaming jobs unexpectedly inherits job group, job descriptions from context starting thread

2015-09-21 Thread Tathagata Das (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10649?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das resolved SPARK-10649.
---
   Resolution: Fixed
Fix Version/s: 1.6.0

> Streaming jobs unexpectedly inherits job group, job descriptions from context 
> starting thread
> -
>
> Key: SPARK-10649
> URL: https://issues.apache.org/jira/browse/SPARK-10649
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 1.3.1, 1.4.1, 1.5.0
>Reporter: Tathagata Das
>Assignee: Tathagata Das
> Fix For: 1.6.0
>
>
> The job group, and job descriptions information is passed through thread 
> local properties, and get inherited by child threads. In case of spark 
> streaming, the streaming jobs inherit these properties from the thread that 
> called streamingContext.start(). This may not make sense. 
> 1. Job group: This is mainly used for cancelling a group of jobs together. It 
> does not make sense to cancel streaming jobs like this, as the effect will be 
> unpredictable. And its not a valid usecase any way, to cancel a streaming 
> context, call streamingContext.stop()
> 2. Job description: This is used to pass on nice text descriptions for jobs 
> to show up in the UI. The job description of the thread that calls 
> streamingContext.start() is not useful for all the streaming jobs, as it does 
> not make sense for all of the streaming jobs to have the same description, 
> and the description may or may not be related to streaming.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10649) Streaming jobs unexpectedly inherits job group, job descriptions from context starting thread

2015-09-21 Thread Tathagata Das (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10649?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das updated SPARK-10649:
--
Target Version/s: 1.6.0, 1.5.1  (was: 1.6.0)

> Streaming jobs unexpectedly inherits job group, job descriptions from context 
> starting thread
> -
>
> Key: SPARK-10649
> URL: https://issues.apache.org/jira/browse/SPARK-10649
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 1.3.1, 1.4.1, 1.5.0
>Reporter: Tathagata Das
>Assignee: Tathagata Das
> Fix For: 1.6.0
>
>
> The job group, and job descriptions information is passed through thread 
> local properties, and get inherited by child threads. In case of spark 
> streaming, the streaming jobs inherit these properties from the thread that 
> called streamingContext.start(). This may not make sense. 
> 1. Job group: This is mainly used for cancelling a group of jobs together. It 
> does not make sense to cancel streaming jobs like this, as the effect will be 
> unpredictable. And its not a valid usecase any way, to cancel a streaming 
> context, call streamingContext.stop()
> 2. Job description: This is used to pass on nice text descriptions for jobs 
> to show up in the UI. The job description of the thread that calls 
> streamingContext.start() is not useful for all the streaming jobs, as it does 
> not make sense for all of the streaming jobs to have the same description, 
> and the description may or may not be related to streaming.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10474) TungstenAggregation cannot acquire memory for pointer array after switching to sort-based

2015-09-21 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14901632#comment-14901632
 ] 

Yin Huai commented on SPARK-10474:
--

[~jameszhouyi] [~xjrk] What is your workloads? How large is the data (if the 
data is generated, can you share the script for data generation)? What is your 
query? And, did you set any spark conf?

> TungstenAggregation cannot acquire memory for pointer array after switching 
> to sort-based
> -
>
> Key: SPARK-10474
> URL: https://issues.apache.org/jira/browse/SPARK-10474
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Yi Zhou
>Assignee: Andrew Or
>Priority: Blocker
> Fix For: 1.6.0, 1.5.1
>
>
> In aggregation case, a  Lost task happened with below error.
> {code}
>  java.io.IOException: Could not acquire 65536 bytes of memory
> at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.initializeForWriting(UnsafeExternalSorter.java:169)
> at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.spill(UnsafeExternalSorter.java:220)
> at 
> org.apache.spark.sql.execution.UnsafeKVExternalSorter.(UnsafeKVExternalSorter.java:126)
> at 
> org.apache.spark.sql.execution.UnsafeFixedWidthAggregationMap.destructAndCreateExternalSorter(UnsafeFixedWidthAggregationMap.java:257)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.switchToSortBasedAggregation(TungstenAggregationIterator.scala:435)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.processInputs(TungstenAggregationIterator.scala:379)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.start(TungstenAggregationIterator.scala:622)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1.org$apache$spark$sql$execution$aggregate$TungstenAggregate$$anonfun$$executePartition$1(TungstenAggregate.scala:110)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:119)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:119)
> at 
> org.apache.spark.rdd.MapPartitionsWithPreparationRDD.compute(MapPartitionsWithPreparationRDD.scala:64)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
> at org.apache.spark.scheduler.Task.run(Task.scala:88)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> {code}
> Key SQL Query
> {code:sql}
> INSERT INTO TABLE test_table
> SELECT
>   ss.ss_customer_sk AS cid,
>   count(CASE WHEN i.i_class_id=1  THEN 1 ELSE NULL END) AS id1,
>   count(CASE WHEN i.i_class_id=3  THEN 1 ELSE NULL END) AS id3,
>   count(CASE WHEN i.i_class_id=5  THEN 1 ELSE NULL END) AS id5,
>   count(CASE WHEN i.i_class_id=7  THEN 1 ELSE NULL END) AS id7,
>   count(CASE WHEN i.i_class_id=9  THEN 1 ELSE NULL END) AS id9,
>   count(CASE WHEN i.i_class_id=11 THEN 1 ELSE NULL END) AS id11,
>   count(CASE WHEN i.i_class_id=13 THEN 1 ELSE NULL END) AS id13,
>   count(CASE WHEN i.i_class_id=15 THEN 1 ELSE NULL END) AS id15,
>   count(CASE WHEN i.i_class_id=2  THEN 1 ELSE NULL END) AS id2,
>   count(CASE WHEN i.i_class_id=4  THEN 1 ELSE NULL END) AS id4,
>   count(CASE WHEN i.i_class_id=6  THEN 1 ELSE NULL END) AS id6,
>   count(CASE WHEN i.i_class_id=8  THEN 1 ELSE NULL END) AS id8,
>   count(CASE WHEN i.i_class_id=10 THEN 1 ELSE NULL END) AS id10,
>   count(CASE WHEN i.i_class_id=14 THEN 1 ELSE NULL END) AS id14,
>   count(CASE WHEN i.i_class_id=16 THEN 1 ELSE NULL END) AS id16
> FROM store_sales ss
> INNER JOIN item i ON ss.ss_item_sk = i.i_item_sk
> WHERE i.i_category IN ('Books')
> AND ss.ss_customer_sk IS NOT NULL
> GROUP BY ss.ss_customer_sk
> HAVING count(ss.ss_item_sk) > 5
> {code}
> Note:
> the store_sales is a big fact table and item is a small di

[jira] [Commented] (SPARK-5569) Checkpoints cannot reference classes defined outside of Spark's assembly

2015-09-21 Thread Jon Buffington (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14901609#comment-14901609
 ] 

Jon Buffington commented on SPARK-5569:
---

Cody,

We were able to work around the issue by destructuring the OffsetRange type 
into its parts (i.e., string, long, ints). Below is our flow for reference. Our 
only concern is whether the transform operation runs at the driver as we have 
use a singleton on the driver to persist the last captured offsets. Can you 
confirm that the stream transform runs at the driver?

{noformat}
// Recover or create the stream.
val ssc = StreamingContext.getOrCreate(checkpointPath, () => {
  createStreamingContext(checkpointPath)
})
...
def createStreamingContext(checkpointPath: String): StreamingContext = {
  val ssc = new StreamingContext(SparkConf,  Minutes(1))
  ssc.checkpoint(checkpointPath)
  createStream(ssc)
  ssc
}
...

def createStream((ssc: StreamingContext): Unit = {
...
  KafkaUtils.createDirectStream[K, V, KD, VD, R](ssc, kafkaParams, fromOffsets, 
messageHandler)
// "Note that the typecast to HasOffsetRanges will only succeed if it is
// done in the first method called on the directKafkaStream, not later down
// a chain of methods. You can use transform() instead of foreachRDD() as
// your first method call in order to access offsets, then call further
// Spark methods. However, be aware that the one-to-one mapping between RDD
// partition and Kafka partition does not remain after any methods that
// shuffle or repartition, e.g. reduceByKey() or window()."
.transform { rdd =>
  ...
  val offsetRanges = rdd.asInstanceOf[HasOffsetRanges].offsetRanges
  Ranger.captureOffsetRanges(offsetRanges) // De-structure the OffsetRange 
type into its parts and save in a singleton.
  ...
  rdd
}
...
}
{noformat}


> Checkpoints cannot reference classes defined outside of Spark's assembly
> 
>
> Key: SPARK-5569
> URL: https://issues.apache.org/jira/browse/SPARK-5569
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Reporter: Patrick Wendell
>
> Not sure if this is a bug or a feature, but it's not obvious, so wanted to 
> create a JIRA to make sure we document this behavior.
> First documented by Cody Koeninger:
> https://gist.github.com/koeninger/561a61482cd1b5b3600c
> {code}
> 15/01/12 16:07:07 INFO CheckpointReader: Attempting to load checkpoint from 
> file file:/var/tmp/cp/checkpoint-142110041.bk
> 15/01/12 16:07:07 WARN CheckpointReader: Error reading checkpoint from file 
> file:/var/tmp/cp/checkpoint-142110041.bk
> java.io.IOException: java.lang.ClassNotFoundException: 
> org.apache.spark.rdd.kafka.KafkaRDDPartition
> at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1043)
> at 
> org.apache.spark.streaming.dstream.DStreamCheckpointData.readObject(DStreamCheckpointData.scala:146)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at 
> java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017)
> at 
> java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893)
> at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
> at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
> at 
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
> at 
> java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
> at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
> at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
> at java.io.ObjectInputStream.readArray(ObjectInputStream.java:1706)
> at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1344)
> at 
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
> at 
> java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
> at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
> at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
> at 
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
> at 
> java.io.ObjectInputStream.defaultReadObject(ObjectInputStream.java:500)
> at 
> org.apache.spark.streaming.DStreamGraph$$anonfun$readObject$1.apply$mcV$sp(DStreamGraph.scala:180)
> at org.apache.spark.util.Utils$.tryOrIOExcepti

[jira] [Updated] (SPARK-10738) Refactoring `Instance` out from LOR and LIR, and also cleaning up some code

2015-09-21 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10738?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-10738:
--
Shepherd: Xiangrui Meng

> Refactoring `Instance` out from LOR and LIR, and also cleaning up some code
> ---
>
> Key: SPARK-10738
> URL: https://issues.apache.org/jira/browse/SPARK-10738
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 1.5.0
>Reporter: DB Tsai
>Assignee: DB Tsai
> Fix For: 1.6.0
>
>
> Refactoring `Instance` case class out from LOR and LIR, and also cleaning up 
> some code.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10739) Add attempt window for long running Spark application on Yarn

2015-09-21 Thread Saisai Shao (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10739?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Saisai Shao updated SPARK-10739:

Description: 
Currently Spark on Yarn uses max attempts to control the failure number, if 
application's failure number reaches to the max attempts, application will not 
be recovered by RM, it is not very effective for long running applications, 
since it will easily exceed the max number at a long time period, also setting 
a very large max attempts will hide the real problem.

So here introduce an attempt window to control the application attempt times, 
this will ignore the out of window attempts, it is introduced in Hadoop 2.6+ to 
support long running application, it is quite useful for Spark Streaming, Spark 
shell like applications.


  was:
Currently Spark on Yarn uses max attempts to control the failure number, if 
application's failure number reaches to the max attempts, application will not 
be recovered by RM, it is not very for long running applications, since it will 
easily exceed the max number, also setting a very large max attempts will hide 
the real problem.

So here introduce an attempt window to control the application attempt times, 
this will ignore the out of window attempts, it is introduced in Hadoop 2.6+ to 
support long running application, it is quite useful for Spark Streaming, Spark 
shell like applications.



> Add attempt window for long running Spark application on Yarn
> -
>
> Key: SPARK-10739
> URL: https://issues.apache.org/jira/browse/SPARK-10739
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Reporter: Saisai Shao
>Priority: Minor
>
> Currently Spark on Yarn uses max attempts to control the failure number, if 
> application's failure number reaches to the max attempts, application will 
> not be recovered by RM, it is not very effective for long running 
> applications, since it will easily exceed the max number at a long time 
> period, also setting a very large max attempts will hide the real problem.
> So here introduce an attempt window to control the application attempt times, 
> this will ignore the out of window attempts, it is introduced in Hadoop 2.6+ 
> to support long running application, it is quite useful for Spark Streaming, 
> Spark shell like applications.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10739) Add attempt window for long running Spark application on Yarn

2015-09-21 Thread Saisai Shao (JIRA)
Saisai Shao created SPARK-10739:
---

 Summary: Add attempt window for long running Spark application on 
Yarn
 Key: SPARK-10739
 URL: https://issues.apache.org/jira/browse/SPARK-10739
 Project: Spark
  Issue Type: Improvement
  Components: YARN
Reporter: Saisai Shao
Priority: Minor


Currently Spark on Yarn uses max attempts to control the failure number, if 
application's failure number reaches to the max attempts, application will not 
be recovered by RM, it is not very for long running applications, since it will 
easily exceed the max number, also setting a very large max attempts will hide 
the real problem.

So here introduce an attempt window to control the application attempt times, 
this will ignore the out of window attempts, it is introduced in Hadoop 2.6+ to 
support long running application, it is quite useful for Spark Streaming, Spark 
shell like applications.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10304) Partition discovery does not throw an exception if the dir structure is invalid

2015-09-21 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14901548#comment-14901548
 ] 

Yin Huai commented on SPARK-10304:
--

I dropped 1.5.1 as the target version since this issue does not prevent users 
from creating a df from a dataset with valid dir structure (the issue at here 
is that users can create a df from a dir with invalid structure). 

> Partition discovery does not throw an exception if the dir structure is 
> invalid
> ---
>
> Key: SPARK-10304
> URL: https://issues.apache.org/jira/browse/SPARK-10304
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Yin Huai
>Assignee: Zhan Zhang
>
> I have a dir structure like {{/path/table1/partition_column=1/}}. When I try 
> to use {{load("/path/")}}, it works and I get a DF. When I query this DF, if 
> it is stored as ORC, there will be the following NPE. But, if it is Parquet, 
> we even can return rows. We should complain to users about the dir struct 
> because {{table1}} does not meet our format.
> {code}
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 26 in 
> stage 57.0 failed 4 times, most recent failure: Lost task 26.3 in stage 57.0 
> (TID 3504, 10.0.195.227): java.lang.NullPointerException
> at 
> org.apache.spark.sql.hive.HiveInspectors$class.unwrapperFor(HiveInspectors.scala:466)
>   at 
> org.apache.spark.sql.hive.orc.OrcTableScan.unwrapperFor(OrcRelation.scala:224)
>   at 
> org.apache.spark.sql.hive.orc.OrcTableScan$$anonfun$org$apache$spark$sql$hive$orc$OrcTableScan$$fillObject$1$$anonfun$9.apply(OrcRelation.scala:261)
>   at 
> org.apache.spark.sql.hive.orc.OrcTableScan$$anonfun$org$apache$spark$sql$hive$orc$OrcTableScan$$fillObject$1$$anonfun$9.apply(OrcRelation.scala:261)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:105)
>   at 
> org.apache.spark.sql.hive.orc.OrcTableScan$$anonfun$org$apache$spark$sql$hive$orc$OrcTableScan$$fillObject$1.apply(OrcRelation.scala:261)
>   at 
> org.apache.spark.sql.hive.orc.OrcTableScan$$anonfun$org$apache$spark$sql$hive$orc$OrcTableScan$$fillObject$1.apply(OrcRelation.scala:256)
>   at scala.Option.map(Option.scala:145)
>   at 
> org.apache.spark.sql.hive.orc.OrcTableScan.org$apache$spark$sql$hive$orc$OrcTableScan$$fillObject(OrcRelation.scala:256)
>   at 
> org.apache.spark.sql.hive.orc.OrcTableScan$$anonfun$execute$3.apply(OrcRelation.scala:318)
>   at 
> org.apache.spark.sql.hive.orc.OrcTableScan$$anonfun$execute$3.apply(OrcRelation.scala:316)
>   at 
> org.apache.spark.rdd.HadoopRDD$HadoopMapPartitionsWithSplitRDD.compute(HadoopRDD.scala:380)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10304) Partition discovery does not throw an exception if the dir structure is invalid

2015-09-21 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10304?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-10304:
-
Target Version/s: 1.6.0  (was: 1.6.0, 1.5.1)

> Partition discovery does not throw an exception if the dir structure is 
> invalid
> ---
>
> Key: SPARK-10304
> URL: https://issues.apache.org/jira/browse/SPARK-10304
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Yin Huai
>Assignee: Zhan Zhang
>
> I have a dir structure like {{/path/table1/partition_column=1/}}. When I try 
> to use {{load("/path/")}}, it works and I get a DF. When I query this DF, if 
> it is stored as ORC, there will be the following NPE. But, if it is Parquet, 
> we even can return rows. We should complain to users about the dir struct 
> because {{table1}} does not meet our format.
> {code}
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 26 in 
> stage 57.0 failed 4 times, most recent failure: Lost task 26.3 in stage 57.0 
> (TID 3504, 10.0.195.227): java.lang.NullPointerException
> at 
> org.apache.spark.sql.hive.HiveInspectors$class.unwrapperFor(HiveInspectors.scala:466)
>   at 
> org.apache.spark.sql.hive.orc.OrcTableScan.unwrapperFor(OrcRelation.scala:224)
>   at 
> org.apache.spark.sql.hive.orc.OrcTableScan$$anonfun$org$apache$spark$sql$hive$orc$OrcTableScan$$fillObject$1$$anonfun$9.apply(OrcRelation.scala:261)
>   at 
> org.apache.spark.sql.hive.orc.OrcTableScan$$anonfun$org$apache$spark$sql$hive$orc$OrcTableScan$$fillObject$1$$anonfun$9.apply(OrcRelation.scala:261)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:105)
>   at 
> org.apache.spark.sql.hive.orc.OrcTableScan$$anonfun$org$apache$spark$sql$hive$orc$OrcTableScan$$fillObject$1.apply(OrcRelation.scala:261)
>   at 
> org.apache.spark.sql.hive.orc.OrcTableScan$$anonfun$org$apache$spark$sql$hive$orc$OrcTableScan$$fillObject$1.apply(OrcRelation.scala:256)
>   at scala.Option.map(Option.scala:145)
>   at 
> org.apache.spark.sql.hive.orc.OrcTableScan.org$apache$spark$sql$hive$orc$OrcTableScan$$fillObject(OrcRelation.scala:256)
>   at 
> org.apache.spark.sql.hive.orc.OrcTableScan$$anonfun$execute$3.apply(OrcRelation.scala:318)
>   at 
> org.apache.spark.sql.hive.orc.OrcTableScan$$anonfun$execute$3.apply(OrcRelation.scala:316)
>   at 
> org.apache.spark.rdd.HadoopRDD$HadoopMapPartitionsWithSplitRDD.compute(HadoopRDD.scala:380)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10304) Partition discovery does not throw an exception if the dir structure is invalid

2015-09-21 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10304?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-10304:
-
Priority: Major  (was: Critical)

> Partition discovery does not throw an exception if the dir structure is 
> invalid
> ---
>
> Key: SPARK-10304
> URL: https://issues.apache.org/jira/browse/SPARK-10304
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Yin Huai
>Assignee: Zhan Zhang
>
> I have a dir structure like {{/path/table1/partition_column=1/}}. When I try 
> to use {{load("/path/")}}, it works and I get a DF. When I query this DF, if 
> it is stored as ORC, there will be the following NPE. But, if it is Parquet, 
> we even can return rows. We should complain to users about the dir struct 
> because {{table1}} does not meet our format.
> {code}
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 26 in 
> stage 57.0 failed 4 times, most recent failure: Lost task 26.3 in stage 57.0 
> (TID 3504, 10.0.195.227): java.lang.NullPointerException
> at 
> org.apache.spark.sql.hive.HiveInspectors$class.unwrapperFor(HiveInspectors.scala:466)
>   at 
> org.apache.spark.sql.hive.orc.OrcTableScan.unwrapperFor(OrcRelation.scala:224)
>   at 
> org.apache.spark.sql.hive.orc.OrcTableScan$$anonfun$org$apache$spark$sql$hive$orc$OrcTableScan$$fillObject$1$$anonfun$9.apply(OrcRelation.scala:261)
>   at 
> org.apache.spark.sql.hive.orc.OrcTableScan$$anonfun$org$apache$spark$sql$hive$orc$OrcTableScan$$fillObject$1$$anonfun$9.apply(OrcRelation.scala:261)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:105)
>   at 
> org.apache.spark.sql.hive.orc.OrcTableScan$$anonfun$org$apache$spark$sql$hive$orc$OrcTableScan$$fillObject$1.apply(OrcRelation.scala:261)
>   at 
> org.apache.spark.sql.hive.orc.OrcTableScan$$anonfun$org$apache$spark$sql$hive$orc$OrcTableScan$$fillObject$1.apply(OrcRelation.scala:256)
>   at scala.Option.map(Option.scala:145)
>   at 
> org.apache.spark.sql.hive.orc.OrcTableScan.org$apache$spark$sql$hive$orc$OrcTableScan$$fillObject(OrcRelation.scala:256)
>   at 
> org.apache.spark.sql.hive.orc.OrcTableScan$$anonfun$execute$3.apply(OrcRelation.scala:318)
>   at 
> org.apache.spark.sql.hive.orc.OrcTableScan$$anonfun$execute$3.apply(OrcRelation.scala:316)
>   at 
> org.apache.spark.rdd.HadoopRDD$HadoopMapPartitionsWithSplitRDD.compute(HadoopRDD.scala:380)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10718) Update License on conf files and corresponding excludes file

2015-09-21 Thread Rekha Joshi (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10718?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rekha Joshi updated SPARK-10718:

Description: 
Update License on conf files and corresponding excludes file update.
{code}
Apache license header missing from multiple script and required files
Could not find Apache license headers in the following files:
 !? <>spark/conf/spark-defaults.conf
[error] running <>spark/dev/check-license ; received return code 1
{code}

  was:
Check License should not verify conf files for license
{code}
Apache license header missing from multiple script and required files
Could not find Apache license headers in the following files:
 !? <>spark/conf/spark-defaults.conf
[error] running <>spark/dev/check-license ; received return code 1
{code}


> Update License on conf files and corresponding excludes file
> 
>
> Key: SPARK-10718
> URL: https://issues.apache.org/jira/browse/SPARK-10718
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 1.5.0
>Reporter: Rekha Joshi
>Priority: Minor
>
> Update License on conf files and corresponding excludes file update.
> {code}
> Apache license header missing from multiple script and required files
> Could not find Apache license headers in the following files:
>  !? <>spark/conf/spark-defaults.conf
> [error] running <>spark/dev/check-license ; received return code 1
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10718) Update License on conf files and corresponding excludes file

2015-09-21 Thread Rekha Joshi (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10718?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rekha Joshi updated SPARK-10718:

Summary: Update License on conf files and corresponding excludes file  
(was: Check License should not verify conf files for license)

> Update License on conf files and corresponding excludes file
> 
>
> Key: SPARK-10718
> URL: https://issues.apache.org/jira/browse/SPARK-10718
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 1.5.0
>Reporter: Rekha Joshi
>Priority: Minor
>
> Check License should not verify conf files for license
> {code}
> Apache license header missing from multiple script and required files
> Could not find Apache license headers in the following files:
>  !? <>spark/conf/spark-defaults.conf
> [error] running <>spark/dev/check-license ; received return code 1
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10737) When using UnsafeRows, SortMergeJoin may return wrong results

2015-09-21 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10737?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10737:


Assignee: Yin Huai  (was: Apache Spark)

> When using UnsafeRows, SortMergeJoin may return wrong results
> -
>
> Key: SPARK-10737
> URL: https://issues.apache.org/jira/browse/SPARK-10737
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Yin Huai
>Assignee: Yin Huai
>Priority: Blocker
>
> {code}
> val df1 = (1 to 10).map(i => (s"str_$i", i)).toDF("i", "j")
> val df2 =
>   df1
>   .join(df1.select(df1("i")), "i")
>   .select(df1("i"), df1("j"))
> val df3 = df2.withColumnRenamed("i", "i1").withColumnRenamed("j", "j1")
> val df4 =
>   df2
>   .join(df3, df2("i") === df3("i1"))
>   .withColumn("diff", $"j" - $"j1")
> df4.show(100, false)
> +--+---+--+---++
> |i |j  |i1|j1 |diff|
> +--+---+--+---++
> |str_2 |2  |str_2 |2  |0   |
> |str_7 |7  |str_2 |2  |5   |
> |str_10|10 |str_10|10 |0   |
> |str_3 |3  |str_3 |3  |0   |
> |str_8 |8  |str_3 |3  |5   |
> |str_4 |4  |str_4 |4  |0   |
> |str_9 |9  |str_4 |4  |5   |
> |str_5 |5  |str_5 |5  |0   |
> |str_1 |1  |str_1 |1  |0   |
> |str_6 |6  |str_1 |1  |5   |
> +--+---+--+---++
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10737) When using UnsafeRows, SortMergeJoin may return wrong results

2015-09-21 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10737?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14901499#comment-14901499
 ] 

Apache Spark commented on SPARK-10737:
--

User 'yhuai' has created a pull request for this issue:
https://github.com/apache/spark/pull/8854

> When using UnsafeRows, SortMergeJoin may return wrong results
> -
>
> Key: SPARK-10737
> URL: https://issues.apache.org/jira/browse/SPARK-10737
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Yin Huai
>Assignee: Yin Huai
>Priority: Blocker
>
> {code}
> val df1 = (1 to 10).map(i => (s"str_$i", i)).toDF("i", "j")
> val df2 =
>   df1
>   .join(df1.select(df1("i")), "i")
>   .select(df1("i"), df1("j"))
> val df3 = df2.withColumnRenamed("i", "i1").withColumnRenamed("j", "j1")
> val df4 =
>   df2
>   .join(df3, df2("i") === df3("i1"))
>   .withColumn("diff", $"j" - $"j1")
> df4.show(100, false)
> +--+---+--+---++
> |i |j  |i1|j1 |diff|
> +--+---+--+---++
> |str_2 |2  |str_2 |2  |0   |
> |str_7 |7  |str_2 |2  |5   |
> |str_10|10 |str_10|10 |0   |
> |str_3 |3  |str_3 |3  |0   |
> |str_8 |8  |str_3 |3  |5   |
> |str_4 |4  |str_4 |4  |0   |
> |str_9 |9  |str_4 |4  |5   |
> |str_5 |5  |str_5 |5  |0   |
> |str_1 |1  |str_1 |1  |0   |
> |str_6 |6  |str_1 |1  |5   |
> +--+---+--+---++
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10737) When using UnsafeRows, SortMergeJoin may return wrong results

2015-09-21 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10737?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10737:


Assignee: Apache Spark  (was: Yin Huai)

> When using UnsafeRows, SortMergeJoin may return wrong results
> -
>
> Key: SPARK-10737
> URL: https://issues.apache.org/jira/browse/SPARK-10737
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Yin Huai
>Assignee: Apache Spark
>Priority: Blocker
>
> {code}
> val df1 = (1 to 10).map(i => (s"str_$i", i)).toDF("i", "j")
> val df2 =
>   df1
>   .join(df1.select(df1("i")), "i")
>   .select(df1("i"), df1("j"))
> val df3 = df2.withColumnRenamed("i", "i1").withColumnRenamed("j", "j1")
> val df4 =
>   df2
>   .join(df3, df2("i") === df3("i1"))
>   .withColumn("diff", $"j" - $"j1")
> df4.show(100, false)
> +--+---+--+---++
> |i |j  |i1|j1 |diff|
> +--+---+--+---++
> |str_2 |2  |str_2 |2  |0   |
> |str_7 |7  |str_2 |2  |5   |
> |str_10|10 |str_10|10 |0   |
> |str_3 |3  |str_3 |3  |0   |
> |str_8 |8  |str_3 |3  |5   |
> |str_4 |4  |str_4 |4  |0   |
> |str_9 |9  |str_4 |4  |5   |
> |str_5 |5  |str_5 |5  |0   |
> |str_1 |1  |str_1 |1  |0   |
> |str_6 |6  |str_1 |1  |5   |
> +--+---+--+---++
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10737) When using UnsafeRows, SortMergeJoin may return wrong results

2015-09-21 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10737?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-10737:
-
Summary: When using UnsafeRows, SortMergeJoin may return wrong results  
(was: When using UnsafeRow, SortMergeJoin may return wrong results)

> When using UnsafeRows, SortMergeJoin may return wrong results
> -
>
> Key: SPARK-10737
> URL: https://issues.apache.org/jira/browse/SPARK-10737
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Yin Huai
>Assignee: Yin Huai
>Priority: Blocker
>
> {code}
> val df1 = (1 to 10).map(i => (s"str_$i", i)).toDF("i", "j")
> val df2 =
>   df1
>   .join(df1.select(df1("i")), "i")
>   .select(df1("i"), df1("j"))
> val df3 = df2.withColumnRenamed("i", "i1").withColumnRenamed("j", "j1")
> val df4 =
>   df2
>   .join(df3, df2("i") === df3("i1"))
>   .withColumn("diff", $"j" - $"j1")
> df4.show(100, false)
> +--+---+--+---++
> |i |j  |i1|j1 |diff|
> +--+---+--+---++
> |str_2 |2  |str_2 |2  |0   |
> |str_7 |7  |str_2 |2  |5   |
> |str_10|10 |str_10|10 |0   |
> |str_3 |3  |str_3 |3  |0   |
> |str_8 |8  |str_3 |3  |5   |
> |str_4 |4  |str_4 |4  |0   |
> |str_9 |9  |str_4 |4  |5   |
> |str_5 |5  |str_5 |5  |0   |
> |str_1 |1  |str_1 |1  |0   |
> |str_6 |6  |str_1 |1  |5   |
> +--+---+--+---++
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10738) Refactoring `Instance` out from LOR and LIR, and also cleaning up some code

2015-09-21 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10738?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10738:


Assignee: DB Tsai  (was: Apache Spark)

> Refactoring `Instance` out from LOR and LIR, and also cleaning up some code
> ---
>
> Key: SPARK-10738
> URL: https://issues.apache.org/jira/browse/SPARK-10738
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 1.5.0
>Reporter: DB Tsai
>Assignee: DB Tsai
> Fix For: 1.6.0
>
>
> Refactoring `Instance` case class out from LOR and LIR, and also cleaning up 
> some code.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10738) Refactoring `Instance` out from LOR and LIR, and also cleaning up some code

2015-09-21 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10738?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10738:


Assignee: Apache Spark  (was: DB Tsai)

> Refactoring `Instance` out from LOR and LIR, and also cleaning up some code
> ---
>
> Key: SPARK-10738
> URL: https://issues.apache.org/jira/browse/SPARK-10738
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 1.5.0
>Reporter: DB Tsai
>Assignee: Apache Spark
> Fix For: 1.6.0
>
>
> Refactoring `Instance` case class out from LOR and LIR, and also cleaning up 
> some code.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10738) Refactoring `Instance` out from LOR and LIR, and also cleaning up some code

2015-09-21 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10738?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14901479#comment-14901479
 ] 

Apache Spark commented on SPARK-10738:
--

User 'dbtsai' has created a pull request for this issue:
https://github.com/apache/spark/pull/8853

> Refactoring `Instance` out from LOR and LIR, and also cleaning up some code
> ---
>
> Key: SPARK-10738
> URL: https://issues.apache.org/jira/browse/SPARK-10738
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 1.5.0
>Reporter: DB Tsai
>Assignee: DB Tsai
> Fix For: 1.6.0
>
>
> Refactoring `Instance` case class out from LOR and LIR, and also cleaning up 
> some code.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10737) When using UnsafeRow, SortMergeJoin may return wrong results

2015-09-21 Thread Yin Huai (JIRA)
Yin Huai created SPARK-10737:


 Summary: When using UnsafeRow, SortMergeJoin may return wrong 
results
 Key: SPARK-10737
 URL: https://issues.apache.org/jira/browse/SPARK-10737
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.5.0
Reporter: Yin Huai
Assignee: Yin Huai
Priority: Blocker


{code}
val df1 = (1 to 10).map(i => (s"str_$i", i)).toDF("i", "j")
val df2 =
  df1
  .join(df1.select(df1("i")), "i")
  .select(df1("i"), df1("j"))

val df3 = df2.withColumnRenamed("i", "i1").withColumnRenamed("j", "j1")
val df4 =
  df2
  .join(df3, df2("i") === df3("i1"))
  .withColumn("diff", $"j" - $"j1")

df4.show(100, false)

+--+---+--+---++
|i |j  |i1|j1 |diff|
+--+---+--+---++
|str_2 |2  |str_2 |2  |0   |
|str_7 |7  |str_2 |2  |5   |
|str_10|10 |str_10|10 |0   |
|str_3 |3  |str_3 |3  |0   |
|str_8 |8  |str_3 |3  |5   |
|str_4 |4  |str_4 |4  |0   |
|str_9 |9  |str_4 |4  |5   |
|str_5 |5  |str_5 |5  |0   |
|str_1 |1  |str_1 |1  |0   |
|str_6 |6  |str_1 |1  |5   |
+--+---+--+---++

{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10738) Refactoring `Instance` out from LOR and LIR, and also cleaning up some code

2015-09-21 Thread DB Tsai (JIRA)
DB Tsai created SPARK-10738:
---

 Summary: Refactoring `Instance` out from LOR and LIR, and also 
cleaning up some code
 Key: SPARK-10738
 URL: https://issues.apache.org/jira/browse/SPARK-10738
 Project: Spark
  Issue Type: Improvement
  Components: ML
Affects Versions: 1.5.0
Reporter: DB Tsai
Assignee: DB Tsai
 Fix For: 1.6.0


Refactoring `Instance` case class out from LOR and LIR, and also cleaning up 
some code.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10662) Code snippets are not properly formatted in tables

2015-09-21 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10662?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-10662:
--
Assignee: Jacek Laskowski  (was: Jacek Lewandowski)

Oops, auto-complete error there

> Code snippets are not properly formatted in tables
> --
>
> Key: SPARK-10662
> URL: https://issues.apache.org/jira/browse/SPARK-10662
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation
>Affects Versions: 1.5.0
>Reporter: Jacek Laskowski
>Assignee: Jacek Laskowski
>Priority: Trivial
> Fix For: 1.6.0
>
> Attachments: spark-docs-backticks-tables.png
>
>
> Backticks (markdown) in tables are not processed and hence not formatted 
> properly. See 
> http://people.apache.org/~pwendell/spark-nightly/spark-master-docs/latest/running-on-yarn.html
>  and search for {{`yarn-client`}}.
> As per [Sean's 
> suggestion|https://github.com/apache/spark/pull/8795#issuecomment-141019047] 
> I'm creating the JIRA task.
> {quote}
> This is a good fix, but this is another instance where I suspect the same 
> issue exists in several markup files, like configuration.html. It's worth a 
> JIRA since I think catching and fixing all of these is one non-trivial 
> logical change.
> If you can, avoid whitespace changes like stripping or adding space at the 
> end of lines. It just adds to the diff and makes for a tiny extra chance of 
> merge conflicts.
> {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-10662) Code snippets are not properly formatted in tables

2015-09-21 Thread Jacek Laskowski (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10662?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14901460#comment-14901460
 ] 

Jacek Laskowski edited comment on SPARK-10662 at 9/21/15 9:39 PM:
--

It's [~jlaskowski] not [~jlewandowski] in Assignee. We've got very similar 
names, but he's much better at Spark, Cassandra, and all the things (a 
DataStax'er, after all) . Mind fixing it, [~sowen]? Thanks!

Many thanks for merging the changes to master!


was (Author: jlaskowski):
It's [~jlaskowski] not [~jlewandowski] in Assignee. We've got very similar 
names, but he's much better at Spark, Cassandra, and all the things (a 
DataStax'er, after all) . Mind fixing it, [~sowen]? Thanks!

> Code snippets are not properly formatted in tables
> --
>
> Key: SPARK-10662
> URL: https://issues.apache.org/jira/browse/SPARK-10662
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation
>Affects Versions: 1.5.0
>Reporter: Jacek Laskowski
>Assignee: Jacek Lewandowski
>Priority: Trivial
> Fix For: 1.6.0
>
> Attachments: spark-docs-backticks-tables.png
>
>
> Backticks (markdown) in tables are not processed and hence not formatted 
> properly. See 
> http://people.apache.org/~pwendell/spark-nightly/spark-master-docs/latest/running-on-yarn.html
>  and search for {{`yarn-client`}}.
> As per [Sean's 
> suggestion|https://github.com/apache/spark/pull/8795#issuecomment-141019047] 
> I'm creating the JIRA task.
> {quote}
> This is a good fix, but this is another instance where I suspect the same 
> issue exists in several markup files, like configuration.html. It's worth a 
> JIRA since I think catching and fixing all of these is one non-trivial 
> logical change.
> If you can, avoid whitespace changes like stripping or adding space at the 
> end of lines. It just adds to the diff and makes for a tiny extra chance of 
> merge conflicts.
> {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10662) Code snippets are not properly formatted in tables

2015-09-21 Thread Jacek Laskowski (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10662?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14901460#comment-14901460
 ] 

Jacek Laskowski commented on SPARK-10662:
-

It's [~jlaskowski] not [~jlewandowski] in Assignee. We've got very similar 
names, but he's much better at Spark, Cassandra, and all the things (a 
DataStax'er, after all) . Mind fixing it, [~sowen]? Thanks!

> Code snippets are not properly formatted in tables
> --
>
> Key: SPARK-10662
> URL: https://issues.apache.org/jira/browse/SPARK-10662
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation
>Affects Versions: 1.5.0
>Reporter: Jacek Laskowski
>Assignee: Jacek Lewandowski
>Priority: Trivial
> Fix For: 1.6.0
>
> Attachments: spark-docs-backticks-tables.png
>
>
> Backticks (markdown) in tables are not processed and hence not formatted 
> properly. See 
> http://people.apache.org/~pwendell/spark-nightly/spark-master-docs/latest/running-on-yarn.html
>  and search for {{`yarn-client`}}.
> As per [Sean's 
> suggestion|https://github.com/apache/spark/pull/8795#issuecomment-141019047] 
> I'm creating the JIRA task.
> {quote}
> This is a good fix, but this is another instance where I suspect the same 
> issue exists in several markup files, like configuration.html. It's worth a 
> JIRA since I think catching and fixing all of these is one non-trivial 
> logical change.
> If you can, avoid whitespace changes like stripping or adding space at the 
> end of lines. It just adds to the diff and makes for a tiny extra chance of 
> merge conflicts.
> {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10474) TungstenAggregation cannot acquire memory for pointer array after switching to sort-based

2015-09-21 Thread Andrew Or (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14901440#comment-14901440
 ] 

Andrew Or commented on SPARK-10474:
---

[~Yi Zhou] [~jrk] did you set `spark.task.cpus` by any chance?

> TungstenAggregation cannot acquire memory for pointer array after switching 
> to sort-based
> -
>
> Key: SPARK-10474
> URL: https://issues.apache.org/jira/browse/SPARK-10474
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Yi Zhou
>Assignee: Andrew Or
>Priority: Blocker
> Fix For: 1.6.0, 1.5.1
>
>
> In aggregation case, a  Lost task happened with below error.
> {code}
>  java.io.IOException: Could not acquire 65536 bytes of memory
> at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.initializeForWriting(UnsafeExternalSorter.java:169)
> at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.spill(UnsafeExternalSorter.java:220)
> at 
> org.apache.spark.sql.execution.UnsafeKVExternalSorter.(UnsafeKVExternalSorter.java:126)
> at 
> org.apache.spark.sql.execution.UnsafeFixedWidthAggregationMap.destructAndCreateExternalSorter(UnsafeFixedWidthAggregationMap.java:257)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.switchToSortBasedAggregation(TungstenAggregationIterator.scala:435)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.processInputs(TungstenAggregationIterator.scala:379)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.start(TungstenAggregationIterator.scala:622)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1.org$apache$spark$sql$execution$aggregate$TungstenAggregate$$anonfun$$executePartition$1(TungstenAggregate.scala:110)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:119)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:119)
> at 
> org.apache.spark.rdd.MapPartitionsWithPreparationRDD.compute(MapPartitionsWithPreparationRDD.scala:64)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
> at org.apache.spark.scheduler.Task.run(Task.scala:88)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> {code}
> Key SQL Query
> {code:sql}
> INSERT INTO TABLE test_table
> SELECT
>   ss.ss_customer_sk AS cid,
>   count(CASE WHEN i.i_class_id=1  THEN 1 ELSE NULL END) AS id1,
>   count(CASE WHEN i.i_class_id=3  THEN 1 ELSE NULL END) AS id3,
>   count(CASE WHEN i.i_class_id=5  THEN 1 ELSE NULL END) AS id5,
>   count(CASE WHEN i.i_class_id=7  THEN 1 ELSE NULL END) AS id7,
>   count(CASE WHEN i.i_class_id=9  THEN 1 ELSE NULL END) AS id9,
>   count(CASE WHEN i.i_class_id=11 THEN 1 ELSE NULL END) AS id11,
>   count(CASE WHEN i.i_class_id=13 THEN 1 ELSE NULL END) AS id13,
>   count(CASE WHEN i.i_class_id=15 THEN 1 ELSE NULL END) AS id15,
>   count(CASE WHEN i.i_class_id=2  THEN 1 ELSE NULL END) AS id2,
>   count(CASE WHEN i.i_class_id=4  THEN 1 ELSE NULL END) AS id4,
>   count(CASE WHEN i.i_class_id=6  THEN 1 ELSE NULL END) AS id6,
>   count(CASE WHEN i.i_class_id=8  THEN 1 ELSE NULL END) AS id8,
>   count(CASE WHEN i.i_class_id=10 THEN 1 ELSE NULL END) AS id10,
>   count(CASE WHEN i.i_class_id=14 THEN 1 ELSE NULL END) AS id14,
>   count(CASE WHEN i.i_class_id=16 THEN 1 ELSE NULL END) AS id16
> FROM store_sales ss
> INNER JOIN item i ON ss.ss_item_sk = i.i_item_sk
> WHERE i.i_category IN ('Books')
> AND ss.ss_customer_sk IS NOT NULL
> GROUP BY ss.ss_customer_sk
> HAVING count(ss.ss_item_sk) > 5
> {code}
> Note:
> the store_sales is a big fact table and item is a small dimension table.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-

[jira] [Comment Edited] (SPARK-10474) TungstenAggregation cannot acquire memory for pointer array after switching to sort-based

2015-09-21 Thread Andrew Or (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14901440#comment-14901440
 ] 

Andrew Or edited comment on SPARK-10474 at 9/21/15 9:23 PM:


[~jameszhouyi] [~jrk] did you set `spark.task.cpus` by any chance?


was (Author: andrewor14):
[~Yi Zhou] [~jrk] did you set `spark.task.cpus` by any chance?

> TungstenAggregation cannot acquire memory for pointer array after switching 
> to sort-based
> -
>
> Key: SPARK-10474
> URL: https://issues.apache.org/jira/browse/SPARK-10474
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Yi Zhou
>Assignee: Andrew Or
>Priority: Blocker
> Fix For: 1.6.0, 1.5.1
>
>
> In aggregation case, a  Lost task happened with below error.
> {code}
>  java.io.IOException: Could not acquire 65536 bytes of memory
> at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.initializeForWriting(UnsafeExternalSorter.java:169)
> at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.spill(UnsafeExternalSorter.java:220)
> at 
> org.apache.spark.sql.execution.UnsafeKVExternalSorter.(UnsafeKVExternalSorter.java:126)
> at 
> org.apache.spark.sql.execution.UnsafeFixedWidthAggregationMap.destructAndCreateExternalSorter(UnsafeFixedWidthAggregationMap.java:257)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.switchToSortBasedAggregation(TungstenAggregationIterator.scala:435)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.processInputs(TungstenAggregationIterator.scala:379)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.start(TungstenAggregationIterator.scala:622)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1.org$apache$spark$sql$execution$aggregate$TungstenAggregate$$anonfun$$executePartition$1(TungstenAggregate.scala:110)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:119)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:119)
> at 
> org.apache.spark.rdd.MapPartitionsWithPreparationRDD.compute(MapPartitionsWithPreparationRDD.scala:64)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
> at org.apache.spark.scheduler.Task.run(Task.scala:88)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> {code}
> Key SQL Query
> {code:sql}
> INSERT INTO TABLE test_table
> SELECT
>   ss.ss_customer_sk AS cid,
>   count(CASE WHEN i.i_class_id=1  THEN 1 ELSE NULL END) AS id1,
>   count(CASE WHEN i.i_class_id=3  THEN 1 ELSE NULL END) AS id3,
>   count(CASE WHEN i.i_class_id=5  THEN 1 ELSE NULL END) AS id5,
>   count(CASE WHEN i.i_class_id=7  THEN 1 ELSE NULL END) AS id7,
>   count(CASE WHEN i.i_class_id=9  THEN 1 ELSE NULL END) AS id9,
>   count(CASE WHEN i.i_class_id=11 THEN 1 ELSE NULL END) AS id11,
>   count(CASE WHEN i.i_class_id=13 THEN 1 ELSE NULL END) AS id13,
>   count(CASE WHEN i.i_class_id=15 THEN 1 ELSE NULL END) AS id15,
>   count(CASE WHEN i.i_class_id=2  THEN 1 ELSE NULL END) AS id2,
>   count(CASE WHEN i.i_class_id=4  THEN 1 ELSE NULL END) AS id4,
>   count(CASE WHEN i.i_class_id=6  THEN 1 ELSE NULL END) AS id6,
>   count(CASE WHEN i.i_class_id=8  THEN 1 ELSE NULL END) AS id8,
>   count(CASE WHEN i.i_class_id=10 THEN 1 ELSE NULL END) AS id10,
>   count(CASE WHEN i.i_class_id=14 THEN 1 ELSE NULL END) AS id14,
>   count(CASE WHEN i.i_class_id=16 THEN 1 ELSE NULL END) AS id16
> FROM store_sales ss
> INNER JOIN item i ON ss.ss_item_sk = i.i_item_sk
> WHERE i.i_category IN ('Books')
> AND ss.ss_customer_sk IS NOT NULL
> GROUP BY ss.ss_customer_sk
> HAVING count(ss.ss_item_sk) > 5
> {code}
> Note:
> the store_sales is a big fact table and item is

[jira] [Updated] (SPARK-10704) Rename HashShufflereader to BlockStoreShuffleReader

2015-09-21 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10704?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-10704:
---
Description: 
The current shuffle code has an interface named ShuffleReader with only one 
implementation, HashShuffleReader. This naming is confusing, since the same 
read path code is used for both sort- and hash-based shuffle. -We should 
consolidate these classes.- We should rename HashShuffleReader.

--In addition, there are aspects of ShuffleManager.getReader()'s API which 
don't make a lot of sense: it exposes the ability to request a contiguous range 
of shuffle partitions, but this feature isn't supported by any ShuffleReader 
implementations and isn't used anywhere in the existing code. We should clean 
this up, too.--

  was:
The current shuffle code has an interface named ShuffleReader with only one 
implementation, HashShuffleReader. This naming is confusing, since the same 
read path code is used for both sort- and hash-based shuffle. -We should 
consolidate these classes.-

In addition, there are aspects of ShuffleManager.getReader()'s API which don't 
make a lot of sense: it exposes the ability to request a contiguous range of 
shuffle partitions, but this feature isn't supported by any ShuffleReader 
implementations and isn't used anywhere in the existing code. We should clean 
this up, too.


> Rename HashShufflereader to BlockStoreShuffleReader
> ---
>
> Key: SPARK-10704
> URL: https://issues.apache.org/jira/browse/SPARK-10704
> Project: Spark
>  Issue Type: Sub-task
>  Components: Shuffle
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>
> The current shuffle code has an interface named ShuffleReader with only one 
> implementation, HashShuffleReader. This naming is confusing, since the same 
> read path code is used for both sort- and hash-based shuffle. -We should 
> consolidate these classes.- We should rename HashShuffleReader.
> --In addition, there are aspects of ShuffleManager.getReader()'s API which 
> don't make a lot of sense: it exposes the ability to request a contiguous 
> range of shuffle partitions, but this feature isn't supported by any 
> ShuffleReader implementations and isn't used anywhere in the existing code. 
> We should clean this up, too.--



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10704) Rename HashShufflereader to BlockStoreShuffleReader

2015-09-21 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10704?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-10704:
---
Description: 
The current shuffle code has an interface named ShuffleReader with only one 
implementation, HashShuffleReader. This naming is confusing, since the same 
read path code is used for both sort- and hash-based shuffle. ~~We should 
consolidate these classes.~~

In addition, there are aspects of ShuffleManager.getReader()'s API which don't 
make a lot of sense: it exposes the ability to request a contiguous range of 
shuffle partitions, but this feature isn't supported by any ShuffleReader 
implementations and isn't used anywhere in the existing code. We should clean 
this up, too.

  was:
The current shuffle code has an interface named ShuffleReader with only one 
implementation, HashShuffleReader. This naming is confusing, since the same 
read path code is used for both sort- and hash-based shuffle. We 
should consolidate these classes.

In addition, there are aspects of ShuffleManager.getReader()'s API which don't 
make a lot of sense: it exposes the ability to request a contiguous range of 
shuffle partitions, but this feature isn't supported by any ShuffleReader 
implementations and isn't used anywhere in the existing code. We should clean 
this up, too.


> Rename HashShufflereader to BlockStoreShuffleReader
> ---
>
> Key: SPARK-10704
> URL: https://issues.apache.org/jira/browse/SPARK-10704
> Project: Spark
>  Issue Type: Sub-task
>  Components: Shuffle
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>
> The current shuffle code has an interface named ShuffleReader with only one 
> implementation, HashShuffleReader. This naming is confusing, since the same 
> read path code is used for both sort- and hash-based shuffle. ~~We should 
> consolidate these classes.~~
> In addition, there are aspects of ShuffleManager.getReader()'s API which 
> don't make a lot of sense: it exposes the ability to request a contiguous 
> range of shuffle partitions, but this feature isn't supported by any 
> ShuffleReader implementations and isn't used anywhere in the existing code. 
> We should clean this up, too.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10704) Rename HashShufflereader to BlockStoreShuffleReader

2015-09-21 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10704?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-10704:
---
Description: 
The current shuffle code has an interface named ShuffleReader with only one 
implementation, HashShuffleReader. This naming is confusing, since the same 
read path code is used for both sort- and hash-based shuffle. -We should 
consolidate these classes.-

In addition, there are aspects of ShuffleManager.getReader()'s API which don't 
make a lot of sense: it exposes the ability to request a contiguous range of 
shuffle partitions, but this feature isn't supported by any ShuffleReader 
implementations and isn't used anywhere in the existing code. We should clean 
this up, too.

  was:
The current shuffle code has an interface named ShuffleReader with only one 
implementation, HashShuffleReader. This naming is confusing, since the same 
read path code is used for both sort- and hash-based shuffle. ~~We should 
consolidate these classes.~~

In addition, there are aspects of ShuffleManager.getReader()'s API which don't 
make a lot of sense: it exposes the ability to request a contiguous range of 
shuffle partitions, but this feature isn't supported by any ShuffleReader 
implementations and isn't used anywhere in the existing code. We should clean 
this up, too.


> Rename HashShufflereader to BlockStoreShuffleReader
> ---
>
> Key: SPARK-10704
> URL: https://issues.apache.org/jira/browse/SPARK-10704
> Project: Spark
>  Issue Type: Sub-task
>  Components: Shuffle
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>
> The current shuffle code has an interface named ShuffleReader with only one 
> implementation, HashShuffleReader. This naming is confusing, since the same 
> read path code is used for both sort- and hash-based shuffle. -We should 
> consolidate these classes.-
> In addition, there are aspects of ShuffleManager.getReader()'s API which 
> don't make a lot of sense: it exposes the ability to request a contiguous 
> range of shuffle partitions, but this feature isn't supported by any 
> ShuffleReader implementations and isn't used anywhere in the existing code. 
> We should clean this up, too.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10704) Rename HashShufflereader to BlockStoreShuffleReader

2015-09-21 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10704?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-10704:
---
Description: 
The current shuffle code has an interface named ShuffleReader with only one 
implementation, HashShuffleReader. This naming is confusing, since the same 
read path code is used for both sort- and hash-based shuffle. We 
should consolidate these classes.

In addition, there are aspects of ShuffleManager.getReader()'s API which don't 
make a lot of sense: it exposes the ability to request a contiguous range of 
shuffle partitions, but this feature isn't supported by any ShuffleReader 
implementations and isn't used anywhere in the existing code. We should clean 
this up, too.

  was:
The current shuffle code has an interface named ShuffleReader with only one 
implementation, HashShuffleReader. This naming is confusing, since the same 
read path code is used for both sort- and hash-based shuffle. We should 
consolidate these classes.

In addition, there are aspects of ShuffleManager.getReader()'s API which don't 
make a lot of sense: it exposes the ability to request a contiguous range of 
shuffle partitions, but this feature isn't supported by any ShuffleReader 
implementations and isn't used anywhere in the existing code. We should clean 
this up, too.


> Rename HashShufflereader to BlockStoreShuffleReader
> ---
>
> Key: SPARK-10704
> URL: https://issues.apache.org/jira/browse/SPARK-10704
> Project: Spark
>  Issue Type: Sub-task
>  Components: Shuffle
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>
> The current shuffle code has an interface named ShuffleReader with only one 
> implementation, HashShuffleReader. This naming is confusing, since the same 
> read path code is used for both sort- and hash-based shuffle. 
> We should consolidate these classes.
> In addition, there are aspects of ShuffleManager.getReader()'s API which 
> don't make a lot of sense: it exposes the ability to request a contiguous 
> range of shuffle partitions, but this feature isn't supported by any 
> ShuffleReader implementations and isn't used anywhere in the existing code. 
> We should clean this up, too.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10704) Rename HashShufflereader to BlockStoreShuffleReader

2015-09-21 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10704?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-10704:
---
Summary: Rename HashShufflereader to BlockStoreShuffleReader  (was: 
Consolidate HashShuffleReader and ShuffleReader and refactor 
ShuffleManager.getReader())

> Rename HashShufflereader to BlockStoreShuffleReader
> ---
>
> Key: SPARK-10704
> URL: https://issues.apache.org/jira/browse/SPARK-10704
> Project: Spark
>  Issue Type: Sub-task
>  Components: Shuffle
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>
> The current shuffle code has an interface named ShuffleReader with only one 
> implementation, HashShuffleReader. This naming is confusing, since the same 
> read path code is used for both sort- and hash-based shuffle. We should 
> consolidate these classes.
> In addition, there are aspects of ShuffleManager.getReader()'s API which 
> don't make a lot of sense: it exposes the ability to request a contiguous 
> range of shuffle partitions, but this feature isn't supported by any 
> ShuffleReader implementations and isn't used anywhere in the existing code. 
> We should clean this up, too.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10736) Use 1 for all ratings if $(ratingCol) = ""

2015-09-21 Thread Xiangrui Meng (JIRA)
Xiangrui Meng created SPARK-10736:
-

 Summary: Use 1 for all ratings if $(ratingCol) = ""
 Key: SPARK-10736
 URL: https://issues.apache.org/jira/browse/SPARK-10736
 Project: Spark
  Issue Type: Improvement
  Components: ML
Affects Versions: 1.6.0
Reporter: Xiangrui Meng
Priority: Minor


For some implicit dataset, ratings may not exist in the training data. In this 
case, we can assume all observed pairs to be positive and treat their ratings 
as 1. This should happen when users set ratingCol to an empty string.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5569) Checkpoints cannot reference classes defined outside of Spark's assembly

2015-09-21 Thread Cody Koeninger (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14901435#comment-14901435
 ] 

Cody Koeninger commented on SPARK-5569:
---

Yeah, spark-streaming can be marked as provided.

Try to get a small reproducible case that demonstrates the issue.

On Mon, Sep 21, 2015 at 3:58 PM, Jon Buffington (JIRA) 



> Checkpoints cannot reference classes defined outside of Spark's assembly
> 
>
> Key: SPARK-5569
> URL: https://issues.apache.org/jira/browse/SPARK-5569
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Reporter: Patrick Wendell
>
> Not sure if this is a bug or a feature, but it's not obvious, so wanted to 
> create a JIRA to make sure we document this behavior.
> First documented by Cody Koeninger:
> https://gist.github.com/koeninger/561a61482cd1b5b3600c
> {code}
> 15/01/12 16:07:07 INFO CheckpointReader: Attempting to load checkpoint from 
> file file:/var/tmp/cp/checkpoint-142110041.bk
> 15/01/12 16:07:07 WARN CheckpointReader: Error reading checkpoint from file 
> file:/var/tmp/cp/checkpoint-142110041.bk
> java.io.IOException: java.lang.ClassNotFoundException: 
> org.apache.spark.rdd.kafka.KafkaRDDPartition
> at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1043)
> at 
> org.apache.spark.streaming.dstream.DStreamCheckpointData.readObject(DStreamCheckpointData.scala:146)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at 
> java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017)
> at 
> java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893)
> at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
> at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
> at 
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
> at 
> java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
> at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
> at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
> at java.io.ObjectInputStream.readArray(ObjectInputStream.java:1706)
> at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1344)
> at 
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
> at 
> java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
> at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
> at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
> at 
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
> at 
> java.io.ObjectInputStream.defaultReadObject(ObjectInputStream.java:500)
> at 
> org.apache.spark.streaming.DStreamGraph$$anonfun$readObject$1.apply$mcV$sp(DStreamGraph.scala:180)
> at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1040)
> at 
> org.apache.spark.streaming.DStreamGraph.readObject(DStreamGraph.scala:176)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at 
> java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017)
> at 
> java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893)
> at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
> at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
> at 
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
> at 
> java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
> at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
> at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
> at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
> at 
> org.apache.spark.streaming.CheckpointReader$$anonfun$read$2.apply(Checkpoint.scala:251)
> at 
> org.apache.spark.streaming.CheckpointReader$$anonfun$read$2.apply(Checkpoint.scala:239)
> at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)

[jira] [Commented] (SPARK-10485) IF expression is not correctly resolved when one of the options have NullType

2015-09-21 Thread Wenchen Fan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10485?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14901394#comment-14901394
 ] 

Wenchen Fan commented on SPARK-10485:
-

Hi [~ajnavarro], can you reproduce it in 1.5? We do have data type check for IF 
but we also have type coercions to cast null type if needed.

> IF expression is not correctly resolved when one of the options have NullType
> -
>
> Key: SPARK-10485
> URL: https://issues.apache.org/jira/browse/SPARK-10485
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.1, 1.5.0
>Reporter: Antonio Jesus Navarro
>
> If we have this query:
> {code}
> SELECT IF(column > 1, 1, NULL) FROM T1
> {code}
> On Spark 1.4.1 we have this:
> {code}
> override lazy val resolved = childrenResolved && trueValue.dataType == 
> falseValue.dataType
> {code}
> So if one of the types is NullType, the if expression is not resolved.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10568) Error thrown in stopping one component in SparkContext.stop() doesn't allow other components to be stopped

2015-09-21 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10568?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14901396#comment-14901396
 ] 

Sean Owen commented on SPARK-10568:
---

Sure, if you have suggestions along those lines. I had imagined it was about 
calling some code to stop each component in a way that exceptions from one 
would not prevent the rest from executing.

> Error thrown in stopping one component in SparkContext.stop() doesn't allow 
> other components to be stopped
> --
>
> Key: SPARK-10568
> URL: https://issues.apache.org/jira/browse/SPARK-10568
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.4.1
>Reporter: Matt Cheah
>Priority: Minor
>
> When I shut down a Java process that is running a SparkContext, it invokes a 
> shutdown hook that eventually calls SparkContext.stop(), and inside 
> SparkContext.stop() each individual component (DiskBlockManager, Scheduler 
> Backend) is stopped. If an exception is thrown in stopping one of these 
> components, none of the other components will be stopped cleanly either. This 
> caused problems when I stopped a Java process running a Spark context in 
> yarn-client mode, because not properly stopping YarnSchedulerBackend leads to 
> problems.
> The steps I ran are as follows:
> 1. Create one job which fills the cluster
> 2. Kick off another job which creates a Spark Context
> 3. Kill the Java process with the Spark Context in #2
> 4. The job remains in the YARN UI as ACCEPTED
> Looking in the logs we see the following:
> {code}
> 2015-09-07 10:32:43,446 ERROR [Thread-3] o.a.s.u.Utils - Uncaught exception 
> in thread Thread-3
> java.lang.NullPointerException: null
> at 
> org.apache.spark.storage.DiskBlockManager.org$apache$spark$storage$DiskBlockManager$$doStop(DiskBlockManager.scala:162)
>  ~[spark-core_2.10-1.4.1.jar:1.4.1]
> at 
> org.apache.spark.storage.DiskBlockManager$$anonfun$addShutdownHook$1.apply$mcV$sp(DiskBlockManager.scala:144)
>  ~[spark-core_2.10-1.4.1.jar:1.4.1]
> at org.apache.spark.util.SparkShutdownHook.run(Utils.scala:2308) 
> ~[spark-core_2.10-1.4.1.jar:1.4.1]
> {code}
> I think what's going on is that when we kill the application in the queued 
> state, it tries to run the SparkContext.stop() method on the driver and stop 
> each component. It dies trying to stop the DiskBlockManager because it hasn't 
> been initialized yet - the application is still waiting to be scheduled by 
> the Yarn RM - but YarnClient.stop() is not invoked as a result, leaving the 
> application sticking around in the accepted state.
> Because of what appears to be bugs in the YARN scheduler, entering this state 
> makes it so that the YARN scheduler is unable to schedule any more jobs 
> unless we manually remove this application via the YARN CLI. We can tackle 
> the YARN stuck state separately, but ensuring that all components get at 
> least some chance to stop when a SparkContext stops seems like a good idea. 
> Of course we can still throw some exception and/or log exceptions for 
> everything that goes wrong at the end of stopping the context.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5569) Checkpoints cannot reference classes defined outside of Spark's assembly

2015-09-21 Thread Jon Buffington (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14901387#comment-14901387
 ] 

Jon Buffington commented on SPARK-5569:
---

The only substantive difference in the build file you reference is that 
spark-streaming is included in the uber jar (not mark as excluded). We are 
under the impression that the spark-streaming dependency is provided while we 
need to package spark-streaming-kafka. Are we mistaken?

> Checkpoints cannot reference classes defined outside of Spark's assembly
> 
>
> Key: SPARK-5569
> URL: https://issues.apache.org/jira/browse/SPARK-5569
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Reporter: Patrick Wendell
>
> Not sure if this is a bug or a feature, but it's not obvious, so wanted to 
> create a JIRA to make sure we document this behavior.
> First documented by Cody Koeninger:
> https://gist.github.com/koeninger/561a61482cd1b5b3600c
> {code}
> 15/01/12 16:07:07 INFO CheckpointReader: Attempting to load checkpoint from 
> file file:/var/tmp/cp/checkpoint-142110041.bk
> 15/01/12 16:07:07 WARN CheckpointReader: Error reading checkpoint from file 
> file:/var/tmp/cp/checkpoint-142110041.bk
> java.io.IOException: java.lang.ClassNotFoundException: 
> org.apache.spark.rdd.kafka.KafkaRDDPartition
> at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1043)
> at 
> org.apache.spark.streaming.dstream.DStreamCheckpointData.readObject(DStreamCheckpointData.scala:146)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at 
> java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017)
> at 
> java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893)
> at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
> at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
> at 
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
> at 
> java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
> at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
> at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
> at java.io.ObjectInputStream.readArray(ObjectInputStream.java:1706)
> at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1344)
> at 
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
> at 
> java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
> at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
> at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
> at 
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
> at 
> java.io.ObjectInputStream.defaultReadObject(ObjectInputStream.java:500)
> at 
> org.apache.spark.streaming.DStreamGraph$$anonfun$readObject$1.apply$mcV$sp(DStreamGraph.scala:180)
> at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1040)
> at 
> org.apache.spark.streaming.DStreamGraph.readObject(DStreamGraph.scala:176)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at 
> java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017)
> at 
> java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893)
> at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
> at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
> at 
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
> at 
> java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
> at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
> at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
> at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
> at 
> org.apache.spark.streaming.CheckpointReader$$anonfun$read$2.apply(Checkpoint.scala:251)
> at 
> org.apache.spark.streaming.CheckpointReader$$anonfun$read$2.apply(Checkpoin

[jira] [Comment Edited] (SPARK-5569) Checkpoints cannot reference classes defined outside of Spark's assembly

2015-09-21 Thread Jon Buffington (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14901387#comment-14901387
 ] 

Jon Buffington edited comment on SPARK-5569 at 9/21/15 8:57 PM:


The only substantive difference in the build file you reference is that 
spark-streaming is included in the uber jar (not mark as provided). We are 
under the impression that the spark-streaming dependency is provided while we 
need to package spark-streaming-kafka. Are we mistaken?


was (Author: jon_fuseelements):
The only substantive difference in the build file you reference is that 
spark-streaming is included in the uber jar (not mark as excluded). We are 
under the impression that the spark-streaming dependency is provided while we 
need to package spark-streaming-kafka. Are we mistaken?

> Checkpoints cannot reference classes defined outside of Spark's assembly
> 
>
> Key: SPARK-5569
> URL: https://issues.apache.org/jira/browse/SPARK-5569
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Reporter: Patrick Wendell
>
> Not sure if this is a bug or a feature, but it's not obvious, so wanted to 
> create a JIRA to make sure we document this behavior.
> First documented by Cody Koeninger:
> https://gist.github.com/koeninger/561a61482cd1b5b3600c
> {code}
> 15/01/12 16:07:07 INFO CheckpointReader: Attempting to load checkpoint from 
> file file:/var/tmp/cp/checkpoint-142110041.bk
> 15/01/12 16:07:07 WARN CheckpointReader: Error reading checkpoint from file 
> file:/var/tmp/cp/checkpoint-142110041.bk
> java.io.IOException: java.lang.ClassNotFoundException: 
> org.apache.spark.rdd.kafka.KafkaRDDPartition
> at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1043)
> at 
> org.apache.spark.streaming.dstream.DStreamCheckpointData.readObject(DStreamCheckpointData.scala:146)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at 
> java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017)
> at 
> java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893)
> at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
> at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
> at 
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
> at 
> java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
> at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
> at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
> at java.io.ObjectInputStream.readArray(ObjectInputStream.java:1706)
> at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1344)
> at 
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
> at 
> java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
> at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
> at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
> at 
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
> at 
> java.io.ObjectInputStream.defaultReadObject(ObjectInputStream.java:500)
> at 
> org.apache.spark.streaming.DStreamGraph$$anonfun$readObject$1.apply$mcV$sp(DStreamGraph.scala:180)
> at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1040)
> at 
> org.apache.spark.streaming.DStreamGraph.readObject(DStreamGraph.scala:176)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at 
> java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017)
> at 
> java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893)
> at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
> at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
> at 
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
> at 
> java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
> at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStre

  1   2   3   >