[jira] [Resolved] (SPARK-10446) Support to specify join type when calling join with usingColumns
[ https://issues.apache.org/jira/browse/SPARK-10446?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-10446. - Resolution: Fixed Assignee: Liang-Chi Hsieh Fix Version/s: 1.6.0 > Support to specify join type when calling join with usingColumns > > > Key: SPARK-10446 > URL: https://issues.apache.org/jira/browse/SPARK-10446 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Liang-Chi Hsieh >Assignee: Liang-Chi Hsieh > Fix For: 1.6.0 > > > Currently the method join(right: DataFrame, usingColumns: Seq[String]) only > supports inner join. It is more convenient to have it support other join > types. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-10419) Add JDBC dialect for Microsoft SQL Server
[ https://issues.apache.org/jira/browse/SPARK-10419?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-10419. - Resolution: Fixed Assignee: Ewan Leith Fix Version/s: 1.6.0 > Add JDBC dialect for Microsoft SQL Server > - > > Key: SPARK-10419 > URL: https://issues.apache.org/jira/browse/SPARK-10419 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.4.1, 1.5.0 >Reporter: Ewan Leith >Assignee: Ewan Leith >Priority: Minor > Fix For: 1.6.0 > > > Running JDBC connections against Microsoft SQL Server database tables, when a > table contains a datetimeoffset column type, the following error is received: > {code} > sqlContext.read.jdbc("jdbc:sqlserver://127.0.0.1:1433;DatabaseName=testdb", > "sampletable", prop) > java.sql.SQLException: Unsupported type -155 > at > org.apache.spark.sql.jdbc.JDBCRDD$.org$apache$spark$sql$jdbc$JDBCRDD$$getCatalystType(JDBCRDD.scala:100) > at > org.apache.spark.sql.jdbc.JDBCRDD$$anonfun$1.apply(JDBCRDD.scala:137) > at > org.apache.spark.sql.jdbc.JDBCRDD$$anonfun$1.apply(JDBCRDD.scala:137) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.sql.jdbc.JDBCRDD$.resolveTable(JDBCRDD.scala:136) > at > org.apache.spark.sql.jdbc.JDBCRelation.(JDBCRelation.scala:128) > at > org.apache.spark.sql.DataFrameReader.jdbc(DataFrameReader.scala:200) > at > org.apache.spark.sql.DataFrameReader.jdbc(DataFrameReader.scala:130) > {code} > Based on the JdbcDialect code for DB2 and the Microsoft SQL Server > documentation, we should probably treat datetimeoffset types as Strings > https://technet.microsoft.com/en-us/library/bb630289%28v=sql.105%29.aspx > We've created a small addition to JdbcDialects.scala to do this conversion, > I'll create a pull request for it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10577) [PySpark] DataFrame hint for broadcast join
[ https://issues.apache.org/jira/browse/SPARK-10577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14902049#comment-14902049 ] Reynold Xin commented on SPARK-10577: - [~maver1ck] the patch is now merged - you can just create a broadcast function yourself similar to the ones here in order to use it in Spark 1.5: https://github.com/apache/spark/pull/8801/files > [PySpark] DataFrame hint for broadcast join > --- > > Key: SPARK-10577 > URL: https://issues.apache.org/jira/browse/SPARK-10577 > Project: Spark > Issue Type: Improvement > Components: PySpark, SQL >Affects Versions: 1.5.0 >Reporter: Maciej Bryński > Labels: starter > Fix For: 1.6.0 > > > As in https://issues.apache.org/jira/browse/SPARK-8300 > there should by possibility to add hint for broadcast join in: > - Pyspark -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-10577) [PySpark] DataFrame hint for broadcast join
[ https://issues.apache.org/jira/browse/SPARK-10577?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-10577. - Resolution: Fixed Fix Version/s: 1.6.0 > [PySpark] DataFrame hint for broadcast join > --- > > Key: SPARK-10577 > URL: https://issues.apache.org/jira/browse/SPARK-10577 > Project: Spark > Issue Type: Improvement > Components: PySpark, SQL >Affects Versions: 1.5.0 >Reporter: Maciej Bryński > Labels: starter > Fix For: 1.6.0 > > > As in https://issues.apache.org/jira/browse/SPARK-8300 > there should by possibility to add hint for broadcast join in: > - Pyspark -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-10750) ML Param validate should print better error information
[ https://issues.apache.org/jira/browse/SPARK-10750?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-10750: Assignee: Apache Spark > ML Param validate should print better error information > --- > > Key: SPARK-10750 > URL: https://issues.apache.org/jira/browse/SPARK-10750 > Project: Spark > Issue Type: Bug > Components: ML >Reporter: Yanbo Liang >Assignee: Apache Spark >Priority: Minor > > Currently when you set illegal value for params of array type (such as > IntArrayParam, DoubleArrayParam, StringArrayParam), it will throw > IllegalArgumentException but with incomprehensible error information. > For example: > val vectorSlicer = new > VectorSlicer().setInputCol("features").setOutputCol("result") > vectorSlicer.setIndices(Array.empty).setNames(Array("f1", "f4", "f1")) > It will throw IllegalArgumentException as: > vectorSlicer_b3b4d1a10f43 parameter names given invalid value > [Ljava.lang.String;@798256c5. > java.lang.IllegalArgumentException: vectorSlicer_b3b4d1a10f43 parameter names > given invalid value [Ljava.lang.String;@798256c5. > Users can not understand which params were set incorrectly. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-10750) ML Param validate should print better error information
[ https://issues.apache.org/jira/browse/SPARK-10750?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-10750: Assignee: (was: Apache Spark) > ML Param validate should print better error information > --- > > Key: SPARK-10750 > URL: https://issues.apache.org/jira/browse/SPARK-10750 > Project: Spark > Issue Type: Bug > Components: ML >Reporter: Yanbo Liang >Priority: Minor > > Currently when you set illegal value for params of array type (such as > IntArrayParam, DoubleArrayParam, StringArrayParam), it will throw > IllegalArgumentException but with incomprehensible error information. > For example: > val vectorSlicer = new > VectorSlicer().setInputCol("features").setOutputCol("result") > vectorSlicer.setIndices(Array.empty).setNames(Array("f1", "f4", "f1")) > It will throw IllegalArgumentException as: > vectorSlicer_b3b4d1a10f43 parameter names given invalid value > [Ljava.lang.String;@798256c5. > java.lang.IllegalArgumentException: vectorSlicer_b3b4d1a10f43 parameter names > given invalid value [Ljava.lang.String;@798256c5. > Users can not understand which params were set incorrectly. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10750) ML Param validate should print better error information
[ https://issues.apache.org/jira/browse/SPARK-10750?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14902044#comment-14902044 ] Apache Spark commented on SPARK-10750: -- User 'yanboliang' has created a pull request for this issue: https://github.com/apache/spark/pull/8863 > ML Param validate should print better error information > --- > > Key: SPARK-10750 > URL: https://issues.apache.org/jira/browse/SPARK-10750 > Project: Spark > Issue Type: Bug > Components: ML >Reporter: Yanbo Liang >Priority: Minor > > Currently when you set illegal value for params of array type (such as > IntArrayParam, DoubleArrayParam, StringArrayParam), it will throw > IllegalArgumentException but with incomprehensible error information. > For example: > val vectorSlicer = new > VectorSlicer().setInputCol("features").setOutputCol("result") > vectorSlicer.setIndices(Array.empty).setNames(Array("f1", "f4", "f1")) > It will throw IllegalArgumentException as: > vectorSlicer_b3b4d1a10f43 parameter names given invalid value > [Ljava.lang.String;@798256c5. > java.lang.IllegalArgumentException: vectorSlicer_b3b4d1a10f43 parameter names > given invalid value [Ljava.lang.String;@798256c5. > Users can not understand which params were set incorrectly. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8386) DataFrame and JDBC regression
[ https://issues.apache.org/jira/browse/SPARK-8386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14902042#comment-14902042 ] Reynold Xin commented on SPARK-8386: [~viirya] do you have time to take a look? > DataFrame and JDBC regression > - > > Key: SPARK-8386 > URL: https://issues.apache.org/jira/browse/SPARK-8386 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.4.0 > Environment: RHEL 7.1 >Reporter: Peter Haumer >Priority: Critical > > I have an ETL app that appends to a JDBC table new results found at each run. > In 1.3.1 I did this: > testResultsDF.insertIntoJDBC(CONNECTION_URL, TABLE_NAME, false); > When I do this now in 1.4 it complains that the "object" 'TABLE_NAME' already > exists. I get this even if I switch the overwrite to true. I also tried this > now: > testResultsDF.write().mode(SaveMode.Append).jdbc(CONNECTION_URL, TABLE_NAME, > connectionProperties); > getting the same error. It works running the first time creating the new > table and adding data successfully. But, running it a second time it (the > jdbc driver) will tell me that the table already exists. Even > SaveMode.Overwrite will give me the same error. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-10716) spark-1.5.0-bin-hadoop2.6.tgz file doesn't uncompress on OS X due to hidden file
[ https://issues.apache.org/jira/browse/SPARK-10716?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-10716. - Resolution: Fixed Assignee: Sean Owen Fix Version/s: 1.5.1 1.6.0 > spark-1.5.0-bin-hadoop2.6.tgz file doesn't uncompress on OS X due to hidden > file > > > Key: SPARK-10716 > URL: https://issues.apache.org/jira/browse/SPARK-10716 > Project: Spark > Issue Type: Bug > Components: Build, Deploy >Affects Versions: 1.5.0 > Environment: Yosemite 10.10.5 >Reporter: Jack Jack >Assignee: Sean Owen >Priority: Minor > Fix For: 1.6.0, 1.5.1 > > > Directly downloaded prebuilt binaries of > http://d3kbcqa49mib13.cloudfront.net/spark-1.5.0-bin-hadoop2.6.tgz > got error when tar xvzf it. Tried download twice and extract twice. > error log: > .. > x spark-1.5.0-bin-hadoop2.6/lib/ > x spark-1.5.0-bin-hadoop2.6/lib/datanucleus-core-3.2.10.jar > x spark-1.5.0-bin-hadoop2.6/lib/datanucleus-api-jdo-3.2.6.jar > x spark-1.5.0-bin-hadoop2.6/lib/datanucleus-rdbms-3.2.9.jar > x spark-1.5.0-bin-hadoop2.6/lib/spark-examples-1.5.0-hadoop2.6.0.jar > x spark-1.5.0-bin-hadoop2.6/lib/spark-assembly-1.5.0-hadoop2.6.0.jar > x spark-1.5.0-bin-hadoop2.6/lib/spark-1.5.0-yarn-shuffle.jar > x spark-1.5.0-bin-hadoop2.6/README.md > tar: copyfile unpack > (spark-1.5.0-bin-hadoop2.6/python/test_support/sql/orc_partitioned/SUCCESS.crc) > failed: No such file or directory > ~ :> -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10750) ML Param validate should print better error information
[ https://issues.apache.org/jira/browse/SPARK-10750?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14902029#comment-14902029 ] Yanbo Liang commented on SPARK-10750: - This is because Param.validate(value: T) use value.toString at error information which did not distinguish value of array type or not. > ML Param validate should print better error information > --- > > Key: SPARK-10750 > URL: https://issues.apache.org/jira/browse/SPARK-10750 > Project: Spark > Issue Type: Bug > Components: ML >Reporter: Yanbo Liang >Priority: Minor > > Currently when you set illegal value for params of array type (such as > IntArrayParam, DoubleArrayParam, StringArrayParam), it will throw > IllegalArgumentException but with incomprehensible error information. > For example: > val vectorSlicer = new > VectorSlicer().setInputCol("features").setOutputCol("result") > vectorSlicer.setIndices(Array.empty).setNames(Array("f1", "f4", "f1")) > It will throw IllegalArgumentException as: > vectorSlicer_b3b4d1a10f43 parameter names given invalid value > [Ljava.lang.String;@798256c5. > java.lang.IllegalArgumentException: vectorSlicer_b3b4d1a10f43 parameter names > given invalid value [Ljava.lang.String;@798256c5. > Users can not understand which params were set incorrectly. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-10751) ML Param validate should print better error information
[ https://issues.apache.org/jira/browse/SPARK-10751?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yanbo Liang closed SPARK-10751. --- Resolution: Duplicate > ML Param validate should print better error information > --- > > Key: SPARK-10751 > URL: https://issues.apache.org/jira/browse/SPARK-10751 > Project: Spark > Issue Type: Bug > Components: ML >Reporter: Yanbo Liang >Priority: Minor > > Currently when you set illegal value for params of array type (such as > IntArrayParam, DoubleArrayParam, StringArrayParam), it will throw > IllegalArgumentException but with incomprehensible error information. > For example: > val vectorSlicer = new > VectorSlicer().setInputCol("features").setOutputCol("result") > vectorSlicer.setIndices(Array.empty).setNames(Array("f1", "f4", "f1")) > It will throw IllegalArgumentException as: > vectorSlicer_b3b4d1a10f43 parameter names given invalid value > [Ljava.lang.String;@798256c5. > java.lang.IllegalArgumentException: vectorSlicer_b3b4d1a10f43 parameter names > given invalid value [Ljava.lang.String;@798256c5. > Users can not understand which params were set incorrectly. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-9821) pyspark reduceByKey should allow a custom partitioner
[ https://issues.apache.org/jira/browse/SPARK-9821?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu resolved SPARK-9821. --- Resolution: Fixed Fix Version/s: 1.6.0 Issue resolved by pull request 8569 [https://github.com/apache/spark/pull/8569] > pyspark reduceByKey should allow a custom partitioner > - > > Key: SPARK-9821 > URL: https://issues.apache.org/jira/browse/SPARK-9821 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 1.3.0 >Reporter: Diana Carroll >Priority: Minor > Fix For: 1.6.0 > > > In Scala, I can supply a custom partitioner to reduceByKey (and other > aggregation/repartitioning methods like aggregateByKey and combinedByKey), > but as far as I can tell from the Pyspark API, there's no way to do the same > in Python. > Here's an example of my code in Scala: > {code}weblogs.map(s => (getFileType(s), 1)).reduceByKey(new > FileTypePartitioner(),_+_){code} > But I can't figure out how to do the same in Python. The closest I can get > is to call repartition before reduceByKey like so: > {code}weblogs.map(lambda s: (getFileType(s), > 1)).partitionBy(3,hash_filetype).reduceByKey(lambda v1,v2: > v1+v2).collect(){code} > But that defeats the purpose, because I'm shuffling twice instead of once, so > my performance is worse instead of better. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-10751) ML Param validate should print better error information
Yanbo Liang created SPARK-10751: --- Summary: ML Param validate should print better error information Key: SPARK-10751 URL: https://issues.apache.org/jira/browse/SPARK-10751 Project: Spark Issue Type: Bug Components: ML Reporter: Yanbo Liang Priority: Minor Currently when you set illegal value for params of array type (such as IntArrayParam, DoubleArrayParam, StringArrayParam), it will throw IllegalArgumentException but with incomprehensible error information. For example: val vectorSlicer = new VectorSlicer().setInputCol("features").setOutputCol("result") vectorSlicer.setIndices(Array.empty).setNames(Array("f1", "f4", "f1")) It will throw IllegalArgumentException as: vectorSlicer_b3b4d1a10f43 parameter names given invalid value [Ljava.lang.String;@798256c5. java.lang.IllegalArgumentException: vectorSlicer_b3b4d1a10f43 parameter names given invalid value [Ljava.lang.String;@798256c5. Users can not understand which params were set incorrectly. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-10750) ML Param validate should print better error information
Yanbo Liang created SPARK-10750: --- Summary: ML Param validate should print better error information Key: SPARK-10750 URL: https://issues.apache.org/jira/browse/SPARK-10750 Project: Spark Issue Type: Bug Components: ML Reporter: Yanbo Liang Priority: Minor Currently when you set illegal value for params of array type (such as IntArrayParam, DoubleArrayParam, StringArrayParam), it will throw IllegalArgumentException but with incomprehensible error information. For example: val vectorSlicer = new VectorSlicer().setInputCol("features").setOutputCol("result") vectorSlicer.setIndices(Array.empty).setNames(Array("f1", "f4", "f1")) It will throw IllegalArgumentException as: vectorSlicer_b3b4d1a10f43 parameter names given invalid value [Ljava.lang.String;@798256c5. java.lang.IllegalArgumentException: vectorSlicer_b3b4d1a10f43 parameter names given invalid value [Ljava.lang.String;@798256c5. Users can not understand which params were set incorrectly. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-10749) Support multiple roles with Spark Mesos dispatcher
Timothy Chen created SPARK-10749: Summary: Support multiple roles with Spark Mesos dispatcher Key: SPARK-10749 URL: https://issues.apache.org/jira/browse/SPARK-10749 Project: Spark Issue Type: Improvement Components: Mesos Reporter: Timothy Chen Although you can currently set the framework role of the Mesos dispatcher, it doesn't correctly use the offers given to it. It should inherit how Coarse/Fine grain scheduler works and use multiple roles offers. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-10748) Log error instead of crashing Spark Mesos dispatcher when a job is misconfigured
Timothy Chen created SPARK-10748: Summary: Log error instead of crashing Spark Mesos dispatcher when a job is misconfigured Key: SPARK-10748 URL: https://issues.apache.org/jira/browse/SPARK-10748 Project: Spark Issue Type: Bug Components: Mesos Reporter: Timothy Chen Currently when a dispatcher is submitting a new driver, it simply throws a SparkExecption when necessary configuration is not set. We should log and keep the dispatcher running instead of crashing. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-10731) The head() implementation of dataframe is very slow
[ https://issues.apache.org/jira/browse/SPARK-10731?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-10731: Assignee: Apache Spark > The head() implementation of dataframe is very slow > --- > > Key: SPARK-10731 > URL: https://issues.apache.org/jira/browse/SPARK-10731 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 1.4.1, 1.5.0 >Reporter: Jerry Lam >Assignee: Apache Spark > Labels: pyspark > > {code} > df=sqlContext.read.parquet("someparquetfiles") > df.head() > {code} > The above lines take over 15 minutes. It seems the dataframe requires 3 > stages to return the first row. It reads all data (which is about 1 billion > rows) and run Limit twice. The take(1) implementation in the RDD performs > much better. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-10731) The head() implementation of dataframe is very slow
[ https://issues.apache.org/jira/browse/SPARK-10731?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-10731: Assignee: (was: Apache Spark) > The head() implementation of dataframe is very slow > --- > > Key: SPARK-10731 > URL: https://issues.apache.org/jira/browse/SPARK-10731 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 1.4.1, 1.5.0 >Reporter: Jerry Lam > Labels: pyspark > > {code} > df=sqlContext.read.parquet("someparquetfiles") > df.head() > {code} > The above lines take over 15 minutes. It seems the dataframe requires 3 > stages to return the first row. It reads all data (which is about 1 billion > rows) and run Limit twice. The take(1) implementation in the RDD performs > much better. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10731) The head() implementation of dataframe is very slow
[ https://issues.apache.org/jira/browse/SPARK-10731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14901983#comment-14901983 ] Apache Spark commented on SPARK-10731: -- User 'davies' has created a pull request for this issue: https://github.com/apache/spark/pull/8852 > The head() implementation of dataframe is very slow > --- > > Key: SPARK-10731 > URL: https://issues.apache.org/jira/browse/SPARK-10731 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 1.4.1, 1.5.0 >Reporter: Jerry Lam > Labels: pyspark > > {code} > df=sqlContext.read.parquet("someparquetfiles") > df.head() > {code} > The above lines take over 15 minutes. It seems the dataframe requires 3 > stages to return the first row. It reads all data (which is about 1 billion > rows) and run Limit twice. The take(1) implementation in the RDD performs > much better. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9595) Adding API to SparkConf for kryo serializers registration
[ https://issues.apache.org/jira/browse/SPARK-9595?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14901979#comment-14901979 ] John Chen commented on SPARK-9595: -- Yes, thanks a lot. > Adding API to SparkConf for kryo serializers registration > - > > Key: SPARK-9595 > URL: https://issues.apache.org/jira/browse/SPARK-9595 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 1.3.1, 1.4.1 >Reporter: John Chen >Priority: Minor > Original Estimate: 168h > Remaining Estimate: 168h > > Currently SparkConf has a registerKryoClasses API for kryo registration. > However, this only works when you register classes. If you want to register > customized kryo serializers, you'll have to extend the KryoSerializer class > and write some codes. > This is not only very inconvenient, but also require the registration to be > done in compile-time, which is not always possible. Thus, I suggest another > API to SparkConf for registering customized kryo serializers. It could be > like this: > def registerKryoSerializers(serializers: Map[Class[_], Serializer]): SparkConf -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-10733) TungstenAggregation cannot acquire page after switching to sort-based
[ https://issues.apache.org/jira/browse/SPARK-10733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14901959#comment-14901959 ] Yi Zhou edited comment on SPARK-10733 at 9/22/15 5:18 AM: -- 1) I have never set `spark.task.cpus` and only use by default. 2) scale factor=1000 (1TB data set) 3) Spark conf like below spark.shuffle.manager=sort spark.sql.hive.metastore.version=1.1.0 spark.sql.hive.metastore.jars=/usr/lib/hive/lib/\*:/usr/lib/hadoop/client/\* spark.executor.extraClassPath=/etc/hive/conf spark.driver.extraClassPath=/etc/hive/conf spark.serializer=org.apache.spark.serializer.KryoSerializer was (Author: jameszhouyi): 1) I have never set `spark.task.cpus` and only use by default. 2) scale factor=1000 (1TB data set) 3) Spark conf like below spark.shuffle.manager=sort spark.sql.hive.metastore.version=1.1.0 spark.sql.hive.metastore.jars=/usr/lib/hive/lib/*:/usr/lib/hadoop/client/* spark.executor.extraClassPath=/etc/hive/conf spark.driver.extraClassPath=/etc/hive/conf spark.serializer=org.apache.spark.serializer.KryoSerializer > TungstenAggregation cannot acquire page after switching to sort-based > - > > Key: SPARK-10733 > URL: https://issues.apache.org/jira/browse/SPARK-10733 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0 >Reporter: Andrew Or >Assignee: Andrew Or >Priority: Blocker > > This is uncovered after fixing SPARK-10474. Stack trace: > {code} > 15/09/21 12:51:46 WARN scheduler.TaskSetManager: Lost task 115.0 in stage > 152.0 (TID 1736, bb-node2): java.io.IOException: Unable to acquire 16777216 > bytes of memory > at > org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.acquireNewPage(UnsafeExternalSorter.java:378) > at > org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.acquireNewPageIfNecessary(UnsafeExternalSorter.java:359) > at > org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.insertKVRecord(UnsafeExternalSorter.java:488) > at > org.apache.spark.sql.execution.UnsafeKVExternalSorter.insertKV(UnsafeKVExternalSorter.java:144) > at > org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.switchToSortBasedAggregation(TungstenAggregationIterator.scala:465) > at > org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.processInputs(TungstenAggregationIterator.scala:379) > at > org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.start(TungstenAggregationIterator.scala:622) > at > org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1.org$apache$spark$sql$execution$aggregate$TungstenAggregate$$anonfun$$executePartition$1(TungstenAggregate.scala:110) > at > org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:119) > at > org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:119) > at > org.apache.spark.rdd.MapPartitionsWithPreparationRDD.compute(MapPartitionsWithPreparationRDD.scala:64) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) > at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) > at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73) > at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) > at org.apache.spark.scheduler.Task.run(Task.scala:88) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10733) TungstenAggregation cannot acquire page after switching to sort-based
[ https://issues.apache.org/jira/browse/SPARK-10733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14901959#comment-14901959 ] Yi Zhou commented on SPARK-10733: - 1) I have never set `spark.task.cpus` and only use by default. 2) scale factor=1000 (1TB data set) 3) Spark conf like below spark.shuffle.manager=sort spark.sql.hive.metastore.version=1.1.0 spark.sql.hive.metastore.jars=/usr/lib/hive/lib/*:/usr/lib/hadoop/client/* spark.executor.extraClassPath=/etc/hive/conf spark.driver.extraClassPath=/etc/hive/conf spark.serializer=org.apache.spark.serializer.KryoSerializer > TungstenAggregation cannot acquire page after switching to sort-based > - > > Key: SPARK-10733 > URL: https://issues.apache.org/jira/browse/SPARK-10733 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0 >Reporter: Andrew Or >Assignee: Andrew Or >Priority: Blocker > > This is uncovered after fixing SPARK-10474. Stack trace: > {code} > 15/09/21 12:51:46 WARN scheduler.TaskSetManager: Lost task 115.0 in stage > 152.0 (TID 1736, bb-node2): java.io.IOException: Unable to acquire 16777216 > bytes of memory > at > org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.acquireNewPage(UnsafeExternalSorter.java:378) > at > org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.acquireNewPageIfNecessary(UnsafeExternalSorter.java:359) > at > org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.insertKVRecord(UnsafeExternalSorter.java:488) > at > org.apache.spark.sql.execution.UnsafeKVExternalSorter.insertKV(UnsafeKVExternalSorter.java:144) > at > org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.switchToSortBasedAggregation(TungstenAggregationIterator.scala:465) > at > org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.processInputs(TungstenAggregationIterator.scala:379) > at > org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.start(TungstenAggregationIterator.scala:622) > at > org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1.org$apache$spark$sql$execution$aggregate$TungstenAggregate$$anonfun$$executePartition$1(TungstenAggregate.scala:110) > at > org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:119) > at > org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:119) > at > org.apache.spark.rdd.MapPartitionsWithPreparationRDD.compute(MapPartitionsWithPreparationRDD.scala:64) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) > at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) > at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73) > at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) > at org.apache.spark.scheduler.Task.run(Task.scala:88) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10733) TungstenAggregation cannot acquire page after switching to sort-based
[ https://issues.apache.org/jira/browse/SPARK-10733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14901951#comment-14901951 ] Yi Zhou commented on SPARK-10733: - Key SQL Query: INSERT INTO TABLE test_table SELECT ss.ss_customer_sk AS cid, count(CASE WHEN i.i_class_id=1 THEN 1 ELSE NULL END) AS id1, count(CASE WHEN i.i_class_id=3 THEN 1 ELSE NULL END) AS id3, count(CASE WHEN i.i_class_id=5 THEN 1 ELSE NULL END) AS id5, count(CASE WHEN i.i_class_id=7 THEN 1 ELSE NULL END) AS id7, count(CASE WHEN i.i_class_id=9 THEN 1 ELSE NULL END) AS id9, count(CASE WHEN i.i_class_id=11 THEN 1 ELSE NULL END) AS id11, count(CASE WHEN i.i_class_id=13 THEN 1 ELSE NULL END) AS id13, count(CASE WHEN i.i_class_id=15 THEN 1 ELSE NULL END) AS id15, count(CASE WHEN i.i_class_id=2 THEN 1 ELSE NULL END) AS id2, count(CASE WHEN i.i_class_id=4 THEN 1 ELSE NULL END) AS id4, count(CASE WHEN i.i_class_id=6 THEN 1 ELSE NULL END) AS id6, count(CASE WHEN i.i_class_id=8 THEN 1 ELSE NULL END) AS id8, count(CASE WHEN i.i_class_id=10 THEN 1 ELSE NULL END) AS id10, count(CASE WHEN i.i_class_id=14 THEN 1 ELSE NULL END) AS id14, count(CASE WHEN i.i_class_id=16 THEN 1 ELSE NULL END) AS id16 FROM store_sales ss INNER JOIN item i ON ss.ss_item_sk = i.i_item_sk WHERE i.i_category IN ('Books') AND ss.ss_customer_sk IS NOT NULL GROUP BY ss.ss_customer_sk HAVING count(ss.ss_item_sk) > 5 Note: the store_sales is a big fact table and item is a small dimension table. > TungstenAggregation cannot acquire page after switching to sort-based > - > > Key: SPARK-10733 > URL: https://issues.apache.org/jira/browse/SPARK-10733 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0 >Reporter: Andrew Or >Assignee: Andrew Or >Priority: Blocker > > This is uncovered after fixing SPARK-10474. Stack trace: > {code} > 15/09/21 12:51:46 WARN scheduler.TaskSetManager: Lost task 115.0 in stage > 152.0 (TID 1736, bb-node2): java.io.IOException: Unable to acquire 16777216 > bytes of memory > at > org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.acquireNewPage(UnsafeExternalSorter.java:378) > at > org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.acquireNewPageIfNecessary(UnsafeExternalSorter.java:359) > at > org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.insertKVRecord(UnsafeExternalSorter.java:488) > at > org.apache.spark.sql.execution.UnsafeKVExternalSorter.insertKV(UnsafeKVExternalSorter.java:144) > at > org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.switchToSortBasedAggregation(TungstenAggregationIterator.scala:465) > at > org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.processInputs(TungstenAggregationIterator.scala:379) > at > org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.start(TungstenAggregationIterator.scala:622) > at > org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1.org$apache$spark$sql$execution$aggregate$TungstenAggregate$$anonfun$$executePartition$1(TungstenAggregate.scala:110) > at > org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:119) > at > org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:119) > at > org.apache.spark.rdd.MapPartitionsWithPreparationRDD.compute(MapPartitionsWithPreparationRDD.scala:64) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) > at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) > at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73) > at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) > at org.apache.spark.scheduler.Task.run(Task.scala:88) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10739) Add attempt window for long running Spark application on Yarn
[ https://issues.apache.org/jira/browse/SPARK-10739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14901944#comment-14901944 ] Saisai Shao commented on SPARK-10739: - [~srowen] I'm not sure what you mentioned about the previous JIRA is exactly the same thing as I mentioned here. This JIRA is trying to let YARN know that this is a long running service and ignore out of window failures (YARN-611), probably is the different thing you mentioned about. > Add attempt window for long running Spark application on Yarn > - > > Key: SPARK-10739 > URL: https://issues.apache.org/jira/browse/SPARK-10739 > Project: Spark > Issue Type: Improvement > Components: YARN >Reporter: Saisai Shao >Priority: Minor > > Currently Spark on Yarn uses max attempts to control the failure number, if > application's failure number reaches to the max attempts, application will > not be recovered by RM, it is not very effective for long running > applications, since it will easily exceed the max number at a long time > period, also setting a very large max attempts will hide the real problem. > So here introduce an attempt window to control the application attempt times, > this will ignore the out of window attempts, it is introduced in Hadoop 2.6+ > to support long running application, it is quite useful for Spark Streaming, > Spark shell like applications. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10739) Add attempt window for long running Spark application on Yarn
[ https://issues.apache.org/jira/browse/SPARK-10739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14901901#comment-14901901 ] Sandy Ryza commented on SPARK-10739: I recall there was a JIRA similar to this that avoided killing the application when we reached a certain number of executor failures. However, IIUC, this is about something different: deciding whether to have YARN restart the application when it fails. > Add attempt window for long running Spark application on Yarn > - > > Key: SPARK-10739 > URL: https://issues.apache.org/jira/browse/SPARK-10739 > Project: Spark > Issue Type: Improvement > Components: YARN >Reporter: Saisai Shao >Priority: Minor > > Currently Spark on Yarn uses max attempts to control the failure number, if > application's failure number reaches to the max attempts, application will > not be recovered by RM, it is not very effective for long running > applications, since it will easily exceed the max number at a long time > period, also setting a very large max attempts will hide the real problem. > So here introduce an attempt window to control the application attempt times, > this will ignore the out of window attempts, it is introduced in Hadoop 2.6+ > to support long running application, it is quite useful for Spark Streaming, > Spark shell like applications. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-10711) Do not assume spark.submit.deployMode is always set
[ https://issues.apache.org/jira/browse/SPARK-10711?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-10711. - Resolution: Fixed Assignee: Hossein Falaki Fix Version/s: 1.6.0 > Do not assume spark.submit.deployMode is always set > --- > > Key: SPARK-10711 > URL: https://issues.apache.org/jira/browse/SPARK-10711 > Project: Spark > Issue Type: Bug > Components: R >Affects Versions: 1.5.0 >Reporter: Hossein Falaki >Assignee: Hossein Falaki >Priority: Critical > Fix For: 1.6.0, 1.5.1 > > > in RRDD.createRProcess() we call RUtils.sparkRPackagePath(), which assumes > "... that Spark properties `spark.master` and `spark.submit.deployMode` are > set." > It is better to assume safe defaults if they are not set. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10739) Add attempt window for long running Spark application on Yarn
[ https://issues.apache.org/jira/browse/SPARK-10739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14901866#comment-14901866 ] Sean Owen commented on SPARK-10739: --- I'm sure we've discussed this one before and that there's a JIRA for it ... but can't for the life of me find it. I feel like [~sandyr] or [~vanzin] commented on it. the question was how long back you looked when considering if "a lot" of failures had occurred, etc. > Add attempt window for long running Spark application on Yarn > - > > Key: SPARK-10739 > URL: https://issues.apache.org/jira/browse/SPARK-10739 > Project: Spark > Issue Type: Improvement > Components: YARN >Reporter: Saisai Shao >Priority: Minor > > Currently Spark on Yarn uses max attempts to control the failure number, if > application's failure number reaches to the max attempts, application will > not be recovered by RM, it is not very effective for long running > applications, since it will easily exceed the max number at a long time > period, also setting a very large max attempts will hide the real problem. > So here introduce an attempt window to control the application attempt times, > this will ignore the out of window attempts, it is introduced in Hadoop 2.6+ > to support long running application, it is quite useful for Spark Streaming, > Spark shell like applications. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10495) For json data source, date values are saved as int strings
[ https://issues.apache.org/jira/browse/SPARK-10495?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14901847#comment-14901847 ] Apache Spark commented on SPARK-10495: -- User 'yhuai' has created a pull request for this issue: https://github.com/apache/spark/pull/8861 > For json data source, date values are saved as int strings > -- > > Key: SPARK-10495 > URL: https://issues.apache.org/jira/browse/SPARK-10495 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0 >Reporter: Yin Huai >Assignee: Yin Huai >Priority: Blocker > Fix For: 1.6.0, 1.5.1 > > > {code} > val df = Seq((1, java.sql.Date.valueOf("1900-01-01"))).toDF("i", "j") > df.write.format("json").save("/tmp/testJson") > sc.textFile("/tmp/testJson").collect.foreach(println) > {"i":1,"j":"-25567"} > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-10747) add support for window specification to include how NULLS are ordered
N Campbell created SPARK-10747: -- Summary: add support for window specification to include how NULLS are ordered Key: SPARK-10747 URL: https://issues.apache.org/jira/browse/SPARK-10747 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.5.0 Reporter: N Campbell You cannot express how NULLS are to be sorted in the window order specification and have to use a compensating expression to simulate. Error: org.apache.spark.sql.AnalysisException: line 1:76 missing ) at 'nulls' near 'nulls' line 1:82 missing EOF at 'last' near 'nulls'; SQLState: null Same limitation as Hive reported in Apache JIRA HIVE-9535 ) This fails select rnum, c1, c2, c3, dense_rank() over(partition by c1 order by c3 desc nulls last) from tolap select rnum, c1, c2, c3, dense_rank() over(partition by c1 order by case when c3 is null then 1 else 0 end) from tolap -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-10746) count ( distinct columnref) over () returns wrong result set
N Campbell created SPARK-10746: -- Summary: count ( distinct columnref) over () returns wrong result set Key: SPARK-10746 URL: https://issues.apache.org/jira/browse/SPARK-10746 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.5.0 Reporter: N Campbell Same issue as report against HIVE (HIVE-9534) Result set was expected to contain 5 rows instead of 1 row as others vendors (ORACLE, Netezza etc) would. select count( distinct column) over () from t1 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-10745) Separate configs between shuffle and RPC
Shixiong Zhu created SPARK-10745: Summary: Separate configs between shuffle and RPC Key: SPARK-10745 URL: https://issues.apache.org/jira/browse/SPARK-10745 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Shixiong Zhu SPARK-6028 uses network module to implement RPC. However, there are some configurations named with `spark.shuffle` prefix in the network module. We should refactor them and make sure the user can control them in shuffle and RPC separately. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-10744) parser error (constant * column is null interpreted as constant * boolean)
N Campbell created SPARK-10744: -- Summary: parser error (constant * column is null interpreted as constant * boolean) Key: SPARK-10744 URL: https://issues.apache.org/jira/browse/SPARK-10744 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.5.0 Reporter: N Campbell Priority: Minor SPARK SQL inherits the same defect as Hive where this statement will not parse/execute. See HIVE-9530 select c1 from t1 where 1 * cnnull is null -vs- select c1 from t1 where (1 * cnnull) is null -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10310) [Spark SQL] All result records will be popluated into ONE line during the script transform due to missing the correct line/filed delimiter
[ https://issues.apache.org/jira/browse/SPARK-10310?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14901774#comment-14901774 ] Apache Spark commented on SPARK-10310: -- User 'liancheng' has created a pull request for this issue: https://github.com/apache/spark/pull/8860 > [Spark SQL] All result records will be popluated into ONE line during the > script transform due to missing the correct line/filed delimiter > -- > > Key: SPARK-10310 > URL: https://issues.apache.org/jira/browse/SPARK-10310 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Yi Zhou >Assignee: zhichao-li >Priority: Critical > > There is real case using python stream script in Spark SQL query. We found > that all result records were wroten in ONE line as input from "select" > pipeline for python script and so it caused script will not identify each > record.Other, filed separator in spark sql will be '^A' or '\001' which is > inconsistent/incompatible the '\t' in Hive implementation. > Key query: > {code:sql} > CREATE VIEW temp1 AS > SELECT * > FROM > ( > FROM > ( > SELECT > c.wcs_user_sk, > w.wp_type, > (wcs_click_date_sk * 24 * 60 * 60 + wcs_click_time_sk) AS tstamp_inSec > FROM web_clickstreams c, web_page w > WHERE c.wcs_web_page_sk = w.wp_web_page_sk > AND c.wcs_web_page_sk IS NOT NULL > AND c.wcs_user_sk IS NOT NULL > AND c.wcs_sales_skIS NULL --abandoned implies: no sale > DISTRIBUTE BY wcs_user_sk SORT BY wcs_user_sk, tstamp_inSec > ) clicksAnWebPageType > REDUCE > wcs_user_sk, > tstamp_inSec, > wp_type > USING 'python sessionize.py 3600' > AS ( > wp_type STRING, > tstamp BIGINT, > sessionid STRING) > ) sessionized > {code} > Key Python script: > {noformat} > for line in sys.stdin: > user_sk, tstamp_str, value = line.strip().split("\t") > {noformat} > Sample SELECT result: > {noformat} > ^V31^A3237764860^Afeedback^U31^A3237769106^Adynamic^T31^A3237779027^Areview > {noformat} > Expected result: > {noformat} > 31 3237764860 feedback > 31 3237769106 dynamic > 31 3237779027 review > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-10743) keep the name of expression if possible when do cast
[ https://issues.apache.org/jira/browse/SPARK-10743?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-10743: Assignee: (was: Apache Spark) > keep the name of expression if possible when do cast > > > Key: SPARK-10743 > URL: https://issues.apache.org/jira/browse/SPARK-10743 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Wenchen Fan > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10743) keep the name of expression if possible when do cast
[ https://issues.apache.org/jira/browse/SPARK-10743?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14901768#comment-14901768 ] Apache Spark commented on SPARK-10743: -- User 'cloud-fan' has created a pull request for this issue: https://github.com/apache/spark/pull/8859 > keep the name of expression if possible when do cast > > > Key: SPARK-10743 > URL: https://issues.apache.org/jira/browse/SPARK-10743 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Wenchen Fan > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-10743) keep the name of expression if possible when do cast
[ https://issues.apache.org/jira/browse/SPARK-10743?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-10743: Assignee: Apache Spark > keep the name of expression if possible when do cast > > > Key: SPARK-10743 > URL: https://issues.apache.org/jira/browse/SPARK-10743 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Wenchen Fan >Assignee: Apache Spark > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-10743) keep the name of expression if possible when do cast
Wenchen Fan created SPARK-10743: --- Summary: keep the name of expression if possible when do cast Key: SPARK-10743 URL: https://issues.apache.org/jira/browse/SPARK-10743 Project: Spark Issue Type: Improvement Components: SQL Reporter: Wenchen Fan -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-10742) Add the ability to embed HTML relative links in job descriptions
[ https://issues.apache.org/jira/browse/SPARK-10742?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-10742: Assignee: Apache Spark (was: Tathagata Das) > Add the ability to embed HTML relative links in job descriptions > > > Key: SPARK-10742 > URL: https://issues.apache.org/jira/browse/SPARK-10742 > Project: Spark > Issue Type: Improvement >Reporter: Tathagata Das >Assignee: Apache Spark >Priority: Minor > > This is to allow embedding links to other Spark UI tabs within the job > description. For example, streaming jobs could set descriptions with links > pointing to the corresponding details page of the batch that the job belongs > to. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-10742) Add the ability to embed HTML relative links in job descriptions
[ https://issues.apache.org/jira/browse/SPARK-10742?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-10742: Assignee: Tathagata Das (was: Apache Spark) > Add the ability to embed HTML relative links in job descriptions > > > Key: SPARK-10742 > URL: https://issues.apache.org/jira/browse/SPARK-10742 > Project: Spark > Issue Type: Improvement >Reporter: Tathagata Das >Assignee: Tathagata Das >Priority: Minor > > This is to allow embedding links to other Spark UI tabs within the job > description. For example, streaming jobs could set descriptions with links > pointing to the corresponding details page of the batch that the job belongs > to. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10742) Add the ability to embed HTML relative links in job descriptions
[ https://issues.apache.org/jira/browse/SPARK-10742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14901763#comment-14901763 ] Apache Spark commented on SPARK-10742: -- User 'tdas' has created a pull request for this issue: https://github.com/apache/spark/pull/8791 > Add the ability to embed HTML relative links in job descriptions > > > Key: SPARK-10742 > URL: https://issues.apache.org/jira/browse/SPARK-10742 > Project: Spark > Issue Type: Improvement >Reporter: Tathagata Das >Assignee: Tathagata Das >Priority: Minor > > This is to allow embedding links to other Spark UI tabs within the job > description. For example, streaming jobs could set descriptions with links > pointing to the corresponding details page of the batch that the job belongs > to. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-10742) Add the ability to embed HTML relative links in job descriptions
Tathagata Das created SPARK-10742: - Summary: Add the ability to embed HTML relative links in job descriptions Key: SPARK-10742 URL: https://issues.apache.org/jira/browse/SPARK-10742 Project: Spark Issue Type: Improvement Reporter: Tathagata Das Assignee: Tathagata Das Priority: Minor This is to allow embedding links to other Spark UI tabs within the job description. For example, streaming jobs could set descriptions with links pointing to the corresponding details page of the batch that the job belongs to. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10741) Hive Query Having/OrderBy against Parquet table is not working
[ https://issues.apache.org/jira/browse/SPARK-10741?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ian updated SPARK-10741: Description: Failed Query with Having Clause {code} def testParquetHaving() { val ddl = """CREATE TABLE IF NOT EXISTS test ( c1 string, c2 int ) STORED AS PARQUET""" val failedHaving = """ SELECT c1, avg ( c2 ) as c_avg | FROM test | GROUP BY c1 | HAVING ( avg ( c2 ) > 5) ORDER BY c1""".stripMargin TestHive.sql(ddl) TestHive.sql(failedHaving).collect } org.apache.spark.sql.AnalysisException: resolved attribute(s) c2#16 missing from c1#17,c2#18 in operator !Aggregate [c1#17], [cast((avg(cast(c2#16 as bigint)) > cast(5 as double)) as boolean) AS havingCondition#12,c1#17,avg(cast(c2#18 as bigint)) AS c_avg#9]; at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:37) at org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:154) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:49) {code} Failed Query with OrderBy {code} def testParquetOrderBy() { val ddl = """CREATE TABLE IF NOT EXISTS test ( c1 string, c2 int ) STORED AS PARQUET""" val failedOrderBy = """ SELECT c1, avg ( c2 ) c_avg | FROM test | GROUP BY c1 | ORDER BY avg ( c2 )""".stripMargin TestHive.sql(ddl) TestHive.sql(failedOrderBy).collect } org.apache.spark.sql.AnalysisException: resolved attribute(s) c2#33 missing from c1#34,c2#35 in operator !Aggregate [c1#34], [avg(cast(c2#33 as bigint)) AS aggOrder#31,c1#34,avg(cast(c2#35 as bigint)) AS c_avg#28]; at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:37) at org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44) {code} was: Failed Query with Having Clause {code} def testParquetHaving() { val ddl = """CREATE TABLE IF NOT EXISTS test ( c1 string, c2 int ) STORED AS PARQUET""" val failedHaving = """ SELECT c1, avg ( c2 ) as c_avg | FROM test | GROUP BY c1 | HAVING ( avg ( c2 ) > 5) ORDER BY c1""".stripMargin TestHive.sql(ddl) TestHive.sql(failedHaving).collect } org.apache.spark.sql.AnalysisException: resolved attribute(s) c2#16 missing from c1#17,c2#18 in operator !Aggregate [c1#17], [cast((avg(cast(c2#16 as bigint)) > cast(5 as double)) as boolean) AS havingCondition#12,c1#17,avg(cast(c2#18 as bigint)) AS c_avg#9]; at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:37) at org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:154) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:49) {code} Failed Query with OrderBy {code} @Test def testParquetOrderBy() { val ddl = """CREATE TABLE IF NOT EXISTS test ( c1 string, c2 int ) STORED AS PARQUET""" val failedOrderBy = """ SELECT c1, avg ( c2 ) c_avg | FROM test | GROUP BY c1 | ORDER BY avg ( c2 )""".stripMargin TestHive.sql(ddl) TestHive.sql(failedOrderBy).collect } org.apache.spark.sql.AnalysisException: resolved attribute(s) c2#33 missing from c1#34,c2#35 in operator !Aggregate [c1#34], [avg(cast(c2#33 as bigint)) AS aggOrder#31,c1#34,avg(cast(c2#35 as bigint)) AS c_avg#28]; at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:37) at org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44) {code} > Hive Query Having/OrderBy against Parquet table is not working > --- > > Key: SPARK-10741 > URL: https://issues.apache.org/jira/browse/SPARK-10741 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0 >Reporter: Ian > > Failed Query with Having Clause > {code} > def testParquetHaving() { > val ddl = > """CREATE TABLE IF NOT EXISTS test ( c1 string, c2 int ) STORED AS > PARQUET""" > val failedHaving = > """ SELECT c1, avg ( c2 ) as c_avg > | FROM test > | GROUP BY c1 > | HAVING ( avg ( c2 ) > 5) ORDER BY c1""".stripMargin > TestHive.sql(ddl) > TestHive.sql(failedHaving).collect > } > org.apache.spark.sql.AnalysisException: resolved attribute(s) c2#16 missing > from c1#17,c2#18 in operator !Aggregate [c1#17], [cast((avg(ca
[jira] [Updated] (SPARK-10741) Hive Query Having/OrderBy against Parquet table is not working
[ https://issues.apache.org/jira/browse/SPARK-10741?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ian updated SPARK-10741: Description: Failed Query with Having Clause {code} def testParquetHaving() { val ddl = """CREATE TABLE IF NOT EXISTS test ( c1 string, c2 int ) STORED AS PARQUET""" val failedHaving = """ SELECT c1, avg ( c2 ) as c_avg | FROM test | GROUP BY c1 | HAVING ( avg ( c2 ) > 5) ORDER BY c1""".stripMargin TestHive.sql(ddl) TestHive.sql(failedHaving).collect } org.apache.spark.sql.AnalysisException: resolved attribute(s) c2#16 missing from c1#17,c2#18 in operator !Aggregate [c1#17], [cast((avg(cast(c2#16 as bigint)) > cast(5 as double)) as boolean) AS havingCondition#12,c1#17,avg(cast(c2#18 as bigint)) AS c_avg#9]; at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:37) at org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:154) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:49) {code} Failed Query with OrderBy {code} @Test def testParquetOrderBy() { val ddl = """CREATE TABLE IF NOT EXISTS test ( c1 string, c2 int ) STORED AS PARQUET""" val failedOrderBy = """ SELECT c1, avg ( c2 ) c_avg | FROM test | GROUP BY c1 | ORDER BY avg ( c2 )""".stripMargin TestHive.sql(ddl) TestHive.sql(failedOrderBy).collect } org.apache.spark.sql.AnalysisException: resolved attribute(s) c2#33 missing from c1#34,c2#35 in operator !Aggregate [c1#34], [avg(cast(c2#33 as bigint)) AS aggOrder#31,c1#34,avg(cast(c2#35 as bigint)) AS c_avg#28]; at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:37) at org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44) {code} was: Failed Query with Having Clause {code} def testParquetHaving() { val ddl = """CREATE TABLE IF NOT EXISTS test ( c1 string, c2 int ) STORED AS PARQUET""" val failedHaving = """ SELECT c1, avg ( c2 ) as c_avg | FROM test | GROUP BY c1 | HAVING ( avg ( c2 ) > 5) ORDER BY c1""".stripMargin TestHive.sql(ddl) TestHive.sql(failedHaving).collect } org.apache.spark.sql.AnalysisException: resolved attribute(s) c2#16 missing from c1#17,c2#18 in operator !Aggregate [c1#17], [cast((avg(cast(c2#16 as bigint)) > cast(5 as double)) as boolean) AS havingCondition#12,c1#17,avg(cast(c2#18 as bigint)) AS c_avg#9]; at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:37) at org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:154) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:49) {code} Failed Query with OrderBy {code} val ddl = """CREATE TABLE IF NOT EXISTS test ( c1 string, c2 int ) STORED AS PARQUET""" val filedOrderBy = """ SELECT c1, avg ( c2 ) c_avg | FROM test | GROUP BY c1 | ORDER BY avg ( c2 )""".stripMargin TestHive.sql(ddl) TestHive.sql(filedOrderBy).collect org.apache.spark.sql.AnalysisException: resolved attribute(s) c2#33 missing from c1#34,c2#35 in operator !Aggregate [c1#34], [avg(cast(c2#33 as bigint)) AS aggOrder#31,c1#34,avg(cast(c2#35 as bigint)) AS c_avg#28]; at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:37) at org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44) {code} > Hive Query Having/OrderBy against Parquet table is not working > --- > > Key: SPARK-10741 > URL: https://issues.apache.org/jira/browse/SPARK-10741 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0 >Reporter: Ian > > Failed Query with Having Clause > {code} > def testParquetHaving() { > val ddl = > """CREATE TABLE IF NOT EXISTS test ( c1 string, c2 int ) STORED AS > PARQUET""" > val failedHaving = > """ SELECT c1, avg ( c2 ) as c_avg > | FROM test > | GROUP BY c1 > | HAVING ( avg ( c2 ) > 5) ORDER BY c1""".stripMargin > TestHive.sql(ddl) > TestHive.sql(failedHaving).collect > } > org.apache.spark.sql.AnalysisException: resolved attribute(s) c2#16 missing > from c1#17,c2#18 in operator !Aggregate [c1#17], [cast((avg(cast(c2#16 as > bigint)) > cast(5 as
[jira] [Updated] (SPARK-10741) Hive Query Having/OrderBy against Parquet table is not working
[ https://issues.apache.org/jira/browse/SPARK-10741?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ian updated SPARK-10741: Description: Failed Query with Having Clause {code} def testParquetHaving() { val ddl = """CREATE TABLE IF NOT EXISTS test ( c1 string, c2 int ) STORED AS PARQUET""" val failedHaving = """ SELECT c1, avg ( c2 ) as c_avg | FROM test | GROUP BY c1 | HAVING ( avg ( c2 ) > 5) ORDER BY c1""".stripMargin TestHive.sql(ddl) TestHive.sql(failedHaving).collect } org.apache.spark.sql.AnalysisException: resolved attribute(s) c2#16 missing from c1#17,c2#18 in operator !Aggregate [c1#17], [cast((avg(cast(c2#16 as bigint)) > cast(5 as double)) as boolean) AS havingCondition#12,c1#17,avg(cast(c2#18 as bigint)) AS c_avg#9]; at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:37) at org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:154) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:49) {code} Failed Query with OrderBy {code} val ddl = """CREATE TABLE IF NOT EXISTS test ( c1 string, c2 int ) STORED AS PARQUET""" val filedOrderBy = """ SELECT c1, avg ( c2 ) c_avg | FROM test | GROUP BY c1 | ORDER BY avg ( c2 )""".stripMargin TestHive.sql(ddl) TestHive.sql(filedOrderBy).collect org.apache.spark.sql.AnalysisException: resolved attribute(s) c2#33 missing from c1#34,c2#35 in operator !Aggregate [c1#34], [avg(cast(c2#33 as bigint)) AS aggOrder#31,c1#34,avg(cast(c2#35 as bigint)) AS c_avg#28]; at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:37) at org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44) {code} was: Failed Query with Having Clause {code} def testParquetHaving() { val ddl = """CREATE TABLE IF NOT EXISTS test ( c1 string, c2 int ) STORED AS PARQUET""" val failedHaving = """ SELECT c1, avg ( c2 ) as c_avg | FROM test | GROUP BY c1 | HAVING ( avg ( c2 ) > 5) ORDER BY c1""".stripMargin TestHive.sql(ddl) TestHive.sql(failedHaving).collect } {code} org.apache.spark.sql.AnalysisException: resolved attribute(s) c2#16 missing from c1#17,c2#18 in operator !Aggregate [c1#17], [cast((avg(cast(c2#16 as bigint)) > cast(5 as double)) as boolean) AS havingCondition#12,c1#17,avg(cast(c2#18 as bigint)) AS c_avg#9]; at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:37) at org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:154) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:49) > Hive Query Having/OrderBy against Parquet table is not working > --- > > Key: SPARK-10741 > URL: https://issues.apache.org/jira/browse/SPARK-10741 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0 >Reporter: Ian > > Failed Query with Having Clause > {code} > def testParquetHaving() { > val ddl = > """CREATE TABLE IF NOT EXISTS test ( c1 string, c2 int ) STORED AS > PARQUET""" > val failedHaving = > """ SELECT c1, avg ( c2 ) as c_avg > | FROM test > | GROUP BY c1 > | HAVING ( avg ( c2 ) > 5) ORDER BY c1""".stripMargin > TestHive.sql(ddl) > TestHive.sql(failedHaving).collect > } > org.apache.spark.sql.AnalysisException: resolved attribute(s) c2#16 missing > from c1#17,c2#18 in operator !Aggregate [c1#17], [cast((avg(cast(c2#16 as > bigint)) > cast(5 as double)) as boolean) AS > havingCondition#12,c1#17,avg(cast(c2#18 as bigint)) AS c_avg#9]; > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:37) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:154) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:49) > {code} > Failed Query with OrderBy > {code} > val ddl = > """CREATE TABLE IF NOT EXISTS test ( c1 string, c2 int ) STORED AS > PARQUET""" > val filedOrderBy = > """ SELECT c1, avg ( c2 ) c_avg > | FROM test > | GROUP BY c1 >
[jira] [Updated] (SPARK-10741) Hive Query Having/OrderBy against Parquet table is not working
[ https://issues.apache.org/jira/browse/SPARK-10741?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ian updated SPARK-10741: Description: Failed Query with Having Clause {code} def testParquetHaving() { val ddl = """CREATE TABLE IF NOT EXISTS test ( c1 string, c2 int ) STORED AS PARQUET""" val failedHaving = """ SELECT c1, avg ( c2 ) as c_avg | FROM test | GROUP BY c1 | HAVING ( avg ( c2 ) > 5) ORDER BY c1""".stripMargin TestHive.sql(ddl) TestHive.sql(failedHaving).collect } {code} org.apache.spark.sql.AnalysisException: resolved attribute(s) c2#16 missing from c1#17,c2#18 in operator !Aggregate [c1#17], [cast((avg(cast(c2#16 as bigint)) > cast(5 as double)) as boolean) AS havingCondition#12,c1#17,avg(cast(c2#18 as bigint)) AS c_avg#9]; at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:37) at org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:154) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:49) was: Failed Query with Having Clause def testParquetHaving() { val ddl = """CREATE TABLE IF NOT EXISTS test ( c1 string, c2 int ) STORED AS PARQUET""" val failedHaving = """ SELECT c1, avg ( c2 ) as c_avg | FROM test | GROUP BY c1 | HAVING ( avg ( c2 ) > 5) ORDER BY c1""".stripMargin TestHive.sql(ddl) TestHive.sql(failedHaving).collect } org.apache.spark.sql.AnalysisException: resolved attribute(s) c2#16 missing from c1#17,c2#18 in operator !Aggregate [c1#17], [cast((avg(cast(c2#16 as bigint)) > cast(5 as double)) as boolean) AS havingCondition#12,c1#17,avg(cast(c2#18 as bigint)) AS c_avg#9]; at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:37) at org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:154) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:49) > Hive Query Having/OrderBy against Parquet table is not working > --- > > Key: SPARK-10741 > URL: https://issues.apache.org/jira/browse/SPARK-10741 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0 >Reporter: Ian > > Failed Query with Having Clause > {code} > def testParquetHaving() { > val ddl = > """CREATE TABLE IF NOT EXISTS test ( c1 string, c2 int ) STORED AS > PARQUET""" > val failedHaving = > """ SELECT c1, avg ( c2 ) as c_avg > | FROM test > | GROUP BY c1 > | HAVING ( avg ( c2 ) > 5) ORDER BY c1""".stripMargin > TestHive.sql(ddl) > TestHive.sql(failedHaving).collect > } > {code} > org.apache.spark.sql.AnalysisException: resolved attribute(s) c2#16 missing > from c1#17,c2#18 in operator !Aggregate [c1#17], [cast((avg(cast(c2#16 as > bigint)) > cast(5 as double)) as boolean) AS > havingCondition#12,c1#17,avg(cast(c2#18 as bigint)) AS c_avg#9]; > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:37) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:154) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:49) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-10741) Hive Query Having/OrderBy against Parquet table is not working
Ian created SPARK-10741: --- Summary: Hive Query Having/OrderBy against Parquet table is not working Key: SPARK-10741 URL: https://issues.apache.org/jira/browse/SPARK-10741 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.5.0 Reporter: Ian Failed Query with Having Clause def testParquetHaving() { val ddl = """CREATE TABLE IF NOT EXISTS test ( c1 string, c2 int ) STORED AS PARQUET""" val failedHaving = """ SELECT c1, avg ( c2 ) as c_avg | FROM test | GROUP BY c1 | HAVING ( avg ( c2 ) > 5) ORDER BY c1""".stripMargin TestHive.sql(ddl) TestHive.sql(failedHaving).collect } org.apache.spark.sql.AnalysisException: resolved attribute(s) c2#16 missing from c1#17,c2#18 in operator !Aggregate [c1#17], [cast((avg(cast(c2#16 as bigint)) > cast(5 as double)) as boolean) AS havingCondition#12,c1#17,avg(cast(c2#18 as bigint)) AS c_avg#9]; at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:37) at org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:154) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:49) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10723) Add RDD.reduceOption method
[ https://issues.apache.org/jira/browse/SPARK-10723?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14901723#comment-14901723 ] Tatsuya Atsumi commented on SPARK-10723: Thanks for comment and advice. RDD.fold seems to be suitable for my case. Thank you. > Add RDD.reduceOption method > --- > > Key: SPARK-10723 > URL: https://issues.apache.org/jira/browse/SPARK-10723 > Project: Spark > Issue Type: Improvement > Components: PySpark, Spark Core >Reporter: Tatsuya Atsumi >Priority: Minor > > h2. Problem > RDD.reduce throws exception if the RDD is empty. > It is appropriate behavior if RDD is expected to be not empty, but if it is > not sure until runtime that the RDD is empty or not, it needs to wrap with > try-catch to call reduce safely. > Example Code > {code} > // This RDD may be empty or not > val rdd: RDD[Int] = originalRdd.filter(_ > 10) > val reduced: Option[Int] = try { > Some(rdd.reduce(_ + _)) > } catch { > // if rdd is empty return None. > case e:UnsupportedOperationException => None > } > {code} > h2. Improvement idea > Scala’s List has reduceOption method, which returns None if List is empty. > If RDD has reduceOption API like Scala’s List, it will become easy to handle > above case. > Example Code > {code} > val reduced: Option[Int] = originalRdd.filter(_ > 10).reduceOption(_ + _) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10652) Set meaningful job descriptions for streaming related jobs
[ https://issues.apache.org/jira/browse/SPARK-10652?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das updated SPARK-10652: -- Target Version/s: 1.6.0, 1.5.1 (was: 1.6.0) > Set meaningful job descriptions for streaming related jobs > -- > > Key: SPARK-10652 > URL: https://issues.apache.org/jira/browse/SPARK-10652 > Project: Spark > Issue Type: Improvement > Components: Streaming >Affects Versions: 1.4.1, 1.5.0 >Reporter: Tathagata Das >Assignee: Tathagata Das > > Job descriptions will help distinguish jobs of one batch from the other in > the Jobs and Stages pages in the Spark UI -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10652) Set meaningful job descriptions for streaming related jobs
[ https://issues.apache.org/jira/browse/SPARK-10652?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das updated SPARK-10652: -- Summary: Set meaningful job descriptions for streaming related jobs (was: Set good job descriptions for streaming related jobs) > Set meaningful job descriptions for streaming related jobs > -- > > Key: SPARK-10652 > URL: https://issues.apache.org/jira/browse/SPARK-10652 > Project: Spark > Issue Type: Improvement > Components: Streaming >Affects Versions: 1.4.1, 1.5.0 >Reporter: Tathagata Das >Assignee: Tathagata Das > > Job descriptions will help distinguish jobs of one batch from the other in > the Jobs and Stages pages in the Spark UI -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-10495) For json data source, date values are saved as int strings
[ https://issues.apache.org/jira/browse/SPARK-10495?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian resolved SPARK-10495. Resolution: Fixed Fix Version/s: 1.5.1 1.6.0 Issue resolved by pull request 8806 [https://github.com/apache/spark/pull/8806] > For json data source, date values are saved as int strings > -- > > Key: SPARK-10495 > URL: https://issues.apache.org/jira/browse/SPARK-10495 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0 >Reporter: Yin Huai >Assignee: Yin Huai >Priority: Blocker > Fix For: 1.6.0, 1.5.1 > > > {code} > val df = Seq((1, java.sql.Date.valueOf("1900-01-01"))).toDF("i", "j") > df.write.format("json").save("/tmp/testJson") > sc.textFile("/tmp/testJson").collect.foreach(println) > {"i":1,"j":"-25567"} > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10692) Failed batches are never reported through the StreamingListener interface
[ https://issues.apache.org/jira/browse/SPARK-10692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14901695#comment-14901695 ] Tathagata Das commented on SPARK-10692: --- Yes, that is an independent problem, we need to think about that separately. Now at the least we need to expose them through StreamingListener so that apps that do not immediately exit on one failure, can generate the streaming UI correctly. > Failed batches are never reported through the StreamingListener interface > - > > Key: SPARK-10692 > URL: https://issues.apache.org/jira/browse/SPARK-10692 > Project: Spark > Issue Type: Bug > Components: Streaming >Affects Versions: 1.3.1, 1.4.1, 1.5.0 >Reporter: Tathagata Das >Assignee: Shixiong Zhu >Priority: Blocker > > If an output operation fails, then corresponding batch is never marked as > completed, as the data structure are not updated properly. > https://github.com/apache/spark/blob/master/streaming/src/main/scala/org/apache/spark/streaming/scheduler/JobScheduler.scala#L183 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10474) TungstenAggregation cannot acquire memory for pointer array after switching to sort-based
[ https://issues.apache.org/jira/browse/SPARK-10474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14901694#comment-14901694 ] Andrew Or commented on SPARK-10474: --- Also, would you mind testing out this commit to see if it fixes your problem? https://github.com/andrewor14/spark/commits/aggregate-test > TungstenAggregation cannot acquire memory for pointer array after switching > to sort-based > - > > Key: SPARK-10474 > URL: https://issues.apache.org/jira/browse/SPARK-10474 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0 >Reporter: Yi Zhou >Assignee: Andrew Or >Priority: Blocker > Fix For: 1.6.0, 1.5.1 > > > In aggregation case, a Lost task happened with below error. > {code} > java.io.IOException: Could not acquire 65536 bytes of memory > at > org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.initializeForWriting(UnsafeExternalSorter.java:169) > at > org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.spill(UnsafeExternalSorter.java:220) > at > org.apache.spark.sql.execution.UnsafeKVExternalSorter.(UnsafeKVExternalSorter.java:126) > at > org.apache.spark.sql.execution.UnsafeFixedWidthAggregationMap.destructAndCreateExternalSorter(UnsafeFixedWidthAggregationMap.java:257) > at > org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.switchToSortBasedAggregation(TungstenAggregationIterator.scala:435) > at > org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.processInputs(TungstenAggregationIterator.scala:379) > at > org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.start(TungstenAggregationIterator.scala:622) > at > org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1.org$apache$spark$sql$execution$aggregate$TungstenAggregate$$anonfun$$executePartition$1(TungstenAggregate.scala:110) > at > org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:119) > at > org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:119) > at > org.apache.spark.rdd.MapPartitionsWithPreparationRDD.compute(MapPartitionsWithPreparationRDD.scala:64) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) > at org.apache.spark.scheduler.Task.run(Task.scala:88) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > {code} > Key SQL Query > {code:sql} > INSERT INTO TABLE test_table > SELECT > ss.ss_customer_sk AS cid, > count(CASE WHEN i.i_class_id=1 THEN 1 ELSE NULL END) AS id1, > count(CASE WHEN i.i_class_id=3 THEN 1 ELSE NULL END) AS id3, > count(CASE WHEN i.i_class_id=5 THEN 1 ELSE NULL END) AS id5, > count(CASE WHEN i.i_class_id=7 THEN 1 ELSE NULL END) AS id7, > count(CASE WHEN i.i_class_id=9 THEN 1 ELSE NULL END) AS id9, > count(CASE WHEN i.i_class_id=11 THEN 1 ELSE NULL END) AS id11, > count(CASE WHEN i.i_class_id=13 THEN 1 ELSE NULL END) AS id13, > count(CASE WHEN i.i_class_id=15 THEN 1 ELSE NULL END) AS id15, > count(CASE WHEN i.i_class_id=2 THEN 1 ELSE NULL END) AS id2, > count(CASE WHEN i.i_class_id=4 THEN 1 ELSE NULL END) AS id4, > count(CASE WHEN i.i_class_id=6 THEN 1 ELSE NULL END) AS id6, > count(CASE WHEN i.i_class_id=8 THEN 1 ELSE NULL END) AS id8, > count(CASE WHEN i.i_class_id=10 THEN 1 ELSE NULL END) AS id10, > count(CASE WHEN i.i_class_id=14 THEN 1 ELSE NULL END) AS id14, > count(CASE WHEN i.i_class_id=16 THEN 1 ELSE NULL END) AS id16 > FROM store_sales ss > INNER JOIN item i ON ss.ss_item_sk = i.i_item_sk > WHERE i.i_category IN ('Books') > AND ss.ss_customer_sk IS NOT NULL > GROUP BY ss.ss_customer_sk > HAVING count(ss.ss_item_sk) > 5 > {code} > Note: > the store_sales is a big fact table and item is a small dimension table. -- This message was sent by Atlassian JIR
[jira] [Updated] (SPARK-10310) [Spark SQL] All result records will be popluated into ONE line during the script transform due to missing the correct line/filed delimiter
[ https://issues.apache.org/jira/browse/SPARK-10310?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated SPARK-10310: --- Assignee: zhichao-li > [Spark SQL] All result records will be popluated into ONE line during the > script transform due to missing the correct line/filed delimiter > -- > > Key: SPARK-10310 > URL: https://issues.apache.org/jira/browse/SPARK-10310 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Yi Zhou >Assignee: zhichao-li >Priority: Critical > > There is real case using python stream script in Spark SQL query. We found > that all result records were wroten in ONE line as input from "select" > pipeline for python script and so it caused script will not identify each > record.Other, filed separator in spark sql will be '^A' or '\001' which is > inconsistent/incompatible the '\t' in Hive implementation. > Key query: > {code:sql} > CREATE VIEW temp1 AS > SELECT * > FROM > ( > FROM > ( > SELECT > c.wcs_user_sk, > w.wp_type, > (wcs_click_date_sk * 24 * 60 * 60 + wcs_click_time_sk) AS tstamp_inSec > FROM web_clickstreams c, web_page w > WHERE c.wcs_web_page_sk = w.wp_web_page_sk > AND c.wcs_web_page_sk IS NOT NULL > AND c.wcs_user_sk IS NOT NULL > AND c.wcs_sales_skIS NULL --abandoned implies: no sale > DISTRIBUTE BY wcs_user_sk SORT BY wcs_user_sk, tstamp_inSec > ) clicksAnWebPageType > REDUCE > wcs_user_sk, > tstamp_inSec, > wp_type > USING 'python sessionize.py 3600' > AS ( > wp_type STRING, > tstamp BIGINT, > sessionid STRING) > ) sessionized > {code} > Key Python script: > {noformat} > for line in sys.stdin: > user_sk, tstamp_str, value = line.strip().split("\t") > {noformat} > Sample SELECT result: > {noformat} > ^V31^A3237764860^Afeedback^U31^A3237769106^Adynamic^T31^A3237779027^Areview > {noformat} > Expected result: > {noformat} > 31 3237764860 feedback > 31 3237769106 dynamic > 31 3237779027 review > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-10740) handle nondeterministic expressions correctly for set operations
[ https://issues.apache.org/jira/browse/SPARK-10740?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-10740: Assignee: (was: Apache Spark) > handle nondeterministic expressions correctly for set operations > > > Key: SPARK-10740 > URL: https://issues.apache.org/jira/browse/SPARK-10740 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Wenchen Fan > > We should only push down deterministic filter condition to set operator. > For Union, let's say we do a non-deterministic filter on 1...5 union 1...5, > and we may get 1,3 for the left side and 2,4 for the right side, then the > result should be 1,3,2,4. If we push down this filter, we get 1,3 for both > side(we create a new random object with same seed in each side) and the > result would be 1,3,1,3. > For Intersect, let's say there is a non-deterministic condition with a 0.5 > possibility to accept a row and we have a row that presents in both sides of > an Intersect. Once we push down this condition, the possibility to accept > this row will be 0.25. > For Except, let's say there is a row that presents in both sides of an > Except. This row should not be in the final output. However, if we pushdown a > non-deterministic condition, it is possible that this row is rejected from > one side and then we output a row that should not be a part of the result. > We should only push down deterministic projection to Union. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10740) handle nondeterministic expressions correctly for set operations
[ https://issues.apache.org/jira/browse/SPARK-10740?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14901676#comment-14901676 ] Apache Spark commented on SPARK-10740: -- User 'cloud-fan' has created a pull request for this issue: https://github.com/apache/spark/pull/8858 > handle nondeterministic expressions correctly for set operations > > > Key: SPARK-10740 > URL: https://issues.apache.org/jira/browse/SPARK-10740 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Wenchen Fan > > We should only push down deterministic filter condition to set operator. > For Union, let's say we do a non-deterministic filter on 1...5 union 1...5, > and we may get 1,3 for the left side and 2,4 for the right side, then the > result should be 1,3,2,4. If we push down this filter, we get 1,3 for both > side(we create a new random object with same seed in each side) and the > result would be 1,3,1,3. > For Intersect, let's say there is a non-deterministic condition with a 0.5 > possibility to accept a row and we have a row that presents in both sides of > an Intersect. Once we push down this condition, the possibility to accept > this row will be 0.25. > For Except, let's say there is a row that presents in both sides of an > Except. This row should not be in the final output. However, if we pushdown a > non-deterministic condition, it is possible that this row is rejected from > one side and then we output a row that should not be a part of the result. > We should only push down deterministic projection to Union. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-10740) handle nondeterministic expressions correctly for set operations
[ https://issues.apache.org/jira/browse/SPARK-10740?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-10740: Assignee: Apache Spark > handle nondeterministic expressions correctly for set operations > > > Key: SPARK-10740 > URL: https://issues.apache.org/jira/browse/SPARK-10740 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Wenchen Fan >Assignee: Apache Spark > > We should only push down deterministic filter condition to set operator. > For Union, let's say we do a non-deterministic filter on 1...5 union 1...5, > and we may get 1,3 for the left side and 2,4 for the right side, then the > result should be 1,3,2,4. If we push down this filter, we get 1,3 for both > side(we create a new random object with same seed in each side) and the > result would be 1,3,1,3. > For Intersect, let's say there is a non-deterministic condition with a 0.5 > possibility to accept a row and we have a row that presents in both sides of > an Intersect. Once we push down this condition, the possibility to accept > this row will be 0.25. > For Except, let's say there is a row that presents in both sides of an > Except. This row should not be in the final output. However, if we pushdown a > non-deterministic condition, it is possible that this row is rejected from > one side and then we output a row that should not be a part of the result. > We should only push down deterministic projection to Union. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10740) handle nondeterministic expressions correctly for set operations
[ https://issues.apache.org/jira/browse/SPARK-10740?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan updated SPARK-10740: Description: We should only push down deterministic filter condition to set operator. For Union, let's say we do a non-deterministic filter on 1...5 union 1...5, and we may get 1,3 for the left side and 2,4 for the right side, then the result should be 1,3,2,4. If we push down this filter, we get 1,3 for both side(we create a new random object with same seed in each side) and the result would be 1,3,1,3. For Intersect, let's say there is a non-deterministic condition with a 0.5 possibility to accept a row and we have a row that presents in both sides of an Intersect. Once we push down this condition, the possibility to accept this row will be 0.25. For Except, let's say there is a row that presents in both sides of an Except. This row should not be in the final output. However, if we pushdown a non-deterministic condition, it is possible that this row is rejected from one side and then we output a row that should not be a part of the result. We should only push down deterministic projection to Union. was: We should only push down deterministic filter condition to set operator. For Union, let's say we do a non-deterministic filter on 1...5 union 1...5, and we may get 1,3 for the left side and 2,4 for the right side, then the result should be 1,3,2,4. If we push down this filter, we get 1,3 for both side and the result would be 1,3,1,3. For Intersect, let's say there is a non-deterministic condition with a 0.5 possibility to accept a row and we have a row that presents in both sides of an Intersect. Once we push down this condition, the possibility to accept this row will be 0.25. For Except, let's say there is a row that presents in both sides of an Except. This row should not be in the final output. However, if we pushdown a non-deterministic condition, it is possible that this row is rejected from one side and then we output a row that should not be a part of the result. We should only push down deterministic projection to Union. > handle nondeterministic expressions correctly for set operations > > > Key: SPARK-10740 > URL: https://issues.apache.org/jira/browse/SPARK-10740 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Wenchen Fan > > We should only push down deterministic filter condition to set operator. > For Union, let's say we do a non-deterministic filter on 1...5 union 1...5, > and we may get 1,3 for the left side and 2,4 for the right side, then the > result should be 1,3,2,4. If we push down this filter, we get 1,3 for both > side(we create a new random object with same seed in each side) and the > result would be 1,3,1,3. > For Intersect, let's say there is a non-deterministic condition with a 0.5 > possibility to accept a row and we have a row that presents in both sides of > an Intersect. Once we push down this condition, the possibility to accept > this row will be 0.25. > For Except, let's say there is a row that presents in both sides of an > Except. This row should not be in the final output. However, if we pushdown a > non-deterministic condition, it is possible that this row is rejected from > one side and then we output a row that should not be a part of the result. > We should only push down deterministic projection to Union. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10740) handle nondeterministic expressions correctly for set operations
[ https://issues.apache.org/jira/browse/SPARK-10740?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan updated SPARK-10740: Description: We should only push down deterministic filter condition to set operator. For Union, let's say we do a non-deterministic filter on 1...5 union 1...5, and we may get 1,3 for the left side and 2,4 for the right side, then the result should be 1,3,2,4. If we push down this filter, we get 1,3 for both side and the result would be 1,3,1,3. For Intersect, let's say there is a non-deterministic condition with a 0.5 possibility to accept a row and we have a row that presents in both sides of an Intersect. Once we push down this condition, the possibility to accept this row will be 0.25. For Except, let's say there is a row that presents in both sides of an Except. This row should not be in the final output. However, if we pushdown a non-deterministic condition, it is possible that this row is rejected from one side and then we output a row that should not be a part of the result. We should only push down deterministic projection to Union. > handle nondeterministic expressions correctly for set operations > > > Key: SPARK-10740 > URL: https://issues.apache.org/jira/browse/SPARK-10740 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Wenchen Fan > > We should only push down deterministic filter condition to set operator. > For Union, let's say we do a non-deterministic filter on 1...5 union 1...5, > and we may get 1,3 for the left side and 2,4 for the right side, then the > result should be 1,3,2,4. If we push down this filter, we get 1,3 for both > side and the result would be 1,3,1,3. > For Intersect, let's say there is a non-deterministic condition with a 0.5 > possibility to accept a row and we have a row that presents in both sides of > an Intersect. Once we push down this condition, the possibility to accept > this row will be 0.25. > For Except, let's say there is a row that presents in both sides of an > Except. This row should not be in the final output. However, if we pushdown a > non-deterministic condition, it is possible that this row is rejected from > one side and then we output a row that should not be a part of the result. > We should only push down deterministic projection to Union. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10739) Add attempt window for long running Spark application on Yarn
[ https://issues.apache.org/jira/browse/SPARK-10739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14901656#comment-14901656 ] Apache Spark commented on SPARK-10739: -- User 'jerryshao' has created a pull request for this issue: https://github.com/apache/spark/pull/8857 > Add attempt window for long running Spark application on Yarn > - > > Key: SPARK-10739 > URL: https://issues.apache.org/jira/browse/SPARK-10739 > Project: Spark > Issue Type: Improvement > Components: YARN >Reporter: Saisai Shao >Priority: Minor > > Currently Spark on Yarn uses max attempts to control the failure number, if > application's failure number reaches to the max attempts, application will > not be recovered by RM, it is not very effective for long running > applications, since it will easily exceed the max number at a long time > period, also setting a very large max attempts will hide the real problem. > So here introduce an attempt window to control the application attempt times, > this will ignore the out of window attempts, it is introduced in Hadoop 2.6+ > to support long running application, it is quite useful for Spark Streaming, > Spark shell like applications. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-10739) Add attempt window for long running Spark application on Yarn
[ https://issues.apache.org/jira/browse/SPARK-10739?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-10739: Assignee: (was: Apache Spark) > Add attempt window for long running Spark application on Yarn > - > > Key: SPARK-10739 > URL: https://issues.apache.org/jira/browse/SPARK-10739 > Project: Spark > Issue Type: Improvement > Components: YARN >Reporter: Saisai Shao >Priority: Minor > > Currently Spark on Yarn uses max attempts to control the failure number, if > application's failure number reaches to the max attempts, application will > not be recovered by RM, it is not very effective for long running > applications, since it will easily exceed the max number at a long time > period, also setting a very large max attempts will hide the real problem. > So here introduce an attempt window to control the application attempt times, > this will ignore the out of window attempts, it is introduced in Hadoop 2.6+ > to support long running application, it is quite useful for Spark Streaming, > Spark shell like applications. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-10739) Add attempt window for long running Spark application on Yarn
[ https://issues.apache.org/jira/browse/SPARK-10739?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-10739: Assignee: Apache Spark > Add attempt window for long running Spark application on Yarn > - > > Key: SPARK-10739 > URL: https://issues.apache.org/jira/browse/SPARK-10739 > Project: Spark > Issue Type: Improvement > Components: YARN >Reporter: Saisai Shao >Assignee: Apache Spark >Priority: Minor > > Currently Spark on Yarn uses max attempts to control the failure number, if > application's failure number reaches to the max attempts, application will > not be recovered by RM, it is not very effective for long running > applications, since it will easily exceed the max number at a long time > period, also setting a very large max attempts will hide the real problem. > So here introduce an attempt window to control the application attempt times, > this will ignore the out of window attempts, it is introduced in Hadoop 2.6+ > to support long running application, it is quite useful for Spark Streaming, > Spark shell like applications. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10649) Streaming jobs unexpectedly inherits job group, job descriptions from context starting thread
[ https://issues.apache.org/jira/browse/SPARK-10649?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14901645#comment-14901645 ] Apache Spark commented on SPARK-10649: -- User 'tdas' has created a pull request for this issue: https://github.com/apache/spark/pull/8856 > Streaming jobs unexpectedly inherits job group, job descriptions from context > starting thread > - > > Key: SPARK-10649 > URL: https://issues.apache.org/jira/browse/SPARK-10649 > Project: Spark > Issue Type: Bug > Components: Streaming >Affects Versions: 1.3.1, 1.4.1, 1.5.0 >Reporter: Tathagata Das >Assignee: Tathagata Das > Fix For: 1.6.0 > > > The job group, and job descriptions information is passed through thread > local properties, and get inherited by child threads. In case of spark > streaming, the streaming jobs inherit these properties from the thread that > called streamingContext.start(). This may not make sense. > 1. Job group: This is mainly used for cancelling a group of jobs together. It > does not make sense to cancel streaming jobs like this, as the effect will be > unpredictable. And its not a valid usecase any way, to cancel a streaming > context, call streamingContext.stop() > 2. Job description: This is used to pass on nice text descriptions for jobs > to show up in the UI. The job description of the thread that calls > streamingContext.start() is not useful for all the streaming jobs, as it does > not make sense for all of the streaming jobs to have the same description, > and the description may or may not be related to streaming. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-10740) handle nondeterministic expressions correctly for set operations
Wenchen Fan created SPARK-10740: --- Summary: handle nondeterministic expressions correctly for set operations Key: SPARK-10740 URL: https://issues.apache.org/jira/browse/SPARK-10740 Project: Spark Issue Type: Bug Components: SQL Reporter: Wenchen Fan -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-10649) Streaming jobs unexpectedly inherits job group, job descriptions from context starting thread
[ https://issues.apache.org/jira/browse/SPARK-10649?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das resolved SPARK-10649. --- Resolution: Fixed Fix Version/s: 1.6.0 > Streaming jobs unexpectedly inherits job group, job descriptions from context > starting thread > - > > Key: SPARK-10649 > URL: https://issues.apache.org/jira/browse/SPARK-10649 > Project: Spark > Issue Type: Bug > Components: Streaming >Affects Versions: 1.3.1, 1.4.1, 1.5.0 >Reporter: Tathagata Das >Assignee: Tathagata Das > Fix For: 1.6.0 > > > The job group, and job descriptions information is passed through thread > local properties, and get inherited by child threads. In case of spark > streaming, the streaming jobs inherit these properties from the thread that > called streamingContext.start(). This may not make sense. > 1. Job group: This is mainly used for cancelling a group of jobs together. It > does not make sense to cancel streaming jobs like this, as the effect will be > unpredictable. And its not a valid usecase any way, to cancel a streaming > context, call streamingContext.stop() > 2. Job description: This is used to pass on nice text descriptions for jobs > to show up in the UI. The job description of the thread that calls > streamingContext.start() is not useful for all the streaming jobs, as it does > not make sense for all of the streaming jobs to have the same description, > and the description may or may not be related to streaming. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10649) Streaming jobs unexpectedly inherits job group, job descriptions from context starting thread
[ https://issues.apache.org/jira/browse/SPARK-10649?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das updated SPARK-10649: -- Target Version/s: 1.6.0, 1.5.1 (was: 1.6.0) > Streaming jobs unexpectedly inherits job group, job descriptions from context > starting thread > - > > Key: SPARK-10649 > URL: https://issues.apache.org/jira/browse/SPARK-10649 > Project: Spark > Issue Type: Bug > Components: Streaming >Affects Versions: 1.3.1, 1.4.1, 1.5.0 >Reporter: Tathagata Das >Assignee: Tathagata Das > Fix For: 1.6.0 > > > The job group, and job descriptions information is passed through thread > local properties, and get inherited by child threads. In case of spark > streaming, the streaming jobs inherit these properties from the thread that > called streamingContext.start(). This may not make sense. > 1. Job group: This is mainly used for cancelling a group of jobs together. It > does not make sense to cancel streaming jobs like this, as the effect will be > unpredictable. And its not a valid usecase any way, to cancel a streaming > context, call streamingContext.stop() > 2. Job description: This is used to pass on nice text descriptions for jobs > to show up in the UI. The job description of the thread that calls > streamingContext.start() is not useful for all the streaming jobs, as it does > not make sense for all of the streaming jobs to have the same description, > and the description may or may not be related to streaming. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10474) TungstenAggregation cannot acquire memory for pointer array after switching to sort-based
[ https://issues.apache.org/jira/browse/SPARK-10474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14901632#comment-14901632 ] Yin Huai commented on SPARK-10474: -- [~jameszhouyi] [~xjrk] What is your workloads? How large is the data (if the data is generated, can you share the script for data generation)? What is your query? And, did you set any spark conf? > TungstenAggregation cannot acquire memory for pointer array after switching > to sort-based > - > > Key: SPARK-10474 > URL: https://issues.apache.org/jira/browse/SPARK-10474 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0 >Reporter: Yi Zhou >Assignee: Andrew Or >Priority: Blocker > Fix For: 1.6.0, 1.5.1 > > > In aggregation case, a Lost task happened with below error. > {code} > java.io.IOException: Could not acquire 65536 bytes of memory > at > org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.initializeForWriting(UnsafeExternalSorter.java:169) > at > org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.spill(UnsafeExternalSorter.java:220) > at > org.apache.spark.sql.execution.UnsafeKVExternalSorter.(UnsafeKVExternalSorter.java:126) > at > org.apache.spark.sql.execution.UnsafeFixedWidthAggregationMap.destructAndCreateExternalSorter(UnsafeFixedWidthAggregationMap.java:257) > at > org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.switchToSortBasedAggregation(TungstenAggregationIterator.scala:435) > at > org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.processInputs(TungstenAggregationIterator.scala:379) > at > org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.start(TungstenAggregationIterator.scala:622) > at > org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1.org$apache$spark$sql$execution$aggregate$TungstenAggregate$$anonfun$$executePartition$1(TungstenAggregate.scala:110) > at > org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:119) > at > org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:119) > at > org.apache.spark.rdd.MapPartitionsWithPreparationRDD.compute(MapPartitionsWithPreparationRDD.scala:64) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) > at org.apache.spark.scheduler.Task.run(Task.scala:88) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > {code} > Key SQL Query > {code:sql} > INSERT INTO TABLE test_table > SELECT > ss.ss_customer_sk AS cid, > count(CASE WHEN i.i_class_id=1 THEN 1 ELSE NULL END) AS id1, > count(CASE WHEN i.i_class_id=3 THEN 1 ELSE NULL END) AS id3, > count(CASE WHEN i.i_class_id=5 THEN 1 ELSE NULL END) AS id5, > count(CASE WHEN i.i_class_id=7 THEN 1 ELSE NULL END) AS id7, > count(CASE WHEN i.i_class_id=9 THEN 1 ELSE NULL END) AS id9, > count(CASE WHEN i.i_class_id=11 THEN 1 ELSE NULL END) AS id11, > count(CASE WHEN i.i_class_id=13 THEN 1 ELSE NULL END) AS id13, > count(CASE WHEN i.i_class_id=15 THEN 1 ELSE NULL END) AS id15, > count(CASE WHEN i.i_class_id=2 THEN 1 ELSE NULL END) AS id2, > count(CASE WHEN i.i_class_id=4 THEN 1 ELSE NULL END) AS id4, > count(CASE WHEN i.i_class_id=6 THEN 1 ELSE NULL END) AS id6, > count(CASE WHEN i.i_class_id=8 THEN 1 ELSE NULL END) AS id8, > count(CASE WHEN i.i_class_id=10 THEN 1 ELSE NULL END) AS id10, > count(CASE WHEN i.i_class_id=14 THEN 1 ELSE NULL END) AS id14, > count(CASE WHEN i.i_class_id=16 THEN 1 ELSE NULL END) AS id16 > FROM store_sales ss > INNER JOIN item i ON ss.ss_item_sk = i.i_item_sk > WHERE i.i_category IN ('Books') > AND ss.ss_customer_sk IS NOT NULL > GROUP BY ss.ss_customer_sk > HAVING count(ss.ss_item_sk) > 5 > {code} > Note: > the store_sales is a big fact table and item is a small di
[jira] [Commented] (SPARK-5569) Checkpoints cannot reference classes defined outside of Spark's assembly
[ https://issues.apache.org/jira/browse/SPARK-5569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14901609#comment-14901609 ] Jon Buffington commented on SPARK-5569: --- Cody, We were able to work around the issue by destructuring the OffsetRange type into its parts (i.e., string, long, ints). Below is our flow for reference. Our only concern is whether the transform operation runs at the driver as we have use a singleton on the driver to persist the last captured offsets. Can you confirm that the stream transform runs at the driver? {noformat} // Recover or create the stream. val ssc = StreamingContext.getOrCreate(checkpointPath, () => { createStreamingContext(checkpointPath) }) ... def createStreamingContext(checkpointPath: String): StreamingContext = { val ssc = new StreamingContext(SparkConf, Minutes(1)) ssc.checkpoint(checkpointPath) createStream(ssc) ssc } ... def createStream((ssc: StreamingContext): Unit = { ... KafkaUtils.createDirectStream[K, V, KD, VD, R](ssc, kafkaParams, fromOffsets, messageHandler) // "Note that the typecast to HasOffsetRanges will only succeed if it is // done in the first method called on the directKafkaStream, not later down // a chain of methods. You can use transform() instead of foreachRDD() as // your first method call in order to access offsets, then call further // Spark methods. However, be aware that the one-to-one mapping between RDD // partition and Kafka partition does not remain after any methods that // shuffle or repartition, e.g. reduceByKey() or window()." .transform { rdd => ... val offsetRanges = rdd.asInstanceOf[HasOffsetRanges].offsetRanges Ranger.captureOffsetRanges(offsetRanges) // De-structure the OffsetRange type into its parts and save in a singleton. ... rdd } ... } {noformat} > Checkpoints cannot reference classes defined outside of Spark's assembly > > > Key: SPARK-5569 > URL: https://issues.apache.org/jira/browse/SPARK-5569 > Project: Spark > Issue Type: Bug > Components: Streaming >Reporter: Patrick Wendell > > Not sure if this is a bug or a feature, but it's not obvious, so wanted to > create a JIRA to make sure we document this behavior. > First documented by Cody Koeninger: > https://gist.github.com/koeninger/561a61482cd1b5b3600c > {code} > 15/01/12 16:07:07 INFO CheckpointReader: Attempting to load checkpoint from > file file:/var/tmp/cp/checkpoint-142110041.bk > 15/01/12 16:07:07 WARN CheckpointReader: Error reading checkpoint from file > file:/var/tmp/cp/checkpoint-142110041.bk > java.io.IOException: java.lang.ClassNotFoundException: > org.apache.spark.rdd.kafka.KafkaRDDPartition > at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1043) > at > org.apache.spark.streaming.dstream.DStreamCheckpointData.readObject(DStreamCheckpointData.scala:146) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:606) > at > java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017) > at > java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893) > at > java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) > at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) > at > java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990) > at > java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915) > at > java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) > at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) > at java.io.ObjectInputStream.readArray(ObjectInputStream.java:1706) > at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1344) > at > java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990) > at > java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915) > at > java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) > at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) > at > java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990) > at > java.io.ObjectInputStream.defaultReadObject(ObjectInputStream.java:500) > at > org.apache.spark.streaming.DStreamGraph$$anonfun$readObject$1.apply$mcV$sp(DStreamGraph.scala:180) > at org.apache.spark.util.Utils$.tryOrIOExcepti
[jira] [Updated] (SPARK-10738) Refactoring `Instance` out from LOR and LIR, and also cleaning up some code
[ https://issues.apache.org/jira/browse/SPARK-10738?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-10738: -- Shepherd: Xiangrui Meng > Refactoring `Instance` out from LOR and LIR, and also cleaning up some code > --- > > Key: SPARK-10738 > URL: https://issues.apache.org/jira/browse/SPARK-10738 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 1.5.0 >Reporter: DB Tsai >Assignee: DB Tsai > Fix For: 1.6.0 > > > Refactoring `Instance` case class out from LOR and LIR, and also cleaning up > some code. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10739) Add attempt window for long running Spark application on Yarn
[ https://issues.apache.org/jira/browse/SPARK-10739?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Saisai Shao updated SPARK-10739: Description: Currently Spark on Yarn uses max attempts to control the failure number, if application's failure number reaches to the max attempts, application will not be recovered by RM, it is not very effective for long running applications, since it will easily exceed the max number at a long time period, also setting a very large max attempts will hide the real problem. So here introduce an attempt window to control the application attempt times, this will ignore the out of window attempts, it is introduced in Hadoop 2.6+ to support long running application, it is quite useful for Spark Streaming, Spark shell like applications. was: Currently Spark on Yarn uses max attempts to control the failure number, if application's failure number reaches to the max attempts, application will not be recovered by RM, it is not very for long running applications, since it will easily exceed the max number, also setting a very large max attempts will hide the real problem. So here introduce an attempt window to control the application attempt times, this will ignore the out of window attempts, it is introduced in Hadoop 2.6+ to support long running application, it is quite useful for Spark Streaming, Spark shell like applications. > Add attempt window for long running Spark application on Yarn > - > > Key: SPARK-10739 > URL: https://issues.apache.org/jira/browse/SPARK-10739 > Project: Spark > Issue Type: Improvement > Components: YARN >Reporter: Saisai Shao >Priority: Minor > > Currently Spark on Yarn uses max attempts to control the failure number, if > application's failure number reaches to the max attempts, application will > not be recovered by RM, it is not very effective for long running > applications, since it will easily exceed the max number at a long time > period, also setting a very large max attempts will hide the real problem. > So here introduce an attempt window to control the application attempt times, > this will ignore the out of window attempts, it is introduced in Hadoop 2.6+ > to support long running application, it is quite useful for Spark Streaming, > Spark shell like applications. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-10739) Add attempt window for long running Spark application on Yarn
Saisai Shao created SPARK-10739: --- Summary: Add attempt window for long running Spark application on Yarn Key: SPARK-10739 URL: https://issues.apache.org/jira/browse/SPARK-10739 Project: Spark Issue Type: Improvement Components: YARN Reporter: Saisai Shao Priority: Minor Currently Spark on Yarn uses max attempts to control the failure number, if application's failure number reaches to the max attempts, application will not be recovered by RM, it is not very for long running applications, since it will easily exceed the max number, also setting a very large max attempts will hide the real problem. So here introduce an attempt window to control the application attempt times, this will ignore the out of window attempts, it is introduced in Hadoop 2.6+ to support long running application, it is quite useful for Spark Streaming, Spark shell like applications. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10304) Partition discovery does not throw an exception if the dir structure is invalid
[ https://issues.apache.org/jira/browse/SPARK-10304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14901548#comment-14901548 ] Yin Huai commented on SPARK-10304: -- I dropped 1.5.1 as the target version since this issue does not prevent users from creating a df from a dataset with valid dir structure (the issue at here is that users can create a df from a dir with invalid structure). > Partition discovery does not throw an exception if the dir structure is > invalid > --- > > Key: SPARK-10304 > URL: https://issues.apache.org/jira/browse/SPARK-10304 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Yin Huai >Assignee: Zhan Zhang > > I have a dir structure like {{/path/table1/partition_column=1/}}. When I try > to use {{load("/path/")}}, it works and I get a DF. When I query this DF, if > it is stored as ORC, there will be the following NPE. But, if it is Parquet, > we even can return rows. We should complain to users about the dir struct > because {{table1}} does not meet our format. > {code} > org.apache.spark.SparkException: Job aborted due to stage failure: Task 26 in > stage 57.0 failed 4 times, most recent failure: Lost task 26.3 in stage 57.0 > (TID 3504, 10.0.195.227): java.lang.NullPointerException > at > org.apache.spark.sql.hive.HiveInspectors$class.unwrapperFor(HiveInspectors.scala:466) > at > org.apache.spark.sql.hive.orc.OrcTableScan.unwrapperFor(OrcRelation.scala:224) > at > org.apache.spark.sql.hive.orc.OrcTableScan$$anonfun$org$apache$spark$sql$hive$orc$OrcTableScan$$fillObject$1$$anonfun$9.apply(OrcRelation.scala:261) > at > org.apache.spark.sql.hive.orc.OrcTableScan$$anonfun$org$apache$spark$sql$hive$orc$OrcTableScan$$fillObject$1$$anonfun$9.apply(OrcRelation.scala:261) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at scala.collection.immutable.List.foreach(List.scala:318) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) > at scala.collection.AbstractTraversable.map(Traversable.scala:105) > at > org.apache.spark.sql.hive.orc.OrcTableScan$$anonfun$org$apache$spark$sql$hive$orc$OrcTableScan$$fillObject$1.apply(OrcRelation.scala:261) > at > org.apache.spark.sql.hive.orc.OrcTableScan$$anonfun$org$apache$spark$sql$hive$orc$OrcTableScan$$fillObject$1.apply(OrcRelation.scala:256) > at scala.Option.map(Option.scala:145) > at > org.apache.spark.sql.hive.orc.OrcTableScan.org$apache$spark$sql$hive$orc$OrcTableScan$$fillObject(OrcRelation.scala:256) > at > org.apache.spark.sql.hive.orc.OrcTableScan$$anonfun$execute$3.apply(OrcRelation.scala:318) > at > org.apache.spark.sql.hive.orc.OrcTableScan$$anonfun$execute$3.apply(OrcRelation.scala:316) > at > org.apache.spark.rdd.HadoopRDD$HadoopMapPartitionsWithSplitRDD.compute(HadoopRDD.scala:380) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10304) Partition discovery does not throw an exception if the dir structure is invalid
[ https://issues.apache.org/jira/browse/SPARK-10304?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai updated SPARK-10304: - Target Version/s: 1.6.0 (was: 1.6.0, 1.5.1) > Partition discovery does not throw an exception if the dir structure is > invalid > --- > > Key: SPARK-10304 > URL: https://issues.apache.org/jira/browse/SPARK-10304 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Yin Huai >Assignee: Zhan Zhang > > I have a dir structure like {{/path/table1/partition_column=1/}}. When I try > to use {{load("/path/")}}, it works and I get a DF. When I query this DF, if > it is stored as ORC, there will be the following NPE. But, if it is Parquet, > we even can return rows. We should complain to users about the dir struct > because {{table1}} does not meet our format. > {code} > org.apache.spark.SparkException: Job aborted due to stage failure: Task 26 in > stage 57.0 failed 4 times, most recent failure: Lost task 26.3 in stage 57.0 > (TID 3504, 10.0.195.227): java.lang.NullPointerException > at > org.apache.spark.sql.hive.HiveInspectors$class.unwrapperFor(HiveInspectors.scala:466) > at > org.apache.spark.sql.hive.orc.OrcTableScan.unwrapperFor(OrcRelation.scala:224) > at > org.apache.spark.sql.hive.orc.OrcTableScan$$anonfun$org$apache$spark$sql$hive$orc$OrcTableScan$$fillObject$1$$anonfun$9.apply(OrcRelation.scala:261) > at > org.apache.spark.sql.hive.orc.OrcTableScan$$anonfun$org$apache$spark$sql$hive$orc$OrcTableScan$$fillObject$1$$anonfun$9.apply(OrcRelation.scala:261) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at scala.collection.immutable.List.foreach(List.scala:318) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) > at scala.collection.AbstractTraversable.map(Traversable.scala:105) > at > org.apache.spark.sql.hive.orc.OrcTableScan$$anonfun$org$apache$spark$sql$hive$orc$OrcTableScan$$fillObject$1.apply(OrcRelation.scala:261) > at > org.apache.spark.sql.hive.orc.OrcTableScan$$anonfun$org$apache$spark$sql$hive$orc$OrcTableScan$$fillObject$1.apply(OrcRelation.scala:256) > at scala.Option.map(Option.scala:145) > at > org.apache.spark.sql.hive.orc.OrcTableScan.org$apache$spark$sql$hive$orc$OrcTableScan$$fillObject(OrcRelation.scala:256) > at > org.apache.spark.sql.hive.orc.OrcTableScan$$anonfun$execute$3.apply(OrcRelation.scala:318) > at > org.apache.spark.sql.hive.orc.OrcTableScan$$anonfun$execute$3.apply(OrcRelation.scala:316) > at > org.apache.spark.rdd.HadoopRDD$HadoopMapPartitionsWithSplitRDD.compute(HadoopRDD.scala:380) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10304) Partition discovery does not throw an exception if the dir structure is invalid
[ https://issues.apache.org/jira/browse/SPARK-10304?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai updated SPARK-10304: - Priority: Major (was: Critical) > Partition discovery does not throw an exception if the dir structure is > invalid > --- > > Key: SPARK-10304 > URL: https://issues.apache.org/jira/browse/SPARK-10304 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Yin Huai >Assignee: Zhan Zhang > > I have a dir structure like {{/path/table1/partition_column=1/}}. When I try > to use {{load("/path/")}}, it works and I get a DF. When I query this DF, if > it is stored as ORC, there will be the following NPE. But, if it is Parquet, > we even can return rows. We should complain to users about the dir struct > because {{table1}} does not meet our format. > {code} > org.apache.spark.SparkException: Job aborted due to stage failure: Task 26 in > stage 57.0 failed 4 times, most recent failure: Lost task 26.3 in stage 57.0 > (TID 3504, 10.0.195.227): java.lang.NullPointerException > at > org.apache.spark.sql.hive.HiveInspectors$class.unwrapperFor(HiveInspectors.scala:466) > at > org.apache.spark.sql.hive.orc.OrcTableScan.unwrapperFor(OrcRelation.scala:224) > at > org.apache.spark.sql.hive.orc.OrcTableScan$$anonfun$org$apache$spark$sql$hive$orc$OrcTableScan$$fillObject$1$$anonfun$9.apply(OrcRelation.scala:261) > at > org.apache.spark.sql.hive.orc.OrcTableScan$$anonfun$org$apache$spark$sql$hive$orc$OrcTableScan$$fillObject$1$$anonfun$9.apply(OrcRelation.scala:261) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at scala.collection.immutable.List.foreach(List.scala:318) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) > at scala.collection.AbstractTraversable.map(Traversable.scala:105) > at > org.apache.spark.sql.hive.orc.OrcTableScan$$anonfun$org$apache$spark$sql$hive$orc$OrcTableScan$$fillObject$1.apply(OrcRelation.scala:261) > at > org.apache.spark.sql.hive.orc.OrcTableScan$$anonfun$org$apache$spark$sql$hive$orc$OrcTableScan$$fillObject$1.apply(OrcRelation.scala:256) > at scala.Option.map(Option.scala:145) > at > org.apache.spark.sql.hive.orc.OrcTableScan.org$apache$spark$sql$hive$orc$OrcTableScan$$fillObject(OrcRelation.scala:256) > at > org.apache.spark.sql.hive.orc.OrcTableScan$$anonfun$execute$3.apply(OrcRelation.scala:318) > at > org.apache.spark.sql.hive.orc.OrcTableScan$$anonfun$execute$3.apply(OrcRelation.scala:316) > at > org.apache.spark.rdd.HadoopRDD$HadoopMapPartitionsWithSplitRDD.compute(HadoopRDD.scala:380) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10718) Update License on conf files and corresponding excludes file
[ https://issues.apache.org/jira/browse/SPARK-10718?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rekha Joshi updated SPARK-10718: Description: Update License on conf files and corresponding excludes file update. {code} Apache license header missing from multiple script and required files Could not find Apache license headers in the following files: !? <>spark/conf/spark-defaults.conf [error] running <>spark/dev/check-license ; received return code 1 {code} was: Check License should not verify conf files for license {code} Apache license header missing from multiple script and required files Could not find Apache license headers in the following files: !? <>spark/conf/spark-defaults.conf [error] running <>spark/dev/check-license ; received return code 1 {code} > Update License on conf files and corresponding excludes file > > > Key: SPARK-10718 > URL: https://issues.apache.org/jira/browse/SPARK-10718 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 1.5.0 >Reporter: Rekha Joshi >Priority: Minor > > Update License on conf files and corresponding excludes file update. > {code} > Apache license header missing from multiple script and required files > Could not find Apache license headers in the following files: > !? <>spark/conf/spark-defaults.conf > [error] running <>spark/dev/check-license ; received return code 1 > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10718) Update License on conf files and corresponding excludes file
[ https://issues.apache.org/jira/browse/SPARK-10718?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rekha Joshi updated SPARK-10718: Summary: Update License on conf files and corresponding excludes file (was: Check License should not verify conf files for license) > Update License on conf files and corresponding excludes file > > > Key: SPARK-10718 > URL: https://issues.apache.org/jira/browse/SPARK-10718 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 1.5.0 >Reporter: Rekha Joshi >Priority: Minor > > Check License should not verify conf files for license > {code} > Apache license header missing from multiple script and required files > Could not find Apache license headers in the following files: > !? <>spark/conf/spark-defaults.conf > [error] running <>spark/dev/check-license ; received return code 1 > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-10737) When using UnsafeRows, SortMergeJoin may return wrong results
[ https://issues.apache.org/jira/browse/SPARK-10737?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-10737: Assignee: Yin Huai (was: Apache Spark) > When using UnsafeRows, SortMergeJoin may return wrong results > - > > Key: SPARK-10737 > URL: https://issues.apache.org/jira/browse/SPARK-10737 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0 >Reporter: Yin Huai >Assignee: Yin Huai >Priority: Blocker > > {code} > val df1 = (1 to 10).map(i => (s"str_$i", i)).toDF("i", "j") > val df2 = > df1 > .join(df1.select(df1("i")), "i") > .select(df1("i"), df1("j")) > val df3 = df2.withColumnRenamed("i", "i1").withColumnRenamed("j", "j1") > val df4 = > df2 > .join(df3, df2("i") === df3("i1")) > .withColumn("diff", $"j" - $"j1") > df4.show(100, false) > +--+---+--+---++ > |i |j |i1|j1 |diff| > +--+---+--+---++ > |str_2 |2 |str_2 |2 |0 | > |str_7 |7 |str_2 |2 |5 | > |str_10|10 |str_10|10 |0 | > |str_3 |3 |str_3 |3 |0 | > |str_8 |8 |str_3 |3 |5 | > |str_4 |4 |str_4 |4 |0 | > |str_9 |9 |str_4 |4 |5 | > |str_5 |5 |str_5 |5 |0 | > |str_1 |1 |str_1 |1 |0 | > |str_6 |6 |str_1 |1 |5 | > +--+---+--+---++ > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10737) When using UnsafeRows, SortMergeJoin may return wrong results
[ https://issues.apache.org/jira/browse/SPARK-10737?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14901499#comment-14901499 ] Apache Spark commented on SPARK-10737: -- User 'yhuai' has created a pull request for this issue: https://github.com/apache/spark/pull/8854 > When using UnsafeRows, SortMergeJoin may return wrong results > - > > Key: SPARK-10737 > URL: https://issues.apache.org/jira/browse/SPARK-10737 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0 >Reporter: Yin Huai >Assignee: Yin Huai >Priority: Blocker > > {code} > val df1 = (1 to 10).map(i => (s"str_$i", i)).toDF("i", "j") > val df2 = > df1 > .join(df1.select(df1("i")), "i") > .select(df1("i"), df1("j")) > val df3 = df2.withColumnRenamed("i", "i1").withColumnRenamed("j", "j1") > val df4 = > df2 > .join(df3, df2("i") === df3("i1")) > .withColumn("diff", $"j" - $"j1") > df4.show(100, false) > +--+---+--+---++ > |i |j |i1|j1 |diff| > +--+---+--+---++ > |str_2 |2 |str_2 |2 |0 | > |str_7 |7 |str_2 |2 |5 | > |str_10|10 |str_10|10 |0 | > |str_3 |3 |str_3 |3 |0 | > |str_8 |8 |str_3 |3 |5 | > |str_4 |4 |str_4 |4 |0 | > |str_9 |9 |str_4 |4 |5 | > |str_5 |5 |str_5 |5 |0 | > |str_1 |1 |str_1 |1 |0 | > |str_6 |6 |str_1 |1 |5 | > +--+---+--+---++ > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-10737) When using UnsafeRows, SortMergeJoin may return wrong results
[ https://issues.apache.org/jira/browse/SPARK-10737?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-10737: Assignee: Apache Spark (was: Yin Huai) > When using UnsafeRows, SortMergeJoin may return wrong results > - > > Key: SPARK-10737 > URL: https://issues.apache.org/jira/browse/SPARK-10737 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0 >Reporter: Yin Huai >Assignee: Apache Spark >Priority: Blocker > > {code} > val df1 = (1 to 10).map(i => (s"str_$i", i)).toDF("i", "j") > val df2 = > df1 > .join(df1.select(df1("i")), "i") > .select(df1("i"), df1("j")) > val df3 = df2.withColumnRenamed("i", "i1").withColumnRenamed("j", "j1") > val df4 = > df2 > .join(df3, df2("i") === df3("i1")) > .withColumn("diff", $"j" - $"j1") > df4.show(100, false) > +--+---+--+---++ > |i |j |i1|j1 |diff| > +--+---+--+---++ > |str_2 |2 |str_2 |2 |0 | > |str_7 |7 |str_2 |2 |5 | > |str_10|10 |str_10|10 |0 | > |str_3 |3 |str_3 |3 |0 | > |str_8 |8 |str_3 |3 |5 | > |str_4 |4 |str_4 |4 |0 | > |str_9 |9 |str_4 |4 |5 | > |str_5 |5 |str_5 |5 |0 | > |str_1 |1 |str_1 |1 |0 | > |str_6 |6 |str_1 |1 |5 | > +--+---+--+---++ > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10737) When using UnsafeRows, SortMergeJoin may return wrong results
[ https://issues.apache.org/jira/browse/SPARK-10737?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai updated SPARK-10737: - Summary: When using UnsafeRows, SortMergeJoin may return wrong results (was: When using UnsafeRow, SortMergeJoin may return wrong results) > When using UnsafeRows, SortMergeJoin may return wrong results > - > > Key: SPARK-10737 > URL: https://issues.apache.org/jira/browse/SPARK-10737 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0 >Reporter: Yin Huai >Assignee: Yin Huai >Priority: Blocker > > {code} > val df1 = (1 to 10).map(i => (s"str_$i", i)).toDF("i", "j") > val df2 = > df1 > .join(df1.select(df1("i")), "i") > .select(df1("i"), df1("j")) > val df3 = df2.withColumnRenamed("i", "i1").withColumnRenamed("j", "j1") > val df4 = > df2 > .join(df3, df2("i") === df3("i1")) > .withColumn("diff", $"j" - $"j1") > df4.show(100, false) > +--+---+--+---++ > |i |j |i1|j1 |diff| > +--+---+--+---++ > |str_2 |2 |str_2 |2 |0 | > |str_7 |7 |str_2 |2 |5 | > |str_10|10 |str_10|10 |0 | > |str_3 |3 |str_3 |3 |0 | > |str_8 |8 |str_3 |3 |5 | > |str_4 |4 |str_4 |4 |0 | > |str_9 |9 |str_4 |4 |5 | > |str_5 |5 |str_5 |5 |0 | > |str_1 |1 |str_1 |1 |0 | > |str_6 |6 |str_1 |1 |5 | > +--+---+--+---++ > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-10738) Refactoring `Instance` out from LOR and LIR, and also cleaning up some code
[ https://issues.apache.org/jira/browse/SPARK-10738?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-10738: Assignee: DB Tsai (was: Apache Spark) > Refactoring `Instance` out from LOR and LIR, and also cleaning up some code > --- > > Key: SPARK-10738 > URL: https://issues.apache.org/jira/browse/SPARK-10738 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 1.5.0 >Reporter: DB Tsai >Assignee: DB Tsai > Fix For: 1.6.0 > > > Refactoring `Instance` case class out from LOR and LIR, and also cleaning up > some code. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-10738) Refactoring `Instance` out from LOR and LIR, and also cleaning up some code
[ https://issues.apache.org/jira/browse/SPARK-10738?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-10738: Assignee: Apache Spark (was: DB Tsai) > Refactoring `Instance` out from LOR and LIR, and also cleaning up some code > --- > > Key: SPARK-10738 > URL: https://issues.apache.org/jira/browse/SPARK-10738 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 1.5.0 >Reporter: DB Tsai >Assignee: Apache Spark > Fix For: 1.6.0 > > > Refactoring `Instance` case class out from LOR and LIR, and also cleaning up > some code. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10738) Refactoring `Instance` out from LOR and LIR, and also cleaning up some code
[ https://issues.apache.org/jira/browse/SPARK-10738?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14901479#comment-14901479 ] Apache Spark commented on SPARK-10738: -- User 'dbtsai' has created a pull request for this issue: https://github.com/apache/spark/pull/8853 > Refactoring `Instance` out from LOR and LIR, and also cleaning up some code > --- > > Key: SPARK-10738 > URL: https://issues.apache.org/jira/browse/SPARK-10738 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 1.5.0 >Reporter: DB Tsai >Assignee: DB Tsai > Fix For: 1.6.0 > > > Refactoring `Instance` case class out from LOR and LIR, and also cleaning up > some code. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-10737) When using UnsafeRow, SortMergeJoin may return wrong results
Yin Huai created SPARK-10737: Summary: When using UnsafeRow, SortMergeJoin may return wrong results Key: SPARK-10737 URL: https://issues.apache.org/jira/browse/SPARK-10737 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.5.0 Reporter: Yin Huai Assignee: Yin Huai Priority: Blocker {code} val df1 = (1 to 10).map(i => (s"str_$i", i)).toDF("i", "j") val df2 = df1 .join(df1.select(df1("i")), "i") .select(df1("i"), df1("j")) val df3 = df2.withColumnRenamed("i", "i1").withColumnRenamed("j", "j1") val df4 = df2 .join(df3, df2("i") === df3("i1")) .withColumn("diff", $"j" - $"j1") df4.show(100, false) +--+---+--+---++ |i |j |i1|j1 |diff| +--+---+--+---++ |str_2 |2 |str_2 |2 |0 | |str_7 |7 |str_2 |2 |5 | |str_10|10 |str_10|10 |0 | |str_3 |3 |str_3 |3 |0 | |str_8 |8 |str_3 |3 |5 | |str_4 |4 |str_4 |4 |0 | |str_9 |9 |str_4 |4 |5 | |str_5 |5 |str_5 |5 |0 | |str_1 |1 |str_1 |1 |0 | |str_6 |6 |str_1 |1 |5 | +--+---+--+---++ {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-10738) Refactoring `Instance` out from LOR and LIR, and also cleaning up some code
DB Tsai created SPARK-10738: --- Summary: Refactoring `Instance` out from LOR and LIR, and also cleaning up some code Key: SPARK-10738 URL: https://issues.apache.org/jira/browse/SPARK-10738 Project: Spark Issue Type: Improvement Components: ML Affects Versions: 1.5.0 Reporter: DB Tsai Assignee: DB Tsai Fix For: 1.6.0 Refactoring `Instance` case class out from LOR and LIR, and also cleaning up some code. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10662) Code snippets are not properly formatted in tables
[ https://issues.apache.org/jira/browse/SPARK-10662?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-10662: -- Assignee: Jacek Laskowski (was: Jacek Lewandowski) Oops, auto-complete error there > Code snippets are not properly formatted in tables > -- > > Key: SPARK-10662 > URL: https://issues.apache.org/jira/browse/SPARK-10662 > Project: Spark > Issue Type: Bug > Components: Documentation >Affects Versions: 1.5.0 >Reporter: Jacek Laskowski >Assignee: Jacek Laskowski >Priority: Trivial > Fix For: 1.6.0 > > Attachments: spark-docs-backticks-tables.png > > > Backticks (markdown) in tables are not processed and hence not formatted > properly. See > http://people.apache.org/~pwendell/spark-nightly/spark-master-docs/latest/running-on-yarn.html > and search for {{`yarn-client`}}. > As per [Sean's > suggestion|https://github.com/apache/spark/pull/8795#issuecomment-141019047] > I'm creating the JIRA task. > {quote} > This is a good fix, but this is another instance where I suspect the same > issue exists in several markup files, like configuration.html. It's worth a > JIRA since I think catching and fixing all of these is one non-trivial > logical change. > If you can, avoid whitespace changes like stripping or adding space at the > end of lines. It just adds to the diff and makes for a tiny extra chance of > merge conflicts. > {quote} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-10662) Code snippets are not properly formatted in tables
[ https://issues.apache.org/jira/browse/SPARK-10662?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14901460#comment-14901460 ] Jacek Laskowski edited comment on SPARK-10662 at 9/21/15 9:39 PM: -- It's [~jlaskowski] not [~jlewandowski] in Assignee. We've got very similar names, but he's much better at Spark, Cassandra, and all the things (a DataStax'er, after all) . Mind fixing it, [~sowen]? Thanks! Many thanks for merging the changes to master! was (Author: jlaskowski): It's [~jlaskowski] not [~jlewandowski] in Assignee. We've got very similar names, but he's much better at Spark, Cassandra, and all the things (a DataStax'er, after all) . Mind fixing it, [~sowen]? Thanks! > Code snippets are not properly formatted in tables > -- > > Key: SPARK-10662 > URL: https://issues.apache.org/jira/browse/SPARK-10662 > Project: Spark > Issue Type: Bug > Components: Documentation >Affects Versions: 1.5.0 >Reporter: Jacek Laskowski >Assignee: Jacek Lewandowski >Priority: Trivial > Fix For: 1.6.0 > > Attachments: spark-docs-backticks-tables.png > > > Backticks (markdown) in tables are not processed and hence not formatted > properly. See > http://people.apache.org/~pwendell/spark-nightly/spark-master-docs/latest/running-on-yarn.html > and search for {{`yarn-client`}}. > As per [Sean's > suggestion|https://github.com/apache/spark/pull/8795#issuecomment-141019047] > I'm creating the JIRA task. > {quote} > This is a good fix, but this is another instance where I suspect the same > issue exists in several markup files, like configuration.html. It's worth a > JIRA since I think catching and fixing all of these is one non-trivial > logical change. > If you can, avoid whitespace changes like stripping or adding space at the > end of lines. It just adds to the diff and makes for a tiny extra chance of > merge conflicts. > {quote} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10662) Code snippets are not properly formatted in tables
[ https://issues.apache.org/jira/browse/SPARK-10662?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14901460#comment-14901460 ] Jacek Laskowski commented on SPARK-10662: - It's [~jlaskowski] not [~jlewandowski] in Assignee. We've got very similar names, but he's much better at Spark, Cassandra, and all the things (a DataStax'er, after all) . Mind fixing it, [~sowen]? Thanks! > Code snippets are not properly formatted in tables > -- > > Key: SPARK-10662 > URL: https://issues.apache.org/jira/browse/SPARK-10662 > Project: Spark > Issue Type: Bug > Components: Documentation >Affects Versions: 1.5.0 >Reporter: Jacek Laskowski >Assignee: Jacek Lewandowski >Priority: Trivial > Fix For: 1.6.0 > > Attachments: spark-docs-backticks-tables.png > > > Backticks (markdown) in tables are not processed and hence not formatted > properly. See > http://people.apache.org/~pwendell/spark-nightly/spark-master-docs/latest/running-on-yarn.html > and search for {{`yarn-client`}}. > As per [Sean's > suggestion|https://github.com/apache/spark/pull/8795#issuecomment-141019047] > I'm creating the JIRA task. > {quote} > This is a good fix, but this is another instance where I suspect the same > issue exists in several markup files, like configuration.html. It's worth a > JIRA since I think catching and fixing all of these is one non-trivial > logical change. > If you can, avoid whitespace changes like stripping or adding space at the > end of lines. It just adds to the diff and makes for a tiny extra chance of > merge conflicts. > {quote} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10474) TungstenAggregation cannot acquire memory for pointer array after switching to sort-based
[ https://issues.apache.org/jira/browse/SPARK-10474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14901440#comment-14901440 ] Andrew Or commented on SPARK-10474: --- [~Yi Zhou] [~jrk] did you set `spark.task.cpus` by any chance? > TungstenAggregation cannot acquire memory for pointer array after switching > to sort-based > - > > Key: SPARK-10474 > URL: https://issues.apache.org/jira/browse/SPARK-10474 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0 >Reporter: Yi Zhou >Assignee: Andrew Or >Priority: Blocker > Fix For: 1.6.0, 1.5.1 > > > In aggregation case, a Lost task happened with below error. > {code} > java.io.IOException: Could not acquire 65536 bytes of memory > at > org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.initializeForWriting(UnsafeExternalSorter.java:169) > at > org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.spill(UnsafeExternalSorter.java:220) > at > org.apache.spark.sql.execution.UnsafeKVExternalSorter.(UnsafeKVExternalSorter.java:126) > at > org.apache.spark.sql.execution.UnsafeFixedWidthAggregationMap.destructAndCreateExternalSorter(UnsafeFixedWidthAggregationMap.java:257) > at > org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.switchToSortBasedAggregation(TungstenAggregationIterator.scala:435) > at > org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.processInputs(TungstenAggregationIterator.scala:379) > at > org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.start(TungstenAggregationIterator.scala:622) > at > org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1.org$apache$spark$sql$execution$aggregate$TungstenAggregate$$anonfun$$executePartition$1(TungstenAggregate.scala:110) > at > org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:119) > at > org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:119) > at > org.apache.spark.rdd.MapPartitionsWithPreparationRDD.compute(MapPartitionsWithPreparationRDD.scala:64) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) > at org.apache.spark.scheduler.Task.run(Task.scala:88) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > {code} > Key SQL Query > {code:sql} > INSERT INTO TABLE test_table > SELECT > ss.ss_customer_sk AS cid, > count(CASE WHEN i.i_class_id=1 THEN 1 ELSE NULL END) AS id1, > count(CASE WHEN i.i_class_id=3 THEN 1 ELSE NULL END) AS id3, > count(CASE WHEN i.i_class_id=5 THEN 1 ELSE NULL END) AS id5, > count(CASE WHEN i.i_class_id=7 THEN 1 ELSE NULL END) AS id7, > count(CASE WHEN i.i_class_id=9 THEN 1 ELSE NULL END) AS id9, > count(CASE WHEN i.i_class_id=11 THEN 1 ELSE NULL END) AS id11, > count(CASE WHEN i.i_class_id=13 THEN 1 ELSE NULL END) AS id13, > count(CASE WHEN i.i_class_id=15 THEN 1 ELSE NULL END) AS id15, > count(CASE WHEN i.i_class_id=2 THEN 1 ELSE NULL END) AS id2, > count(CASE WHEN i.i_class_id=4 THEN 1 ELSE NULL END) AS id4, > count(CASE WHEN i.i_class_id=6 THEN 1 ELSE NULL END) AS id6, > count(CASE WHEN i.i_class_id=8 THEN 1 ELSE NULL END) AS id8, > count(CASE WHEN i.i_class_id=10 THEN 1 ELSE NULL END) AS id10, > count(CASE WHEN i.i_class_id=14 THEN 1 ELSE NULL END) AS id14, > count(CASE WHEN i.i_class_id=16 THEN 1 ELSE NULL END) AS id16 > FROM store_sales ss > INNER JOIN item i ON ss.ss_item_sk = i.i_item_sk > WHERE i.i_category IN ('Books') > AND ss.ss_customer_sk IS NOT NULL > GROUP BY ss.ss_customer_sk > HAVING count(ss.ss_item_sk) > 5 > {code} > Note: > the store_sales is a big fact table and item is a small dimension table. -- This message was sent by Atlassian JIRA (v6.3.4#6332) -
[jira] [Comment Edited] (SPARK-10474) TungstenAggregation cannot acquire memory for pointer array after switching to sort-based
[ https://issues.apache.org/jira/browse/SPARK-10474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14901440#comment-14901440 ] Andrew Or edited comment on SPARK-10474 at 9/21/15 9:23 PM: [~jameszhouyi] [~jrk] did you set `spark.task.cpus` by any chance? was (Author: andrewor14): [~Yi Zhou] [~jrk] did you set `spark.task.cpus` by any chance? > TungstenAggregation cannot acquire memory for pointer array after switching > to sort-based > - > > Key: SPARK-10474 > URL: https://issues.apache.org/jira/browse/SPARK-10474 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0 >Reporter: Yi Zhou >Assignee: Andrew Or >Priority: Blocker > Fix For: 1.6.0, 1.5.1 > > > In aggregation case, a Lost task happened with below error. > {code} > java.io.IOException: Could not acquire 65536 bytes of memory > at > org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.initializeForWriting(UnsafeExternalSorter.java:169) > at > org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.spill(UnsafeExternalSorter.java:220) > at > org.apache.spark.sql.execution.UnsafeKVExternalSorter.(UnsafeKVExternalSorter.java:126) > at > org.apache.spark.sql.execution.UnsafeFixedWidthAggregationMap.destructAndCreateExternalSorter(UnsafeFixedWidthAggregationMap.java:257) > at > org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.switchToSortBasedAggregation(TungstenAggregationIterator.scala:435) > at > org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.processInputs(TungstenAggregationIterator.scala:379) > at > org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.start(TungstenAggregationIterator.scala:622) > at > org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1.org$apache$spark$sql$execution$aggregate$TungstenAggregate$$anonfun$$executePartition$1(TungstenAggregate.scala:110) > at > org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:119) > at > org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:119) > at > org.apache.spark.rdd.MapPartitionsWithPreparationRDD.compute(MapPartitionsWithPreparationRDD.scala:64) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) > at org.apache.spark.scheduler.Task.run(Task.scala:88) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > {code} > Key SQL Query > {code:sql} > INSERT INTO TABLE test_table > SELECT > ss.ss_customer_sk AS cid, > count(CASE WHEN i.i_class_id=1 THEN 1 ELSE NULL END) AS id1, > count(CASE WHEN i.i_class_id=3 THEN 1 ELSE NULL END) AS id3, > count(CASE WHEN i.i_class_id=5 THEN 1 ELSE NULL END) AS id5, > count(CASE WHEN i.i_class_id=7 THEN 1 ELSE NULL END) AS id7, > count(CASE WHEN i.i_class_id=9 THEN 1 ELSE NULL END) AS id9, > count(CASE WHEN i.i_class_id=11 THEN 1 ELSE NULL END) AS id11, > count(CASE WHEN i.i_class_id=13 THEN 1 ELSE NULL END) AS id13, > count(CASE WHEN i.i_class_id=15 THEN 1 ELSE NULL END) AS id15, > count(CASE WHEN i.i_class_id=2 THEN 1 ELSE NULL END) AS id2, > count(CASE WHEN i.i_class_id=4 THEN 1 ELSE NULL END) AS id4, > count(CASE WHEN i.i_class_id=6 THEN 1 ELSE NULL END) AS id6, > count(CASE WHEN i.i_class_id=8 THEN 1 ELSE NULL END) AS id8, > count(CASE WHEN i.i_class_id=10 THEN 1 ELSE NULL END) AS id10, > count(CASE WHEN i.i_class_id=14 THEN 1 ELSE NULL END) AS id14, > count(CASE WHEN i.i_class_id=16 THEN 1 ELSE NULL END) AS id16 > FROM store_sales ss > INNER JOIN item i ON ss.ss_item_sk = i.i_item_sk > WHERE i.i_category IN ('Books') > AND ss.ss_customer_sk IS NOT NULL > GROUP BY ss.ss_customer_sk > HAVING count(ss.ss_item_sk) > 5 > {code} > Note: > the store_sales is a big fact table and item is
[jira] [Updated] (SPARK-10704) Rename HashShufflereader to BlockStoreShuffleReader
[ https://issues.apache.org/jira/browse/SPARK-10704?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-10704: --- Description: The current shuffle code has an interface named ShuffleReader with only one implementation, HashShuffleReader. This naming is confusing, since the same read path code is used for both sort- and hash-based shuffle. -We should consolidate these classes.- We should rename HashShuffleReader. --In addition, there are aspects of ShuffleManager.getReader()'s API which don't make a lot of sense: it exposes the ability to request a contiguous range of shuffle partitions, but this feature isn't supported by any ShuffleReader implementations and isn't used anywhere in the existing code. We should clean this up, too.-- was: The current shuffle code has an interface named ShuffleReader with only one implementation, HashShuffleReader. This naming is confusing, since the same read path code is used for both sort- and hash-based shuffle. -We should consolidate these classes.- In addition, there are aspects of ShuffleManager.getReader()'s API which don't make a lot of sense: it exposes the ability to request a contiguous range of shuffle partitions, but this feature isn't supported by any ShuffleReader implementations and isn't used anywhere in the existing code. We should clean this up, too. > Rename HashShufflereader to BlockStoreShuffleReader > --- > > Key: SPARK-10704 > URL: https://issues.apache.org/jira/browse/SPARK-10704 > Project: Spark > Issue Type: Sub-task > Components: Shuffle >Reporter: Josh Rosen >Assignee: Josh Rosen > > The current shuffle code has an interface named ShuffleReader with only one > implementation, HashShuffleReader. This naming is confusing, since the same > read path code is used for both sort- and hash-based shuffle. -We should > consolidate these classes.- We should rename HashShuffleReader. > --In addition, there are aspects of ShuffleManager.getReader()'s API which > don't make a lot of sense: it exposes the ability to request a contiguous > range of shuffle partitions, but this feature isn't supported by any > ShuffleReader implementations and isn't used anywhere in the existing code. > We should clean this up, too.-- -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10704) Rename HashShufflereader to BlockStoreShuffleReader
[ https://issues.apache.org/jira/browse/SPARK-10704?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-10704: --- Description: The current shuffle code has an interface named ShuffleReader with only one implementation, HashShuffleReader. This naming is confusing, since the same read path code is used for both sort- and hash-based shuffle. ~~We should consolidate these classes.~~ In addition, there are aspects of ShuffleManager.getReader()'s API which don't make a lot of sense: it exposes the ability to request a contiguous range of shuffle partitions, but this feature isn't supported by any ShuffleReader implementations and isn't used anywhere in the existing code. We should clean this up, too. was: The current shuffle code has an interface named ShuffleReader with only one implementation, HashShuffleReader. This naming is confusing, since the same read path code is used for both sort- and hash-based shuffle. We should consolidate these classes. In addition, there are aspects of ShuffleManager.getReader()'s API which don't make a lot of sense: it exposes the ability to request a contiguous range of shuffle partitions, but this feature isn't supported by any ShuffleReader implementations and isn't used anywhere in the existing code. We should clean this up, too. > Rename HashShufflereader to BlockStoreShuffleReader > --- > > Key: SPARK-10704 > URL: https://issues.apache.org/jira/browse/SPARK-10704 > Project: Spark > Issue Type: Sub-task > Components: Shuffle >Reporter: Josh Rosen >Assignee: Josh Rosen > > The current shuffle code has an interface named ShuffleReader with only one > implementation, HashShuffleReader. This naming is confusing, since the same > read path code is used for both sort- and hash-based shuffle. ~~We should > consolidate these classes.~~ > In addition, there are aspects of ShuffleManager.getReader()'s API which > don't make a lot of sense: it exposes the ability to request a contiguous > range of shuffle partitions, but this feature isn't supported by any > ShuffleReader implementations and isn't used anywhere in the existing code. > We should clean this up, too. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10704) Rename HashShufflereader to BlockStoreShuffleReader
[ https://issues.apache.org/jira/browse/SPARK-10704?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-10704: --- Description: The current shuffle code has an interface named ShuffleReader with only one implementation, HashShuffleReader. This naming is confusing, since the same read path code is used for both sort- and hash-based shuffle. -We should consolidate these classes.- In addition, there are aspects of ShuffleManager.getReader()'s API which don't make a lot of sense: it exposes the ability to request a contiguous range of shuffle partitions, but this feature isn't supported by any ShuffleReader implementations and isn't used anywhere in the existing code. We should clean this up, too. was: The current shuffle code has an interface named ShuffleReader with only one implementation, HashShuffleReader. This naming is confusing, since the same read path code is used for both sort- and hash-based shuffle. ~~We should consolidate these classes.~~ In addition, there are aspects of ShuffleManager.getReader()'s API which don't make a lot of sense: it exposes the ability to request a contiguous range of shuffle partitions, but this feature isn't supported by any ShuffleReader implementations and isn't used anywhere in the existing code. We should clean this up, too. > Rename HashShufflereader to BlockStoreShuffleReader > --- > > Key: SPARK-10704 > URL: https://issues.apache.org/jira/browse/SPARK-10704 > Project: Spark > Issue Type: Sub-task > Components: Shuffle >Reporter: Josh Rosen >Assignee: Josh Rosen > > The current shuffle code has an interface named ShuffleReader with only one > implementation, HashShuffleReader. This naming is confusing, since the same > read path code is used for both sort- and hash-based shuffle. -We should > consolidate these classes.- > In addition, there are aspects of ShuffleManager.getReader()'s API which > don't make a lot of sense: it exposes the ability to request a contiguous > range of shuffle partitions, but this feature isn't supported by any > ShuffleReader implementations and isn't used anywhere in the existing code. > We should clean this up, too. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10704) Rename HashShufflereader to BlockStoreShuffleReader
[ https://issues.apache.org/jira/browse/SPARK-10704?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-10704: --- Description: The current shuffle code has an interface named ShuffleReader with only one implementation, HashShuffleReader. This naming is confusing, since the same read path code is used for both sort- and hash-based shuffle. We should consolidate these classes. In addition, there are aspects of ShuffleManager.getReader()'s API which don't make a lot of sense: it exposes the ability to request a contiguous range of shuffle partitions, but this feature isn't supported by any ShuffleReader implementations and isn't used anywhere in the existing code. We should clean this up, too. was: The current shuffle code has an interface named ShuffleReader with only one implementation, HashShuffleReader. This naming is confusing, since the same read path code is used for both sort- and hash-based shuffle. We should consolidate these classes. In addition, there are aspects of ShuffleManager.getReader()'s API which don't make a lot of sense: it exposes the ability to request a contiguous range of shuffle partitions, but this feature isn't supported by any ShuffleReader implementations and isn't used anywhere in the existing code. We should clean this up, too. > Rename HashShufflereader to BlockStoreShuffleReader > --- > > Key: SPARK-10704 > URL: https://issues.apache.org/jira/browse/SPARK-10704 > Project: Spark > Issue Type: Sub-task > Components: Shuffle >Reporter: Josh Rosen >Assignee: Josh Rosen > > The current shuffle code has an interface named ShuffleReader with only one > implementation, HashShuffleReader. This naming is confusing, since the same > read path code is used for both sort- and hash-based shuffle. > We should consolidate these classes. > In addition, there are aspects of ShuffleManager.getReader()'s API which > don't make a lot of sense: it exposes the ability to request a contiguous > range of shuffle partitions, but this feature isn't supported by any > ShuffleReader implementations and isn't used anywhere in the existing code. > We should clean this up, too. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10704) Rename HashShufflereader to BlockStoreShuffleReader
[ https://issues.apache.org/jira/browse/SPARK-10704?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-10704: --- Summary: Rename HashShufflereader to BlockStoreShuffleReader (was: Consolidate HashShuffleReader and ShuffleReader and refactor ShuffleManager.getReader()) > Rename HashShufflereader to BlockStoreShuffleReader > --- > > Key: SPARK-10704 > URL: https://issues.apache.org/jira/browse/SPARK-10704 > Project: Spark > Issue Type: Sub-task > Components: Shuffle >Reporter: Josh Rosen >Assignee: Josh Rosen > > The current shuffle code has an interface named ShuffleReader with only one > implementation, HashShuffleReader. This naming is confusing, since the same > read path code is used for both sort- and hash-based shuffle. We should > consolidate these classes. > In addition, there are aspects of ShuffleManager.getReader()'s API which > don't make a lot of sense: it exposes the ability to request a contiguous > range of shuffle partitions, but this feature isn't supported by any > ShuffleReader implementations and isn't used anywhere in the existing code. > We should clean this up, too. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-10736) Use 1 for all ratings if $(ratingCol) = ""
Xiangrui Meng created SPARK-10736: - Summary: Use 1 for all ratings if $(ratingCol) = "" Key: SPARK-10736 URL: https://issues.apache.org/jira/browse/SPARK-10736 Project: Spark Issue Type: Improvement Components: ML Affects Versions: 1.6.0 Reporter: Xiangrui Meng Priority: Minor For some implicit dataset, ratings may not exist in the training data. In this case, we can assume all observed pairs to be positive and treat their ratings as 1. This should happen when users set ratingCol to an empty string. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5569) Checkpoints cannot reference classes defined outside of Spark's assembly
[ https://issues.apache.org/jira/browse/SPARK-5569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14901435#comment-14901435 ] Cody Koeninger commented on SPARK-5569: --- Yeah, spark-streaming can be marked as provided. Try to get a small reproducible case that demonstrates the issue. On Mon, Sep 21, 2015 at 3:58 PM, Jon Buffington (JIRA) > Checkpoints cannot reference classes defined outside of Spark's assembly > > > Key: SPARK-5569 > URL: https://issues.apache.org/jira/browse/SPARK-5569 > Project: Spark > Issue Type: Bug > Components: Streaming >Reporter: Patrick Wendell > > Not sure if this is a bug or a feature, but it's not obvious, so wanted to > create a JIRA to make sure we document this behavior. > First documented by Cody Koeninger: > https://gist.github.com/koeninger/561a61482cd1b5b3600c > {code} > 15/01/12 16:07:07 INFO CheckpointReader: Attempting to load checkpoint from > file file:/var/tmp/cp/checkpoint-142110041.bk > 15/01/12 16:07:07 WARN CheckpointReader: Error reading checkpoint from file > file:/var/tmp/cp/checkpoint-142110041.bk > java.io.IOException: java.lang.ClassNotFoundException: > org.apache.spark.rdd.kafka.KafkaRDDPartition > at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1043) > at > org.apache.spark.streaming.dstream.DStreamCheckpointData.readObject(DStreamCheckpointData.scala:146) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:606) > at > java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017) > at > java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893) > at > java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) > at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) > at > java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990) > at > java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915) > at > java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) > at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) > at java.io.ObjectInputStream.readArray(ObjectInputStream.java:1706) > at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1344) > at > java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990) > at > java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915) > at > java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) > at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) > at > java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990) > at > java.io.ObjectInputStream.defaultReadObject(ObjectInputStream.java:500) > at > org.apache.spark.streaming.DStreamGraph$$anonfun$readObject$1.apply$mcV$sp(DStreamGraph.scala:180) > at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1040) > at > org.apache.spark.streaming.DStreamGraph.readObject(DStreamGraph.scala:176) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:606) > at > java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017) > at > java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893) > at > java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) > at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) > at > java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990) > at > java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915) > at > java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) > at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) > at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370) > at > org.apache.spark.streaming.CheckpointReader$$anonfun$read$2.apply(Checkpoint.scala:251) > at > org.apache.spark.streaming.CheckpointReader$$anonfun$read$2.apply(Checkpoint.scala:239) > at > scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
[jira] [Commented] (SPARK-10485) IF expression is not correctly resolved when one of the options have NullType
[ https://issues.apache.org/jira/browse/SPARK-10485?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14901394#comment-14901394 ] Wenchen Fan commented on SPARK-10485: - Hi [~ajnavarro], can you reproduce it in 1.5? We do have data type check for IF but we also have type coercions to cast null type if needed. > IF expression is not correctly resolved when one of the options have NullType > - > > Key: SPARK-10485 > URL: https://issues.apache.org/jira/browse/SPARK-10485 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.4.1, 1.5.0 >Reporter: Antonio Jesus Navarro > > If we have this query: > {code} > SELECT IF(column > 1, 1, NULL) FROM T1 > {code} > On Spark 1.4.1 we have this: > {code} > override lazy val resolved = childrenResolved && trueValue.dataType == > falseValue.dataType > {code} > So if one of the types is NullType, the if expression is not resolved. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10568) Error thrown in stopping one component in SparkContext.stop() doesn't allow other components to be stopped
[ https://issues.apache.org/jira/browse/SPARK-10568?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14901396#comment-14901396 ] Sean Owen commented on SPARK-10568: --- Sure, if you have suggestions along those lines. I had imagined it was about calling some code to stop each component in a way that exceptions from one would not prevent the rest from executing. > Error thrown in stopping one component in SparkContext.stop() doesn't allow > other components to be stopped > -- > > Key: SPARK-10568 > URL: https://issues.apache.org/jira/browse/SPARK-10568 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.4.1 >Reporter: Matt Cheah >Priority: Minor > > When I shut down a Java process that is running a SparkContext, it invokes a > shutdown hook that eventually calls SparkContext.stop(), and inside > SparkContext.stop() each individual component (DiskBlockManager, Scheduler > Backend) is stopped. If an exception is thrown in stopping one of these > components, none of the other components will be stopped cleanly either. This > caused problems when I stopped a Java process running a Spark context in > yarn-client mode, because not properly stopping YarnSchedulerBackend leads to > problems. > The steps I ran are as follows: > 1. Create one job which fills the cluster > 2. Kick off another job which creates a Spark Context > 3. Kill the Java process with the Spark Context in #2 > 4. The job remains in the YARN UI as ACCEPTED > Looking in the logs we see the following: > {code} > 2015-09-07 10:32:43,446 ERROR [Thread-3] o.a.s.u.Utils - Uncaught exception > in thread Thread-3 > java.lang.NullPointerException: null > at > org.apache.spark.storage.DiskBlockManager.org$apache$spark$storage$DiskBlockManager$$doStop(DiskBlockManager.scala:162) > ~[spark-core_2.10-1.4.1.jar:1.4.1] > at > org.apache.spark.storage.DiskBlockManager$$anonfun$addShutdownHook$1.apply$mcV$sp(DiskBlockManager.scala:144) > ~[spark-core_2.10-1.4.1.jar:1.4.1] > at org.apache.spark.util.SparkShutdownHook.run(Utils.scala:2308) > ~[spark-core_2.10-1.4.1.jar:1.4.1] > {code} > I think what's going on is that when we kill the application in the queued > state, it tries to run the SparkContext.stop() method on the driver and stop > each component. It dies trying to stop the DiskBlockManager because it hasn't > been initialized yet - the application is still waiting to be scheduled by > the Yarn RM - but YarnClient.stop() is not invoked as a result, leaving the > application sticking around in the accepted state. > Because of what appears to be bugs in the YARN scheduler, entering this state > makes it so that the YARN scheduler is unable to schedule any more jobs > unless we manually remove this application via the YARN CLI. We can tackle > the YARN stuck state separately, but ensuring that all components get at > least some chance to stop when a SparkContext stops seems like a good idea. > Of course we can still throw some exception and/or log exceptions for > everything that goes wrong at the end of stopping the context. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5569) Checkpoints cannot reference classes defined outside of Spark's assembly
[ https://issues.apache.org/jira/browse/SPARK-5569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14901387#comment-14901387 ] Jon Buffington commented on SPARK-5569: --- The only substantive difference in the build file you reference is that spark-streaming is included in the uber jar (not mark as excluded). We are under the impression that the spark-streaming dependency is provided while we need to package spark-streaming-kafka. Are we mistaken? > Checkpoints cannot reference classes defined outside of Spark's assembly > > > Key: SPARK-5569 > URL: https://issues.apache.org/jira/browse/SPARK-5569 > Project: Spark > Issue Type: Bug > Components: Streaming >Reporter: Patrick Wendell > > Not sure if this is a bug or a feature, but it's not obvious, so wanted to > create a JIRA to make sure we document this behavior. > First documented by Cody Koeninger: > https://gist.github.com/koeninger/561a61482cd1b5b3600c > {code} > 15/01/12 16:07:07 INFO CheckpointReader: Attempting to load checkpoint from > file file:/var/tmp/cp/checkpoint-142110041.bk > 15/01/12 16:07:07 WARN CheckpointReader: Error reading checkpoint from file > file:/var/tmp/cp/checkpoint-142110041.bk > java.io.IOException: java.lang.ClassNotFoundException: > org.apache.spark.rdd.kafka.KafkaRDDPartition > at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1043) > at > org.apache.spark.streaming.dstream.DStreamCheckpointData.readObject(DStreamCheckpointData.scala:146) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:606) > at > java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017) > at > java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893) > at > java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) > at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) > at > java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990) > at > java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915) > at > java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) > at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) > at java.io.ObjectInputStream.readArray(ObjectInputStream.java:1706) > at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1344) > at > java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990) > at > java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915) > at > java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) > at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) > at > java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990) > at > java.io.ObjectInputStream.defaultReadObject(ObjectInputStream.java:500) > at > org.apache.spark.streaming.DStreamGraph$$anonfun$readObject$1.apply$mcV$sp(DStreamGraph.scala:180) > at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1040) > at > org.apache.spark.streaming.DStreamGraph.readObject(DStreamGraph.scala:176) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:606) > at > java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017) > at > java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893) > at > java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) > at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) > at > java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990) > at > java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915) > at > java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) > at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) > at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370) > at > org.apache.spark.streaming.CheckpointReader$$anonfun$read$2.apply(Checkpoint.scala:251) > at > org.apache.spark.streaming.CheckpointReader$$anonfun$read$2.apply(Checkpoin
[jira] [Comment Edited] (SPARK-5569) Checkpoints cannot reference classes defined outside of Spark's assembly
[ https://issues.apache.org/jira/browse/SPARK-5569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14901387#comment-14901387 ] Jon Buffington edited comment on SPARK-5569 at 9/21/15 8:57 PM: The only substantive difference in the build file you reference is that spark-streaming is included in the uber jar (not mark as provided). We are under the impression that the spark-streaming dependency is provided while we need to package spark-streaming-kafka. Are we mistaken? was (Author: jon_fuseelements): The only substantive difference in the build file you reference is that spark-streaming is included in the uber jar (not mark as excluded). We are under the impression that the spark-streaming dependency is provided while we need to package spark-streaming-kafka. Are we mistaken? > Checkpoints cannot reference classes defined outside of Spark's assembly > > > Key: SPARK-5569 > URL: https://issues.apache.org/jira/browse/SPARK-5569 > Project: Spark > Issue Type: Bug > Components: Streaming >Reporter: Patrick Wendell > > Not sure if this is a bug or a feature, but it's not obvious, so wanted to > create a JIRA to make sure we document this behavior. > First documented by Cody Koeninger: > https://gist.github.com/koeninger/561a61482cd1b5b3600c > {code} > 15/01/12 16:07:07 INFO CheckpointReader: Attempting to load checkpoint from > file file:/var/tmp/cp/checkpoint-142110041.bk > 15/01/12 16:07:07 WARN CheckpointReader: Error reading checkpoint from file > file:/var/tmp/cp/checkpoint-142110041.bk > java.io.IOException: java.lang.ClassNotFoundException: > org.apache.spark.rdd.kafka.KafkaRDDPartition > at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1043) > at > org.apache.spark.streaming.dstream.DStreamCheckpointData.readObject(DStreamCheckpointData.scala:146) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:606) > at > java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017) > at > java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893) > at > java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) > at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) > at > java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990) > at > java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915) > at > java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) > at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) > at java.io.ObjectInputStream.readArray(ObjectInputStream.java:1706) > at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1344) > at > java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990) > at > java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915) > at > java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) > at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) > at > java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990) > at > java.io.ObjectInputStream.defaultReadObject(ObjectInputStream.java:500) > at > org.apache.spark.streaming.DStreamGraph$$anonfun$readObject$1.apply$mcV$sp(DStreamGraph.scala:180) > at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1040) > at > org.apache.spark.streaming.DStreamGraph.readObject(DStreamGraph.scala:176) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:606) > at > java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017) > at > java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893) > at > java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) > at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) > at > java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990) > at > java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915) > at > java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStre