[jira] [Updated] (SPARK-9611) UnsafeFixedWidthAggregationMap.destructAndCreateExternalSorter will add an empty entry to if the map is empty.
[ https://issues.apache.org/jira/browse/SPARK-9611?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-9611: -- Shepherd: Josh Rosen UnsafeFixedWidthAggregationMap.destructAndCreateExternalSorter will add an empty entry to if the map is empty. -- Key: SPARK-9611 URL: https://issues.apache.org/jira/browse/SPARK-9611 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Yin Huai Assignee: Yin Huai Priority: Blocker Fix For: 1.5.0 There are two corner cases related to the destructAndCreateExternalSorter (class UnsafeKVExternalSorter) returned by UnsafeFixedWidthAggregationMap. 1. The constructor of UnsafeKVExternalSorter tries to first create a UnsafeInMemorySorter based on the BytesToBytesMap of UnsafeFixedWidthAggregationMap. However, when there is no entry in the map, UnsafeInMemorySorter will throw an AssertionError because we are using the size of map (0 at here) as the initialSize of UnsafeInMemorySorter, which is not allowed. 2. Once we fixes the first problem, when we use UnsafeKVExternalSorter's KVSorterIterator loads data back, you can find there is one extra records, which is an empty record. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-9119) In some cases, we may save wrong decimal values to parquet
[ https://issues.apache.org/jira/browse/SPARK-9119?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu resolved SPARK-9119. --- Resolution: Fixed Fix Version/s: 1.5.0 Issue resolved by pull request 7925 [https://github.com/apache/spark/pull/7925] In some cases, we may save wrong decimal values to parquet -- Key: SPARK-9119 URL: https://issues.apache.org/jira/browse/SPARK-9119 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Yin Huai Assignee: Davies Liu Priority: Blocker Fix For: 1.5.0 {code} import org.apache.spark.sql.Row import org.apache.spark.sql.types.{StructType,StructField,StringType,DecimalType} import org.apache.spark.sql.types.Decimal val schema = StructType(Array(StructField(name, DecimalType(10, 5), false))) val rowRDD = sc.parallelize(Array(Row(Decimal(67123.45 val df = sqlContext.createDataFrame(rowRDD, schema) df.registerTempTable(test) df.show() // ++ // |name| // ++ // |67123.45| // ++ sqlContext.sql(create table testDecimal as select * from test) sqlContext.table(testDecimal).show() // ++ // |name| // ++ // |67.12345| // ++ {code} The problem is when we do conversions, we do not use precision/scale info in the schema. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-8359) Spark SQL Decimal type precision loss on multiplication
[ https://issues.apache.org/jira/browse/SPARK-8359?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu resolved SPARK-8359. --- Resolution: Fixed Fix Version/s: 1.5.0 Issue resolved by pull request 7925 [https://github.com/apache/spark/pull/7925] Spark SQL Decimal type precision loss on multiplication --- Key: SPARK-8359 URL: https://issues.apache.org/jira/browse/SPARK-8359 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 1.5.0 Reporter: Rene Treffer Assignee: Davies Liu Fix For: 1.5.0 It looks like the precision of decimal can not be raised beyond ~2^112 without causing full value truncation. The following code computes the power of two up to a specific point {code} import org.apache.spark.sql.types.Decimal val one = Decimal(1) val two = Decimal(2) def pow(n : Int) : Decimal = if (n = 0) { one } else { val a = pow(n - 1) a.changePrecision(n,0) two.changePrecision(n,0) a * two } (109 to 120).foreach(n = println(pow(n).toJavaBigDecimal.unscaledValue.toString)) 649037107316853453566312041152512 1298074214633706907132624082305024 2596148429267413814265248164610048 5192296858534827628530496329220096 1038459371706965525706099265844019 2076918743413931051412198531688038 4153837486827862102824397063376076 8307674973655724205648794126752152 1661534994731144841129758825350430 3323069989462289682259517650700860 6646139978924579364519035301401720 1329227995784915872903807060280344 {code} Beyond ~2^112 the precision is truncated even if the precision was set to n and should thus handle 10^n without problems.. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-9627) SQL job failed if the dataframe is cached
Davies Liu created SPARK-9627: - Summary: SQL job failed if the dataframe is cached Key: SPARK-9627 URL: https://issues.apache.org/jira/browse/SPARK-9627 Project: Spark Issue Type: Bug Affects Versions: 1.5.0 Reporter: Davies Liu Priority: Critical {code} r = random.Random() def gen(i): d = date.today() - timedelta(r.randint(0, 5000)) cat = str(r.randint(0, 20)) * 5 c = r.randint(0, 1000) price = decimal.Decimal(r.randint(0, 10)) / 100 return (d, cat, c, price) schema = StructType().add('date', DateType()).add('cat', StringType()).add('count', ShortType()).add('price', DecimalType(5, 2)) #df = sqlContext.createDataFrame(sc.range(124).map(gen), schema) #df.show() #df.write.parquet('sales4') df = sqlContext.read.parquet('sales4') df.cache() df.count() df.show() print df.schema raw_input() r = df.groupBy(df.date, df.cat).agg(sum(df['count'] * df.price)) print r.explain(True) r.show() {code} {code} StructType(List(StructField(date,DateType,true),StructField(cat,StringType,true),StructField(count,ShortType,true),StructField(price,DecimalType(5,2),true))) == Parsed Logical Plan == 'Aggregate [date#0,cat#1], [date#0,cat#1,sum((count#2 * price#3)) AS sum((count * price))#70] Relation[date#0,cat#1,count#2,price#3] org.apache.spark.sql.parquet.ParquetRelation@5ec8f315 == Analyzed Logical Plan == date: date, cat: string, sum((count * price)): decimal(21,2) Aggregate [date#0,cat#1], [date#0,cat#1,sum((change_decimal_precision(CAST(CAST(count#2, DecimalType(5,0)), DecimalType(11,2))) * change_decimal_precision(CAST(price#3, DecimalType(11,2) AS sum((count * price))#70] Relation[date#0,cat#1,count#2,price#3] org.apache.spark.sql.parquet.ParquetRelation@5ec8f315 == Optimized Logical Plan == Aggregate [date#0,cat#1], [date#0,cat#1,sum((change_decimal_precision(CAST(CAST(count#2, DecimalType(5,0)), DecimalType(11,2))) * change_decimal_precision(CAST(price#3, DecimalType(11,2) AS sum((count * price))#70] InMemoryRelation [date#0,cat#1,count#2,price#3], true, 1, StorageLevel(true, true, false, true, 1), (PhysicalRDD [date#0,cat#1,count#2,price#3], MapPartitionsRDD[3] at), None == Physical Plan == NewAggregate with SortBasedAggregationIterator List(date#0, cat#1) ArrayBuffer((sum((change_decimal_precision(CAST(CAST(count#2, DecimalType(5,0)), DecimalType(11,2))) * change_decimal_precision(CAST(price#3, DecimalType(11,2)2,mode=Final,isDistinct=false)) TungstenSort [date#0 ASC,cat#1 ASC], false, 0 ConvertToUnsafe Exchange hashpartitioning(date#0,cat#1) NewAggregate with SortBasedAggregationIterator List(date#0, cat#1) ArrayBuffer((sum((change_decimal_precision(CAST(CAST(count#2, DecimalType(5,0)), DecimalType(11,2))) * change_decimal_precision(CAST(price#3, DecimalType(11,2)2,mode=Partial,isDistinct=false)) TungstenSort [date#0 ASC,cat#1 ASC], false, 0 ConvertToUnsafe InMemoryColumnarTableScan [date#0,cat#1,count#2,price#3], (InMemoryRelation [date#0,cat#1,count#2,price#3], true, 1, StorageLevel(true, true, false, true, 1), (PhysicalRDD [date#0,cat#1,count#2,price#3], MapPartitionsRDD[3] at), None) Code Generation: true == RDD == None 15/08/04 23:21:53 ERROR TaskSetManager: Task 0 in stage 4.0 failed 1 times; aborting job Traceback (most recent call last): File t.py, line 34, in module r.show() File /Users/davies/work/spark/python/pyspark/sql/dataframe.py, line 258, in show print(self._jdf.showString(n, truncate)) File /Users/davies/work/spark/python/lib/py4j/java_gateway.py, line 538, in __call__ self.target_id, self.name) File /Users/davies/work/spark/python/pyspark/sql/utils.py, line 36, in deco return f(*a, **kw) File /Users/davies/work/spark/python/lib/py4j/protocol.py, line 300, in get_return_value format(target_id, '.', name), value) py4j.protocol.Py4JJavaError: An error occurred while calling o36.showString. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 4.0 failed 1 times, most recent failure: Lost task 0.0 in stage 4.0 (TID 10, localhost): java.lang.UnsupportedOperationException: tail of empty list at scala.collection.immutable.Nil$.tail(List.scala:339) at scala.collection.immutable.Nil$.tail(List.scala:334) at scala.reflect.internal.SymbolTable.popPhase(SymbolTable.scala:172) at scala.reflect.internal.Symbols$Symbol.typeParams(Symbols.scala:1491) at scala.reflect.internal.Types$NoArgsTypeRef.typeParams(Types.scala:2144) at scala.reflect.internal.Types$TypeRef.initializedTypeParams(Types.scala:2408) at scala.reflect.internal.Types$TypeRef.typeParamsMatchArgs(Types.scala:2409) at scala.reflect.internal.Types$AliasTypeRef$class.dealias(Types.scala:2232) at scala.reflect.internal.Types$TypeRef$$anon$3.dealias(Types.scala:2539)
[jira] [Resolved] (SPARK-9046) Decimal type support improvement and bug fix
[ https://issues.apache.org/jira/browse/SPARK-9046?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-9046. Resolution: Fixed Fix Version/s: 1.5.0 Decimal type support improvement and bug fix Key: SPARK-9046 URL: https://issues.apache.org/jira/browse/SPARK-9046 Project: Spark Issue Type: Bug Components: SQL Reporter: Yin Huai Assignee: Davies Liu Priority: Critical Fix For: 1.5.0 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-9628) Rename Int and Long to SQLDate SQLTimestamp In DateTimeUtils
Yijie Shen created SPARK-9628: - Summary: Rename Int and Long to SQLDate SQLTimestamp In DateTimeUtils Key: SPARK-9628 URL: https://issues.apache.org/jira/browse/SPARK-9628 Project: Spark Issue Type: Improvement Reporter: Yijie Shen -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-9628) Rename Int and Long to SQLDate SQLTimestamp In DateTimeUtils
[ https://issues.apache.org/jira/browse/SPARK-9628?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-9628: --- Assignee: Apache Spark Rename Int and Long to SQLDate SQLTimestamp In DateTimeUtils Key: SPARK-9628 URL: https://issues.apache.org/jira/browse/SPARK-9628 Project: Spark Issue Type: Improvement Reporter: Yijie Shen Assignee: Apache Spark -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9628) Rename Int and Long to SQLDate SQLTimestamp In DateTimeUtils
[ https://issues.apache.org/jira/browse/SPARK-9628?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14654900#comment-14654900 ] Apache Spark commented on SPARK-9628: - User 'yjshen' has created a pull request for this issue: https://github.com/apache/spark/pull/7953 Rename Int and Long to SQLDate SQLTimestamp In DateTimeUtils Key: SPARK-9628 URL: https://issues.apache.org/jira/browse/SPARK-9628 Project: Spark Issue Type: Improvement Reporter: Yijie Shen -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-9628) Rename Int and Long to SQLDate SQLTimestamp In DateTimeUtils
[ https://issues.apache.org/jira/browse/SPARK-9628?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-9628: --- Assignee: (was: Apache Spark) Rename Int and Long to SQLDate SQLTimestamp In DateTimeUtils Key: SPARK-9628 URL: https://issues.apache.org/jira/browse/SPARK-9628 Project: Spark Issue Type: Improvement Reporter: Yijie Shen -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6212) The EXPLAIN output of CTAS only shows the analyzed plan
[ https://issues.apache.org/jira/browse/SPARK-6212?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-6212: --- Assignee: Yijie Shen The EXPLAIN output of CTAS only shows the analyzed plan --- Key: SPARK-6212 URL: https://issues.apache.org/jira/browse/SPARK-6212 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.3.0 Reporter: Yin Huai Assignee: Yijie Shen When you try {code} sql(explain extended create table parquet2 as select * from parquet1).collect.foreach(println) {code} The output will be {code} [== Parsed Logical Plan ==] ['CreateTableAsSelect None, parquet2, false, Some(TOK_CREATETABLE)] [ 'Project [*]] [ 'UnresolvedRelation [parquet1], None] [] [== Analyzed Logical Plan ==] [CreateTableAsSelect [Database:default, TableName: parquet2, InsertIntoHiveTable]] [Project [str#44]] [ Subquery parquet1] [ Relation[str#44] ParquetRelation2(List(file:/user/hive/warehouse/parquet1),Map(serialization.format - 1, path - file:/user/hive/warehouse/parquet1),Some(StructType(StructField(str,StringType,true))),None)] [] [] [== Optimized Logical Plan ==] [CreateTableAsSelect [Database:default, TableName: parquet2, InsertIntoHiveTable]] [Project [str#44]] [ Subquery parquet1] [ Relation[str#44] ParquetRelation2(List(file:/user/hive/warehouse/parquet1),Map(serialization.format - 1, path - file:/user/hive/warehouse/parquet1),Some(StructType(StructField(str,StringType,true))),None)] [] [] [== Physical Plan ==] [ExecutedCommand (CreateTableAsSelect [Database:default, TableName: parquet2, InsertIntoHiveTable]] [Project [str#44]] [ Subquery parquet1] [ Relation[str#44] ParquetRelation2(List(file:/user/hive/warehouse/parquet1),Map(serialization.format - 1, path - file:/user/hive/warehouse/parquet1),Some(StructType(StructField(str,StringType,true))),None)] [)] [] [Code Generation: false] [== RDD ==] {code} Query Plans of the SELECT clause shown in Optimized Plan and Physical Plan are actually analyzed plan. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-9629) Client session timed out, have not heard from server in
zengqiuyang created SPARK-9629: -- Summary: Client session timed out, have not heard from server in Key: SPARK-9629 URL: https://issues.apache.org/jira/browse/SPARK-9629 Project: Spark Issue Type: Bug Components: Deploy Affects Versions: 1.4.1, 1.4.0 Environment: spark1.4.1./make-distribution.sh --tgz -Dhadoop.version=2.5.2 -Dyarn.version=2.5.2 -Phive -Phive-thriftserver -Pyarn zookeeper-3.4.6.tar.gz Reporter: zengqiuyang Priority: Critical the spark HA running every few days , then Client session timed out appear。 show reconnect but not do it, and master shutting down. logs: 15/08/05 05:32:57 INFO zookeeper.ClientCnxn: Client session timed out, have not heard from server in 37753ms for sessionid 0x34ee39684b70005, closing socket connection and attempting reconnect 15/08/05 05:32:57 INFO state.ConnectionStateManager: State change: SUSPENDED 15/08/05 05:32:57 WARN state.ConnectionStateManager: There are no ConnectionStateListeners registered. 15/08/05 05:32:57 INFO zookeeper.ClientCnxn: Opening socket connection to server h5/192.168.0.18:2181. Will not attempt to authenticate using SASL (unknown error) 15/08/05 05:32:57 INFO zookeeper.ClientCnxn: Socket connection established to h5/192.168.0.18:2181, initiating session 15/08/05 05:32:57 INFO zookeeper.ClientCnxn: Session establishment complete on server h5/192.168.0.18:2181, sessionid = 0x34ee39684b70005, negotiated timeout = 4 15/08/05 05:32:57 INFO state.ConnectionStateManager: State change: RECONNECTED 15/08/05 05:32:57 WARN state.ConnectionStateManager: There are no ConnectionStateListeners registered. 15/08/05 05:32:58 INFO zookeeper.ClientCnxn: Client session timed out, have not heard from server in 37753ms for sessionid 0x34ee39684b70006, closing socket connection and attempting reconnect 15/08/05 05:32:58 INFO state.ConnectionStateManager: State change: SUSPENDED 15/08/05 05:32:58 INFO master.ZooKeeperLeaderElectionAgent: We have lost leadership 15/08/05 05:32:58 ERROR master.Master: Leadership has been revoked -- master shutting down. 15/08/05 05:32:58 INFO util.Utils: Shutdown hook called -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9617) Implement json_tuple
[ https://issues.apache.org/jira/browse/SPARK-9617?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-9617: --- Issue Type: Sub-task (was: Improvement) Parent: SPARK-9571 Implement json_tuple Key: SPARK-9617 URL: https://issues.apache.org/jira/browse/SPARK-9617 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Nathan Howell Priority: Minor Provide a native Spark implementation for {{json_tuple}} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9618) SQLContext.read.schema().parquet() ignores the supplied schema
[ https://issues.apache.org/jira/browse/SPARK-9618?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-9618: --- Assignee: Nathan Howell Target Version/s: 1.5.0 SQLContext.read.schema().parquet() ignores the supplied schema -- Key: SPARK-9618 URL: https://issues.apache.org/jira/browse/SPARK-9618 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.4.1 Reporter: Nathan Howell Assignee: Nathan Howell Priority: Minor If a user supplies a schema when loading a Parquet file it is ignored and the schema is read off disk instead. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6488) Support addition/multiplication in PySpark's BlockMatrix
[ https://issues.apache.org/jira/browse/SPARK-6488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14654906#comment-14654906 ] Mike Dusenberry commented on SPARK-6488: I'd like to work on this one as well. Support addition/multiplication in PySpark's BlockMatrix Key: SPARK-6488 URL: https://issues.apache.org/jira/browse/SPARK-6488 Project: Spark Issue Type: Sub-task Components: MLlib, PySpark Reporter: Xiangrui Meng This JIRA is to add addition/multiplication to BlockMatrix in PySpark. We should reuse the Scala implementation instead of having a separate implementation in Python. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8930) Throw a AnalysisException with meaningful messages when DataFrame#explode takes a star in expressions
[ https://issues.apache.org/jira/browse/SPARK-8930?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takeshi Yamamuro updated SPARK-8930: Summary: Throw a AnalysisException with meaningful messages when DataFrame#explode takes a star in expressions (was: Support a star '*' in generator function arguments) Throw a AnalysisException with meaningful messages when DataFrame#explode takes a star in expressions - Key: SPARK-8930 URL: https://issues.apache.org/jira/browse/SPARK-8930 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.4.0 Reporter: Takeshi Yamamuro The current implementation throws an exception if generators contain a star '*' like codes blow; {code} val df = Seq((1, 1,2), (2, 4), (3, 7,8,9)).toDF(prefix, csv) checkAnswer( df.explode($*) { case Row(prefix: String, csv: String) = csv.split(,).map(v = Tuple1(prefix + : + v)) }, Row(1, 1,2, 1:1) :: Row(1, 1,2, 1:2) :: Row(2, 4, 2:4) :: Row(3, 7,8,9, 3:7) :: Row(3, 7,8,9, 3:8) :: Row(3, 7,8,9, 3:9) :: Nil ) {code} {code} [info] - explode takes UnresolvedStar *** FAILED *** (14 milliseconds) [info] org.apache.spark.sql.AnalysisException: cannot resolve '_1' given input columns prefix, csv; [info] at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42) [info] at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:55) [info] at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:52) [info] at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:291) [info] at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:291) [info] at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:51) [info] at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:290) [info] at org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionUp$1(QueryPlan.scala:107) [info] at org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2(QueryPlan.scala:117) [info] at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2$1.apply(QueryPlan.scala:1 21) [info] at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) [info] at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) [info] at scala.collection.immutable.List.foreach(List.scala:318) [info] at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) [info] at scala.collection.AbstractTraversable.map(Traversable.scala:105) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8930) Throw a AnalysisException with meaningful messages when DataFrame#explode takes a star in expressions
[ https://issues.apache.org/jira/browse/SPARK-8930?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takeshi Yamamuro updated SPARK-8930: Description: The current implementation throws an exception with meaningless messages if DataFrame#explode contain a star '*' like codes blow; {code} val df = Seq((1, 1,2), (2, 4), (3, 7,8,9)).toDF(prefix, csv) checkAnswer( df.explode($*) { case Row(prefix: String, csv: String) = csv.split(,).map(v = Tuple1(prefix + : + v)) }, Row(1, 1,2, 1:1) :: Row(1, 1,2, 1:2) :: Row(2, 4, 2:4) :: Row(3, 7,8,9, 3:7) :: Row(3, 7,8,9, 3:8) :: Row(3, 7,8,9, 3:9) :: Nil ) {code} {code} [info] - explode takes UnresolvedStar *** FAILED *** (14 milliseconds) [info] org.apache.spark.sql.AnalysisException: cannot resolve '_1' given input columns prefix, csv; [info] at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42) [info] at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:55) [info] at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:52) [info] at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:291) [info] at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:291) [info] at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:51) [info] at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:290) [info] at org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionUp$1(QueryPlan.scala:107) [info] at org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2(QueryPlan.scala:117) [info] at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2$1.apply(QueryPlan.scala:1 21) [info] at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) [info] at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) [info] at scala.collection.immutable.List.foreach(List.scala:318) [info] at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) [info] at scala.collection.AbstractTraversable.map(Traversable.scala:105) {code} was: The current implementation throws an exception if generators contain a star '*' like codes blow; {code} val df = Seq((1, 1,2), (2, 4), (3, 7,8,9)).toDF(prefix, csv) checkAnswer( df.explode($*) { case Row(prefix: String, csv: String) = csv.split(,).map(v = Tuple1(prefix + : + v)) }, Row(1, 1,2, 1:1) :: Row(1, 1,2, 1:2) :: Row(2, 4, 2:4) :: Row(3, 7,8,9, 3:7) :: Row(3, 7,8,9, 3:8) :: Row(3, 7,8,9, 3:9) :: Nil ) {code} {code} [info] - explode takes UnresolvedStar *** FAILED *** (14 milliseconds) [info] org.apache.spark.sql.AnalysisException: cannot resolve '_1' given input columns prefix, csv; [info] at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42) [info] at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:55) [info] at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:52) [info] at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:291) [info] at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:291) [info] at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:51) [info] at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:290) [info] at org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionUp$1(QueryPlan.scala:107) [info] at org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2(QueryPlan.scala:117) [info] at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2$1.apply(QueryPlan.scala:1 21) [info] at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) [info] at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) [info] at scala.collection.immutable.List.foreach(List.scala:318) [info] at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) [info] at scala.collection.AbstractTraversable.map(Traversable.scala:105) {code} Throw a AnalysisException with meaningful messages when DataFrame#explode takes a star in expressions
[jira] [Updated] (SPARK-8930) Throw a AnalysisException with meaningful messages if DataFrame#explode takes a star in expressions
[ https://issues.apache.org/jira/browse/SPARK-8930?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takeshi Yamamuro updated SPARK-8930: Summary: Throw a AnalysisException with meaningful messages if DataFrame#explode takes a star in expressions (was: Throw a AnalysisException with meaningful messages when DataFrame#explode takes a star in expressions) Throw a AnalysisException with meaningful messages if DataFrame#explode takes a star in expressions --- Key: SPARK-8930 URL: https://issues.apache.org/jira/browse/SPARK-8930 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.4.0 Reporter: Takeshi Yamamuro The current implementation throws an exception with meaningless messages if DataFrame#explode contain a star '*' like codes blow; {code} val df = Seq((1, 1,2), (2, 4), (3, 7,8,9)).toDF(prefix, csv) checkAnswer( df.explode($*) { case Row(prefix: String, csv: String) = csv.split(,).map(v = Tuple1(prefix + : + v)) }, Row(1, 1,2, 1:1) :: Row(1, 1,2, 1:2) :: Row(2, 4, 2:4) :: Row(3, 7,8,9, 3:7) :: Row(3, 7,8,9, 3:8) :: Row(3, 7,8,9, 3:9) :: Nil ) {code} {code} [info] - explode takes UnresolvedStar *** FAILED *** (14 milliseconds) [info] org.apache.spark.sql.AnalysisException: cannot resolve '_1' given input columns prefix, csv; [info] at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42) [info] at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:55) [info] at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:52) [info] at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:291) [info] at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:291) [info] at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:51) [info] at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:290) [info] at org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionUp$1(QueryPlan.scala:107) [info] at org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2(QueryPlan.scala:117) [info] at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2$1.apply(QueryPlan.scala:1 21) [info] at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) [info] at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) [info] at scala.collection.immutable.List.foreach(List.scala:318) [info] at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) [info] at scala.collection.AbstractTraversable.map(Traversable.scala:105) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8930) Throw a AnalysisException with meaningful messages if DataFrame#explode takes a star in expressions
[ https://issues.apache.org/jira/browse/SPARK-8930?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takeshi Yamamuro updated SPARK-8930: Description: The current implementation throws an exception with meaningless messages if DataFrame#explode contain a star '*' (ISTM that explode cannot take a star in expressions) like codes blow; {code} val df = Seq((1, 1,2), (2, 4), (3, 7,8,9)).toDF(prefix, csv) df.explode($*) { case Row(prefix: String, csv: String) = csv.split(,).map(v = Tuple1(prefix + : + v)) } {code} {code} [info] - explode takes UnresolvedStar *** FAILED *** (14 milliseconds) [info] org.apache.spark.sql.AnalysisException: cannot resolve '_1' given input columns prefix, csv; [info] at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42) [info] at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:55) [info] at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:52) {code} was: The current implementation throws an exception with meaningless messages if DataFrame#explode contain a star '*' like codes blow; {code} val df = Seq((1, 1,2), (2, 4), (3, 7,8,9)).toDF(prefix, csv) checkAnswer( df.explode($*) { case Row(prefix: String, csv: String) = csv.split(,).map(v = Tuple1(prefix + : + v)) }, Row(1, 1,2, 1:1) :: Row(1, 1,2, 1:2) :: Row(2, 4, 2:4) :: Row(3, 7,8,9, 3:7) :: Row(3, 7,8,9, 3:8) :: Row(3, 7,8,9, 3:9) :: Nil ) {code} {code} [info] - explode takes UnresolvedStar *** FAILED *** (14 milliseconds) [info] org.apache.spark.sql.AnalysisException: cannot resolve '_1' given input columns prefix, csv; [info] at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42) [info] at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:55) [info] at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:52) [info] at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:291) [info] at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:291) [info] at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:51) [info] at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:290) [info] at org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionUp$1(QueryPlan.scala:107) [info] at org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2(QueryPlan.scala:117) [info] at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2$1.apply(QueryPlan.scala:1 21) [info] at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) [info] at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) [info] at scala.collection.immutable.List.foreach(List.scala:318) [info] at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) [info] at scala.collection.AbstractTraversable.map(Traversable.scala:105) {code} Throw a AnalysisException with meaningful messages if DataFrame#explode takes a star in expressions --- Key: SPARK-8930 URL: https://issues.apache.org/jira/browse/SPARK-8930 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.4.0 Reporter: Takeshi Yamamuro The current implementation throws an exception with meaningless messages if DataFrame#explode contain a star '*' (ISTM that explode cannot take a star in expressions) like codes blow; {code} val df = Seq((1, 1,2), (2, 4), (3, 7,8,9)).toDF(prefix, csv) df.explode($*) { case Row(prefix: String, csv: String) = csv.split(,).map(v = Tuple1(prefix + : + v)) } {code} {code} [info] - explode takes UnresolvedStar *** FAILED *** (14 milliseconds) [info] org.apache.spark.sql.AnalysisException: cannot resolve '_1' given input columns prefix, csv; [info] at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42) [info] at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:55) [info] at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:52) {code} -- This
[jira] [Updated] (SPARK-8930) Throw a AnalysisException with meaningful messages if DataFrame#explode takes a star in expressions
[ https://issues.apache.org/jira/browse/SPARK-8930?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-8930: --- Target Version/s: 1.5.0 Throw a AnalysisException with meaningful messages if DataFrame#explode takes a star in expressions --- Key: SPARK-8930 URL: https://issues.apache.org/jira/browse/SPARK-8930 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.4.0 Reporter: Takeshi Yamamuro The current implementation throws an exception with meaningless messages if DataFrame#explode contain a star '*' (ISTM that explode cannot take a star in expressions) like codes blow; {code} val df = Seq((1, 1,2), (2, 4), (3, 7,8,9)).toDF(prefix, csv) df.explode($*) { case Row(prefix: String, csv: String) = csv.split(,).map(v = Tuple1(prefix + : + v)) } {code} {code} [info] - explode takes UnresolvedStar *** FAILED *** (14 milliseconds) [info] org.apache.spark.sql.AnalysisException: cannot resolve '_1' given input columns prefix, csv; [info] at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42) [info] at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:55) [info] at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:52) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9629) Client session timed out, have not heard from server in
[ https://issues.apache.org/jira/browse/SPARK-9629?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zengqiuyang updated SPARK-9629: --- Environment: spark1.4.1./make-distribution.sh --tgz -Dhadoop.version=2.5.2 -Dyarn.version=2.5.2 -Phive -Phive-thriftserver -Pyarn zookeeper-3.4.6.tar.gz standalone HA Linux version 2.6.32-358.el6.x86_64 (mockbu...@c6b8.bsys.dev.centos.org) (gcc version 4.4.7 20120313 (Red Hat 4.4.7-3) (GCC) ) #1 SMP Fri Feb 22 00:31:26 UTC 2013 was: spark1.4.1./make-distribution.sh --tgz -Dhadoop.version=2.5.2 -Dyarn.version=2.5.2 -Phive -Phive-thriftserver -Pyarn zookeeper-3.4.6.tar.gz standalone HA Client session timed out, have not heard from server in Key: SPARK-9629 URL: https://issues.apache.org/jira/browse/SPARK-9629 Project: Spark Issue Type: Bug Components: Deploy Affects Versions: 1.4.0, 1.4.1 Environment: spark1.4.1./make-distribution.sh --tgz -Dhadoop.version=2.5.2 -Dyarn.version=2.5.2 -Phive -Phive-thriftserver -Pyarn zookeeper-3.4.6.tar.gz standalone HA Linux version 2.6.32-358.el6.x86_64 (mockbu...@c6b8.bsys.dev.centos.org) (gcc version 4.4.7 20120313 (Red Hat 4.4.7-3) (GCC) ) #1 SMP Fri Feb 22 00:31:26 UTC 2013 Reporter: zengqiuyang Priority: Critical the spark HA running every few days , then Client session timed out appear。 show reconnect but not do it, and master shutting down. logs: 15/08/05 05:32:57 INFO zookeeper.ClientCnxn: Client session timed out, have not heard from server in 37753ms for sessionid 0x34ee39684b70005, closing socket connection and attempting reconnect 15/08/05 05:32:57 INFO state.ConnectionStateManager: State change: SUSPENDED 15/08/05 05:32:57 WARN state.ConnectionStateManager: There are no ConnectionStateListeners registered. 15/08/05 05:32:57 INFO zookeeper.ClientCnxn: Opening socket connection to server h5/192.168.0.18:2181. Will not attempt to authenticate using SASL (unknown error) 15/08/05 05:32:57 INFO zookeeper.ClientCnxn: Socket connection established to h5/192.168.0.18:2181, initiating session 15/08/05 05:32:57 INFO zookeeper.ClientCnxn: Session establishment complete on server h5/192.168.0.18:2181, sessionid = 0x34ee39684b70005, negotiated timeout = 4 15/08/05 05:32:57 INFO state.ConnectionStateManager: State change: RECONNECTED 15/08/05 05:32:57 WARN state.ConnectionStateManager: There are no ConnectionStateListeners registered. 15/08/05 05:32:58 INFO zookeeper.ClientCnxn: Client session timed out, have not heard from server in 37753ms for sessionid 0x34ee39684b70006, closing socket connection and attempting reconnect 15/08/05 05:32:58 INFO state.ConnectionStateManager: State change: SUSPENDED 15/08/05 05:32:58 INFO master.ZooKeeperLeaderElectionAgent: We have lost leadership 15/08/05 05:32:58 ERROR master.Master: Leadership has been revoked -- master shutting down. 15/08/05 05:32:58 INFO util.Utils: Shutdown hook called -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-9630) Cleanup Hybrid Aggregate Operator.
Yin Huai created SPARK-9630: --- Summary: Cleanup Hybrid Aggregate Operator. Key: SPARK-9630 URL: https://issues.apache.org/jira/browse/SPARK-9630 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Yin Huai Assignee: Yin Huai Priority: Blocker This is the follow-up of SPARK-9240 to address review comments and clean up code. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-9630) Cleanup Hybrid Aggregate Operator.
[ https://issues.apache.org/jira/browse/SPARK-9630?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-9630: --- Assignee: Apache Spark (was: Yin Huai) Cleanup Hybrid Aggregate Operator. -- Key: SPARK-9630 URL: https://issues.apache.org/jira/browse/SPARK-9630 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Yin Huai Assignee: Apache Spark Priority: Blocker This is the follow-up of SPARK-9240 to address review comments and clean up code. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-9630) Cleanup Hybrid Aggregate Operator.
[ https://issues.apache.org/jira/browse/SPARK-9630?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-9630: --- Assignee: Yin Huai (was: Apache Spark) Cleanup Hybrid Aggregate Operator. -- Key: SPARK-9630 URL: https://issues.apache.org/jira/browse/SPARK-9630 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Yin Huai Assignee: Yin Huai Priority: Blocker This is the follow-up of SPARK-9240 to address review comments and clean up code. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9630) Cleanup Hybrid Aggregate Operator.
[ https://issues.apache.org/jira/browse/SPARK-9630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14654924#comment-14654924 ] Apache Spark commented on SPARK-9630: - User 'yhuai' has created a pull request for this issue: https://github.com/apache/spark/pull/7954 Cleanup Hybrid Aggregate Operator. -- Key: SPARK-9630 URL: https://issues.apache.org/jira/browse/SPARK-9630 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Yin Huai Assignee: Yin Huai Priority: Blocker This is the follow-up of SPARK-9240 to address review comments and clean up code. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9240) Hybrid aggregate operator using unsafe row
[ https://issues.apache.org/jira/browse/SPARK-9240?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14654925#comment-14654925 ] Apache Spark commented on SPARK-9240: - User 'yhuai' has created a pull request for this issue: https://github.com/apache/spark/pull/7954 Hybrid aggregate operator using unsafe row -- Key: SPARK-9240 URL: https://issues.apache.org/jira/browse/SPARK-9240 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Yin Huai Assignee: Yin Huai Priority: Blocker Fix For: 1.5.0 We need a hybrid aggregate operator, which first tries hash-based aggregations and gracefully switch to sort-based aggregations if the hash map's memory footprint exceeds a given threshold (how to track memory footprint and how to set the threshold?). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-9631) Giant pile of parquet log when trying to read local data
Reynold Xin created SPARK-9631: -- Summary: Giant pile of parquet log when trying to read local data Key: SPARK-9631 URL: https://issues.apache.org/jira/browse/SPARK-9631 Project: Spark Issue Type: Bug Components: SQL Reporter: Reynold Xin Assignee: Cheng Lian When I read a Parquet file, I got the following {code} Aug 5, 2015 12:13:36 AM INFO: org.apache.parquet.hadoop.InternalParquetRecordReader: RecordReader initialized will read a total of 2097152 records. Aug 5, 2015 12:13:36 AM INFO: org.apache.parquet.hadoop.InternalParquetRecordReader: at row 0. reading next block Aug 5, 2015 12:13:36 AM INFO: org.apache.parquet.hadoop.InternalParquetRecordReader: block read in memory in 0 ms. row count = 2097152 Aug 5, 2015 12:13:36 AM INFO: org.apache.parquet.hadoop.InternalParquetRecordReader: RecordReader initialized will read a total of 2097152 records. Aug 5, 2015 12:13:36 AM INFO: org.apache.parquet.hadoop.InternalParquetRecordReader: at row 0. reading next block Aug 5, 2015 12:13:36 AM INFO: org.apache.parquet.hadoop.InternalParquetRecordReader: block read in memory in 0 ms. row count = 2097152 Aug 5, 2015 12:13:36 AM WARNING: org.apache.parquet.hadoop.ParquetRecordReader: Can not initialize counter due to context is not a instance of TaskInputOutputContext, but is org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl Aug 5, 2015 12:13:36 AM WARNING: org.apache.parquet.hadoop.ParquetRecordReader: Can not initialize counter due to context is not a instance of TaskInputOutputContext, but is org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl Aug 5, 2015 12:13:36 AM INFO: org.apache.parquet.hadoop.InternalParquetRecordReader: RecordReader initialized will read a total of 2097152 records. Aug 5, 2015 12:13:36 AM INFO: org.apache.parquet.hadoop.InternalParquetRecordReader: RecordReader initialized will read a total of 2097152 records. Aug 5, 2015 12:13:36 AM INFO: org.apache.parquet.hadoop.InternalParquetRecordReader: at row 0. reading next block Aug 5, 2015 12:13:36 AM INFO: org.apache.parquet.hadoop.InternalParquetRecordReader: at row 0. reading next block Aug 5, 2015 12:13:36 AM INFO: org.apache.parquet.hadoop.InternalParquetRecordReader: block read in memory in 0 ms. row count = 2097152 Aug 5, 2015 12:13:36 AM INFO: org.apache.parquet.hadoop.InternalParquetRecordReader: block read in memory in 0 ms. row count = 2097152 Aug 5, 2015 12:13:53 AM WARNING: org.apache.parquet.hadoop.ParquetRecordReader: Can not initialize counter due to context is not a instance of TaskInputOutputContext, but is org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl Aug 5, 2015 12:13:53 AM WARNING: org.apache.parquet.hadoop.ParquetRecordReader: Can not initialize counter due to context is not a instance of TaskInputOutputContext, but is org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl Aug 5, 2015 12:13:53 AM INFO: org.apache.parquet.hadoop.InternalParquetRecordReader: RecordReader initialized will read a total of 2097152 records. Aug 5, 2015 12:13:53 AM INFO: org.apache.parquet.hadoop.InternalParquetRecordReader: RecordReader initialized will read a total of 2097152 records. Aug 5, 2015 12:13:53 AM INFO: org.apache.parquet.hadoop.InternalParquetRecordReader: at row 0. reading next block Aug 5, 2015 12:13:53 AM INFO: org.apache.parquet.hadoop.InternalParquetRecordReader: at row 0. reading next block Aug 5, 2015 12:13:53 AM INFO: org.apache.parquet.hadoop.InternalParquetRecordReader: block read in memory in 0 ms. row count = 2097152 Aug 5, 2015 12:13:53 AM INFO: org.apache.parquet.hadoop.InternalParquetRecordReader: block read in memory in 0 ms. row count = 2097152 Aug 5, 2015 12:13:53 AM WARNING: org.apache.parquet.hadoop.ParquetRecordReader: Can not initialize counter due to context is not a instance of TaskInputOutputContext, but is org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl Aug 5, 2015 12:13:53 AM WARNING: org.apache.parquet.hadoop.ParquetRecordReader: Can not initialize counter due to context is not a instance of TaskInputOutputContext, but is org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl Aug 5, 2015 12:13:53 AM INFO: org.apache.parquet.hadoop.InternalParquetRecordReader: RecordReader initialized will read a total of 2097152 records. Aug 5, 2015 12:13:53 AM INFO: org.apache.parquet.hadoop.InternalParquetRecordReader: RecordReader initialized will read a total of 2097152 records. Aug 5, 2015 12:13:53 AM INFO: org.apache.parquet.hadoop.InternalParquetRecordReader: at row 0. reading next block Aug 5, 2015 12:13:53 AM INFO: org.apache.parquet.hadoop.InternalParquetRecordReader: at row 0. reading next block Aug 5, 2015 12:13:53 AM INFO: org.apache.parquet.hadoop.InternalParquetRecordReader: block read in memory in 0 ms. row count = 2097152 Aug 5, 2015 12:13:53 AM INFO:
[jira] [Commented] (SPARK-9631) Giant pile of parquet log when trying to read local data
[ https://issues.apache.org/jira/browse/SPARK-9631?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14654933#comment-14654933 ] Reynold Xin commented on SPARK-9631: FYI this was running PySpark. Giant pile of parquet log when trying to read local data Key: SPARK-9631 URL: https://issues.apache.org/jira/browse/SPARK-9631 Project: Spark Issue Type: Bug Components: SQL Reporter: Reynold Xin Assignee: Cheng Lian When I read a Parquet file, I got the following {code} Aug 5, 2015 12:13:36 AM INFO: org.apache.parquet.hadoop.InternalParquetRecordReader: RecordReader initialized will read a total of 2097152 records. Aug 5, 2015 12:13:36 AM INFO: org.apache.parquet.hadoop.InternalParquetRecordReader: at row 0. reading next block Aug 5, 2015 12:13:36 AM INFO: org.apache.parquet.hadoop.InternalParquetRecordReader: block read in memory in 0 ms. row count = 2097152 Aug 5, 2015 12:13:36 AM INFO: org.apache.parquet.hadoop.InternalParquetRecordReader: RecordReader initialized will read a total of 2097152 records. Aug 5, 2015 12:13:36 AM INFO: org.apache.parquet.hadoop.InternalParquetRecordReader: at row 0. reading next block Aug 5, 2015 12:13:36 AM INFO: org.apache.parquet.hadoop.InternalParquetRecordReader: block read in memory in 0 ms. row count = 2097152 Aug 5, 2015 12:13:36 AM WARNING: org.apache.parquet.hadoop.ParquetRecordReader: Can not initialize counter due to context is not a instance of TaskInputOutputContext, but is org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl Aug 5, 2015 12:13:36 AM WARNING: org.apache.parquet.hadoop.ParquetRecordReader: Can not initialize counter due to context is not a instance of TaskInputOutputContext, but is org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl Aug 5, 2015 12:13:36 AM INFO: org.apache.parquet.hadoop.InternalParquetRecordReader: RecordReader initialized will read a total of 2097152 records. Aug 5, 2015 12:13:36 AM INFO: org.apache.parquet.hadoop.InternalParquetRecordReader: RecordReader initialized will read a total of 2097152 records. Aug 5, 2015 12:13:36 AM INFO: org.apache.parquet.hadoop.InternalParquetRecordReader: at row 0. reading next block Aug 5, 2015 12:13:36 AM INFO: org.apache.parquet.hadoop.InternalParquetRecordReader: at row 0. reading next block Aug 5, 2015 12:13:36 AM INFO: org.apache.parquet.hadoop.InternalParquetRecordReader: block read in memory in 0 ms. row count = 2097152 Aug 5, 2015 12:13:36 AM INFO: org.apache.parquet.hadoop.InternalParquetRecordReader: block read in memory in 0 ms. row count = 2097152 Aug 5, 2015 12:13:53 AM WARNING: org.apache.parquet.hadoop.ParquetRecordReader: Can not initialize counter due to context is not a instance of TaskInputOutputContext, but is org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl Aug 5, 2015 12:13:53 AM WARNING: org.apache.parquet.hadoop.ParquetRecordReader: Can not initialize counter due to context is not a instance of TaskInputOutputContext, but is org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl Aug 5, 2015 12:13:53 AM INFO: org.apache.parquet.hadoop.InternalParquetRecordReader: RecordReader initialized will read a total of 2097152 records. Aug 5, 2015 12:13:53 AM INFO: org.apache.parquet.hadoop.InternalParquetRecordReader: RecordReader initialized will read a total of 2097152 records. Aug 5, 2015 12:13:53 AM INFO: org.apache.parquet.hadoop.InternalParquetRecordReader: at row 0. reading next block Aug 5, 2015 12:13:53 AM INFO: org.apache.parquet.hadoop.InternalParquetRecordReader: at row 0. reading next block Aug 5, 2015 12:13:53 AM INFO: org.apache.parquet.hadoop.InternalParquetRecordReader: block read in memory in 0 ms. row count = 2097152 Aug 5, 2015 12:13:53 AM INFO: org.apache.parquet.hadoop.InternalParquetRecordReader: block read in memory in 0 ms. row count = 2097152 Aug 5, 2015 12:13:53 AM WARNING: org.apache.parquet.hadoop.ParquetRecordReader: Can not initialize counter due to context is not a instance of TaskInputOutputContext, but is org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl Aug 5, 2015 12:13:53 AM WARNING: org.apache.parquet.hadoop.ParquetRecordReader: Can not initialize counter due to context is not a instance of TaskInputOutputContext, but is org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl Aug 5, 2015 12:13:53 AM INFO: org.apache.parquet.hadoop.InternalParquetRecordReader: RecordReader initialized will read a total of 2097152 records. Aug 5, 2015 12:13:53 AM INFO: org.apache.parquet.hadoop.InternalParquetRecordReader: RecordReader initialized will read a total of 2097152 records. Aug 5, 2015 12:13:53 AM INFO: org.apache.parquet.hadoop.InternalParquetRecordReader: at
[jira] [Resolved] (SPARK-9215) Implement WAL-free Kinesis receiver that give at-least once guarantee
[ https://issues.apache.org/jira/browse/SPARK-9215?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das resolved SPARK-9215. -- Resolution: Fixed Fix Version/s: 1.5.0 Implement WAL-free Kinesis receiver that give at-least once guarantee - Key: SPARK-9215 URL: https://issues.apache.org/jira/browse/SPARK-9215 Project: Spark Issue Type: Improvement Components: Streaming Affects Versions: 1.4.1 Reporter: Tathagata Das Assignee: Tathagata Das Fix For: 1.5.0 Currently, the KinesisReceiver can loose some data in the case of certain failures (receiver and driver failures). Using the write ahead logs can mitigate some of the problem, but it is not ideal because WALs dont work with S3 (eventually consistency, etc.) which is the most likely file system to be used in the EC2 environment. Hence, we have to take a different approach to improving reliability for Kinesis. Detailed design doc - https://docs.google.com/document/d/1k0dl270EnK7uExrsCE7jYw7PYx0YC935uBcxn3p0f58/edit?usp=sharing -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-9217) Update Kinesis Receiver to record sequence numbers
[ https://issues.apache.org/jira/browse/SPARK-9217?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das resolved SPARK-9217. -- Resolution: Fixed Fix Version/s: 1.5.0 Update Kinesis Receiver to record sequence numbers -- Key: SPARK-9217 URL: https://issues.apache.org/jira/browse/SPARK-9217 Project: Spark Issue Type: Sub-task Components: Streaming Reporter: Tathagata Das Assignee: Tathagata Das Fix For: 1.5.0 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6599) Improve usability and reliability of Kinesis stream
[ https://issues.apache.org/jira/browse/SPARK-6599?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das updated SPARK-6599: - Issue Type: Epic (was: Improvement) Improve usability and reliability of Kinesis stream --- Key: SPARK-6599 URL: https://issues.apache.org/jira/browse/SPARK-6599 Project: Spark Issue Type: Epic Components: Streaming Reporter: Tathagata Das Assignee: Tathagata Das Usability improvements: API improvements, AWS SDK upgrades, etc. Reliability improvements: Currently, the KinesisReceiver can loose some data in the case of certain failures (receiver and driver failures). Using the write ahead logs can mitigate some of the problem, but it is not ideal because WALs dont work with S3 (eventually consistency, etc.) which is the most likely file system to be used in the EC2 environment. Hence, we have to take a different approach to improving reliability for Kinesis. See https://issues.apache.org/jira/browse/SPARK-9215 for more details. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6599) Improve usability and reliability of Kinesis stream
[ https://issues.apache.org/jira/browse/SPARK-6599?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das updated SPARK-6599: - Issue Type: Umbrella (was: Epic) Improve usability and reliability of Kinesis stream --- Key: SPARK-6599 URL: https://issues.apache.org/jira/browse/SPARK-6599 Project: Spark Issue Type: Umbrella Components: Streaming Reporter: Tathagata Das Assignee: Tathagata Das Usability improvements: API improvements, AWS SDK upgrades, etc. Reliability improvements: Currently, the KinesisReceiver can loose some data in the case of certain failures (receiver and driver failures). Using the write ahead logs can mitigate some of the problem, but it is not ideal because WALs dont work with S3 (eventually consistency, etc.) which is the most likely file system to be used in the EC2 environment. Hence, we have to take a different approach to improving reliability for Kinesis. See https://issues.apache.org/jira/browse/SPARK-9215 for more details. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-9632) update InternalRow.toSeq to make it accept data type info
Wenchen Fan created SPARK-9632: -- Summary: update InternalRow.toSeq to make it accept data type info Key: SPARK-9632 URL: https://issues.apache.org/jira/browse/SPARK-9632 Project: Spark Issue Type: Improvement Components: SQL Reporter: Wenchen Fan -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-9632) update InternalRow.toSeq to make it accept data type info
[ https://issues.apache.org/jira/browse/SPARK-9632?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-9632: --- Assignee: (was: Apache Spark) update InternalRow.toSeq to make it accept data type info - Key: SPARK-9632 URL: https://issues.apache.org/jira/browse/SPARK-9632 Project: Spark Issue Type: Improvement Components: SQL Reporter: Wenchen Fan -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9632) update InternalRow.toSeq to make it accept data type info
[ https://issues.apache.org/jira/browse/SPARK-9632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14654942#comment-14654942 ] Apache Spark commented on SPARK-9632: - User 'cloud-fan' has created a pull request for this issue: https://github.com/apache/spark/pull/7955 update InternalRow.toSeq to make it accept data type info - Key: SPARK-9632 URL: https://issues.apache.org/jira/browse/SPARK-9632 Project: Spark Issue Type: Improvement Components: SQL Reporter: Wenchen Fan -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-9632) update InternalRow.toSeq to make it accept data type info
[ https://issues.apache.org/jira/browse/SPARK-9632?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-9632: --- Assignee: Apache Spark update InternalRow.toSeq to make it accept data type info - Key: SPARK-9632 URL: https://issues.apache.org/jira/browse/SPARK-9632 Project: Spark Issue Type: Improvement Components: SQL Reporter: Wenchen Fan Assignee: Apache Spark -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9631) Giant pile of parquet log when trying to read local data
[ https://issues.apache.org/jira/browse/SPARK-9631?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14654953#comment-14654953 ] Sean Owen commented on SPARK-9631: -- Is this fixed by https://issues.apache.org/jira/browse/SPARK-8118 ? or supposed to be? might be the same report either way. Giant pile of parquet log when trying to read local data Key: SPARK-9631 URL: https://issues.apache.org/jira/browse/SPARK-9631 Project: Spark Issue Type: Bug Components: SQL Reporter: Reynold Xin Assignee: Cheng Lian When I read a Parquet file, I got the following {code} Aug 5, 2015 12:13:36 AM INFO: org.apache.parquet.hadoop.InternalParquetRecordReader: RecordReader initialized will read a total of 2097152 records. Aug 5, 2015 12:13:36 AM INFO: org.apache.parquet.hadoop.InternalParquetRecordReader: at row 0. reading next block Aug 5, 2015 12:13:36 AM INFO: org.apache.parquet.hadoop.InternalParquetRecordReader: block read in memory in 0 ms. row count = 2097152 Aug 5, 2015 12:13:36 AM INFO: org.apache.parquet.hadoop.InternalParquetRecordReader: RecordReader initialized will read a total of 2097152 records. Aug 5, 2015 12:13:36 AM INFO: org.apache.parquet.hadoop.InternalParquetRecordReader: at row 0. reading next block Aug 5, 2015 12:13:36 AM INFO: org.apache.parquet.hadoop.InternalParquetRecordReader: block read in memory in 0 ms. row count = 2097152 Aug 5, 2015 12:13:36 AM WARNING: org.apache.parquet.hadoop.ParquetRecordReader: Can not initialize counter due to context is not a instance of TaskInputOutputContext, but is org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl Aug 5, 2015 12:13:36 AM WARNING: org.apache.parquet.hadoop.ParquetRecordReader: Can not initialize counter due to context is not a instance of TaskInputOutputContext, but is org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl Aug 5, 2015 12:13:36 AM INFO: org.apache.parquet.hadoop.InternalParquetRecordReader: RecordReader initialized will read a total of 2097152 records. Aug 5, 2015 12:13:36 AM INFO: org.apache.parquet.hadoop.InternalParquetRecordReader: RecordReader initialized will read a total of 2097152 records. Aug 5, 2015 12:13:36 AM INFO: org.apache.parquet.hadoop.InternalParquetRecordReader: at row 0. reading next block Aug 5, 2015 12:13:36 AM INFO: org.apache.parquet.hadoop.InternalParquetRecordReader: at row 0. reading next block Aug 5, 2015 12:13:36 AM INFO: org.apache.parquet.hadoop.InternalParquetRecordReader: block read in memory in 0 ms. row count = 2097152 Aug 5, 2015 12:13:36 AM INFO: org.apache.parquet.hadoop.InternalParquetRecordReader: block read in memory in 0 ms. row count = 2097152 Aug 5, 2015 12:13:53 AM WARNING: org.apache.parquet.hadoop.ParquetRecordReader: Can not initialize counter due to context is not a instance of TaskInputOutputContext, but is org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl Aug 5, 2015 12:13:53 AM WARNING: org.apache.parquet.hadoop.ParquetRecordReader: Can not initialize counter due to context is not a instance of TaskInputOutputContext, but is org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl Aug 5, 2015 12:13:53 AM INFO: org.apache.parquet.hadoop.InternalParquetRecordReader: RecordReader initialized will read a total of 2097152 records. Aug 5, 2015 12:13:53 AM INFO: org.apache.parquet.hadoop.InternalParquetRecordReader: RecordReader initialized will read a total of 2097152 records. Aug 5, 2015 12:13:53 AM INFO: org.apache.parquet.hadoop.InternalParquetRecordReader: at row 0. reading next block Aug 5, 2015 12:13:53 AM INFO: org.apache.parquet.hadoop.InternalParquetRecordReader: at row 0. reading next block Aug 5, 2015 12:13:53 AM INFO: org.apache.parquet.hadoop.InternalParquetRecordReader: block read in memory in 0 ms. row count = 2097152 Aug 5, 2015 12:13:53 AM INFO: org.apache.parquet.hadoop.InternalParquetRecordReader: block read in memory in 0 ms. row count = 2097152 Aug 5, 2015 12:13:53 AM WARNING: org.apache.parquet.hadoop.ParquetRecordReader: Can not initialize counter due to context is not a instance of TaskInputOutputContext, but is org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl Aug 5, 2015 12:13:53 AM WARNING: org.apache.parquet.hadoop.ParquetRecordReader: Can not initialize counter due to context is not a instance of TaskInputOutputContext, but is org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl Aug 5, 2015 12:13:53 AM INFO: org.apache.parquet.hadoop.InternalParquetRecordReader: RecordReader initialized will read a total of 2097152 records. Aug 5, 2015 12:13:53 AM INFO: org.apache.parquet.hadoop.InternalParquetRecordReader: RecordReader initialized will read a total of 2097152 records.
[jira] [Resolved] (SPARK-9581) Add test for JSON UDTs
[ https://issues.apache.org/jira/browse/SPARK-9581?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-9581. Resolution: Fixed Fix Version/s: 1.5.0 Add test for JSON UDTs -- Key: SPARK-9581 URL: https://issues.apache.org/jira/browse/SPARK-9581 Project: Spark Issue Type: Test Components: SQL Reporter: Reynold Xin Assignee: Reynold Xin Fix For: 1.5.0 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9629) Client session timed out, have not heard from server in
[ https://issues.apache.org/jira/browse/SPARK-9629?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14654959#comment-14654959 ] Sean Owen commented on SPARK-9629: -- This points to a problem with your ZK broker. Have you investigated that first? Client session timed out, have not heard from server in Key: SPARK-9629 URL: https://issues.apache.org/jira/browse/SPARK-9629 Project: Spark Issue Type: Bug Components: Deploy Affects Versions: 1.4.0, 1.4.1 Environment: spark1.4.1./make-distribution.sh --tgz -Dhadoop.version=2.5.2 -Dyarn.version=2.5.2 -Phive -Phive-thriftserver -Pyarn zookeeper-3.4.6.tar.gz standalone HA Linux version 2.6.32-358.el6.x86_64 (mockbu...@c6b8.bsys.dev.centos.org) (gcc version 4.4.7 20120313 (Red Hat 4.4.7-3) (GCC) ) #1 SMP Fri Feb 22 00:31:26 UTC 2013 Reporter: zengqiuyang Priority: Critical the spark HA running every few days , then Client session timed out appear。 show reconnect but not do it, and master shutting down. logs: 15/08/05 05:32:57 INFO zookeeper.ClientCnxn: Client session timed out, have not heard from server in 37753ms for sessionid 0x34ee39684b70005, closing socket connection and attempting reconnect 15/08/05 05:32:57 INFO state.ConnectionStateManager: State change: SUSPENDED 15/08/05 05:32:57 WARN state.ConnectionStateManager: There are no ConnectionStateListeners registered. 15/08/05 05:32:57 INFO zookeeper.ClientCnxn: Opening socket connection to server h5/192.168.0.18:2181. Will not attempt to authenticate using SASL (unknown error) 15/08/05 05:32:57 INFO zookeeper.ClientCnxn: Socket connection established to h5/192.168.0.18:2181, initiating session 15/08/05 05:32:57 INFO zookeeper.ClientCnxn: Session establishment complete on server h5/192.168.0.18:2181, sessionid = 0x34ee39684b70005, negotiated timeout = 4 15/08/05 05:32:57 INFO state.ConnectionStateManager: State change: RECONNECTED 15/08/05 05:32:57 WARN state.ConnectionStateManager: There are no ConnectionStateListeners registered. 15/08/05 05:32:58 INFO zookeeper.ClientCnxn: Client session timed out, have not heard from server in 37753ms for sessionid 0x34ee39684b70006, closing socket connection and attempting reconnect 15/08/05 05:32:58 INFO state.ConnectionStateManager: State change: SUSPENDED 15/08/05 05:32:58 INFO master.ZooKeeperLeaderElectionAgent: We have lost leadership 15/08/05 05:32:58 ERROR master.Master: Leadership has been revoked -- master shutting down. 15/08/05 05:32:58 INFO util.Utils: Shutdown hook called -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9625) SparkILoop creates sql context continuously, thousands of times
[ https://issues.apache.org/jira/browse/SPARK-9625?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-9625: - Component/s: SQL Did you say this is reproducible -- how do you do it? SparkILoop creates sql context continuously, thousands of times --- Key: SPARK-9625 URL: https://issues.apache.org/jira/browse/SPARK-9625 Project: Spark Issue Type: Bug Components: Spark Shell, SQL Affects Versions: 1.4.1 Environment: Ubuntu on AWS Reporter: Simeon Simeonov Labels: sql Occasionally but repeatably, based on the Spark SQL operations being run, {{spark-shell}} gets into a funk where it attempts to create a sql context over and over again as it is doing its work. Example output below: {code} 15/08/05 03:04:12 INFO DAGScheduler: looking for newly runnable stages 15/08/05 03:04:12 INFO DAGScheduler: running: Set() 15/08/05 03:04:12 INFO DAGScheduler: waiting: Set(ShuffleMapStage 7, ResultStage 8) 15/08/05 03:04:12 INFO DAGScheduler: failed: Set() 15/08/05 03:04:12 INFO DAGScheduler: Missing parents for ShuffleMapStage 7: List() 15/08/05 03:04:12 INFO DAGScheduler: Missing parents for ResultStage 8: List(ShuffleMapStage 7) 15/08/05 03:04:12 INFO DAGScheduler: Submitting ShuffleMapStage 7 (MapPartitionsRDD[49] at map at console:474), which is now runnable 15/08/05 03:04:12 INFO MemoryStore: ensureFreeSpace(47840) called with curMem=685306, maxMem=26671746908 15/08/05 03:04:12 INFO MemoryStore: Block broadcast_12 stored as values in memory (estimated size 46.7 KB, free 24.8 GB) 15/08/05 03:04:12 INFO MemoryStore: ensureFreeSpace(15053) called with curMem=733146, maxMem=26671746908 15/08/05 03:04:12 INFO MemoryStore: Block broadcast_12_piece0 stored as bytes in memory (estimated size 14.7 KB, free 24.8 GB) 15/08/05 03:04:12 INFO BlockManagerInfo: Added broadcast_12_piece0 in memory on localhost:39451 (size: 14.7 KB, free: 24.8 GB) 15/08/05 03:04:12 INFO SparkContext: Created broadcast 12 from broadcast at DAGScheduler.scala:874 15/08/05 03:04:12 INFO DAGScheduler: Submitting 1 missing tasks from ShuffleMapStage 7 (MapPartitionsRDD[49] at map at console:474) 15/08/05 03:04:12 INFO TaskSchedulerImpl: Adding task set 7.0 with 1 tasks 15/08/05 03:04:12 INFO TaskSetManager: Starting task 0.0 in stage 7.0 (TID 684, localhost, PROCESS_LOCAL, 1461 bytes) 15/08/05 03:04:12 INFO Executor: Running task 0.0 in stage 7.0 (TID 684) 15/08/05 03:04:12 INFO ShuffleBlockFetcherIterator: Getting 214 non-empty blocks out of 214 blocks 15/08/05 03:04:12 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 1 ms 15/08/05 03:04:12 INFO HiveContext: Initializing execution hive, version 0.13.1 15/08/05 03:04:13 INFO HiveMetaStore: No user is added in admin role, since config is empty 15/08/05 03:04:13 INFO SessionState: No Tez session required at this point. hive.execution.engine=mr. 15/08/05 03:04:13 INFO SparkILoop: Created sql context (with Hive support).. SQL context available as sqlContext. 15/08/05 03:04:13 INFO HiveContext: Initializing execution hive, version 0.13.1 15/08/05 03:04:13 INFO SparkILoop: Created sql context (with Hive support).. SQL context available as sqlContext. 15/08/05 03:04:13 INFO HiveContext: Initializing execution hive, version 0.13.1 15/08/05 03:04:13 INFO SparkILoop: Created sql context (with Hive support).. SQL context available as sqlContext. 15/08/05 03:04:13 INFO HiveContext: Initializing execution hive, version 0.13.1 15/08/05 03:04:13 INFO SparkILoop: Created sql context (with Hive support).. SQL context available as sqlContext. 15/08/05 03:04:13 INFO HiveContext: Initializing execution hive, version 0.13.1 15/08/05 03:04:13 INFO SparkILoop: Created sql context (with Hive support).. SQL context available as sqlContext. 15/08/05 03:04:13 INFO HiveContext: Initializing execution hive, version 0.13.1 15/08/05 03:04:13 INFO SparkILoop: Created sql context (with Hive support).. SQL context available as sqlContext. 15/08/05 03:04:13 INFO HiveContext: Initializing execution hive, version 0.13.1 15/08/05 03:04:13 INFO SparkILoop: Created sql context (with Hive support).. SQL context available as sqlContext. 15/08/05 03:04:13 INFO HiveContext: Initializing execution hive, version 0.13.1 15/08/05 03:04:13 INFO SparkILoop: Created sql context (with Hive support).. SQL context available as sqlContext. 15/08/05 03:04:13 INFO HiveContext: Initializing execution hive, version 0.13.1 15/08/05 03:04:13 INFO SparkILoop: Created sql context (with Hive support).. SQL context available as sqlContext. 15/08/05 03:04:13 INFO HiveContext: Initializing execution hive, version 0.13.1 15/08/05 03:04:13 INFO SparkILoop: Created sql context (with Hive support).. SQL
[jira] [Closed] (SPARK-9631) Giant pile of parquet log when trying to read local data
[ https://issues.apache.org/jira/browse/SPARK-9631?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin closed SPARK-9631. -- Resolution: Duplicate Giant pile of parquet log when trying to read local data Key: SPARK-9631 URL: https://issues.apache.org/jira/browse/SPARK-9631 Project: Spark Issue Type: Bug Components: SQL Reporter: Reynold Xin Assignee: Cheng Lian When I read a Parquet file, I got the following {code} Aug 5, 2015 12:13:36 AM INFO: org.apache.parquet.hadoop.InternalParquetRecordReader: RecordReader initialized will read a total of 2097152 records. Aug 5, 2015 12:13:36 AM INFO: org.apache.parquet.hadoop.InternalParquetRecordReader: at row 0. reading next block Aug 5, 2015 12:13:36 AM INFO: org.apache.parquet.hadoop.InternalParquetRecordReader: block read in memory in 0 ms. row count = 2097152 Aug 5, 2015 12:13:36 AM INFO: org.apache.parquet.hadoop.InternalParquetRecordReader: RecordReader initialized will read a total of 2097152 records. Aug 5, 2015 12:13:36 AM INFO: org.apache.parquet.hadoop.InternalParquetRecordReader: at row 0. reading next block Aug 5, 2015 12:13:36 AM INFO: org.apache.parquet.hadoop.InternalParquetRecordReader: block read in memory in 0 ms. row count = 2097152 Aug 5, 2015 12:13:36 AM WARNING: org.apache.parquet.hadoop.ParquetRecordReader: Can not initialize counter due to context is not a instance of TaskInputOutputContext, but is org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl Aug 5, 2015 12:13:36 AM WARNING: org.apache.parquet.hadoop.ParquetRecordReader: Can not initialize counter due to context is not a instance of TaskInputOutputContext, but is org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl Aug 5, 2015 12:13:36 AM INFO: org.apache.parquet.hadoop.InternalParquetRecordReader: RecordReader initialized will read a total of 2097152 records. Aug 5, 2015 12:13:36 AM INFO: org.apache.parquet.hadoop.InternalParquetRecordReader: RecordReader initialized will read a total of 2097152 records. Aug 5, 2015 12:13:36 AM INFO: org.apache.parquet.hadoop.InternalParquetRecordReader: at row 0. reading next block Aug 5, 2015 12:13:36 AM INFO: org.apache.parquet.hadoop.InternalParquetRecordReader: at row 0. reading next block Aug 5, 2015 12:13:36 AM INFO: org.apache.parquet.hadoop.InternalParquetRecordReader: block read in memory in 0 ms. row count = 2097152 Aug 5, 2015 12:13:36 AM INFO: org.apache.parquet.hadoop.InternalParquetRecordReader: block read in memory in 0 ms. row count = 2097152 Aug 5, 2015 12:13:53 AM WARNING: org.apache.parquet.hadoop.ParquetRecordReader: Can not initialize counter due to context is not a instance of TaskInputOutputContext, but is org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl Aug 5, 2015 12:13:53 AM WARNING: org.apache.parquet.hadoop.ParquetRecordReader: Can not initialize counter due to context is not a instance of TaskInputOutputContext, but is org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl Aug 5, 2015 12:13:53 AM INFO: org.apache.parquet.hadoop.InternalParquetRecordReader: RecordReader initialized will read a total of 2097152 records. Aug 5, 2015 12:13:53 AM INFO: org.apache.parquet.hadoop.InternalParquetRecordReader: RecordReader initialized will read a total of 2097152 records. Aug 5, 2015 12:13:53 AM INFO: org.apache.parquet.hadoop.InternalParquetRecordReader: at row 0. reading next block Aug 5, 2015 12:13:53 AM INFO: org.apache.parquet.hadoop.InternalParquetRecordReader: at row 0. reading next block Aug 5, 2015 12:13:53 AM INFO: org.apache.parquet.hadoop.InternalParquetRecordReader: block read in memory in 0 ms. row count = 2097152 Aug 5, 2015 12:13:53 AM INFO: org.apache.parquet.hadoop.InternalParquetRecordReader: block read in memory in 0 ms. row count = 2097152 Aug 5, 2015 12:13:53 AM WARNING: org.apache.parquet.hadoop.ParquetRecordReader: Can not initialize counter due to context is not a instance of TaskInputOutputContext, but is org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl Aug 5, 2015 12:13:53 AM WARNING: org.apache.parquet.hadoop.ParquetRecordReader: Can not initialize counter due to context is not a instance of TaskInputOutputContext, but is org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl Aug 5, 2015 12:13:53 AM INFO: org.apache.parquet.hadoop.InternalParquetRecordReader: RecordReader initialized will read a total of 2097152 records. Aug 5, 2015 12:13:53 AM INFO: org.apache.parquet.hadoop.InternalParquetRecordReader: RecordReader initialized will read a total of 2097152 records. Aug 5, 2015 12:13:53 AM INFO: org.apache.parquet.hadoop.InternalParquetRecordReader: at row 0. reading next block Aug 5, 2015 12:13:53 AM INFO:
[jira] [Updated] (SPARK-9627) SQL job failed if the dataframe is cached
[ https://issues.apache.org/jira/browse/SPARK-9627?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-9627: --- Target Version/s: 1.5.0 SQL job failed if the dataframe is cached - Key: SPARK-9627 URL: https://issues.apache.org/jira/browse/SPARK-9627 Project: Spark Issue Type: Bug Affects Versions: 1.5.0 Reporter: Davies Liu Priority: Critical {code} r = random.Random() def gen(i): d = date.today() - timedelta(r.randint(0, 5000)) cat = str(r.randint(0, 20)) * 5 c = r.randint(0, 1000) price = decimal.Decimal(r.randint(0, 10)) / 100 return (d, cat, c, price) schema = StructType().add('date', DateType()).add('cat', StringType()).add('count', ShortType()).add('price', DecimalType(5, 2)) #df = sqlContext.createDataFrame(sc.range(124).map(gen), schema) #df.show() #df.write.parquet('sales4') df = sqlContext.read.parquet('sales4') df.cache() df.count() df.show() print df.schema raw_input() r = df.groupBy(df.date, df.cat).agg(sum(df['count'] * df.price)) print r.explain(True) r.show() {code} {code} StructType(List(StructField(date,DateType,true),StructField(cat,StringType,true),StructField(count,ShortType,true),StructField(price,DecimalType(5,2),true))) == Parsed Logical Plan == 'Aggregate [date#0,cat#1], [date#0,cat#1,sum((count#2 * price#3)) AS sum((count * price))#70] Relation[date#0,cat#1,count#2,price#3] org.apache.spark.sql.parquet.ParquetRelation@5ec8f315 == Analyzed Logical Plan == date: date, cat: string, sum((count * price)): decimal(21,2) Aggregate [date#0,cat#1], [date#0,cat#1,sum((change_decimal_precision(CAST(CAST(count#2, DecimalType(5,0)), DecimalType(11,2))) * change_decimal_precision(CAST(price#3, DecimalType(11,2) AS sum((count * price))#70] Relation[date#0,cat#1,count#2,price#3] org.apache.spark.sql.parquet.ParquetRelation@5ec8f315 == Optimized Logical Plan == Aggregate [date#0,cat#1], [date#0,cat#1,sum((change_decimal_precision(CAST(CAST(count#2, DecimalType(5,0)), DecimalType(11,2))) * change_decimal_precision(CAST(price#3, DecimalType(11,2) AS sum((count * price))#70] InMemoryRelation [date#0,cat#1,count#2,price#3], true, 1, StorageLevel(true, true, false, true, 1), (PhysicalRDD [date#0,cat#1,count#2,price#3], MapPartitionsRDD[3] at), None == Physical Plan == NewAggregate with SortBasedAggregationIterator List(date#0, cat#1) ArrayBuffer((sum((change_decimal_precision(CAST(CAST(count#2, DecimalType(5,0)), DecimalType(11,2))) * change_decimal_precision(CAST(price#3, DecimalType(11,2)2,mode=Final,isDistinct=false)) TungstenSort [date#0 ASC,cat#1 ASC], false, 0 ConvertToUnsafe Exchange hashpartitioning(date#0,cat#1) NewAggregate with SortBasedAggregationIterator List(date#0, cat#1) ArrayBuffer((sum((change_decimal_precision(CAST(CAST(count#2, DecimalType(5,0)), DecimalType(11,2))) * change_decimal_precision(CAST(price#3, DecimalType(11,2)2,mode=Partial,isDistinct=false)) TungstenSort [date#0 ASC,cat#1 ASC], false, 0 ConvertToUnsafe InMemoryColumnarTableScan [date#0,cat#1,count#2,price#3], (InMemoryRelation [date#0,cat#1,count#2,price#3], true, 1, StorageLevel(true, true, false, true, 1), (PhysicalRDD [date#0,cat#1,count#2,price#3], MapPartitionsRDD[3] at), None) Code Generation: true == RDD == None 15/08/04 23:21:53 ERROR TaskSetManager: Task 0 in stage 4.0 failed 1 times; aborting job Traceback (most recent call last): File t.py, line 34, in module r.show() File /Users/davies/work/spark/python/pyspark/sql/dataframe.py, line 258, in show print(self._jdf.showString(n, truncate)) File /Users/davies/work/spark/python/lib/py4j/java_gateway.py, line 538, in __call__ self.target_id, self.name) File /Users/davies/work/spark/python/pyspark/sql/utils.py, line 36, in deco return f(*a, **kw) File /Users/davies/work/spark/python/lib/py4j/protocol.py, line 300, in get_return_value format(target_id, '.', name), value) py4j.protocol.Py4JJavaError: An error occurred while calling o36.showString. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 4.0 failed 1 times, most recent failure: Lost task 0.0 in stage 4.0 (TID 10, localhost): java.lang.UnsupportedOperationException: tail of empty list at scala.collection.immutable.Nil$.tail(List.scala:339) at scala.collection.immutable.Nil$.tail(List.scala:334) at scala.reflect.internal.SymbolTable.popPhase(SymbolTable.scala:172) at scala.reflect.internal.Symbols$Symbol.typeParams(Symbols.scala:1491) at scala.reflect.internal.Types$NoArgsTypeRef.typeParams(Types.scala:2144) at scala.reflect.internal.Types$TypeRef.initializedTypeParams(Types.scala:2408)
[jira] [Resolved] (SPARK-9621) Closure inside RDD doesn't properly close over environment
[ https://issues.apache.org/jira/browse/SPARK-9621?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-9621. -- Resolution: Duplicate Pretty sure this is a subset of the general problem of using case classes in the shell. They don't end up being the same class when used this way. I don't know if it's a Scala shell thing or not, and I am not aware of a solution other than don't use case classes in the shell Closure inside RDD doesn't properly close over environment -- Key: SPARK-9621 URL: https://issues.apache.org/jira/browse/SPARK-9621 Project: Spark Issue Type: Bug Affects Versions: 1.4.1 Environment: Ubuntu 15.04, spark-1.4.1-bin-hadoop2.6 package Reporter: Joe Near I expect the following: case class MyTest(i: Int) val tv = MyTest(1) val res = sc.parallelize(Array((t: MyTest) = t == tv)).first()(tv) to be true. It is false, when I type this into spark-shell. It seems the closure is changed somehow when it's serialized and deserialized. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-9601) Join example fix in streaming-programming-guide.md
[ https://issues.apache.org/jira/browse/SPARK-9601?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das resolved SPARK-9601. -- Resolution: Fixed Fix Version/s: 1.5.0 Join example fix in streaming-programming-guide.md -- Key: SPARK-9601 URL: https://issues.apache.org/jira/browse/SPARK-9601 Project: Spark Issue Type: Bug Components: Documentation Affects Versions: 1.4.1 Reporter: Jayant Shekhar Priority: Trivial Fix For: 1.5.0 Stream-Stream Join has the following signature for Java in the guide: JavaPairDStreamString, String joinedStream = stream1.join(stream2); It should be: JavaPairDStreamString, Tuple2String, String joinedStream = stream1.join(stream2); Same for windowed stream join. It should be: JavaPairDStreamString, Tuple2String, String joinedStream = windowedStream1.join(windowedStream2); -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9601) Join example fix in streaming-programming-guide.md
[ https://issues.apache.org/jira/browse/SPARK-9601?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-9601: - Assignee: Namit Katariya Join example fix in streaming-programming-guide.md -- Key: SPARK-9601 URL: https://issues.apache.org/jira/browse/SPARK-9601 Project: Spark Issue Type: Bug Components: Documentation Affects Versions: 1.4.1 Reporter: Jayant Shekhar Assignee: Namit Katariya Priority: Trivial Fix For: 1.5.0 Stream-Stream Join has the following signature for Java in the guide: JavaPairDStreamString, String joinedStream = stream1.join(stream2); It should be: JavaPairDStreamString, Tuple2String, String joinedStream = stream1.join(stream2); Same for windowed stream join. It should be: JavaPairDStreamString, Tuple2String, String joinedStream = windowedStream1.join(windowedStream2); -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6107) event log file ends with .inprogress should be able to display on webUI for standalone mode
[ https://issues.apache.org/jira/browse/SPARK-6107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14655002#comment-14655002 ] kumar deepak commented on SPARK-6107: - Is there a plan to fix it in 1.3.1 event log file ends with .inprogress should be able to display on webUI for standalone mode --- Key: SPARK-6107 URL: https://issues.apache.org/jira/browse/SPARK-6107 Project: Spark Issue Type: Bug Components: Web UI Affects Versions: 1.2.1 Reporter: Zhang, Liye Assignee: Zhang, Liye Fix For: 1.4.0 when application is finished running abnormally (Ctrl + c for example), the history event log file is still ends with *.inprogress* suffix. And the application state can not be showed on webUI, User can just see *Application history not foud , Application xxx is still in progress*. User should also can see the status of the abnormal finished applications. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-9633) SBT download locations outdated; need an update
Sean Owen created SPARK-9633: Summary: SBT download locations outdated; need an update Key: SPARK-9633 URL: https://issues.apache.org/jira/browse/SPARK-9633 Project: Spark Issue Type: Improvement Components: Build Affects Versions: 1.4.1, 1.3.1, 1.5.0 Reporter: Sean Owen Priority: Minor The SBT download script tries to download from two locations, typesafe.artifactoryonline.com and repo.typesafe.com. The former is offline; the latter redirects to dl.bintray.com now. In fact, bintray seems like the only place to download SBT at this point. We should update to reference bintray directly. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9633) SBT download locations outdated; need an update
[ https://issues.apache.org/jira/browse/SPARK-9633?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-9633: - Description: The SBT download script tries to download from two locations, typesafe.artifactoryonline.com and repo.typesafe.com. The former is offline; the latter redirects to dl.bintray.com now. In fact, bintray seems like the only place to download SBT at this point. We should update to reference bintray directly. PS: we should download SBT over HTTPS too, not HTTP was:The SBT download script tries to download from two locations, typesafe.artifactoryonline.com and repo.typesafe.com. The former is offline; the latter redirects to dl.bintray.com now. In fact, bintray seems like the only place to download SBT at this point. We should update to reference bintray directly. SBT download locations outdated; need an update --- Key: SPARK-9633 URL: https://issues.apache.org/jira/browse/SPARK-9633 Project: Spark Issue Type: Improvement Components: Build Affects Versions: 1.3.1, 1.4.1, 1.5.0 Reporter: Sean Owen Priority: Minor The SBT download script tries to download from two locations, typesafe.artifactoryonline.com and repo.typesafe.com. The former is offline; the latter redirects to dl.bintray.com now. In fact, bintray seems like the only place to download SBT at this point. We should update to reference bintray directly. PS: we should download SBT over HTTPS too, not HTTP -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-8862) Add a web UI page that visualizes physical plans (SparkPlan)
[ https://issues.apache.org/jira/browse/SPARK-8862?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-8862. Resolution: Fixed Assignee: Shixiong Zhu Fix Version/s: 1.5.0 Add a web UI page that visualizes physical plans (SparkPlan) Key: SPARK-8862 URL: https://issues.apache.org/jira/browse/SPARK-8862 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin Assignee: Shixiong Zhu Fix For: 1.5.0 We currently have the ability to visualize part of the query plan using the Spark DAG viz. However, that does NOT work for one of the most important operators: broadcast join. The reason is that broadcast join launches multiple Spark jobs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-8861) Add basic instrumentation to each SparkPlan operator
[ https://issues.apache.org/jira/browse/SPARK-8861?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-8861. Resolution: Fixed Assignee: Shixiong Zhu Fix Version/s: 1.5.0 Add basic instrumentation to each SparkPlan operator Key: SPARK-8861 URL: https://issues.apache.org/jira/browse/SPARK-8861 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin Assignee: Shixiong Zhu Fix For: 1.5.0 The basic metric can be the number of tuples that is flowing through. We can add more metrics later. In order for this to work, we can add a new accumulators method to SparkPlan that defines the list of accumulators, .e.g. {code} def accumulators: Map[String, Accumulator] {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-9633) SBT download locations outdated; need an update
[ https://issues.apache.org/jira/browse/SPARK-9633?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-9633: --- Assignee: (was: Apache Spark) SBT download locations outdated; need an update --- Key: SPARK-9633 URL: https://issues.apache.org/jira/browse/SPARK-9633 Project: Spark Issue Type: Improvement Components: Build Affects Versions: 1.3.1, 1.4.1, 1.5.0 Reporter: Sean Owen Priority: Minor The SBT download script tries to download from two locations, typesafe.artifactoryonline.com and repo.typesafe.com. The former is offline; the latter redirects to dl.bintray.com now. In fact, bintray seems like the only place to download SBT at this point. We should update to reference bintray directly. PS: we should download SBT over HTTPS too, not HTTP -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9633) SBT download locations outdated; need an update
[ https://issues.apache.org/jira/browse/SPARK-9633?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14655027#comment-14655027 ] Apache Spark commented on SPARK-9633: - User 'srowen' has created a pull request for this issue: https://github.com/apache/spark/pull/7956 SBT download locations outdated; need an update --- Key: SPARK-9633 URL: https://issues.apache.org/jira/browse/SPARK-9633 Project: Spark Issue Type: Improvement Components: Build Affects Versions: 1.3.1, 1.4.1, 1.5.0 Reporter: Sean Owen Priority: Minor The SBT download script tries to download from two locations, typesafe.artifactoryonline.com and repo.typesafe.com. The former is offline; the latter redirects to dl.bintray.com now. In fact, bintray seems like the only place to download SBT at this point. We should update to reference bintray directly. PS: we should download SBT over HTTPS too, not HTTP -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-9633) SBT download locations outdated; need an update
[ https://issues.apache.org/jira/browse/SPARK-9633?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-9633: --- Assignee: Apache Spark SBT download locations outdated; need an update --- Key: SPARK-9633 URL: https://issues.apache.org/jira/browse/SPARK-9633 Project: Spark Issue Type: Improvement Components: Build Affects Versions: 1.3.1, 1.4.1, 1.5.0 Reporter: Sean Owen Assignee: Apache Spark Priority: Minor The SBT download script tries to download from two locations, typesafe.artifactoryonline.com and repo.typesafe.com. The former is offline; the latter redirects to dl.bintray.com now. In fact, bintray seems like the only place to download SBT at this point. We should update to reference bintray directly. PS: we should download SBT over HTTPS too, not HTTP -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9633) SBT download locations outdated; need an update
[ https://issues.apache.org/jira/browse/SPARK-9633?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-9633: - Assignee: Sean Owen (Assigning to me as I don't yet see that nraychaudhuri has a JIRA username) SBT download locations outdated; need an update --- Key: SPARK-9633 URL: https://issues.apache.org/jira/browse/SPARK-9633 Project: Spark Issue Type: Improvement Components: Build Affects Versions: 1.3.1, 1.4.1, 1.5.0 Reporter: Sean Owen Assignee: Sean Owen Priority: Minor The SBT download script tries to download from two locations, typesafe.artifactoryonline.com and repo.typesafe.com. The former is offline; the latter redirects to dl.bintray.com now. In fact, bintray seems like the only place to download SBT at this point. We should update to reference bintray directly. PS: we should download SBT over HTTPS too, not HTTP -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-9634) resolve UnresolvedAlias in DataFrame.resolve
Wenchen Fan created SPARK-9634: -- Summary: resolve UnresolvedAlias in DataFrame.resolve Key: SPARK-9634 URL: https://issues.apache.org/jira/browse/SPARK-9634 Project: Spark Issue Type: Bug Components: SQL Reporter: Wenchen Fan -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9634) resolve UnresolvedAlias in DataFrame.resolve
[ https://issues.apache.org/jira/browse/SPARK-9634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14655044#comment-14655044 ] Apache Spark commented on SPARK-9634: - User 'cloud-fan' has created a pull request for this issue: https://github.com/apache/spark/pull/7957 resolve UnresolvedAlias in DataFrame.resolve Key: SPARK-9634 URL: https://issues.apache.org/jira/browse/SPARK-9634 Project: Spark Issue Type: Bug Components: SQL Reporter: Wenchen Fan -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-9634) resolve UnresolvedAlias in DataFrame.resolve
[ https://issues.apache.org/jira/browse/SPARK-9634?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-9634: --- Assignee: Apache Spark resolve UnresolvedAlias in DataFrame.resolve Key: SPARK-9634 URL: https://issues.apache.org/jira/browse/SPARK-9634 Project: Spark Issue Type: Bug Components: SQL Reporter: Wenchen Fan Assignee: Apache Spark -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-9634) resolve UnresolvedAlias in DataFrame.resolve
[ https://issues.apache.org/jira/browse/SPARK-9634?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-9634: --- Assignee: (was: Apache Spark) resolve UnresolvedAlias in DataFrame.resolve Key: SPARK-9634 URL: https://issues.apache.org/jira/browse/SPARK-9634 Project: Spark Issue Type: Bug Components: SQL Reporter: Wenchen Fan -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-9323) DataFrame.orderBy gives confusing analysis errors when ordering based on nested columns
[ https://issues.apache.org/jira/browse/SPARK-9323?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-9323: --- Assignee: Apache Spark DataFrame.orderBy gives confusing analysis errors when ordering based on nested columns --- Key: SPARK-9323 URL: https://issues.apache.org/jira/browse/SPARK-9323 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.3.1, 1.4.1, 1.5.0 Reporter: Josh Rosen Assignee: Apache Spark The following two queries should be equivalent, but the second crashes: {code} sqlContext.read.json(sqlContext.sparkContext.makeRDD( {a: {b: 1, a: {a: 1}}, c: [{d: 1}]} :: Nil)) .registerTempTable(nestedOrder) checkAnswer(sql(SELECT a.b FROM nestedOrder ORDER BY a.b), Row(1)) checkAnswer(sql(select * from nestedOrder).select(a.b).orderBy(a.b), Row(1)) {code} Here's the stacktrace: {code} Cannot resolve column name a.b among (b); org.apache.spark.sql.AnalysisException: Cannot resolve column name a.b among (b); at org.apache.spark.sql.DataFrame$$anonfun$resolve$1.apply(DataFrame.scala:159) at org.apache.spark.sql.DataFrame$$anonfun$resolve$1.apply(DataFrame.scala:159) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.sql.DataFrame.resolve(DataFrame.scala:158) at org.apache.spark.sql.DataFrame.col(DataFrame.scala:651) at org.apache.spark.sql.DataFrame.apply(DataFrame.scala:640) at org.apache.spark.sql.DataFrame$$anonfun$sort$1.apply(DataFrame.scala:593) at org.apache.spark.sql.DataFrame$$anonfun$sort$1.apply(DataFrame.scala:593) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) at scala.collection.AbstractTraversable.map(Traversable.scala:105) at org.apache.spark.sql.DataFrame.sort(DataFrame.scala:593) at org.apache.spark.sql.DataFrame.orderBy(DataFrame.scala:624) at org.apache.spark.sql.SQLQuerySuite$$anonfun$96.apply$mcV$sp(SQLQuerySuite.scala:1389) {code} Per [~marmbrus], the problem may be that {{DataFrame.resolve}} calls {{resolveQuoted}}, causing the nested field to be treated as a single field named {{a.b}}. UPDATE: here's a shorter one-liner reproduction: {code} val df = sqlContext.read.json(sqlContext.sparkContext.makeRDD({a: {b: 1}} :: Nil)) checkAnswer(df.select(a.b).filter(a.b = a.b), Row(1)) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-9323) DataFrame.orderBy gives confusing analysis errors when ordering based on nested columns
[ https://issues.apache.org/jira/browse/SPARK-9323?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-9323: --- Assignee: (was: Apache Spark) DataFrame.orderBy gives confusing analysis errors when ordering based on nested columns --- Key: SPARK-9323 URL: https://issues.apache.org/jira/browse/SPARK-9323 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.3.1, 1.4.1, 1.5.0 Reporter: Josh Rosen The following two queries should be equivalent, but the second crashes: {code} sqlContext.read.json(sqlContext.sparkContext.makeRDD( {a: {b: 1, a: {a: 1}}, c: [{d: 1}]} :: Nil)) .registerTempTable(nestedOrder) checkAnswer(sql(SELECT a.b FROM nestedOrder ORDER BY a.b), Row(1)) checkAnswer(sql(select * from nestedOrder).select(a.b).orderBy(a.b), Row(1)) {code} Here's the stacktrace: {code} Cannot resolve column name a.b among (b); org.apache.spark.sql.AnalysisException: Cannot resolve column name a.b among (b); at org.apache.spark.sql.DataFrame$$anonfun$resolve$1.apply(DataFrame.scala:159) at org.apache.spark.sql.DataFrame$$anonfun$resolve$1.apply(DataFrame.scala:159) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.sql.DataFrame.resolve(DataFrame.scala:158) at org.apache.spark.sql.DataFrame.col(DataFrame.scala:651) at org.apache.spark.sql.DataFrame.apply(DataFrame.scala:640) at org.apache.spark.sql.DataFrame$$anonfun$sort$1.apply(DataFrame.scala:593) at org.apache.spark.sql.DataFrame$$anonfun$sort$1.apply(DataFrame.scala:593) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) at scala.collection.AbstractTraversable.map(Traversable.scala:105) at org.apache.spark.sql.DataFrame.sort(DataFrame.scala:593) at org.apache.spark.sql.DataFrame.orderBy(DataFrame.scala:624) at org.apache.spark.sql.SQLQuerySuite$$anonfun$96.apply$mcV$sp(SQLQuerySuite.scala:1389) {code} Per [~marmbrus], the problem may be that {{DataFrame.resolve}} calls {{resolveQuoted}}, causing the nested field to be treated as a single field named {{a.b}}. UPDATE: here's a shorter one-liner reproduction: {code} val df = sqlContext.read.json(sqlContext.sparkContext.makeRDD({a: {b: 1}} :: Nil)) checkAnswer(df.select(a.b).filter(a.b = a.b), Row(1)) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9607) Incorrect zinc check in build/mvn
[ https://issues.apache.org/jira/browse/SPARK-9607?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-9607: - Assignee: Ryan Williams Incorrect zinc check in build/mvn - Key: SPARK-9607 URL: https://issues.apache.org/jira/browse/SPARK-9607 Project: Spark Issue Type: Bug Components: Build Affects Versions: 1.4.1 Reporter: Ryan Williams Assignee: Ryan Williams Priority: Minor [This check|https://github.com/apache/spark/blob/5a23213c148bfe362514f9c71f5273ebda0a848a/build/mvn#L84-L85] in {{build/mvn}} attempts to determine whether {{zinc}} has been installed, but it fails to add the prefix {{build/}} to the path, so it always thinks that {{zinc}} is not installed, sets {{ZINC_INSTALL_FLAG}} to {{1}}, and attempts to install {{zinc}}. This error manifests later because [the {{zinc -shutdown}} and {{zinc -start}} commands|https://github.com/apache/spark/blob/5a23213c148bfe362514f9c71f5273ebda0a848a/build/mvn#L140-L143] are always run, even if zinc was not installed and is running. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-9607) Incorrect zinc check in build/mvn
[ https://issues.apache.org/jira/browse/SPARK-9607?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-9607. -- Resolution: Fixed Fix Version/s: 1.5.0 1.3.2 1.4.2 Issue resolved by pull request 7944 [https://github.com/apache/spark/pull/7944] Incorrect zinc check in build/mvn - Key: SPARK-9607 URL: https://issues.apache.org/jira/browse/SPARK-9607 Project: Spark Issue Type: Bug Components: Build Affects Versions: 1.4.1 Reporter: Ryan Williams Assignee: Ryan Williams Priority: Minor Fix For: 1.4.2, 1.3.2, 1.5.0 [This check|https://github.com/apache/spark/blob/5a23213c148bfe362514f9c71f5273ebda0a848a/build/mvn#L84-L85] in {{build/mvn}} attempts to determine whether {{zinc}} has been installed, but it fails to add the prefix {{build/}} to the path, so it always thinks that {{zinc}} is not installed, sets {{ZINC_INSTALL_FLAG}} to {{1}}, and attempts to install {{zinc}}. This error manifests later because [the {{zinc -shutdown}} and {{zinc -start}} commands|https://github.com/apache/spark/blob/5a23213c148bfe362514f9c71f5273ebda0a848a/build/mvn#L140-L143] are always run, even if zinc was not installed and is running. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-9608) Incorrect zinc -status check in build/mvn
[ https://issues.apache.org/jira/browse/SPARK-9608?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-9608. -- Resolution: Fixed Fix Version/s: 1.5.0 1.3.2 1.4.2 Issue resolved by pull request 7944 [https://github.com/apache/spark/pull/7944] Incorrect zinc -status check in build/mvn - Key: SPARK-9608 URL: https://issues.apache.org/jira/browse/SPARK-9608 Project: Spark Issue Type: Bug Components: Build Affects Versions: 1.4.1 Reporter: Ryan Williams Assignee: Ryan Williams Priority: Minor Fix For: 1.4.2, 1.3.2, 1.5.0 {{build/mvn}} [uses a {{-z `zinc -status`}} test|https://github.com/apache/spark/blob/5a23213c148bfe362514f9c71f5273ebda0a848a/build/mvn#L138] to determine whether a {{zinc}} process is running. However, {{zinc -status}} checks port {{3030}} by default. This means that if a {{$ZINC_PORT}} env var is set to some value besides {{3030}}, and an existing {{zinc}} process is running on port {{3030}}, {{build/mvn}} will skip starting a {{zinc}} process, thinking that a suitable one is running. Subsequent compilations will look for a {{zinc}} at port {{$ZINC_PORT}} and not find one. The {{zinc -status}} call should get the flag {{-port $ZINC_PORT}} added to it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-9636) Treat $SPARK_HOME as write-only
Philipp Angerer created SPARK-9636: -- Summary: Treat $SPARK_HOME as write-only Key: SPARK-9636 URL: https://issues.apache.org/jira/browse/SPARK-9636 Project: Spark Issue Type: Bug Components: Input/Output Affects Versions: 1.4.1 Environment: Linux Reporter: Philipp Angerer when starting spark scripts as user and it is installed in a directory the user has no write permissions on, many things work fine, except for the logs (e.g. for {{start-master.sh}}) logs are per default written to {{$SPARK_LOG_DIR}} or (if unset) to {{$SPARK_HOME/logs}}. if installed in this way, it should, instead of throwing an error, write logs to {{/var/log/spark/}}. that’s easy to fix by simply testing a few log dirs in sequence for writability before trying to use one. i suggest using {{$SPARK_LOG_DIR}} (if set) → {{/var/log/spark/}} → {{~/.cache/spark-logs/}} → {{$SPARK_HOME/logs/}} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9636) Treat $SPARK_HOME as write-only
[ https://issues.apache.org/jira/browse/SPARK-9636?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-9636: - Priority: Minor (was: Major) Issue Type: Improvement (was: Bug) Treat $SPARK_HOME as write-only --- Key: SPARK-9636 URL: https://issues.apache.org/jira/browse/SPARK-9636 Project: Spark Issue Type: Improvement Components: Input/Output Affects Versions: 1.4.1 Environment: Linux Reporter: Philipp Angerer Priority: Minor Labels: easyfix when starting spark scripts as user and it is installed in a directory the user has no write permissions on, many things work fine, except for the logs (e.g. for {{start-master.sh}}) logs are per default written to {{$SPARK_LOG_DIR}} or (if unset) to {{$SPARK_HOME/logs}}. if installed in this way, it should, instead of throwing an error, write logs to {{/var/log/spark/}}. that’s easy to fix by simply testing a few log dirs in sequence for writability before trying to use one. i suggest using {{$SPARK_LOG_DIR}} (if set) → {{/var/log/spark/}} → {{~/.cache/spark-logs/}} → {{$SPARK_HOME/logs/}} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9636) Treat $SPARK_HOME as write-only
[ https://issues.apache.org/jira/browse/SPARK-9636?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14655146#comment-14655146 ] Sean Owen commented on SPARK-9636: -- I'm not sure those are as obvious as defaults, or necessarily have write permission either. Isn't the solution that {{SPARK_LOG_DIR}} should be set if needed? Treat $SPARK_HOME as write-only --- Key: SPARK-9636 URL: https://issues.apache.org/jira/browse/SPARK-9636 Project: Spark Issue Type: Bug Components: Input/Output Affects Versions: 1.4.1 Environment: Linux Reporter: Philipp Angerer Labels: easyfix when starting spark scripts as user and it is installed in a directory the user has no write permissions on, many things work fine, except for the logs (e.g. for {{start-master.sh}}) logs are per default written to {{$SPARK_LOG_DIR}} or (if unset) to {{$SPARK_HOME/logs}}. if installed in this way, it should, instead of throwing an error, write logs to {{/var/log/spark/}}. that’s easy to fix by simply testing a few log dirs in sequence for writability before trying to use one. i suggest using {{$SPARK_LOG_DIR}} (if set) → {{/var/log/spark/}} → {{~/.cache/spark-logs/}} → {{$SPARK_HOME/logs/}} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-9637) Add interface for implementing scheduling algorithm for standalone deployment
Liang-Chi Hsieh created SPARK-9637: -- Summary: Add interface for implementing scheduling algorithm for standalone deployment Key: SPARK-9637 URL: https://issues.apache.org/jira/browse/SPARK-9637 Project: Spark Issue Type: New Feature Components: Spark Core Reporter: Liang-Chi Hsieh We want to abstract the interface of scheduling algorithm for standalone deployment mode. It can benefit for implementing different scheduling algorithms. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9637) Add interface for implementing scheduling algorithm for standalone deployment
[ https://issues.apache.org/jira/browse/SPARK-9637?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14655166#comment-14655166 ] Apache Spark commented on SPARK-9637: - User 'viirya' has created a pull request for this issue: https://github.com/apache/spark/pull/7958 Add interface for implementing scheduling algorithm for standalone deployment - Key: SPARK-9637 URL: https://issues.apache.org/jira/browse/SPARK-9637 Project: Spark Issue Type: New Feature Components: Spark Core Reporter: Liang-Chi Hsieh We want to abstract the interface of scheduling algorithm for standalone deployment mode. It can benefit for implementing different scheduling algorithms. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-9637) Add interface for implementing scheduling algorithm for standalone deployment
[ https://issues.apache.org/jira/browse/SPARK-9637?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-9637: --- Assignee: (was: Apache Spark) Add interface for implementing scheduling algorithm for standalone deployment - Key: SPARK-9637 URL: https://issues.apache.org/jira/browse/SPARK-9637 Project: Spark Issue Type: New Feature Components: Spark Core Reporter: Liang-Chi Hsieh We want to abstract the interface of scheduling algorithm for standalone deployment mode. It can benefit for implementing different scheduling algorithms. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-9637) Add interface for implementing scheduling algorithm for standalone deployment
[ https://issues.apache.org/jira/browse/SPARK-9637?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-9637: --- Assignee: Apache Spark Add interface for implementing scheduling algorithm for standalone deployment - Key: SPARK-9637 URL: https://issues.apache.org/jira/browse/SPARK-9637 Project: Spark Issue Type: New Feature Components: Spark Core Reporter: Liang-Chi Hsieh Assignee: Apache Spark We want to abstract the interface of scheduling algorithm for standalone deployment mode. It can benefit for implementing different scheduling algorithms. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9563) Remove repartition operators when they are the child of Exchange and shuffle=True
[ https://issues.apache.org/jira/browse/SPARK-9563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14655180#comment-14655180 ] Apache Spark commented on SPARK-9563: - User 'viirya' has created a pull request for this issue: https://github.com/apache/spark/pull/7959 Remove repartition operators when they are the child of Exchange and shuffle=True - Key: SPARK-9563 URL: https://issues.apache.org/jira/browse/SPARK-9563 Project: Spark Issue Type: Improvement Components: SQL Reporter: Josh Rosen Consider the following query: {code} val df1 = sqlContext.createDataFrame(sc.parallelize(1 to 100, 100).map(x = (x, x))) val df2 = sqlContext.createDataFrame(sc.parallelize(1 to 100, 100).map(x = (x, x))) df1.repartition(1000).join(df2, _1).explain(true) {code} Here's the plan for this query as of Spark 1.4.1: {code} == Parsed Logical Plan == Project [_1#68991,_2#68992,_2#68994] Join Inner, Some((_1#68991 = _1#68993)) Repartition 1000, true LogicalRDD [_1#68991,_2#68992], MapPartitionsRDD[82530] at createDataFrame at console:29 LogicalRDD [_1#68993,_2#68994], MapPartitionsRDD[82533] at createDataFrame at console:30 == Analyzed Logical Plan == _1: int, _2: int, _2: int Project [_1#68991,_2#68992,_2#68994] Join Inner, Some((_1#68991 = _1#68993)) Repartition 1000, true LogicalRDD [_1#68991,_2#68992], MapPartitionsRDD[82530] at createDataFrame at console:29 LogicalRDD [_1#68993,_2#68994], MapPartitionsRDD[82533] at createDataFrame at console:30 == Optimized Logical Plan == Project [_1#68991,_2#68992,_2#68994] Join Inner, Some((_1#68991 = _1#68993)) Repartition 1000, true LogicalRDD [_1#68991,_2#68992], MapPartitionsRDD[82530] at createDataFrame at console:29 LogicalRDD [_1#68993,_2#68994], MapPartitionsRDD[82533] at createDataFrame at console:30 == Physical Plan == Project [_1#68991,_2#68992,_2#68994] ShuffledHashJoin [_1#68991], [_1#68993], BuildRight Exchange (HashPartitioning 200) Repartition 1000, true PhysicalRDD [_1#68991,_2#68992], MapPartitionsRDD[82530] at createDataFrame at console:29 Exchange (HashPartitioning 200) PhysicalRDD [_1#68993,_2#68994], MapPartitionsRDD[82533] at createDataFrame at console:30 {code} In this plan, we end up repartitioning {{df1}} to have 1000 partitions, which involves a shuffle, only to turn around and shuffle again as part of the exchange. To avoid this extra shuffle, I think that we should remove the Repartition when the following condition holds: - Exchange's child is a repartition operator where shuffle=True. We should not perform this collapsing when shuffle=False, since there might be a legitimate reason to coalesce before shuffling (reducing the number of map outputs that need to be tracked, for instance). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-9563) Remove repartition operators when they are the child of Exchange and shuffle=True
[ https://issues.apache.org/jira/browse/SPARK-9563?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-9563: --- Assignee: Apache Spark Remove repartition operators when they are the child of Exchange and shuffle=True - Key: SPARK-9563 URL: https://issues.apache.org/jira/browse/SPARK-9563 Project: Spark Issue Type: Improvement Components: SQL Reporter: Josh Rosen Assignee: Apache Spark Consider the following query: {code} val df1 = sqlContext.createDataFrame(sc.parallelize(1 to 100, 100).map(x = (x, x))) val df2 = sqlContext.createDataFrame(sc.parallelize(1 to 100, 100).map(x = (x, x))) df1.repartition(1000).join(df2, _1).explain(true) {code} Here's the plan for this query as of Spark 1.4.1: {code} == Parsed Logical Plan == Project [_1#68991,_2#68992,_2#68994] Join Inner, Some((_1#68991 = _1#68993)) Repartition 1000, true LogicalRDD [_1#68991,_2#68992], MapPartitionsRDD[82530] at createDataFrame at console:29 LogicalRDD [_1#68993,_2#68994], MapPartitionsRDD[82533] at createDataFrame at console:30 == Analyzed Logical Plan == _1: int, _2: int, _2: int Project [_1#68991,_2#68992,_2#68994] Join Inner, Some((_1#68991 = _1#68993)) Repartition 1000, true LogicalRDD [_1#68991,_2#68992], MapPartitionsRDD[82530] at createDataFrame at console:29 LogicalRDD [_1#68993,_2#68994], MapPartitionsRDD[82533] at createDataFrame at console:30 == Optimized Logical Plan == Project [_1#68991,_2#68992,_2#68994] Join Inner, Some((_1#68991 = _1#68993)) Repartition 1000, true LogicalRDD [_1#68991,_2#68992], MapPartitionsRDD[82530] at createDataFrame at console:29 LogicalRDD [_1#68993,_2#68994], MapPartitionsRDD[82533] at createDataFrame at console:30 == Physical Plan == Project [_1#68991,_2#68992,_2#68994] ShuffledHashJoin [_1#68991], [_1#68993], BuildRight Exchange (HashPartitioning 200) Repartition 1000, true PhysicalRDD [_1#68991,_2#68992], MapPartitionsRDD[82530] at createDataFrame at console:29 Exchange (HashPartitioning 200) PhysicalRDD [_1#68993,_2#68994], MapPartitionsRDD[82533] at createDataFrame at console:30 {code} In this plan, we end up repartitioning {{df1}} to have 1000 partitions, which involves a shuffle, only to turn around and shuffle again as part of the exchange. To avoid this extra shuffle, I think that we should remove the Repartition when the following condition holds: - Exchange's child is a repartition operator where shuffle=True. We should not perform this collapsing when shuffle=False, since there might be a legitimate reason to coalesce before shuffling (reducing the number of map outputs that need to be tracked, for instance). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-9563) Remove repartition operators when they are the child of Exchange and shuffle=True
[ https://issues.apache.org/jira/browse/SPARK-9563?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-9563: --- Assignee: (was: Apache Spark) Remove repartition operators when they are the child of Exchange and shuffle=True - Key: SPARK-9563 URL: https://issues.apache.org/jira/browse/SPARK-9563 Project: Spark Issue Type: Improvement Components: SQL Reporter: Josh Rosen Consider the following query: {code} val df1 = sqlContext.createDataFrame(sc.parallelize(1 to 100, 100).map(x = (x, x))) val df2 = sqlContext.createDataFrame(sc.parallelize(1 to 100, 100).map(x = (x, x))) df1.repartition(1000).join(df2, _1).explain(true) {code} Here's the plan for this query as of Spark 1.4.1: {code} == Parsed Logical Plan == Project [_1#68991,_2#68992,_2#68994] Join Inner, Some((_1#68991 = _1#68993)) Repartition 1000, true LogicalRDD [_1#68991,_2#68992], MapPartitionsRDD[82530] at createDataFrame at console:29 LogicalRDD [_1#68993,_2#68994], MapPartitionsRDD[82533] at createDataFrame at console:30 == Analyzed Logical Plan == _1: int, _2: int, _2: int Project [_1#68991,_2#68992,_2#68994] Join Inner, Some((_1#68991 = _1#68993)) Repartition 1000, true LogicalRDD [_1#68991,_2#68992], MapPartitionsRDD[82530] at createDataFrame at console:29 LogicalRDD [_1#68993,_2#68994], MapPartitionsRDD[82533] at createDataFrame at console:30 == Optimized Logical Plan == Project [_1#68991,_2#68992,_2#68994] Join Inner, Some((_1#68991 = _1#68993)) Repartition 1000, true LogicalRDD [_1#68991,_2#68992], MapPartitionsRDD[82530] at createDataFrame at console:29 LogicalRDD [_1#68993,_2#68994], MapPartitionsRDD[82533] at createDataFrame at console:30 == Physical Plan == Project [_1#68991,_2#68992,_2#68994] ShuffledHashJoin [_1#68991], [_1#68993], BuildRight Exchange (HashPartitioning 200) Repartition 1000, true PhysicalRDD [_1#68991,_2#68992], MapPartitionsRDD[82530] at createDataFrame at console:29 Exchange (HashPartitioning 200) PhysicalRDD [_1#68993,_2#68994], MapPartitionsRDD[82533] at createDataFrame at console:30 {code} In this plan, we end up repartitioning {{df1}} to have 1000 partitions, which involves a shuffle, only to turn around and shuffle again as part of the exchange. To avoid this extra shuffle, I think that we should remove the Repartition when the following condition holds: - Exchange's child is a repartition operator where shuffle=True. We should not perform this collapsing when shuffle=False, since there might be a legitimate reason to coalesce before shuffling (reducing the number of map outputs that need to be tracked, for instance). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-9638) .save() Procedure fails
Stijn Geuens created SPARK-9638: --- Summary: .save() Procedure fails Key: SPARK-9638 URL: https://issues.apache.org/jira/browse/SPARK-9638 Project: Spark Issue Type: Bug Components: MLlib, PySpark Affects Versions: 1.4.1 Reporter: Stijn Geuens I am not able to save a MatrixFactorizationModel I created. Path ./Models exists. Working with pyspark in IPython notebook (spark version = 1.4.1, hadoop version = 2.6) Error message: --- Py4JJavaError Traceback (most recent call last) ipython-input-14-28d4a0d852bb in module() 1 CFMFModel11.save(sc, ./Models/CFMFModel11) C:\Users\s.geuens\Spark\spark-1.4.1-bin-hadoop2.6\python\pyspark\mllib\util.pyc in save(self, sc, path) 202 203 def save(self, sc, path): -- 204 self._java_model.save(sc._jsc.sc(), path) 205 206 C:\Users\s.geuens\Spark\spark-1.4.1-bin-hadoop2.6\python\lib\py4j-0.8.2.1-src.zip\py4j\java_gateway.py in __call__(self, *args) 536 answer = self.gateway_client.send_command(command) 537 return_value = get_return_value(answer, self.gateway_client, -- 538 self.target_id, self.name) 539 540 for temp_arg in temp_args: C:\Users\s.geuens\Spark\spark-1.4.1-bin-hadoop2.6\python\lib\py4j-0.8.2.1-src.zip\py4j\protocol.py in get_return_value(answer, gateway_client, target_id, name) 298 raise Py4JJavaError( 299 'An error occurred while calling {0}{1}{2}.\n'. -- 300 format(target_id, '.', name), value) 301 else: 302 raise Py4JError( Py4JJavaError: An error occurred while calling o334.save. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 1823.0 failed 1 times, most recent failure: Lost task 0.0 in stage 1823.0 (TID 489, localhost): java.lang.NullPointerException at java.lang.ProcessBuilder.start(ProcessBuilder.java:1010) at org.apache.hadoop.util.Shell.runCommand(Shell.java:482) at org.apache.hadoop.util.Shell.run(Shell.java:455) at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:715) at org.apache.hadoop.util.Shell.execCommand(Shell.java:808) at org.apache.hadoop.util.Shell.execCommand(Shell.java:791) at org.apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSystem.java:656) at org.apache.hadoop.fs.FilterFileSystem.setPermission(FilterFileSystem.java:490) at org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:462) at org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:428) at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:908) at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:801) at org.apache.hadoop.mapred.TextOutputFormat.getRecordWriter(TextOutputFormat.java:123) at org.apache.spark.SparkHadoopWriter.open(SparkHadoopWriter.scala:90) at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13.apply(PairRDDFunctions.scala:1104) at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13.apply(PairRDDFunctions.scala:1095) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:63) at org.apache.spark.scheduler.Task.run(Task.scala:70) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1273) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1264) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1263) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1263) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:730) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:730) at scala.Option.foreach(Option.scala:236) at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:730) at
[jira] [Commented] (SPARK-9638) .save() Procedure fails
[ https://issues.apache.org/jira/browse/SPARK-9638?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14655214#comment-14655214 ] Sean Owen commented on SPARK-9638: -- I think this is because you are on Windows and you may not have Hadoop installed and/or HADOOP_HOME set. It needs some support binaries on windows to interact with the FS. That is I think this is the same as SPARK-2356 underneath. .save() Procedure fails --- Key: SPARK-9638 URL: https://issues.apache.org/jira/browse/SPARK-9638 Project: Spark Issue Type: Bug Components: MLlib, PySpark Affects Versions: 1.4.1 Reporter: Stijn Geuens I am not able to save a MatrixFactorizationModel I created. Path ./Models exists. Working with pyspark in IPython notebook (spark version = 1.4.1, hadoop version = 2.6) Error message: --- Py4JJavaError Traceback (most recent call last) ipython-input-14-28d4a0d852bb in module() 1 CFMFModel11.save(sc, ./Models/CFMFModel11) C:\Users\s.geuens\Spark\spark-1.4.1-bin-hadoop2.6\python\pyspark\mllib\util.pyc in save(self, sc, path) 202 203 def save(self, sc, path): -- 204 self._java_model.save(sc._jsc.sc(), path) 205 206 C:\Users\s.geuens\Spark\spark-1.4.1-bin-hadoop2.6\python\lib\py4j-0.8.2.1-src.zip\py4j\java_gateway.py in __call__(self, *args) 536 answer = self.gateway_client.send_command(command) 537 return_value = get_return_value(answer, self.gateway_client, -- 538 self.target_id, self.name) 539 540 for temp_arg in temp_args: C:\Users\s.geuens\Spark\spark-1.4.1-bin-hadoop2.6\python\lib\py4j-0.8.2.1-src.zip\py4j\protocol.py in get_return_value(answer, gateway_client, target_id, name) 298 raise Py4JJavaError( 299 'An error occurred while calling {0}{1}{2}.\n'. -- 300 format(target_id, '.', name), value) 301 else: 302 raise Py4JError( Py4JJavaError: An error occurred while calling o334.save. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 1823.0 failed 1 times, most recent failure: Lost task 0.0 in stage 1823.0 (TID 489, localhost): java.lang.NullPointerException at java.lang.ProcessBuilder.start(ProcessBuilder.java:1010) at org.apache.hadoop.util.Shell.runCommand(Shell.java:482) at org.apache.hadoop.util.Shell.run(Shell.java:455) at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:715) at org.apache.hadoop.util.Shell.execCommand(Shell.java:808) at org.apache.hadoop.util.Shell.execCommand(Shell.java:791) at org.apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSystem.java:656) at org.apache.hadoop.fs.FilterFileSystem.setPermission(FilterFileSystem.java:490) at org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:462) at org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:428) at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:908) at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:801) at org.apache.hadoop.mapred.TextOutputFormat.getRecordWriter(TextOutputFormat.java:123) at org.apache.spark.SparkHadoopWriter.open(SparkHadoopWriter.scala:90) at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13.apply(PairRDDFunctions.scala:1104) at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13.apply(PairRDDFunctions.scala:1095) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:63) at org.apache.spark.scheduler.Task.run(Task.scala:70) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1273) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1264) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1263) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1263) at
[jira] [Resolved] (SPARK-9593) Hive ShimLoader loads wrong Hadoop shims when Spark is compiled against Hadoop 2.0.0-mr1-cdh4.1.1
[ https://issues.apache.org/jira/browse/SPARK-9593?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian resolved SPARK-9593. --- Resolution: Fixed Fix Version/s: 1.5.0 Issue resolved by pull request 7929 [https://github.com/apache/spark/pull/7929] Hive ShimLoader loads wrong Hadoop shims when Spark is compiled against Hadoop 2.0.0-mr1-cdh4.1.1 - Key: SPARK-9593 URL: https://issues.apache.org/jira/browse/SPARK-9593 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.5.0 Reporter: Cheng Lian Assignee: Cheng Lian Priority: Blocker Fix For: 1.5.0 Internally, Hive {{ShimLoader}} tries to load different versions of Hadoop shims by checking version information gathered from Hadoop jar files. If the major version number is 1, {{Hadoop20SShims}} will be loaded. Otherwise, if the major version number is 2, {{Hadoop23Shims}} will be chosen. However, CDH Hadoop versions like 2.0.0-mr1-cdh4.1.1 have 2 as major version number, but contain Hadoop 1 code. This confuses Hive {{ShimLoader}} and loads wrong version of shims. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9636) Treat $SPARK_HOME as write-only
[ https://issues.apache.org/jira/browse/SPARK-9636?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14655310#comment-14655310 ] Philipp Angerer commented on SPARK-9636: everything is more obvious than picing a location relative to the binary ;) and the location is reported anyway since the {{start-master.sh}} script outputs {{starting org.apache.spark.deploy.master.Master, logging to /home/user/.cache/spark-logs/spark-user-org.apache.spark.deploy.master.Master-1-hostname.out}} about write permissions, mind that i suggest testing them sequentially until one is found that can be written to. that’s IMHO a more sensible default than failing, and having to {{grep -i 'log' $SPARK_HOME/sbin/*.sh}} to find that an environment variable exists, and then retrying with that variable set. Treat $SPARK_HOME as write-only --- Key: SPARK-9636 URL: https://issues.apache.org/jira/browse/SPARK-9636 Project: Spark Issue Type: Improvement Components: Input/Output Affects Versions: 1.4.1 Environment: Linux Reporter: Philipp Angerer Priority: Minor Labels: easyfix when starting spark scripts as user and it is installed in a directory the user has no write permissions on, many things work fine, except for the logs (e.g. for {{start-master.sh}}) logs are per default written to {{$SPARK_LOG_DIR}} or (if unset) to {{$SPARK_HOME/logs}}. if installed in this way, it should, instead of throwing an error, write logs to {{/var/log/spark/}}. that’s easy to fix by simply testing a few log dirs in sequence for writability before trying to use one. i suggest using {{$SPARK_LOG_DIR}} (if set) → {{/var/log/spark/}} → {{~/.cache/spark-logs/}} → {{$SPARK_HOME/logs/}} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-9636) Treat $SPARK_HOME as write-only
[ https://issues.apache.org/jira/browse/SPARK-9636?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14655310#comment-14655310 ] Philipp Angerer edited comment on SPARK-9636 at 8/5/15 1:04 PM: everything is more obvious than picking a location relative to the binary ;) and the location is reported anyway since the {{start-master.sh}} script outputs {{starting org.apache.spark.deploy.master.Master, logging to /home/user/.cache/spark-logs/spark-user-org.apache.spark.deploy.master.Master-1-hostname.out}} about write permissions, mind that i suggest testing them sequentially until one is found that can be written to. that’s IMHO a more sensible default than failing, and having to {{grep -i 'log' $SPARK_HOME/sbin/*.sh}} to find that an environment variable exists, and then retrying with that variable set. was (Author: angerer): everything is more obvious than picing a location relative to the binary ;) and the location is reported anyway since the {{start-master.sh}} script outputs {{starting org.apache.spark.deploy.master.Master, logging to /home/user/.cache/spark-logs/spark-user-org.apache.spark.deploy.master.Master-1-hostname.out}} about write permissions, mind that i suggest testing them sequentially until one is found that can be written to. that’s IMHO a more sensible default than failing, and having to {{grep -i 'log' $SPARK_HOME/sbin/*.sh}} to find that an environment variable exists, and then retrying with that variable set. Treat $SPARK_HOME as write-only --- Key: SPARK-9636 URL: https://issues.apache.org/jira/browse/SPARK-9636 Project: Spark Issue Type: Improvement Components: Input/Output Affects Versions: 1.4.1 Environment: Linux Reporter: Philipp Angerer Priority: Minor Labels: easyfix when starting spark scripts as user and it is installed in a directory the user has no write permissions on, many things work fine, except for the logs (e.g. for {{start-master.sh}}) logs are per default written to {{$SPARK_LOG_DIR}} or (if unset) to {{$SPARK_HOME/logs}}. if installed in this way, it should, instead of throwing an error, write logs to {{/var/log/spark/}}. that’s easy to fix by simply testing a few log dirs in sequence for writability before trying to use one. i suggest using {{$SPARK_LOG_DIR}} (if set) → {{/var/log/spark/}} → {{~/.cache/spark-logs/}} → {{$SPARK_HOME/logs/}} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-9639) JobHandler may throw NPE if JobScheduler has been stopped
Shixiong Zhu created SPARK-9639: --- Summary: JobHandler may throw NPE if JobScheduler has been stopped Key: SPARK-9639 URL: https://issues.apache.org/jira/browse/SPARK-9639 Project: Spark Issue Type: Bug Components: Streaming Reporter: Shixiong Zhu Because `JobScheduler.stop(false)` may set `eventLoop` to null when `JobHandler` is running, then it's possible that when `post` is called, `eventLoop` happens to null. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-9639) JobHandler may throw NPE if JobScheduler has been stopped
[ https://issues.apache.org/jira/browse/SPARK-9639?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-9639: --- Assignee: Apache Spark JobHandler may throw NPE if JobScheduler has been stopped - Key: SPARK-9639 URL: https://issues.apache.org/jira/browse/SPARK-9639 Project: Spark Issue Type: Bug Components: Streaming Reporter: Shixiong Zhu Assignee: Apache Spark Because `JobScheduler.stop(false)` may set `eventLoop` to null when `JobHandler` is running, then it's possible that when `post` is called, `eventLoop` happens to null. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9639) JobHandler may throw NPE if JobScheduler has been stopped
[ https://issues.apache.org/jira/browse/SPARK-9639?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14658195#comment-14658195 ] Apache Spark commented on SPARK-9639: - User 'zsxwing' has created a pull request for this issue: https://github.com/apache/spark/pull/7960 JobHandler may throw NPE if JobScheduler has been stopped - Key: SPARK-9639 URL: https://issues.apache.org/jira/browse/SPARK-9639 Project: Spark Issue Type: Bug Components: Streaming Reporter: Shixiong Zhu Because `JobScheduler.stop(false)` may set `eventLoop` to null when `JobHandler` is running, then it's possible that when `post` is called, `eventLoop` happens to null. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-9639) JobHandler may throw NPE if JobScheduler has been stopped
[ https://issues.apache.org/jira/browse/SPARK-9639?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-9639: --- Assignee: (was: Apache Spark) JobHandler may throw NPE if JobScheduler has been stopped - Key: SPARK-9639 URL: https://issues.apache.org/jira/browse/SPARK-9639 Project: Spark Issue Type: Bug Components: Streaming Reporter: Shixiong Zhu Because `JobScheduler.stop(false)` may set `eventLoop` to null when `JobHandler` is running, then it's possible that when `post` is called, `eventLoop` happens to null. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-9640) Do not run Python Kinesis tests when the Kinesis assembly JAR has not been generated
Tathagata Das created SPARK-9640: Summary: Do not run Python Kinesis tests when the Kinesis assembly JAR has not been generated Key: SPARK-9640 URL: https://issues.apache.org/jira/browse/SPARK-9640 Project: Spark Issue Type: Test Components: Streaming, Tests Reporter: Tathagata Das Assignee: Tathagata Das -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-9640) Do not run Python Kinesis tests when the Kinesis assembly JAR has not been generated
[ https://issues.apache.org/jira/browse/SPARK-9640?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-9640: --- Assignee: Apache Spark (was: Tathagata Das) Do not run Python Kinesis tests when the Kinesis assembly JAR has not been generated Key: SPARK-9640 URL: https://issues.apache.org/jira/browse/SPARK-9640 Project: Spark Issue Type: Test Components: Streaming, Tests Reporter: Tathagata Das Assignee: Apache Spark -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-9640) Do not run Python Kinesis tests when the Kinesis assembly JAR has not been generated
[ https://issues.apache.org/jira/browse/SPARK-9640?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-9640: --- Assignee: Tathagata Das (was: Apache Spark) Do not run Python Kinesis tests when the Kinesis assembly JAR has not been generated Key: SPARK-9640 URL: https://issues.apache.org/jira/browse/SPARK-9640 Project: Spark Issue Type: Test Components: Streaming, Tests Reporter: Tathagata Das Assignee: Tathagata Das -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9640) Do not run Python Kinesis tests when the Kinesis assembly JAR has not been generated
[ https://issues.apache.org/jira/browse/SPARK-9640?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14658243#comment-14658243 ] Apache Spark commented on SPARK-9640: - User 'tdas' has created a pull request for this issue: https://github.com/apache/spark/pull/7961 Do not run Python Kinesis tests when the Kinesis assembly JAR has not been generated Key: SPARK-9640 URL: https://issues.apache.org/jira/browse/SPARK-9640 Project: Spark Issue Type: Test Components: Streaming, Tests Reporter: Tathagata Das Assignee: Tathagata Das -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-9641) spark.shuffle.service.port is not documented
Thomas Graves created SPARK-9641: Summary: spark.shuffle.service.port is not documented Key: SPARK-9641 URL: https://issues.apache.org/jira/browse/SPARK-9641 Project: Spark Issue Type: Bug Components: Shuffle Reporter: Thomas Graves Looking at the code I see spark.shuffle.service.port being used but I can't find any documentation on it. I don't see a reason for this to be an internal config so we should document it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-9618) SQLContext.read.schema().parquet() ignores the supplied schema
[ https://issues.apache.org/jira/browse/SPARK-9618?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian resolved SPARK-9618. --- Resolution: Fixed Fix Version/s: 1.5.0 Issue resolved by pull request 7947 [https://github.com/apache/spark/pull/7947] SQLContext.read.schema().parquet() ignores the supplied schema -- Key: SPARK-9618 URL: https://issues.apache.org/jira/browse/SPARK-9618 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.4.1 Reporter: Nathan Howell Assignee: Nathan Howell Priority: Minor Fix For: 1.5.0 If a user supplies a schema when loading a Parquet file it is ignored and the schema is read off disk instead. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8862) Add a web UI page that visualizes physical plans (SparkPlan)
[ https://issues.apache.org/jira/browse/SPARK-8862?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14658302#comment-14658302 ] Apache Spark commented on SPARK-8862: - User 'zsxwing' has created a pull request for this issue: https://github.com/apache/spark/pull/7962 Add a web UI page that visualizes physical plans (SparkPlan) Key: SPARK-8862 URL: https://issues.apache.org/jira/browse/SPARK-8862 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin Assignee: Shixiong Zhu Fix For: 1.5.0 We currently have the ability to visualize part of the query plan using the Spark DAG viz. However, that does NOT work for one of the most important operators: broadcast join. The reason is that broadcast join launches multiple Spark jobs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6486) Add BlockMatrix in PySpark
[ https://issues.apache.org/jira/browse/SPARK-6486?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-6486: - Fix Version/s: (was: 1.6.0) 1.5.0 Add BlockMatrix in PySpark -- Key: SPARK-6486 URL: https://issues.apache.org/jira/browse/SPARK-6486 Project: Spark Issue Type: Sub-task Components: MLlib, PySpark Reporter: Xiangrui Meng Assignee: Mike Dusenberry Fix For: 1.5.0 We should add BlockMatrix to PySpark. Internally, we can use DataFrames and MatrixUDT for serialization. This JIRA should contain conversions between IndexedRowMatrix/CoordinateMatrix to block matrices. But this does NOT cover linear algebra operations of block matrices. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-9381) Migrate JSON data source to the new partitioning data source
[ https://issues.apache.org/jira/browse/SPARK-9381?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian resolved SPARK-9381. --- Resolution: Fixed Fix Version/s: 1.5.0 Issue resolved by pull request 7696 [https://github.com/apache/spark/pull/7696] Migrate JSON data source to the new partitioning data source Key: SPARK-9381 URL: https://issues.apache.org/jira/browse/SPARK-9381 Project: Spark Issue Type: New Feature Components: SQL Reporter: Cheng Hao Assignee: Cheng Hao Fix For: 1.5.0 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9381) Migrate JSON data source to the new partitioning data source
[ https://issues.apache.org/jira/browse/SPARK-9381?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated SPARK-9381: -- Assignee: Cheng Hao Migrate JSON data source to the new partitioning data source Key: SPARK-9381 URL: https://issues.apache.org/jira/browse/SPARK-9381 Project: Spark Issue Type: New Feature Components: SQL Reporter: Cheng Hao Assignee: Cheng Hao -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9641) spark.shuffle.service.port is not documented
[ https://issues.apache.org/jira/browse/SPARK-9641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14658328#comment-14658328 ] Sean Owen commented on SPARK-9641: -- Agree. the .enabled flag is mentioned but not documented either. Want to make a PR or should I? spark.shuffle.service.port is not documented Key: SPARK-9641 URL: https://issues.apache.org/jira/browse/SPARK-9641 Project: Spark Issue Type: Bug Components: Documentation, Shuffle Reporter: Thomas Graves Looking at the code I see spark.shuffle.service.port being used but I can't find any documentation on it. I don't see a reason for this to be an internal config so we should document it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9641) spark.shuffle.service.port is not documented
[ https://issues.apache.org/jira/browse/SPARK-9641?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-9641: - Priority: Minor (was: Major) Component/s: Documentation Issue Type: Improvement (was: Bug) spark.shuffle.service.port is not documented Key: SPARK-9641 URL: https://issues.apache.org/jira/browse/SPARK-9641 Project: Spark Issue Type: Improvement Components: Documentation, Shuffle Reporter: Thomas Graves Priority: Minor Looking at the code I see spark.shuffle.service.port being used but I can't find any documentation on it. I don't see a reason for this to be an internal config so we should document it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5544) wholeTextFiles should recognize multiple input paths delimited by ,
[ https://issues.apache.org/jira/browse/SPARK-5544?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14658368#comment-14658368 ] Perinkulam I Ganesh commented on SPARK-5544: It seems like this JIRA got resolved by SPARK 7155... please double check. thanks - P. I. wholeTextFiles should recognize multiple input paths delimited by , --- Key: SPARK-5544 URL: https://issues.apache.org/jira/browse/SPARK-5544 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Xiangrui Meng textFile takes delimited paths in a single path string. wholeTextFiles should behave the same. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8486) SIFT Feature Transformer
[ https://issues.apache.org/jira/browse/SPARK-8486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14658377#comment-14658377 ] K S Sreenivasa Raghavan commented on SPARK-8486: How to accept this issue? Should I use scala or python? SIFT Feature Transformer Key: SPARK-8486 URL: https://issues.apache.org/jira/browse/SPARK-8486 Project: Spark Issue Type: Sub-task Components: ML Reporter: Feynman Liang Priority: Minor Scale invariant feature transform (SIFT) is a scale and rotation invariant method to transform images into matrices describing local features. (Lowe, IJCV 2004, http://www.cs.ubc.ca/~lowe/papers/ijcv04.pdf) We can implement SIFT in Spark ML pipelines as a org.apache.spark.ml.Transformer. Given an image Array[Array[Numeric]], the SIFT transformer should output an ArrayArray[[Numeric]] of the SIFT features for the provided image. The implementation should support computation of SIFT at predefined interest points, every kth pixel, and densely (over all pixels). Furthermore, the implementation should support various approximations for approximating the Laplacian of Gaussian using Difference of Gaussian (as described by Lowe). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8971) Support balanced class labels when splitting train/cross validation sets
[ https://issues.apache.org/jira/browse/SPARK-8971?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14658376#comment-14658376 ] Seth Hendrickson commented on SPARK-8971: - [~mengxr] You mentioned that the solution should call {{sampleByKeyExact}}, which is a function that takes a stratified subsample of m N elements from a dataset. One problem is that when doing things like train/test split and k fold creation (which are fundamentally the same as far as sampling goes) is that we actually need to take random splits of the dataset. That is, we need not only the subsample, but its complement. For k-fold sampling, we need to split the dataset into k unique, non-overlapping subsamples, which isn't possible with {{samplyByKeyExact}} in its current state. I have a pretty coarse prototype which essentially uses the [efficient, parallel sampling routine|http://jmlr.org/proceedings/papers/v28/meng13a.html] to find the exact k thresholds needed to split the dataset into k subsamples. I had to modify the sampling function in {{org.apache.spark.util.random.StratifiedSamplingUtils}} to compare the random keys to a range (e.g. x lb x = ub), rather than simply comparing to one number (x threshold) which only allows for a bisection of the data. Once you know the exact k-1 thresholds that provide even splits for each stratum, and you have a sampling function that can compare the random key to a range, you have what you need to for stratified k-fold and train/test split. Is there a way to implement this without touching the {{org.apache.spark.util.random}} package that I'm missing? Support balanced class labels when splitting train/cross validation sets Key: SPARK-8971 URL: https://issues.apache.org/jira/browse/SPARK-8971 Project: Spark Issue Type: New Feature Components: ML Reporter: Feynman Liang Assignee: Seth Hendrickson {{CrossValidator}} and the proposed {{TrainValidatorSplit}} (SPARK-8484) are Spark classes which partition data into training and evaluation sets for performing hyperparameter selection via cross validation. Both methods currently perform the split by randomly sampling the datasets. However, when class probabilities are highly imbalanced (e.g. detection of extremely low-frequency events), random sampling may result in cross validation sets not representative of actual out-of-training performance (e.g. no positive training examples could be included). Mainstream R packages like already [caret|http://topepo.github.io/caret/splitting.html] support splitting the data based upon the class labels. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4412) Parquet logger cannot be configured
[ https://issues.apache.org/jira/browse/SPARK-4412?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14658378#comment-14658378 ] Stephen Carman commented on SPARK-4412: --- This also happens to me a lot in Spark 1.4.0 perhaps this could be tested on the 1.4 branch as well? Parquet logger cannot be configured --- Key: SPARK-4412 URL: https://issues.apache.org/jira/browse/SPARK-4412 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.2.0, 1.3.1 Reporter: Jim Carroll The Spark ParquetRelation.scala code makes the assumption that the parquet.Log class has already been loaded. If ParquetRelation.enableLogForwarding executes prior to the parquet.Log class being loaded then the code in enableLogForwarding has no affect. ParquetRelation.scala attempts to override the parquet logger but, at least currently (and if your application simply reads a parquet file before it does anything else with Parquet), the parquet.Log class hasn't been loaded yet. Therefore the code in ParquetRelation.enableLogForwarding has no affect. If you look at the code in parquet.Log there's a static initializer that needs to be called prior to enableLogForwarding or whatever enableLogForwarding does gets undone by this static initializer. The fix would be to force the static initializer to get called in parquet.Log as part of enableForwardLogging. PR will be forthcomming. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-5544) wholeTextFiles should recognize multiple input paths delimited by ,
[ https://issues.apache.org/jira/browse/SPARK-5544?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-5544. -- Resolution: Duplicate Target Version/s: (was: 1.5.0) wholeTextFiles should recognize multiple input paths delimited by , --- Key: SPARK-5544 URL: https://issues.apache.org/jira/browse/SPARK-5544 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Xiangrui Meng textFile takes delimited paths in a single path string. wholeTextFiles should behave the same. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-6227) PCA and SVD for PySpark
[ https://issues.apache.org/jira/browse/SPARK-6227?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-6227: --- Assignee: Apache Spark PCA and SVD for PySpark --- Key: SPARK-6227 URL: https://issues.apache.org/jira/browse/SPARK-6227 Project: Spark Issue Type: Sub-task Components: MLlib, PySpark Affects Versions: 1.2.1 Reporter: Julien Amelot Assignee: Apache Spark The Dimensionality Reduction techniques are not available via Python (Scala + Java only). * Principal component analysis (PCA) * Singular value decomposition (SVD) Doc: http://spark.apache.org/docs/1.2.1/mllib-dimensionality-reduction.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6227) PCA and SVD for PySpark
[ https://issues.apache.org/jira/browse/SPARK-6227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14658403#comment-14658403 ] Apache Spark commented on SPARK-6227: - User 'MechCoder' has created a pull request for this issue: https://github.com/apache/spark/pull/7963 PCA and SVD for PySpark --- Key: SPARK-6227 URL: https://issues.apache.org/jira/browse/SPARK-6227 Project: Spark Issue Type: Sub-task Components: MLlib, PySpark Affects Versions: 1.2.1 Reporter: Julien Amelot The Dimensionality Reduction techniques are not available via Python (Scala + Java only). * Principal component analysis (PCA) * Singular value decomposition (SVD) Doc: http://spark.apache.org/docs/1.2.1/mllib-dimensionality-reduction.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org