[jira] [Commented] (SPARK-9213) Improve regular expression performance (via joni)
[ https://issues.apache.org/jira/browse/SPARK-9213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14696596#comment-14696596 ] Reynold Xin commented on SPARK-9213: We just need to handle it in the analyzer to rewrite it, and also pattern match it in the optimizer. No need to handle it everywhere else, since the analyzer will take care of the rewrite. > Improve regular expression performance (via joni) > - > > Key: SPARK-9213 > URL: https://issues.apache.org/jira/browse/SPARK-9213 > Project: Spark > Issue Type: Umbrella > Components: SQL >Reporter: Reynold Xin > > I'm creating an umbrella ticket to improve regular expression performance for > string expressions. Right now our use of regular expressions is inefficient > for two reasons: > 1. Java regex in general is slow. > 2. We have to convert everything from UTF8 encoded bytes into Java String, > and then run regex on it, and then convert it back. > There are libraries in Java that provide regex support directly on UTF8 > encoded bytes. One prominent example is joni, used in JRuby. > Note: all regex functions are in > https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringOperations.scala -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9213) Improve regular expression performance (via joni)
[ https://issues.apache.org/jira/browse/SPARK-9213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14696630#comment-14696630 ] Yadong Qi commented on SPARK-9213: -- I know, thanks! > Improve regular expression performance (via joni) > - > > Key: SPARK-9213 > URL: https://issues.apache.org/jira/browse/SPARK-9213 > Project: Spark > Issue Type: Umbrella > Components: SQL >Reporter: Reynold Xin > > I'm creating an umbrella ticket to improve regular expression performance for > string expressions. Right now our use of regular expressions is inefficient > for two reasons: > 1. Java regex in general is slow. > 2. We have to convert everything from UTF8 encoded bytes into Java String, > and then run regex on it, and then convert it back. > There are libraries in Java that provide regex support directly on UTF8 > encoded bytes. One prominent example is joni, used in JRuby. > Note: all regex functions are in > https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringOperations.scala -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-9961) ML prediction abstractions should have defaultEvaluator fields
[ https://issues.apache.org/jira/browse/SPARK-9961?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-9961: --- Assignee: Apache Spark > ML prediction abstractions should have defaultEvaluator fields > -- > > Key: SPARK-9961 > URL: https://issues.apache.org/jira/browse/SPARK-9961 > Project: Spark > Issue Type: New Feature > Components: ML >Reporter: Joseph K. Bradley >Assignee: Apache Spark > > Predictor and PredictionModel should have abstract defaultEvaluator methods > which return Evaluators. Subclasses like Regressor, Classifier, etc. should > all provide natural evaluators, set to use the correct input columns and > metrics. Concrete classes may later be modified to > The initial implementation should be marked as DeveloperApi since we may need > to change the defaults later on. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9961) ML prediction abstractions should have defaultEvaluator fields
[ https://issues.apache.org/jira/browse/SPARK-9961?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14696640#comment-14696640 ] Apache Spark commented on SPARK-9961: - User 'mengxr' has created a pull request for this issue: https://github.com/apache/spark/pull/8190 > ML prediction abstractions should have defaultEvaluator fields > -- > > Key: SPARK-9961 > URL: https://issues.apache.org/jira/browse/SPARK-9961 > Project: Spark > Issue Type: New Feature > Components: ML >Reporter: Joseph K. Bradley > > Predictor and PredictionModel should have abstract defaultEvaluator methods > which return Evaluators. Subclasses like Regressor, Classifier, etc. should > all provide natural evaluators, set to use the correct input columns and > metrics. Concrete classes may later be modified to > The initial implementation should be marked as DeveloperApi since we may need > to change the defaults later on. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-9961) ML prediction abstractions should have defaultEvaluator fields
[ https://issues.apache.org/jira/browse/SPARK-9961?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-9961: --- Assignee: (was: Apache Spark) > ML prediction abstractions should have defaultEvaluator fields > -- > > Key: SPARK-9961 > URL: https://issues.apache.org/jira/browse/SPARK-9961 > Project: Spark > Issue Type: New Feature > Components: ML >Reporter: Joseph K. Bradley > > Predictor and PredictionModel should have abstract defaultEvaluator methods > which return Evaluators. Subclasses like Regressor, Classifier, etc. should > all provide natural evaluators, set to use the correct input columns and > metrics. Concrete classes may later be modified to > The initial implementation should be marked as DeveloperApi since we may need > to change the defaults later on. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9661) ML 1.5 QA: API: Java compatibility, docs
[ https://issues.apache.org/jira/browse/SPARK-9661?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14696641#comment-14696641 ] Apache Spark commented on SPARK-9661: - User 'mengxr' has created a pull request for this issue: https://github.com/apache/spark/pull/8190 > ML 1.5 QA: API: Java compatibility, docs > > > Key: SPARK-9661 > URL: https://issues.apache.org/jira/browse/SPARK-9661 > Project: Spark > Issue Type: Sub-task > Components: Documentation, Java API, ML, MLlib >Reporter: Joseph K. Bradley >Assignee: Manoj Kumar > Fix For: 1.5.0 > > > Check Java compatibility for MLlib for this release. > Checking compatibility means: > * comparing with the Scala doc > * verifying that Java docs are not messed up by Scala type incompatibilities. > Some items to look out for are: > ** Check for generic "Object" types where Java cannot understand complex > Scala types. > *** *Note*: The Java docs do not always match the bytecode. If you find a > problem, please verify it using {{javap}}. > ** Check Scala objects (especially with nesting!) carefully. > ** Check for uses of Scala and Java enumerations, which can show up oddly in > the other language's doc. > * If needed for complex issues, create small Java unit tests which execute > each method. (The correctness can be checked in Scala.) > If you find issues, please comment here, or for larger items, create separate > JIRAs and link here. > Note that we should not break APIs from previous releases. So if you find a > problem, check if it was introduced in this Spark release (in which case we > can fix it) or in a previous one (in which case we can create a java-friendly > version of the API). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-9661) ML 1.5 QA: API: Java compatibility, docs
[ https://issues.apache.org/jira/browse/SPARK-9661?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng reopened SPARK-9661: -- re-open for some clean-ups > ML 1.5 QA: API: Java compatibility, docs > > > Key: SPARK-9661 > URL: https://issues.apache.org/jira/browse/SPARK-9661 > Project: Spark > Issue Type: Sub-task > Components: Documentation, Java API, ML, MLlib >Reporter: Joseph K. Bradley >Assignee: Manoj Kumar > Fix For: 1.5.0 > > > Check Java compatibility for MLlib for this release. > Checking compatibility means: > * comparing with the Scala doc > * verifying that Java docs are not messed up by Scala type incompatibilities. > Some items to look out for are: > ** Check for generic "Object" types where Java cannot understand complex > Scala types. > *** *Note*: The Java docs do not always match the bytecode. If you find a > problem, please verify it using {{javap}}. > ** Check Scala objects (especially with nesting!) carefully. > ** Check for uses of Scala and Java enumerations, which can show up oddly in > the other language's doc. > * If needed for complex issues, create small Java unit tests which execute > each method. (The correctness can be checked in Scala.) > If you find issues, please comment here, or for larger items, create separate > JIRAs and link here. > Note that we should not break APIs from previous releases. So if you find a > problem, check if it was introduced in this Spark release (in which case we > can fix it) or in a previous one (in which case we can create a java-friendly > version of the API). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-9961) ML prediction abstractions should have defaultEvaluator fields
[ https://issues.apache.org/jira/browse/SPARK-9961?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-9961: - Comment: was deleted (was: User 'mengxr' has created a pull request for this issue: https://github.com/apache/spark/pull/8190) > ML prediction abstractions should have defaultEvaluator fields > -- > > Key: SPARK-9961 > URL: https://issues.apache.org/jira/browse/SPARK-9961 > Project: Spark > Issue Type: New Feature > Components: ML >Reporter: Joseph K. Bradley > > Predictor and PredictionModel should have abstract defaultEvaluator methods > which return Evaluators. Subclasses like Regressor, Classifier, etc. should > all provide natural evaluators, set to use the correct input columns and > metrics. Concrete classes may later be modified to > The initial implementation should be marked as DeveloperApi since we may need > to change the defaults later on. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-9899) JSON/Parquet writing on retry or speculation broken with direct output committer
[ https://issues.apache.org/jira/browse/SPARK-9899?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-9899: --- Assignee: Apache Spark (was: Cheng Lian) > JSON/Parquet writing on retry or speculation broken with direct output > committer > > > Key: SPARK-9899 > URL: https://issues.apache.org/jira/browse/SPARK-9899 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Michael Armbrust >Assignee: Apache Spark >Priority: Blocker > > If the first task fails all subsequent tasks will. We probably need to set a > different boolean when calling create. > {code} > java.io.IOException: File already exists: ... > ... > at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:564) > at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:545) > at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:452) > at > org.apache.hadoop.mapreduce.lib.output.TextOutputFormat.getRecordWriter(TextOutputFormat.java:128) > at > org.apache.spark.sql.execution.datasources.json.JsonOutputWriter.(JSONRelation.scala:185) > at > org.apache.spark.sql.execution.datasources.json.JSONRelation$$anon$1.newInstance(JSONRelation.scala:160) > at > org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:217) > at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(InsertIntoHadoopFsRelation.scala:150) > at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(InsertIntoHadoopFsRelation.scala:150) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66) > at org.apache.spark.scheduler.Task.run(Task.scala:88) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-9899) JSON/Parquet writing on retry or speculation broken with direct output committer
[ https://issues.apache.org/jira/browse/SPARK-9899?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-9899: --- Assignee: Cheng Lian (was: Apache Spark) > JSON/Parquet writing on retry or speculation broken with direct output > committer > > > Key: SPARK-9899 > URL: https://issues.apache.org/jira/browse/SPARK-9899 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Michael Armbrust >Assignee: Cheng Lian >Priority: Blocker > > If the first task fails all subsequent tasks will. We probably need to set a > different boolean when calling create. > {code} > java.io.IOException: File already exists: ... > ... > at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:564) > at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:545) > at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:452) > at > org.apache.hadoop.mapreduce.lib.output.TextOutputFormat.getRecordWriter(TextOutputFormat.java:128) > at > org.apache.spark.sql.execution.datasources.json.JsonOutputWriter.(JSONRelation.scala:185) > at > org.apache.spark.sql.execution.datasources.json.JSONRelation$$anon$1.newInstance(JSONRelation.scala:160) > at > org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:217) > at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(InsertIntoHadoopFsRelation.scala:150) > at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(InsertIntoHadoopFsRelation.scala:150) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66) > at org.apache.spark.scheduler.Task.run(Task.scala:88) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9899) JSON/Parquet writing on retry or speculation broken with direct output committer
[ https://issues.apache.org/jira/browse/SPARK-9899?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14696672#comment-14696672 ] Apache Spark commented on SPARK-9899: - User 'liancheng' has created a pull request for this issue: https://github.com/apache/spark/pull/8191 > JSON/Parquet writing on retry or speculation broken with direct output > committer > > > Key: SPARK-9899 > URL: https://issues.apache.org/jira/browse/SPARK-9899 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Michael Armbrust >Assignee: Cheng Lian >Priority: Blocker > > If the first task fails all subsequent tasks will. We probably need to set a > different boolean when calling create. > {code} > java.io.IOException: File already exists: ... > ... > at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:564) > at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:545) > at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:452) > at > org.apache.hadoop.mapreduce.lib.output.TextOutputFormat.getRecordWriter(TextOutputFormat.java:128) > at > org.apache.spark.sql.execution.datasources.json.JsonOutputWriter.(JSONRelation.scala:185) > at > org.apache.spark.sql.execution.datasources.json.JSONRelation$$anon$1.newInstance(JSONRelation.scala:160) > at > org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:217) > at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(InsertIntoHadoopFsRelation.scala:150) > at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(InsertIntoHadoopFsRelation.scala:150) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66) > at org.apache.spark.scheduler.Task.run(Task.scala:88) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-9969) Remove old Yarn MR classpath api support for Spark Yarn client
Saisai Shao created SPARK-9969: -- Summary: Remove old Yarn MR classpath api support for Spark Yarn client Key: SPARK-9969 URL: https://issues.apache.org/jira/browse/SPARK-9969 Project: Spark Issue Type: Bug Components: YARN Reporter: Saisai Shao Priority: Minor Since now we only support Yarn stable API, so here propose to remove old MRConfig.DEFAULT_MAPREDUCE_APPLICATION_CLASSPATH api support in 2.0. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5556) Latent Dirichlet Allocation (LDA) using Gibbs sampler
[ https://issues.apache.org/jira/browse/SPARK-5556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14696677#comment-14696677 ] ding commented on SPARK-5556: - The code can be found https://github.com/intel-analytics/TopicModeling. There is an example in the package, you can try gibbs sampling lda or online lda by setting --optimizer as "gibbs" or "online" > Latent Dirichlet Allocation (LDA) using Gibbs sampler > -- > > Key: SPARK-5556 > URL: https://issues.apache.org/jira/browse/SPARK-5556 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: Guoqiang Li >Assignee: Pedro Rodriguez > Attachments: LDA_test.xlsx, spark-summit.pptx > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-9969) Remove old Yarn MR classpath api support for Spark Yarn client
[ https://issues.apache.org/jira/browse/SPARK-9969?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-9969: --- Assignee: Apache Spark > Remove old Yarn MR classpath api support for Spark Yarn client > -- > > Key: SPARK-9969 > URL: https://issues.apache.org/jira/browse/SPARK-9969 > Project: Spark > Issue Type: Bug > Components: YARN >Reporter: Saisai Shao >Assignee: Apache Spark >Priority: Minor > > Since now we only support Yarn stable API, so here propose to remove old > MRConfig.DEFAULT_MAPREDUCE_APPLICATION_CLASSPATH api support in 2.0. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9969) Remove old Yarn MR classpath api support for Spark Yarn client
[ https://issues.apache.org/jira/browse/SPARK-9969?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14696681#comment-14696681 ] Apache Spark commented on SPARK-9969: - User 'jerryshao' has created a pull request for this issue: https://github.com/apache/spark/pull/8192 > Remove old Yarn MR classpath api support for Spark Yarn client > -- > > Key: SPARK-9969 > URL: https://issues.apache.org/jira/browse/SPARK-9969 > Project: Spark > Issue Type: Bug > Components: YARN >Reporter: Saisai Shao >Priority: Minor > > Since now we only support Yarn stable API, so here propose to remove old > MRConfig.DEFAULT_MAPREDUCE_APPLICATION_CLASSPATH api support in 2.0. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-9969) Remove old Yarn MR classpath api support for Spark Yarn client
[ https://issues.apache.org/jira/browse/SPARK-9969?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-9969: --- Assignee: (was: Apache Spark) > Remove old Yarn MR classpath api support for Spark Yarn client > -- > > Key: SPARK-9969 > URL: https://issues.apache.org/jira/browse/SPARK-9969 > Project: Spark > Issue Type: Bug > Components: YARN >Reporter: Saisai Shao >Priority: Minor > > Since now we only support Yarn stable API, so here propose to remove old > MRConfig.DEFAULT_MAPREDUCE_APPLICATION_CLASSPATH api support in 2.0. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-9970) SQLContext.createDataFrame failed to properly determine column names
Maciej Bryński created SPARK-9970: - Summary: SQLContext.createDataFrame failed to properly determine column names Key: SPARK-9970 URL: https://issues.apache.org/jira/browse/SPARK-9970 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 1.4.0 Reporter: Maciej Bryński Priority: Minor Hi, I'm trying to do "nested join" of tables. After first join everything is ok, but second join made some of the column names lost. My code is following: def joinTable(tableLeft, tableRight, columnLeft, columnRight, columnNested, joinType = "left_outer"): print "Joining tables %s %s" % (namestr(tableLeft), namestr(tableRight)) sys.stdout.flush() tmpTable = sqlCtx.createDataFrame(tableRight.rdd.groupBy(lambda r: r.asDict()[columnRight])) tmpTable = tmpTable.select(tmpTable._1.alias("joinColumn"), tmpTable._2.data.alias(columnNested)) return tableLeft.join(tmpTable, tableLeft[columnLeft] == tmpTable["joinColumn"], joinType).drop("joinColumn") user = sqlContext.read.json(path + "user.json") user.printSchema() root |-- id: long (nullable = true) |-- name: string (nullable = true) order = sqlContext.read.json(path + "order.json") order.printSchema(); root |-- id: long (nullable = true) |-- price: double (nullable = true) |-- userid: long (nullable = true) lines = sqlContext.read.json(path + "lines.json") lines.printSchema(); root |-- id: long (nullable = true) |-- orderid: long (nullable = true) |-- product: string (nullable = true) orders = joinTable(order, lines, "id", "orderid", "lines") orders.printSchema() root |-- id: long (nullable = true) |-- price: double (nullable = true) |-- userid: long (nullable = true) |-- lines: array (nullable = true) ||-- element: struct (containsNull = true) |||-- id: long (nullable = true) |||-- orderid: long (nullable = true) |||-- product: string (nullable = true) clients = joinTable(user, orders, "id", "userid", "orders") clients.printSchema() root |-- id: long (nullable = true) |-- name: string (nullable = true) |-- orders: array (nullable = true) ||-- element: struct (containsNull = true) |||-- id: long (nullable = true) |||-- price: double (nullable = true) |||-- userid: long (nullable = true) |||-- lines: array (nullable = true) ||||-- element: struct (containsNull = true) |||||-- _1: long (nullable = true) |||||-- _2: long (nullable = true) |||||-- _3: string (nullable = true) I tried to check if groupBy isn't the root of the problem but it's looks right grouped = orders.rdd.groupBy(lambda r: r.userid) print grouped.map(lambda x: list(x[1])).collect() [[Row(id=1, price=202.3, userid=1, lines=[Row(id=1, orderid=1, product=u'XXX'), Row(id=2, orderid=1, product=u'YYY')]), Row(id=2, price=343.99, userid=1, lines=[Row(id=3, orderid=2, product=u'XXX')])], [Row(id=3, price=399.99, userid=2, lines=[Row(id=4, orderid=3, product=u'XXX')])]] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9725) spark sql query string field return empty/garbled string
[ https://issues.apache.org/jira/browse/SPARK-9725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14696684#comment-14696684 ] Zhongshuai Pei commented on SPARK-9725: --- In my environment,"offset" did not change according to BYTE_ARRAY_OFFSET. when set spark.executor.memory=36g, "BYTE_ARRAY_OFFSET" is 24, "offset" is 16 , but set spark.executor.memory=30g, "BYTE_ARRAY_OFFSET" is 16, "offset" is 16 too. So "if (offset == BYTE_ARRAY_OFFSET && base instanceof byte[] && ((byte[]) base).length == numBytes)" will get different result. [~rxin] > spark sql query string field return empty/garbled string > > > Key: SPARK-9725 > URL: https://issues.apache.org/jira/browse/SPARK-9725 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Fei Wang >Assignee: Davies Liu >Priority: Blocker > > to reproduce it: > 1 deploy spark cluster mode, i use standalone mode locally > 2 set executor memory >= 32g, set following config in spark-default.xml >spark.executor.memory36g > 3 run spark-sql.sh with "show tables" it return empty/garbled string -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9970) SQLContext.createDataFrame failed to properly determine column names
[ https://issues.apache.org/jira/browse/SPARK-9970?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maciej Bryński updated SPARK-9970: -- Description: Hi, I'm trying to do "nested join" of tables. After first join everything is ok, but second join made some of the column names lost. My code is following: {code:java} def joinTable(tableLeft, tableRight, columnLeft, columnRight, columnNested, joinType = "left_outer"): tmpTable = sqlCtx.createDataFrame(tableRight.rdd.groupBy(lambda r: r.asDict()[columnRight])) tmpTable = tmpTable.select(tmpTable._1.alias("joinColumn"), tmpTable._2.data.alias(columnNested)) return tableLeft.join(tmpTable, tableLeft[columnLeft] == tmpTable["joinColumn"], joinType).drop("joinColumn") user = sqlContext.read.json(path + "user.json") user.printSchema() root |-- id: long (nullable = true) |-- name: string (nullable = true) order = sqlContext.read.json(path + "order.json") order.printSchema(); root |-- id: long (nullable = true) |-- price: double (nullable = true) |-- userid: long (nullable = true) lines = sqlContext.read.json(path + "lines.json") lines.printSchema(); root |-- id: long (nullable = true) |-- orderid: long (nullable = true) |-- product: string (nullable = true) {code} And joining code: {code:java} orders = joinTable(order, lines, "id", "orderid", "lines") orders.printSchema() root |-- id: long (nullable = true) |-- price: double (nullable = true) |-- userid: long (nullable = true) |-- lines: array (nullable = true) ||-- element: struct (containsNull = true) |||-- id: long (nullable = true) |||-- orderid: long (nullable = true) |||-- product: string (nullable = true) clients = joinTable(user, orders, "id", "userid", "orders") clients.printSchema() root |-- id: long (nullable = true) |-- name: string (nullable = true) |-- orders: array (nullable = true) ||-- element: struct (containsNull = true) |||-- id: long (nullable = true) |||-- price: double (nullable = true) |||-- userid: long (nullable = true) |||-- lines: array (nullable = true) ||||-- element: struct (containsNull = true) |||||-- _1: long (nullable = true) |||||-- _2: long (nullable = true) |||||-- _3: string (nullable = true) {code} I tried to check if groupBy isn't the root of the problem but it's looks right {code:java} grouped = orders.rdd.groupBy(lambda r: r.userid) print grouped.map(lambda x: list(x[1])).collect() [[Row(id=1, price=202.3, userid=1, lines=[Row(id=1, orderid=1, product=u'XXX'), Row(id=2, orderid=1, product=u'YYY')]), Row(id=2, price=343.99, userid=1, lines=[Row(id=3, orderid=2, product=u'XXX')])], [Row(id=3, price=399.99, userid=2, lines=[Row(id=4, orderid=3, product=u'XXX')])]] {code} So I assume that the problem is with createDataFrame. was: Hi, I'm trying to do "nested join" of tables. After first join everything is ok, but second join made some of the column names lost. My code is following: def joinTable(tableLeft, tableRight, columnLeft, columnRight, columnNested, joinType = "left_outer"): print "Joining tables %s %s" % (namestr(tableLeft), namestr(tableRight)) sys.stdout.flush() tmpTable = sqlCtx.createDataFrame(tableRight.rdd.groupBy(lambda r: r.asDict()[columnRight])) tmpTable = tmpTable.select(tmpTable._1.alias("joinColumn"), tmpTable._2.data.alias(columnNested)) return tableLeft.join(tmpTable, tableLeft[columnLeft] == tmpTable["joinColumn"], joinType).drop("joinColumn") user = sqlContext.read.json(path + "user.json") user.printSchema() root |-- id: long (nullable = true) |-- name: string (nullable = true) order = sqlContext.read.json(path + "order.json") order.printSchema(); root |-- id: long (nullable = true) |-- price: double (nullable = true) |-- userid: long (nullable = true) lines = sqlContext.read.json(path + "lines.json") lines.printSchema(); root |-- id: long (nullable = true) |-- orderid: long (nullable = true) |-- product: string (nullable = true) orders = joinTable(order, lines, "id", "orderid", "lines") orders.printSchema() root |-- id: long (nullable = true) |-- price: double (nullable = true) |-- userid: long (nullable = true) |-- lines: array (nullable = true) ||-- element: struct (containsNull = true) |||-- id: long (nullable = true) |||-- orderid: long (nullable = true) |||-- product: string (nullable = true) clients = joinTable(user, orders, "id", "userid", "orders") clients.printSchema() root |-- id: long (nullable = true) |-- name: string (nullable = true) |-- orders: array (nullable = true) ||-- element: struct (containsNull = true) |||-- id: long (nullable = true) |||-- price: double (nullable = true) |||-- userid: long (nullable = true) |||-- lines: array (nullable = true) ||
[jira] [Updated] (SPARK-9970) SQLContext.createDataFrame failed to properly determine column names
[ https://issues.apache.org/jira/browse/SPARK-9970?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maciej Bryński updated SPARK-9970: -- Description: Hi, I'm trying to do "nested join" of tables. After first join everything is ok, but second join made some of the column names lost. My code is following: {code:java} def joinTable(tableLeft, tableRight, columnLeft, columnRight, columnNested, joinType = "left_outer"): tmpTable = sqlCtx.createDataFrame(tableRight.rdd.groupBy(lambda r: r.asDict()[columnRight])) tmpTable = tmpTable.select(tmpTable._1.alias("joinColumn"), tmpTable._2.data.alias(columnNested)) return tableLeft.join(tmpTable, tableLeft[columnLeft] == tmpTable["joinColumn"], joinType).drop("joinColumn") user = sqlContext.read.json(path + "user.json") user.printSchema() root |-- id: long (nullable = true) |-- name: string (nullable = true) order = sqlContext.read.json(path + "order.json") order.printSchema(); root |-- id: long (nullable = true) |-- price: double (nullable = true) |-- userid: long (nullable = true) lines = sqlContext.read.json(path + "lines.json") lines.printSchema(); root |-- id: long (nullable = true) |-- orderid: long (nullable = true) |-- product: string (nullable = true) {code} And joining code: Please look for the result of 2nd join. There are columns _1,_2,_3. Should be 'id', 'orderid', 'product' {code:java} orders = joinTable(order, lines, "id", "orderid", "lines") orders.printSchema() root |-- id: long (nullable = true) |-- price: double (nullable = true) |-- userid: long (nullable = true) |-- lines: array (nullable = true) ||-- element: struct (containsNull = true) |||-- id: long (nullable = true) |||-- orderid: long (nullable = true) |||-- product: string (nullable = true) clients = joinTable(user, orders, "id", "userid", "orders") clients.printSchema() root |-- id: long (nullable = true) |-- name: string (nullable = true) |-- orders: array (nullable = true) ||-- element: struct (containsNull = true) |||-- id: long (nullable = true) |||-- price: double (nullable = true) |||-- userid: long (nullable = true) |||-- lines: array (nullable = true) ||||-- element: struct (containsNull = true) |||||-- _1: long (nullable = true) |||||-- _2: long (nullable = true) |||||-- _3: string (nullable = true) {code} I tried to check if groupBy isn't the root of the problem but it's looks right {code:java} grouped = orders.rdd.groupBy(lambda r: r.userid) print grouped.map(lambda x: list(x[1])).collect() [[Row(id=1, price=202.3, userid=1, lines=[Row(id=1, orderid=1, product=u'XXX'), Row(id=2, orderid=1, product=u'YYY')]), Row(id=2, price=343.99, userid=1, lines=[Row(id=3, orderid=2, product=u'XXX')])], [Row(id=3, price=399.99, userid=2, lines=[Row(id=4, orderid=3, product=u'XXX')])]] {code} So I assume that the problem is with createDataFrame. was: Hi, I'm trying to do "nested join" of tables. After first join everything is ok, but second join made some of the column names lost. My code is following: {code:java} def joinTable(tableLeft, tableRight, columnLeft, columnRight, columnNested, joinType = "left_outer"): tmpTable = sqlCtx.createDataFrame(tableRight.rdd.groupBy(lambda r: r.asDict()[columnRight])) tmpTable = tmpTable.select(tmpTable._1.alias("joinColumn"), tmpTable._2.data.alias(columnNested)) return tableLeft.join(tmpTable, tableLeft[columnLeft] == tmpTable["joinColumn"], joinType).drop("joinColumn") user = sqlContext.read.json(path + "user.json") user.printSchema() root |-- id: long (nullable = true) |-- name: string (nullable = true) order = sqlContext.read.json(path + "order.json") order.printSchema(); root |-- id: long (nullable = true) |-- price: double (nullable = true) |-- userid: long (nullable = true) lines = sqlContext.read.json(path + "lines.json") lines.printSchema(); root |-- id: long (nullable = true) |-- orderid: long (nullable = true) |-- product: string (nullable = true) {code} And joining code: {code:java} orders = joinTable(order, lines, "id", "orderid", "lines") orders.printSchema() root |-- id: long (nullable = true) |-- price: double (nullable = true) |-- userid: long (nullable = true) |-- lines: array (nullable = true) ||-- element: struct (containsNull = true) |||-- id: long (nullable = true) |||-- orderid: long (nullable = true) |||-- product: string (nullable = true) clients = joinTable(user, orders, "id", "userid", "orders") clients.printSchema() root |-- id: long (nullable = true) |-- name: string (nullable = true) |-- orders: array (nullable = true) ||-- element: struct (containsNull = true) |||-- id: long (nullable = true) |||-- price: double (nullable = true) |||-- userid: long (nullable = true)
[jira] [Commented] (SPARK-6624) Convert filters into CNF for data sources
[ https://issues.apache.org/jira/browse/SPARK-6624?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14696700#comment-14696700 ] Apache Spark commented on SPARK-6624: - User 'yjshen' has created a pull request for this issue: https://github.com/apache/spark/pull/8193 > Convert filters into CNF for data sources > - > > Key: SPARK-6624 > URL: https://issues.apache.org/jira/browse/SPARK-6624 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin > > We should turn filters into conjunctive normal form (CNF) before we pass them > to data sources. Otherwise, filters are not very useful if there is a single > filter with a bunch of ORs. > Note that we already try to do some of these in BooleanSimplification, but I > think we should just formalize it to use CNF. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-6624) Convert filters into CNF for data sources
[ https://issues.apache.org/jira/browse/SPARK-6624?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-6624: --- Assignee: (was: Apache Spark) > Convert filters into CNF for data sources > - > > Key: SPARK-6624 > URL: https://issues.apache.org/jira/browse/SPARK-6624 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin > > We should turn filters into conjunctive normal form (CNF) before we pass them > to data sources. Otherwise, filters are not very useful if there is a single > filter with a bunch of ORs. > Note that we already try to do some of these in BooleanSimplification, but I > think we should just formalize it to use CNF. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-6624) Convert filters into CNF for data sources
[ https://issues.apache.org/jira/browse/SPARK-6624?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-6624: --- Assignee: Apache Spark > Convert filters into CNF for data sources > - > > Key: SPARK-6624 > URL: https://issues.apache.org/jira/browse/SPARK-6624 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin >Assignee: Apache Spark > > We should turn filters into conjunctive normal form (CNF) before we pass them > to data sources. Otherwise, filters are not very useful if there is a single > filter with a bunch of ORs. > Note that we already try to do some of these in BooleanSimplification, but I > think we should just formalize it to use CNF. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9970) SQLContext.createDataFrame failed to properly determine column names
[ https://issues.apache.org/jira/browse/SPARK-9970?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maciej Bryński updated SPARK-9970: -- Description: Hi, I'm trying to do "nested join" of tables. After first join everything is ok, but second join made some of the column names lost. My code is following: {code:java} def joinTable(tableLeft, tableRight, columnLeft, columnRight, columnNested, joinType = "left_outer"): tmpTable = sqlCtx.createDataFrame(tableRight.rdd.groupBy(lambda r: r.asDict()[columnRight])) tmpTable = tmpTable.select(tmpTable._1.alias("joinColumn"), tmpTable._2.data.alias(columnNested)) return tableLeft.join(tmpTable, tableLeft[columnLeft] == tmpTable["joinColumn"], joinType).drop("joinColumn") user = sqlContext.read.json(path + "user.json") user.printSchema() root |-- id: long (nullable = true) |-- name: string (nullable = true) order = sqlContext.read.json(path + "order.json") order.printSchema(); root |-- id: long (nullable = true) |-- price: double (nullable = true) |-- userid: long (nullable = true) lines = sqlContext.read.json(path + "lines.json") lines.printSchema(); root |-- id: long (nullable = true) |-- orderid: long (nullable = true) |-- product: string (nullable = true) {code} And joining code: Please look for the result of 2nd join. There are columns _1,_2,_3. Should be 'id', 'orderid', 'product' {code:java} orders = joinTable(order, lines, "id", "orderid", "lines") orders.printSchema() root |-- id: long (nullable = true) |-- price: double (nullable = true) |-- userid: long (nullable = true) |-- lines: array (nullable = true) ||-- element: struct (containsNull = true) |||-- id: long (nullable = true) |||-- orderid: long (nullable = true) |||-- product: string (nullable = true) clients = joinTable(user, orders, "id", "userid", "orders") clients.printSchema() root |-- id: long (nullable = true) |-- name: string (nullable = true) |-- orders: array (nullable = true) ||-- element: struct (containsNull = true) |||-- id: long (nullable = true) |||-- price: double (nullable = true) |||-- userid: long (nullable = true) |||-- lines: array (nullable = true) ||||-- element: struct (containsNull = true) |||||-- _1: long (nullable = true) |||||-- _2: long (nullable = true) |||||-- _3: string (nullable = true) {code} I tried to check if groupBy isn't the cause of the problem but it looks right {code:java} grouped = orders.rdd.groupBy(lambda r: r.userid) print grouped.map(lambda x: list(x[1])).collect() [[Row(id=1, price=202.3, userid=1, lines=[Row(id=1, orderid=1, product=u'XXX'), Row(id=2, orderid=1, product=u'YYY')]), Row(id=2, price=343.99, userid=1, lines=[Row(id=3, orderid=2, product=u'XXX')])], [Row(id=3, price=399.99, userid=2, lines=[Row(id=4, orderid=3, product=u'XXX')])]] {code} So I assume that the problem is with createDataFrame. was: Hi, I'm trying to do "nested join" of tables. After first join everything is ok, but second join made some of the column names lost. My code is following: {code:java} def joinTable(tableLeft, tableRight, columnLeft, columnRight, columnNested, joinType = "left_outer"): tmpTable = sqlCtx.createDataFrame(tableRight.rdd.groupBy(lambda r: r.asDict()[columnRight])) tmpTable = tmpTable.select(tmpTable._1.alias("joinColumn"), tmpTable._2.data.alias(columnNested)) return tableLeft.join(tmpTable, tableLeft[columnLeft] == tmpTable["joinColumn"], joinType).drop("joinColumn") user = sqlContext.read.json(path + "user.json") user.printSchema() root |-- id: long (nullable = true) |-- name: string (nullable = true) order = sqlContext.read.json(path + "order.json") order.printSchema(); root |-- id: long (nullable = true) |-- price: double (nullable = true) |-- userid: long (nullable = true) lines = sqlContext.read.json(path + "lines.json") lines.printSchema(); root |-- id: long (nullable = true) |-- orderid: long (nullable = true) |-- product: string (nullable = true) {code} And joining code: Please look for the result of 2nd join. There are columns _1,_2,_3. Should be 'id', 'orderid', 'product' {code:java} orders = joinTable(order, lines, "id", "orderid", "lines") orders.printSchema() root |-- id: long (nullable = true) |-- price: double (nullable = true) |-- userid: long (nullable = true) |-- lines: array (nullable = true) ||-- element: struct (containsNull = true) |||-- id: long (nullable = true) |||-- orderid: long (nullable = true) |||-- product: string (nullable = true) clients = joinTable(user, orders, "id", "userid", "orders") clients.printSchema() root |-- id: long (nullable = true) |-- name: string (nullable = true) |-- orders: array (nullable = true) ||-- element: struct (containsNull = true) |||-- id: long (null
[jira] [Commented] (SPARK-9871) Add expression functions into SparkR which have a variable parameter
[ https://issues.apache.org/jira/browse/SPARK-9871?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14696706#comment-14696706 ] Yu Ishikawa commented on SPARK-9871: I think it would be better to deal with {{struct}} on another issue. Since {{collect}} doesn't work with a DataFrame which has a column of Struct type. So we need to improve {{collect}} method. When I tried to implement {{struct}} function, a strict column converted by {{dfToCols}} consists of {{jobj}} {noformat} List of 1 $ structed:List of 2 ..$ :Class 'jobj' ..$ :Class 'jobj' {noformat} > Add expression functions into SparkR which have a variable parameter > > > Key: SPARK-9871 > URL: https://issues.apache.org/jira/browse/SPARK-9871 > Project: Spark > Issue Type: Sub-task > Components: SparkR >Reporter: Yu Ishikawa > > Add expression functions into SparkR which has a variable parameter, like > {{concat}} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-9971) MaxFunction not working correctly with columns containing Double.NaN
Frank Rosner created SPARK-9971: --- Summary: MaxFunction not working correctly with columns containing Double.NaN Key: SPARK-9971 URL: https://issues.apache.org/jira/browse/SPARK-9971 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.4.1 Reporter: Frank Rosner h5. Problem Description When using the {{max}} function on a {{DoubleType}} column that contains {{Double.NaN}} values, the returned maximum value will be {{Double.NaN}}. This is because it compares all values with the running maximum. However, {{x < Double.NaN}} will always lead false for all {{x: Double}}, so will {{x > Double.NaN}}. h5. How to Reproduce {code} import org.apache.spark.sql.{SQLContext, Row} import org.apache.spark.sql.types._ val sql = new SQLContext(sc) val rdd = sc.makeRDD(List(Row(Double.NaN), Row(-10d), Row(0d))) val dataFrame = sql.createDataFrame(rdd, StructType(List( StructField("col", DoubleType, false) ))) dataFrame.select(max("col")).first // returns org.apache.spark.sql.Row = [NaN] {code} h5. Solution The {{max}} and {{min}} functions should ignore NaN values, as they are not numbers. If a column contains only NaN values, then the maximum and minimum is not defined. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-9973) wrong buffle size
xukun created SPARK-9973: Summary: wrong buffle size Key: SPARK-9973 URL: https://issues.apache.org/jira/browse/SPARK-9973 Project: Spark Issue Type: Bug Components: SQL Reporter: xukun -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-9972) Add `struct` function in SparkR
Yu Ishikawa created SPARK-9972: -- Summary: Add `struct` function in SparkR Key: SPARK-9972 URL: https://issues.apache.org/jira/browse/SPARK-9972 Project: Spark Issue Type: Sub-task Components: SparkR Reporter: Yu Ishikawa Support {{struct}} function on a DataFrame in SparkR. However, I think we need to improve {{collect}} function in SparkR in order to implement {{struct}} function. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9971) MaxFunction not working correctly with columns containing Double.NaN
[ https://issues.apache.org/jira/browse/SPARK-9971?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Frank Rosner updated SPARK-9971: Priority: Minor (was: Major) > MaxFunction not working correctly with columns containing Double.NaN > > > Key: SPARK-9971 > URL: https://issues.apache.org/jira/browse/SPARK-9971 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.4.1 >Reporter: Frank Rosner >Priority: Minor > > h5. Problem Description > When using the {{max}} function on a {{DoubleType}} column that contains > {{Double.NaN}} values, the returned maximum value will be {{Double.NaN}}. > This is because it compares all values with the running maximum. However, {{x > < Double.NaN}} will always lead false for all {{x: Double}}, so will {{x > > Double.NaN}}. > h5. How to Reproduce > {code} > import org.apache.spark.sql.{SQLContext, Row} > import org.apache.spark.sql.types._ > val sql = new SQLContext(sc) > val rdd = sc.makeRDD(List(Row(Double.NaN), Row(-10d), Row(0d))) > val dataFrame = sql.createDataFrame(rdd, StructType(List( > StructField("col", DoubleType, false) > ))) > dataFrame.select(max("col")).first > // returns org.apache.spark.sql.Row = [NaN] > {code} > h5. Solution > The {{max}} and {{min}} functions should ignore NaN values, as they are not > numbers. If a column contains only NaN values, then the maximum and minimum > is not defined. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9971) MaxFunction not working correctly with columns containing Double.NaN
[ https://issues.apache.org/jira/browse/SPARK-9971?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Frank Rosner updated SPARK-9971: Description: h4. Problem Description When using the {{max}} function on a {{DoubleType}} column that contains {{Double.NaN}} values, the returned maximum value will be {{Double.NaN}}. This is because it compares all values with the running maximum. However, {{x < Double.NaN}} will always lead false for all {{x: Double}}, so will {{x > Double.NaN}}. h4. How to Reproduce {code} import org.apache.spark.sql.{SQLContext, Row} import org.apache.spark.sql.types._ val sql = new SQLContext(sc) val rdd = sc.makeRDD(List(Row(Double.NaN), Row(-10d), Row(0d))) val dataFrame = sql.createDataFrame(rdd, StructType(List( StructField("col", DoubleType, false) ))) dataFrame.select(max("col")).first // returns org.apache.spark.sql.Row = [NaN] {code} h4. Solution The {{max}} and {{min}} functions should ignore NaN values, as they are not numbers. If a column contains only NaN values, then the maximum and minimum is not defined. was: h5. Problem Description When using the {{max}} function on a {{DoubleType}} column that contains {{Double.NaN}} values, the returned maximum value will be {{Double.NaN}}. This is because it compares all values with the running maximum. However, {{x < Double.NaN}} will always lead false for all {{x: Double}}, so will {{x > Double.NaN}}. h5. How to Reproduce {code} import org.apache.spark.sql.{SQLContext, Row} import org.apache.spark.sql.types._ val sql = new SQLContext(sc) val rdd = sc.makeRDD(List(Row(Double.NaN), Row(-10d), Row(0d))) val dataFrame = sql.createDataFrame(rdd, StructType(List( StructField("col", DoubleType, false) ))) dataFrame.select(max("col")).first // returns org.apache.spark.sql.Row = [NaN] {code} h5. Solution The {{max}} and {{min}} functions should ignore NaN values, as they are not numbers. If a column contains only NaN values, then the maximum and minimum is not defined. > MaxFunction not working correctly with columns containing Double.NaN > > > Key: SPARK-9971 > URL: https://issues.apache.org/jira/browse/SPARK-9971 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.4.1 >Reporter: Frank Rosner >Priority: Minor > > h4. Problem Description > When using the {{max}} function on a {{DoubleType}} column that contains > {{Double.NaN}} values, the returned maximum value will be {{Double.NaN}}. > This is because it compares all values with the running maximum. However, {{x > < Double.NaN}} will always lead false for all {{x: Double}}, so will {{x > > Double.NaN}}. > h4. How to Reproduce > {code} > import org.apache.spark.sql.{SQLContext, Row} > import org.apache.spark.sql.types._ > val sql = new SQLContext(sc) > val rdd = sc.makeRDD(List(Row(Double.NaN), Row(-10d), Row(0d))) > val dataFrame = sql.createDataFrame(rdd, StructType(List( > StructField("col", DoubleType, false) > ))) > dataFrame.select(max("col")).first > // returns org.apache.spark.sql.Row = [NaN] > {code} > h4. Solution > The {{max}} and {{min}} functions should ignore NaN values, as they are not > numbers. If a column contains only NaN values, then the maximum and minimum > is not defined. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9973) wrong buffle size
[ https://issues.apache.org/jira/browse/SPARK-9973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] xukun updated SPARK-9973: - Description: allocate wrong buffer size (4+ size * columnType.defaultSize * columnType.defaultSize), right buffer size is 4+ size * columnType.defaultSize. > wrong buffle size > - > > Key: SPARK-9973 > URL: https://issues.apache.org/jira/browse/SPARK-9973 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: xukun > > allocate wrong buffer size (4+ size * columnType.defaultSize * > columnType.defaultSize), right buffer size is 4+ size * > columnType.defaultSize. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-9973) wrong buffle size
[ https://issues.apache.org/jira/browse/SPARK-9973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-9973: --- Assignee: Apache Spark > wrong buffle size > - > > Key: SPARK-9973 > URL: https://issues.apache.org/jira/browse/SPARK-9973 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: xukun >Assignee: Apache Spark > > allocate wrong buffer size (4+ size * columnType.defaultSize * > columnType.defaultSize), right buffer size is 4+ size * > columnType.defaultSize. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9973) wrong buffle size
[ https://issues.apache.org/jira/browse/SPARK-9973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14696714#comment-14696714 ] Apache Spark commented on SPARK-9973: - User 'viper-kun' has created a pull request for this issue: https://github.com/apache/spark/pull/8189 > wrong buffle size > - > > Key: SPARK-9973 > URL: https://issues.apache.org/jira/browse/SPARK-9973 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: xukun > > allocate wrong buffer size (4+ size * columnType.defaultSize * > columnType.defaultSize), right buffer size is 4+ size * > columnType.defaultSize. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-9973) wrong buffle size
[ https://issues.apache.org/jira/browse/SPARK-9973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-9973: --- Assignee: (was: Apache Spark) > wrong buffle size > - > > Key: SPARK-9973 > URL: https://issues.apache.org/jira/browse/SPARK-9973 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: xukun > > allocate wrong buffer size (4+ size * columnType.defaultSize * > columnType.defaultSize), right buffer size is 4+ size * > columnType.defaultSize. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9971) MaxFunction not working correctly with columns containing Double.NaN
[ https://issues.apache.org/jira/browse/SPARK-9971?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14696716#comment-14696716 ] Sean Owen commented on SPARK-9971: -- My instinct is that this should in fact result in NaN; NaN is not in general ignored. For example {{math.max(1.0, Double.NaN)}} and {{math.min(1.0, Double.NaN)}} are both NaN. But are you ready for some weird? {code} scala> Seq(1.0, Double.NaN).max res23: Double = NaN scala> Seq(Double.NaN, 1.0).max res24: Double = 1.0 scala> Seq(5.0, Double.NaN, 1.0).max res25: Double = 1.0 scala> Seq(5.0, Double.NaN, 1.0, 6.0).max res26: Double = 6.0 scala> Seq(5.0, Double.NaN, 1.0, 6.0, Double.NaN).max res27: Double = NaN {code} Not sure what to make of that, other than the Scala collection library isn't a good reference for behavior. Java? {code} scala> java.util.Collections.max(java.util.Arrays.asList(new java.lang.Double(1.0), new java.lang.Double(Double.NaN))) res33: Double = NaN scala> java.util.Collections.max(java.util.Arrays.asList(new java.lang.Double(Double.NaN), new java.lang.Double(1.0))) res34: Double = NaN {code} Makes more sense at least. I think this is correct behavior and you would filter NaN if you want to ignore them, as it is generally not something the language ignores. > MaxFunction not working correctly with columns containing Double.NaN > > > Key: SPARK-9971 > URL: https://issues.apache.org/jira/browse/SPARK-9971 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.4.1 >Reporter: Frank Rosner >Priority: Minor > > h4. Problem Description > When using the {{max}} function on a {{DoubleType}} column that contains > {{Double.NaN}} values, the returned maximum value will be {{Double.NaN}}. > This is because it compares all values with the running maximum. However, {{x > < Double.NaN}} will always lead false for all {{x: Double}}, so will {{x > > Double.NaN}}. > h4. How to Reproduce > {code} > import org.apache.spark.sql.{SQLContext, Row} > import org.apache.spark.sql.types._ > val sql = new SQLContext(sc) > val rdd = sc.makeRDD(List(Row(Double.NaN), Row(-10d), Row(0d))) > val dataFrame = sql.createDataFrame(rdd, StructType(List( > StructField("col", DoubleType, false) > ))) > dataFrame.select(max("col")).first > // returns org.apache.spark.sql.Row = [NaN] > {code} > h4. Solution > The {{max}} and {{min}} functions should ignore NaN values, as they are not > numbers. If a column contains only NaN values, then the maximum and minimum > is not defined. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9973) wrong buffle size
[ https://issues.apache.org/jira/browse/SPARK-9973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14696717#comment-14696717 ] Sean Owen commented on SPARK-9973: -- [~xukun] This would be more useful if you added a descriptive title and more complete description. What code is affected? what buffer? you explained this in the PR to some degree so a brief note about the problem to be solved is appropriate. Right now someone reading this doesn't know what it refers to. > wrong buffle size > - > > Key: SPARK-9973 > URL: https://issues.apache.org/jira/browse/SPARK-9973 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: xukun > > allocate wrong buffer size (4+ size * columnType.defaultSize * > columnType.defaultSize), right buffer size is 4+ size * > columnType.defaultSize. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-9871) Add expression functions into SparkR which have a variable parameter
[ https://issues.apache.org/jira/browse/SPARK-9871?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-9871: --- Assignee: (was: Apache Spark) > Add expression functions into SparkR which have a variable parameter > > > Key: SPARK-9871 > URL: https://issues.apache.org/jira/browse/SPARK-9871 > Project: Spark > Issue Type: Sub-task > Components: SparkR >Reporter: Yu Ishikawa > > Add expression functions into SparkR which has a variable parameter, like > {{concat}} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-9871) Add expression functions into SparkR which have a variable parameter
[ https://issues.apache.org/jira/browse/SPARK-9871?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-9871: --- Assignee: Apache Spark > Add expression functions into SparkR which have a variable parameter > > > Key: SPARK-9871 > URL: https://issues.apache.org/jira/browse/SPARK-9871 > Project: Spark > Issue Type: Sub-task > Components: SparkR >Reporter: Yu Ishikawa >Assignee: Apache Spark > > Add expression functions into SparkR which has a variable parameter, like > {{concat}} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9871) Add expression functions into SparkR which have a variable parameter
[ https://issues.apache.org/jira/browse/SPARK-9871?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14696720#comment-14696720 ] Apache Spark commented on SPARK-9871: - User 'yu-iskw' has created a pull request for this issue: https://github.com/apache/spark/pull/8194 > Add expression functions into SparkR which have a variable parameter > > > Key: SPARK-9871 > URL: https://issues.apache.org/jira/browse/SPARK-9871 > Project: Spark > Issue Type: Sub-task > Components: SparkR >Reporter: Yu Ishikawa > > Add expression functions into SparkR which has a variable parameter, like > {{concat}} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9971) MaxFunction not working correctly with columns containing Double.NaN
[ https://issues.apache.org/jira/browse/SPARK-9971?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14696760#comment-14696760 ] Frank Rosner commented on SPARK-9971: - I would like to provide a patch to make the following unit tests in {{DataFrameFunctionsSuite}} succeed: {code} test("max function ignoring Double.NaN") { val df = Seq( (Double.NaN, Double.NaN), (-10d, Double.NaN), (10d, Double.NaN) ).toDF("col1", "col2") checkAnswer( df.select(max("col1")), Seq(Row(10d)) ) checkAnswer( df.select(max("col2")), Seq(Row(Double.NaN)) ) } test("min function ignoring Double.NaN") { val df = Seq( (Double.NaN, Double.NaN), (-10d, Double.NaN), (10d, Double.NaN) ).toDF("col1", "col2") checkAnswer( df.select(min("col1")), Seq(Row(-10d)) ) checkAnswer( df.select(min("col1")), Seq(Row(Double.NaN)) ) } {code} > MaxFunction not working correctly with columns containing Double.NaN > > > Key: SPARK-9971 > URL: https://issues.apache.org/jira/browse/SPARK-9971 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.4.1 >Reporter: Frank Rosner >Priority: Minor > > h4. Problem Description > When using the {{max}} function on a {{DoubleType}} column that contains > {{Double.NaN}} values, the returned maximum value will be {{Double.NaN}}. > This is because it compares all values with the running maximum. However, {{x > < Double.NaN}} will always lead false for all {{x: Double}}, so will {{x > > Double.NaN}}. > h4. How to Reproduce > {code} > import org.apache.spark.sql.{SQLContext, Row} > import org.apache.spark.sql.types._ > val sql = new SQLContext(sc) > val rdd = sc.makeRDD(List(Row(Double.NaN), Row(-10d), Row(0d))) > val dataFrame = sql.createDataFrame(rdd, StructType(List( > StructField("col", DoubleType, false) > ))) > dataFrame.select(max("col")).first > // returns org.apache.spark.sql.Row = [NaN] > {code} > h4. Solution > The {{max}} and {{min}} functions should ignore NaN values, as they are not > numbers. If a column contains only NaN values, then the maximum and minimum > is not defined. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9971) MaxFunction not working correctly with columns containing Double.NaN
[ https://issues.apache.org/jira/browse/SPARK-9971?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14696762#comment-14696762 ] Sean Owen commented on SPARK-9971: -- Hey Frank as I say I don't think that's the correct answer, at least in that it's not consistent with Java/Scala APIs... well, the Scala APIs are kind of confused, but none of them consistently ignore NaN. > MaxFunction not working correctly with columns containing Double.NaN > > > Key: SPARK-9971 > URL: https://issues.apache.org/jira/browse/SPARK-9971 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.4.1 >Reporter: Frank Rosner >Priority: Minor > > h4. Problem Description > When using the {{max}} function on a {{DoubleType}} column that contains > {{Double.NaN}} values, the returned maximum value will be {{Double.NaN}}. > This is because it compares all values with the running maximum. However, {{x > < Double.NaN}} will always lead false for all {{x: Double}}, so will {{x > > Double.NaN}}. > h4. How to Reproduce > {code} > import org.apache.spark.sql.{SQLContext, Row} > import org.apache.spark.sql.types._ > val sql = new SQLContext(sc) > val rdd = sc.makeRDD(List(Row(Double.NaN), Row(-10d), Row(0d))) > val dataFrame = sql.createDataFrame(rdd, StructType(List( > StructField("col", DoubleType, false) > ))) > dataFrame.select(max("col")).first > // returns org.apache.spark.sql.Row = [NaN] > {code} > h4. Solution > The {{max}} and {{min}} functions should ignore NaN values, as they are not > numbers. If a column contains only NaN values, then the maximum and minimum > is not defined. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-9971) MaxFunction not working correctly with columns containing Double.NaN
[ https://issues.apache.org/jira/browse/SPARK-9971?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14696760#comment-14696760 ] Frank Rosner edited comment on SPARK-9971 at 8/14/15 9:34 AM: -- I would like to provide a patch to make the following unit tests in {{DataFrameFunctionsSuite}} succeed: {code} test("max function ignoring Double.NaN") { val df = Seq( (Double.NaN, Double.NaN), (-10d, Double.NaN), (10d, Double.NaN) ).toDF("col1", "col2") checkAnswer( df.select(max("col1")), Seq(Row(10d)) ) checkAnswer( df.select(max("col2")), Seq(Row(null)) ) } test("min function ignoring Double.NaN") { val df = Seq( (Double.NaN, Double.NaN), (-10d, Double.NaN), (10d, Double.NaN) ).toDF("col1", "col2") checkAnswer( df.select(min("col1")), Seq(Row(-10d)) ) checkAnswer( df.select(min("col1")), Seq(Row(null)) ) } {code} was (Author: frosner): I would like to provide a patch to make the following unit tests in {{DataFrameFunctionsSuite}} succeed: {code} test("max function ignoring Double.NaN") { val df = Seq( (Double.NaN, Double.NaN), (-10d, Double.NaN), (10d, Double.NaN) ).toDF("col1", "col2") checkAnswer( df.select(max("col1")), Seq(Row(10d)) ) checkAnswer( df.select(max("col2")), Seq(Row(Double.NaN)) ) } test("min function ignoring Double.NaN") { val df = Seq( (Double.NaN, Double.NaN), (-10d, Double.NaN), (10d, Double.NaN) ).toDF("col1", "col2") checkAnswer( df.select(min("col1")), Seq(Row(-10d)) ) checkAnswer( df.select(min("col1")), Seq(Row(Double.NaN)) ) } {code} > MaxFunction not working correctly with columns containing Double.NaN > > > Key: SPARK-9971 > URL: https://issues.apache.org/jira/browse/SPARK-9971 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.4.1 >Reporter: Frank Rosner >Priority: Minor > > h4. Problem Description > When using the {{max}} function on a {{DoubleType}} column that contains > {{Double.NaN}} values, the returned maximum value will be {{Double.NaN}}. > This is because it compares all values with the running maximum. However, {{x > < Double.NaN}} will always lead false for all {{x: Double}}, so will {{x > > Double.NaN}}. > h4. How to Reproduce > {code} > import org.apache.spark.sql.{SQLContext, Row} > import org.apache.spark.sql.types._ > val sql = new SQLContext(sc) > val rdd = sc.makeRDD(List(Row(Double.NaN), Row(-10d), Row(0d))) > val dataFrame = sql.createDataFrame(rdd, StructType(List( > StructField("col", DoubleType, false) > ))) > dataFrame.select(max("col")).first > // returns org.apache.spark.sql.Row = [NaN] > {code} > h4. Solution > The {{max}} and {{min}} functions should ignore NaN values, as they are not > numbers. If a column contains only NaN values, then the maximum and minimum > is not defined. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9971) MaxFunction not working correctly with columns containing Double.NaN
[ https://issues.apache.org/jira/browse/SPARK-9971?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14696766#comment-14696766 ] Frank Rosner commented on SPARK-9971: - [~srowen] I get your point. But given your "weird" examples, I don't think that being consistent with the Scala collection library is desirable. When I work on a data set and there are values which are not a number (e.g. divided by zero) and I want to compute the maximum, I find it convenient not to give me NaN. NaN is not a number so it cannot be a maximum number by definition. > MaxFunction not working correctly with columns containing Double.NaN > > > Key: SPARK-9971 > URL: https://issues.apache.org/jira/browse/SPARK-9971 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.4.1 >Reporter: Frank Rosner >Priority: Minor > > h4. Problem Description > When using the {{max}} function on a {{DoubleType}} column that contains > {{Double.NaN}} values, the returned maximum value will be {{Double.NaN}}. > This is because it compares all values with the running maximum. However, {{x > < Double.NaN}} will always lead false for all {{x: Double}}, so will {{x > > Double.NaN}}. > h4. How to Reproduce > {code} > import org.apache.spark.sql.{SQLContext, Row} > import org.apache.spark.sql.types._ > val sql = new SQLContext(sc) > val rdd = sc.makeRDD(List(Row(Double.NaN), Row(-10d), Row(0d))) > val dataFrame = sql.createDataFrame(rdd, StructType(List( > StructField("col", DoubleType, false) > ))) > dataFrame.select(max("col")).first > // returns org.apache.spark.sql.Row = [NaN] > {code} > h4. Solution > The {{max}} and {{min}} functions should ignore NaN values, as they are not > numbers. If a column contains only NaN values, then the maximum and minimum > is not defined. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9971) MaxFunction not working correctly with columns containing Double.NaN
[ https://issues.apache.org/jira/browse/SPARK-9971?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14696768#comment-14696768 ] Sean Owen commented on SPARK-9971: -- My argument is being consistent with Java really, and the not-weird parts of Scala. Within either language, NaNs are not ignored; products and sums of things with NaN become NaN. None of them appear to ignore NaN, so this behavior would not be consistent with anything else. I get that it's convenient, but it's also easy to filter. > MaxFunction not working correctly with columns containing Double.NaN > > > Key: SPARK-9971 > URL: https://issues.apache.org/jira/browse/SPARK-9971 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.4.1 >Reporter: Frank Rosner >Priority: Minor > > h4. Problem Description > When using the {{max}} function on a {{DoubleType}} column that contains > {{Double.NaN}} values, the returned maximum value will be {{Double.NaN}}. > This is because it compares all values with the running maximum. However, {{x > < Double.NaN}} will always lead false for all {{x: Double}}, so will {{x > > Double.NaN}}. > h4. How to Reproduce > {code} > import org.apache.spark.sql.{SQLContext, Row} > import org.apache.spark.sql.types._ > val sql = new SQLContext(sc) > val rdd = sc.makeRDD(List(Row(Double.NaN), Row(-10d), Row(0d))) > val dataFrame = sql.createDataFrame(rdd, StructType(List( > StructField("col", DoubleType, false) > ))) > dataFrame.select(max("col")).first > // returns org.apache.spark.sql.Row = [NaN] > {code} > h4. Solution > The {{max}} and {{min}} functions should ignore NaN values, as they are not > numbers. If a column contains only NaN values, then the maximum and minimum > is not defined. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-9780) In case of invalid initialization of KafkaDirectStream, NPE is thrown
[ https://issues.apache.org/jira/browse/SPARK-9780?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-9780. -- Resolution: Fixed Assignee: Cody Koeninger Fix Version/s: 1.5.0 > In case of invalid initialization of KafkaDirectStream, NPE is thrown > - > > Key: SPARK-9780 > URL: https://issues.apache.org/jira/browse/SPARK-9780 > Project: Spark > Issue Type: Bug > Components: Streaming >Affects Versions: 1.3.1, 1.4.1 >Reporter: Grigory Turunov >Assignee: Cody Koeninger >Priority: Minor > Fix For: 1.5.0 > > > [o.a.s.streaming.kafka.KafkaRDD.scala#L143|https://github.com/apache/spark/blob/master/external/kafka/src/main/scala/org/apache/spark/streaming/kafka/KafkaRDD.scala#L143] > In initialization of KafkaRDDIterator, there is an addition of > TaskCompletionListener to the context, which calls close() to the consumer, > which is not initialized yet (and will be initialized 12 lines after that). > If something happens in this 12 lines (in my case there was a private > constructor for valueDecoder), an Exception, which is thrown, triggers > context.markTaskCompleted() in > [o.a.s.scheduler.Task.scala#L90|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/Task.scala#L90] > which throws NullPointerException, when tries to call close() for > non-initialized consumer. > This masks original exception - so it is very hard to understand, what is > happening. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-9974) SBT build: com.twitter:parquet-hadoop-bundle:1.6.0 is not packaged into the assembly jar
Cheng Lian created SPARK-9974: - Summary: SBT build: com.twitter:parquet-hadoop-bundle:1.6.0 is not packaged into the assembly jar Key: SPARK-9974 URL: https://issues.apache.org/jira/browse/SPARK-9974 Project: Spark Issue Type: Bug Components: Build, SQL Affects Versions: 1.5.0 Reporter: Cheng Lian Priority: Blocker Git commit: [69930310115501f0de094fe6f5c6c60dade342bd|https://github.com/apache/spark/commit/69930310115501f0de094fe6f5c6c60dade342bd] Build with SBT and check the assembly jar for {{parquet.hadoop.api.WriteSupport}}: {noformat} $ ./build/sbt -Phive -Phive-thriftserver -Phadoop-1 -Dhadoop.version=1.2.1 clean assembly/assembly ... $ jar tf assembly/target/scala-2.10/spark-assembly-1.5.0-SNAPSHOT-hadoop1.2.1.jar | fgrep "parquet/hadoop/api" org/apache/parquet/hadoop/api/ org/apache/parquet/hadoop/api/DelegatingReadSupport.class org/apache/parquet/hadoop/api/DelegatingWriteSupport.class org/apache/parquet/hadoop/api/InitContext.class org/apache/parquet/hadoop/api/ReadSupport$ReadContext.class org/apache/parquet/hadoop/api/ReadSupport.class org/apache/parquet/hadoop/api/WriteSupport$FinalizedWriteContext.class org/apache/parquet/hadoop/api/WriteSupport$WriteContext.class org/apache/parquet/hadoop/api/WriteSupport.class {noformat} Only classes of {{org.apache.parquet:parquet-mr:1.7.0}}. Note that classes in {{com.twitter:parquet-hadoop-bundle:1.6.0}} are not under the {{org.apache}} namespace. Build with Maven and check the assembly jar for {{parquet.hadoop.api.WriteSupport}}: {noformat} $ ./build/mvn -Phive -Phive-thriftserver -Phadoop-1 -Dhadoop.version=1.2.1 -DskipTests clean package $ jar tf assembly/target/scala-2.10/spark-assembly-1.5.0-SNAPSHOT-hadoop1.2.1.jar | fgrep "parquet/hadoop/api" org/apache/parquet/hadoop/api/ org/apache/parquet/hadoop/api/DelegatingReadSupport.class org/apache/parquet/hadoop/api/DelegatingWriteSupport.class org/apache/parquet/hadoop/api/InitContext.class org/apache/parquet/hadoop/api/ReadSupport$ReadContext.class org/apache/parquet/hadoop/api/ReadSupport.class org/apache/parquet/hadoop/api/WriteSupport$FinalizedWriteContext.class org/apache/parquet/hadoop/api/WriteSupport$WriteContext.class org/apache/parquet/hadoop/api/WriteSupport.class parquet/hadoop/api/ parquet/hadoop/api/DelegatingReadSupport.class parquet/hadoop/api/DelegatingWriteSupport.class parquet/hadoop/api/InitContext.class parquet/hadoop/api/ReadSupport$ReadContext.class parquet/hadoop/api/ReadSupport.class parquet/hadoop/api/WriteSupport$FinalizedWriteContext.class parquet/hadoop/api/WriteSupport$WriteContext.class parquet/hadoop/api/WriteSupport.class {noformat} Expected classes are packaged properly. One of the consequence of this issue is that Parquet tables created in Hive are not accessible from Spark SQL built with SBT. To reproduce this issue, first create a Parquet table with Hive (say 0.13.1): {noformat} hive> CREATE TABLE parquet_test STORED AS PARQUET AS SELECT 1; {noformat} Build Spark assembly jar with the SBT command above, start {{spark-shell}}: {noformat} scala> sqlContext.table("parquet_test").show() 15/08/14 17:52:50 INFO HiveMetaStore: 0: get_table : db=default tbl=parquet_test 15/08/14 17:52:50 INFO audit: ugi=lian ip=unknown-ip-addr cmd=get_table : db=default tbl=parquet_test java.lang.NoClassDefFoundError: parquet/hadoop/api/WriteSupport at java.lang.Class.forName0(Native Method) at java.lang.Class.forName(Class.java:270) at org.apache.hadoop.hive.ql.metadata.Table.getOutputFormatClass(Table.java:328) at org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$getTableOption$1$$anonfun$2.apply(ClientWrapper.scala:320) at org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$getTableOption$1$$anonfun$2.apply(ClientWrapper.scala:303) at scala.Option.map(Option.scala:145) at org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$getTableOption$1.apply(ClientWrapper.scala:303) at org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$getTableOption$1.apply(ClientWrapper.scala:298) at org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$withHiveState$1.apply(ClientWrapper.scala:256) at org.apache.spark.sql.hive.client.ClientWrapper.retryLocked(ClientWrapper.scala:211) at org.apache.spark.sql.hive.client.ClientWrapper.withHiveState(ClientWrapper.scala:248) at org.apache.spark.sql.hive.client.ClientWrapper.getTableOption(ClientWrapper.scala:298) at org.apache.spark.sql.hive.client.ClientInterface$class.getTable(ClientInterface.scala:123) at org.apache.spark.sql.hive.client.ClientWrapper.getTable(ClientWrapper.scala:60) at org.apache.spark.sql.hive.HiveMetastoreCatalog.lookupRelation(HiveMetastoreCatalog.scala:397) at org.apache.spark.sql.hive.HiveContext$$anon$1.org$apache$spar
[jira] [Updated] (SPARK-9974) SBT build: com.twitter:parquet-hadoop-bundle:1.6.0 is not packaged into the assembly jar
[ https://issues.apache.org/jira/browse/SPARK-9974?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated SPARK-9974: -- Description: One of the consequence of this issue is that Parquet tables created in Hive are not accessible from Spark SQL built with SBT. Maven build is OK. Git commit: [69930310115501f0de094fe6f5c6c60dade342bd|https://github.com/apache/spark/commit/69930310115501f0de094fe6f5c6c60dade342bd] Build with SBT and check the assembly jar for {{parquet.hadoop.api.WriteSupport}}: {noformat} $ ./build/sbt -Phive -Phive-thriftserver -Phadoop-1 -Dhadoop.version=1.2.1 clean assembly/assembly ... $ jar tf assembly/target/scala-2.10/spark-assembly-1.5.0-SNAPSHOT-hadoop1.2.1.jar | fgrep "parquet/hadoop/api" org/apache/parquet/hadoop/api/ org/apache/parquet/hadoop/api/DelegatingReadSupport.class org/apache/parquet/hadoop/api/DelegatingWriteSupport.class org/apache/parquet/hadoop/api/InitContext.class org/apache/parquet/hadoop/api/ReadSupport$ReadContext.class org/apache/parquet/hadoop/api/ReadSupport.class org/apache/parquet/hadoop/api/WriteSupport$FinalizedWriteContext.class org/apache/parquet/hadoop/api/WriteSupport$WriteContext.class org/apache/parquet/hadoop/api/WriteSupport.class {noformat} Only classes of {{org.apache.parquet:parquet-mr:1.7.0}}. Note that classes in {{com.twitter:parquet-hadoop-bundle:1.6.0}} are not under the {{org.apache}} namespace. Build with Maven and check the assembly jar for {{parquet.hadoop.api.WriteSupport}}: {noformat} $ ./build/mvn -Phive -Phive-thriftserver -Phadoop-1 -Dhadoop.version=1.2.1 -DskipTests clean package $ jar tf assembly/target/scala-2.10/spark-assembly-1.5.0-SNAPSHOT-hadoop1.2.1.jar | fgrep "parquet/hadoop/api" org/apache/parquet/hadoop/api/ org/apache/parquet/hadoop/api/DelegatingReadSupport.class org/apache/parquet/hadoop/api/DelegatingWriteSupport.class org/apache/parquet/hadoop/api/InitContext.class org/apache/parquet/hadoop/api/ReadSupport$ReadContext.class org/apache/parquet/hadoop/api/ReadSupport.class org/apache/parquet/hadoop/api/WriteSupport$FinalizedWriteContext.class org/apache/parquet/hadoop/api/WriteSupport$WriteContext.class org/apache/parquet/hadoop/api/WriteSupport.class parquet/hadoop/api/ parquet/hadoop/api/DelegatingReadSupport.class parquet/hadoop/api/DelegatingWriteSupport.class parquet/hadoop/api/InitContext.class parquet/hadoop/api/ReadSupport$ReadContext.class parquet/hadoop/api/ReadSupport.class parquet/hadoop/api/WriteSupport$FinalizedWriteContext.class parquet/hadoop/api/WriteSupport$WriteContext.class parquet/hadoop/api/WriteSupport.class {noformat} Expected classes are packaged properly. To reproduce the Parquet table access issue, first create a Parquet table with Hive (say 0.13.1): {noformat} hive> CREATE TABLE parquet_test STORED AS PARQUET AS SELECT 1; {noformat} Build Spark assembly jar with the SBT command above, start {{spark-shell}}: {noformat} scala> sqlContext.table("parquet_test").show() 15/08/14 17:52:50 INFO HiveMetaStore: 0: get_table : db=default tbl=parquet_test 15/08/14 17:52:50 INFO audit: ugi=lian ip=unknown-ip-addr cmd=get_table : db=default tbl=parquet_test java.lang.NoClassDefFoundError: parquet/hadoop/api/WriteSupport at java.lang.Class.forName0(Native Method) at java.lang.Class.forName(Class.java:270) at org.apache.hadoop.hive.ql.metadata.Table.getOutputFormatClass(Table.java:328) at org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$getTableOption$1$$anonfun$2.apply(ClientWrapper.scala:320) at org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$getTableOption$1$$anonfun$2.apply(ClientWrapper.scala:303) at scala.Option.map(Option.scala:145) at org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$getTableOption$1.apply(ClientWrapper.scala:303) at org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$getTableOption$1.apply(ClientWrapper.scala:298) at org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$withHiveState$1.apply(ClientWrapper.scala:256) at org.apache.spark.sql.hive.client.ClientWrapper.retryLocked(ClientWrapper.scala:211) at org.apache.spark.sql.hive.client.ClientWrapper.withHiveState(ClientWrapper.scala:248) at org.apache.spark.sql.hive.client.ClientWrapper.getTableOption(ClientWrapper.scala:298) at org.apache.spark.sql.hive.client.ClientInterface$class.getTable(ClientInterface.scala:123) at org.apache.spark.sql.hive.client.ClientWrapper.getTable(ClientWrapper.scala:60) at org.apache.spark.sql.hive.HiveMetastoreCatalog.lookupRelation(HiveMetastoreCatalog.scala:397) at org.apache.spark.sql.hive.HiveContext$$anon$1.org$apache$spark$sql$catalyst$analysis$OverrideCatalog$$super$lookupRelation(HiveContext.scala:403) at org.apache.spark.sql.catalyst.analysis.OverrideCatalog$$anonfun$lookupRelation$3.apply(Catalog.scala:170) at
[jira] [Created] (SPARK-9975) Add Normalized Closeness Centrality to Spark GraphX
Kenny Bastani created SPARK-9975: Summary: Add Normalized Closeness Centrality to Spark GraphX Key: SPARK-9975 URL: https://issues.apache.org/jira/browse/SPARK-9975 Project: Spark Issue Type: New Feature Components: GraphX Reporter: Kenny Bastani Priority: Minor “Closeness centrality” is also defined as a proportion. First, the distance of a vertex from all other vertices in the network is counted. Normalization is achieved by defining closeness centrality as the number of other vertices divided by this sum (De Nooy et al., 2005, p. 127). Because of this normalization, closeness centrality provides a global measure about the position of a vertex in the network, while betweenness centrality is defined with reference to the local position of a vertex. -- Cited from http://arxiv.org/pdf/0911.2719.pdf This request is to add normalized closeness centrality as a core graph algorithm in the GraphX library. I implemented this algorithm for a graph processing extension to Neo4j (https://github.com/kbastani/neo4j-mazerunner#supported-algorithms) and I would like to put it up for review for inclusion into Spark. This algorithm is very straight forward and builds on top of the included ShortestPaths (SSSP) algorithm already in the library. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-9975) Add Normalized Closeness Centrality to Spark GraphX
[ https://issues.apache.org/jira/browse/SPARK-9975?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-9975: --- Assignee: Apache Spark > Add Normalized Closeness Centrality to Spark GraphX > --- > > Key: SPARK-9975 > URL: https://issues.apache.org/jira/browse/SPARK-9975 > Project: Spark > Issue Type: New Feature > Components: GraphX >Reporter: Kenny Bastani >Assignee: Apache Spark >Priority: Minor > Labels: features > > “Closeness centrality” is also defined as a proportion. First, the distance > of a vertex from all other vertices in the network is counted. Normalization > is achieved by defining closeness centrality as the number of other vertices > divided by this sum (De Nooy et al., 2005, p. 127). Because of this > normalization, closeness centrality provides a global measure about the > position of a vertex in the network, while betweenness centrality is defined > with reference to the local position of a vertex. -- Cited from > http://arxiv.org/pdf/0911.2719.pdf > This request is to add normalized closeness centrality as a core graph > algorithm in the GraphX library. I implemented this algorithm for a graph > processing extension to Neo4j > (https://github.com/kbastani/neo4j-mazerunner#supported-algorithms) and I > would like to put it up for review for inclusion into Spark. This algorithm > is very straight forward and builds on top of the included ShortestPaths > (SSSP) algorithm already in the library. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9975) Add Normalized Closeness Centrality to Spark GraphX
[ https://issues.apache.org/jira/browse/SPARK-9975?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14696799#comment-14696799 ] Apache Spark commented on SPARK-9975: - User 'kbastani' has created a pull request for this issue: https://github.com/apache/spark/pull/8195 > Add Normalized Closeness Centrality to Spark GraphX > --- > > Key: SPARK-9975 > URL: https://issues.apache.org/jira/browse/SPARK-9975 > Project: Spark > Issue Type: New Feature > Components: GraphX >Reporter: Kenny Bastani >Priority: Minor > Labels: features > > “Closeness centrality” is also defined as a proportion. First, the distance > of a vertex from all other vertices in the network is counted. Normalization > is achieved by defining closeness centrality as the number of other vertices > divided by this sum (De Nooy et al., 2005, p. 127). Because of this > normalization, closeness centrality provides a global measure about the > position of a vertex in the network, while betweenness centrality is defined > with reference to the local position of a vertex. -- Cited from > http://arxiv.org/pdf/0911.2719.pdf > This request is to add normalized closeness centrality as a core graph > algorithm in the GraphX library. I implemented this algorithm for a graph > processing extension to Neo4j > (https://github.com/kbastani/neo4j-mazerunner#supported-algorithms) and I > would like to put it up for review for inclusion into Spark. This algorithm > is very straight forward and builds on top of the included ShortestPaths > (SSSP) algorithm already in the library. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-9975) Add Normalized Closeness Centrality to Spark GraphX
[ https://issues.apache.org/jira/browse/SPARK-9975?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-9975: --- Assignee: (was: Apache Spark) > Add Normalized Closeness Centrality to Spark GraphX > --- > > Key: SPARK-9975 > URL: https://issues.apache.org/jira/browse/SPARK-9975 > Project: Spark > Issue Type: New Feature > Components: GraphX >Reporter: Kenny Bastani >Priority: Minor > Labels: features > > “Closeness centrality” is also defined as a proportion. First, the distance > of a vertex from all other vertices in the network is counted. Normalization > is achieved by defining closeness centrality as the number of other vertices > divided by this sum (De Nooy et al., 2005, p. 127). Because of this > normalization, closeness centrality provides a global measure about the > position of a vertex in the network, while betweenness centrality is defined > with reference to the local position of a vertex. -- Cited from > http://arxiv.org/pdf/0911.2719.pdf > This request is to add normalized closeness centrality as a core graph > algorithm in the GraphX library. I implemented this algorithm for a graph > processing extension to Neo4j > (https://github.com/kbastani/neo4j-mazerunner#supported-algorithms) and I > would like to put it up for review for inclusion into Spark. This algorithm > is very straight forward and builds on top of the included ShortestPaths > (SSSP) algorithm already in the library. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8922) Add @since tags to mllib.evaluation
[ https://issues.apache.org/jira/browse/SPARK-8922?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-8922: - Assignee: Xiangrui Meng > Add @since tags to mllib.evaluation > --- > > Key: SPARK-8922 > URL: https://issues.apache.org/jira/browse/SPARK-8922 > Project: Spark > Issue Type: Sub-task > Components: Documentation, MLlib >Reporter: Xiangrui Meng >Assignee: Xiangrui Meng >Priority: Minor > Labels: starter > Fix For: 1.5.0 > > Original Estimate: 1h > Remaining Estimate: 1h > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9726) PySpark Regression: DataFrame join no longer accepts None as join expression
[ https://issues.apache.org/jira/browse/SPARK-9726?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-9726: - Assignee: Brennan Ashton > PySpark Regression: DataFrame join no longer accepts None as join expression > > > Key: SPARK-9726 > URL: https://issues.apache.org/jira/browse/SPARK-9726 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 1.5.0 >Reporter: Brennan Ashton >Assignee: Brennan Ashton > Fix For: 1.5.0 > > > The patch to add methods to support equi-join broke joins where on is None. > Rather than ending the branch with jdf = self._jdf.join(other._jdf) it > continues to another branch where it fails because you cannot take the index > of None.if isinstance(on[0], basestring): > This was valid in 1.4.1: > df3 = df.join(df2,on=None) > This is a trivial fix in the attached PR. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9826) Cannot use custom classes in log4j.properties
[ https://issues.apache.org/jira/browse/SPARK-9826?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-9826: - Assignee: Michel Lemay > Cannot use custom classes in log4j.properties > - > > Key: SPARK-9826 > URL: https://issues.apache.org/jira/browse/SPARK-9826 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.4.1 >Reporter: Michel Lemay >Assignee: Michel Lemay >Priority: Minor > Labels: regression > Fix For: 1.4.2, 1.5.0 > > > log4j is initialized before spark class loader is set on the thread context. > Therefore it cannot use classes embedded in fat-jars submitted to spark. > While parsing arguments, spark calls methods on Utils class and triggers > ShutdownHookManager static initialization. This then leads to log4j being > initialized before spark gets the chance to specify custom class > MutableURLClassLoader on the thread context. > See detailed explanation here: > http://apache-spark-user-list.1001560.n3.nabble.com/log4j-custom-appender-ClassNotFoundException-with-spark-1-4-1-tt24159.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9936) decimal precision lost when loading DataFrame from RDD
[ https://issues.apache.org/jira/browse/SPARK-9936?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-9936: - Assignee: Liang-Chi Hsieh > decimal precision lost when loading DataFrame from RDD > -- > > Key: SPARK-9936 > URL: https://issues.apache.org/jira/browse/SPARK-9936 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.4.0 >Reporter: Tzach Zohar >Assignee: Liang-Chi Hsieh > Fix For: 1.5.0 > > > It seems that when converting an RDD that contains BigDecimals into a > DataFrame (using SQLContext.createDataFrame without specifying schema), > precision info is lost, which means saving as Parquet file will fail (Parquet > tries to verify precision < 18, so fails if it's unset). > This seems to be similar to > [SPARK-7196|https://issues.apache.org/jira/browse/SPARK-7196], which fixed > the same issue for DataFrames created via JDBC. > To reproduce: > {code:none} > scala> val rdd: RDD[(String, BigDecimal)] = sc.parallelize(Seq(("a", > BigDecimal.valueOf(0.234 > rdd: org.apache.spark.rdd.RDD[(String, BigDecimal)] = > ParallelCollectionRDD[0] at parallelize at :23 > scala> val df: DataFrame = new SQLContext(rdd.context).createDataFrame(rdd) > df: org.apache.spark.sql.DataFrame = [_1: string, _2: decimal(10,0)] > scala> df.write.parquet("/data/parquet-file") > 15/08/13 10:30:07 ERROR executor.Executor: Exception in task 0.0 in stage 0.0 > (TID 0) > java.lang.RuntimeException: Unsupported datatype DecimalType() > {code} > To verify this is indeed caused by the precision being lost, I've tried > manually changing the schema to include precision (by traversing the > StructFields and replacing the DecimalTypes with altered DecimalTypes), > creating a new DataFrame using this updated schema - and indeed it fixes the > problem. > I'm using Spark 1.4.0. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9931) Flaky test: mllib/tests.py StreamingLogisticRegressionWithSGDTests. test_training_and_prediction
[ https://issues.apache.org/jira/browse/SPARK-9931?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-9931: - Component/s: Tests PySpark > Flaky test: mllib/tests.py StreamingLogisticRegressionWithSGDTests. > test_training_and_prediction > > > Key: SPARK-9931 > URL: https://issues.apache.org/jira/browse/SPARK-9931 > Project: Spark > Issue Type: Bug > Components: PySpark, Tests >Reporter: Davies Liu >Priority: Critical > > {code} > FAIL: test_training_and_prediction > (__main__.StreamingLogisticRegressionWithSGDTests) > Test that the model improves on toy data with no. of batches > -- > Traceback (most recent call last): > File > "/home/jenkins/workspace/NewSparkPullRequestBuilder/python/pyspark/mllib/tests.py", > line 1250, in test_training_and_prediction > self.assertTrue(errors[1] - errors[-1] > 0.3) > AssertionError: False is not true > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9934) Deprecate NIO ConnectionManager
[ https://issues.apache.org/jira/browse/SPARK-9934?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-9934: - Component/s: Spark Core > Deprecate NIO ConnectionManager > --- > > Key: SPARK-9934 > URL: https://issues.apache.org/jira/browse/SPARK-9934 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Reporter: Reynold Xin >Assignee: Reynold Xin > > We should deprecate ConnectionManager in 1.5 before removing it in 1.6. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9878) ReduceByKey + FullOuterJoin return 0 element if using an empty RDD
[ https://issues.apache.org/jira/browse/SPARK-9878?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-9878: - Component/s: Spark Core > ReduceByKey + FullOuterJoin return 0 element if using an empty RDD > --- > > Key: SPARK-9878 > URL: https://issues.apache.org/jira/browse/SPARK-9878 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.4.0 > Environment: linux ubuntu 64b spark-hadoop > launched with Local[2] >Reporter: durand remi >Priority: Minor > > code to reproduce: > println("ok > :"+sc.parallelize(List((3,4),(4,5))).fullOuterJoin(sc.emptyRDD[(Int,Seq[Int])]).count) > println("ko: > "+sc.parallelize(List((3,4),(4,5))).fullOuterJoin(sc.emptyRDD[(Int,Seq[Int])].reduceByKey((e1, > e2) => e1 ++ e2)).count) > what i expect: > ok: 2 > ko: 2 > but what i have: > ok: 2 > ko: 0 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-9637) Add interface for implementing scheduling algorithm for standalone deployment
[ https://issues.apache.org/jira/browse/SPARK-9637?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Liang-Chi Hsieh closed SPARK-9637. -- Resolution: Duplicate > Add interface for implementing scheduling algorithm for standalone deployment > - > > Key: SPARK-9637 > URL: https://issues.apache.org/jira/browse/SPARK-9637 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Reporter: Liang-Chi Hsieh > > We want to abstract the interface of scheduling algorithm for standalone > deployment mode. It can benefit for implementing different scheduling > algorithms. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-9396) Spark yarn allocator does not call "removeContainerRequest" for allocated Container requests, resulting in bloated ask[] toYarn RM.
[ https://issues.apache.org/jira/browse/SPARK-9396?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-9396. -- Resolution: Won't Fix This is fixed in 1.3+ by other changes; unless a decision is taken to make another 1.2.x release I think this is a WontFix. > Spark yarn allocator does not call "removeContainerRequest" for allocated > Container requests, resulting in bloated ask[] toYarn RM. > --- > > Key: SPARK-9396 > URL: https://issues.apache.org/jira/browse/SPARK-9396 > Project: Spark > Issue Type: Improvement > Components: YARN >Affects Versions: 1.2.1, 1.2.2 > Environment: Spark-1.2.1 on hadoop-yarn-2.4.0 cluster. All servers in > cluster running Linux version 2.6.32. >Reporter: prakhar jauhari >Priority: Minor > > Note : Attached logs contain logs that i added (spark yarn allocator side and > Yarn client side) for debugging purpose. > ! My spark job is configured for 2 executors, on killing 1 executor the > ask is of 3 !!! > On killing a executor - resource request logs : > *Killed container: ask for 3 containers, instead for 1*** > 15/07/15 10:49:01 INFO yarn.YarnAllocationHandler: Will allocate 1 executor > containers, each with 2432 MB memory including 384 MB overhead > 15/07/15 10:49:01 INFO yarn.YarnAllocationHandler: numExecutors: 1 > 15/07/15 10:49:01 INFO yarn.YarnAllocationHandler: host preferences is empty > 15/07/15 10:49:01 INFO yarn.YarnAllocationHandler: Container request (host: > Any, priority: 1, capability: > 15/07/15 10:49:01 INFO impl.AMRMClientImpl: prakhar : AMRMClientImpl : > allocate: this.ask = [{Priority: 1, Capability: , # > Containers: 3, Location: *, Relax Locality: true}] > 15/07/15 10:49:01 INFO impl.AMRMClientImpl: prakhar : AMRMClientImpl : > allocate: allocateRequest = ask { priority{ priority: 1 } resource_name: "*" > capability { memory: 2432 virtual_cores: 4 } num_containers: 3 > relax_locality: true } blacklist_request { } response_id: 354 progress: 0.1 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8118) Turn off noisy log output produced by Parquet 1.7.0
[ https://issues.apache.org/jira/browse/SPARK-8118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14696849#comment-14696849 ] Apache Spark commented on SPARK-8118: - User 'liancheng' has created a pull request for this issue: https://github.com/apache/spark/pull/8196 > Turn off noisy log output produced by Parquet 1.7.0 > --- > > Key: SPARK-8118 > URL: https://issues.apache.org/jira/browse/SPARK-8118 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 1.4.1, 1.5.0 >Reporter: Cheng Lian >Assignee: Cheng Lian >Priority: Minor > Fix For: 1.5.0 > > > Parquet 1.7.0 renames package name to "org.apache.parquet", need to adjust > {{ParquetRelation.enableLogForwarding}} accordingly to avoid noisy log output. > A better approach than simply muting these log lines is to redirect Parquet > logs via SLF4J, so that we can handle them consistently. In general these > logs are very useful. Esp. when used to diagnosing Parquet memory issue and > filter push-down. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-9826) Cannot use custom classes in log4j.properties
[ https://issues.apache.org/jira/browse/SPARK-9826?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michel Lemay closed SPARK-9826. --- Verified working with a local build of branch-1.4 (1.4.2-snapshot) > Cannot use custom classes in log4j.properties > - > > Key: SPARK-9826 > URL: https://issues.apache.org/jira/browse/SPARK-9826 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.4.1 >Reporter: Michel Lemay >Assignee: Michel Lemay >Priority: Minor > Labels: regression > Fix For: 1.4.2, 1.5.0 > > > log4j is initialized before spark class loader is set on the thread context. > Therefore it cannot use classes embedded in fat-jars submitted to spark. > While parsing arguments, spark calls methods on Utils class and triggers > ShutdownHookManager static initialization. This then leads to log4j being > initialized before spark gets the chance to specify custom class > MutableURLClassLoader on the thread context. > See detailed explanation here: > http://apache-spark-user-list.1001560.n3.nabble.com/log4j-custom-appender-ClassNotFoundException-with-spark-1-4-1-tt24159.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9971) MaxFunction not working correctly with columns containing Double.NaN
[ https://issues.apache.org/jira/browse/SPARK-9971?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14696889#comment-14696889 ] Frank Rosner commented on SPARK-9971: - Ok so shall we close it as a wontfix? > MaxFunction not working correctly with columns containing Double.NaN > > > Key: SPARK-9971 > URL: https://issues.apache.org/jira/browse/SPARK-9971 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.4.1 >Reporter: Frank Rosner >Priority: Minor > > h4. Problem Description > When using the {{max}} function on a {{DoubleType}} column that contains > {{Double.NaN}} values, the returned maximum value will be {{Double.NaN}}. > This is because it compares all values with the running maximum. However, {{x > < Double.NaN}} will always lead false for all {{x: Double}}, so will {{x > > Double.NaN}}. > h4. How to Reproduce > {code} > import org.apache.spark.sql.{SQLContext, Row} > import org.apache.spark.sql.types._ > val sql = new SQLContext(sc) > val rdd = sc.makeRDD(List(Row(Double.NaN), Row(-10d), Row(0d))) > val dataFrame = sql.createDataFrame(rdd, StructType(List( > StructField("col", DoubleType, false) > ))) > dataFrame.select(max("col")).first > // returns org.apache.spark.sql.Row = [NaN] > {code} > h4. Solution > The {{max}} and {{min}} functions should ignore NaN values, as they are not > numbers. If a column contains only NaN values, then the maximum and minimum > is not defined. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-9906) User guide for LogisticRegressionSummary
[ https://issues.apache.org/jira/browse/SPARK-9906?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-9906: --- Assignee: Manoj Kumar (was: Apache Spark) > User guide for LogisticRegressionSummary > > > Key: SPARK-9906 > URL: https://issues.apache.org/jira/browse/SPARK-9906 > Project: Spark > Issue Type: Documentation > Components: ML >Reporter: Feynman Liang >Assignee: Manoj Kumar > > SPARK-9112 introduces {{LogisticRegressionSummary}} to provide R-like model > statistics to ML pipeline logistic regression models. This feature is not > present in mllib and should be documented within {{ml-guide}} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-9906) User guide for LogisticRegressionSummary
[ https://issues.apache.org/jira/browse/SPARK-9906?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-9906: --- Assignee: Apache Spark (was: Manoj Kumar) > User guide for LogisticRegressionSummary > > > Key: SPARK-9906 > URL: https://issues.apache.org/jira/browse/SPARK-9906 > Project: Spark > Issue Type: Documentation > Components: ML >Reporter: Feynman Liang >Assignee: Apache Spark > > SPARK-9112 introduces {{LogisticRegressionSummary}} to provide R-like model > statistics to ML pipeline logistic regression models. This feature is not > present in mllib and should be documented within {{ml-guide}} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9906) User guide for LogisticRegressionSummary
[ https://issues.apache.org/jira/browse/SPARK-9906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14696891#comment-14696891 ] Apache Spark commented on SPARK-9906: - User 'MechCoder' has created a pull request for this issue: https://github.com/apache/spark/pull/8197 > User guide for LogisticRegressionSummary > > > Key: SPARK-9906 > URL: https://issues.apache.org/jira/browse/SPARK-9906 > Project: Spark > Issue Type: Documentation > Components: ML >Reporter: Feynman Liang >Assignee: Manoj Kumar > > SPARK-9112 introduces {{LogisticRegressionSummary}} to provide R-like model > statistics to ML pipeline logistic regression models. This feature is not > present in mllib and should be documented within {{ml-guide}} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9971) MaxFunction not working correctly with columns containing Double.NaN
[ https://issues.apache.org/jira/browse/SPARK-9971?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14696894#comment-14696894 ] Sean Owen commented on SPARK-9971: -- I'd always prefer to leave it open at least ~1 week to see if anyone else has comments. If not, yes I personally would argue that the current behavior is the right one. > MaxFunction not working correctly with columns containing Double.NaN > > > Key: SPARK-9971 > URL: https://issues.apache.org/jira/browse/SPARK-9971 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.4.1 >Reporter: Frank Rosner >Priority: Minor > > h4. Problem Description > When using the {{max}} function on a {{DoubleType}} column that contains > {{Double.NaN}} values, the returned maximum value will be {{Double.NaN}}. > This is because it compares all values with the running maximum. However, {{x > < Double.NaN}} will always lead false for all {{x: Double}}, so will {{x > > Double.NaN}}. > h4. How to Reproduce > {code} > import org.apache.spark.sql.{SQLContext, Row} > import org.apache.spark.sql.types._ > val sql = new SQLContext(sc) > val rdd = sc.makeRDD(List(Row(Double.NaN), Row(-10d), Row(0d))) > val dataFrame = sql.createDataFrame(rdd, StructType(List( > StructField("col", DoubleType, false) > ))) > dataFrame.select(max("col")).first > // returns org.apache.spark.sql.Row = [NaN] > {code} > h4. Solution > The {{max}} and {{min}} functions should ignore NaN values, as they are not > numbers. If a column contains only NaN values, then the maximum and minimum > is not defined. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-9974) SBT build: com.twitter:parquet-hadoop-bundle:1.6.0 is not packaged into the assembly jar
[ https://issues.apache.org/jira/browse/SPARK-9974?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-9974: --- Assignee: Apache Spark > SBT build: com.twitter:parquet-hadoop-bundle:1.6.0 is not packaged into the > assembly jar > > > Key: SPARK-9974 > URL: https://issues.apache.org/jira/browse/SPARK-9974 > Project: Spark > Issue Type: Bug > Components: Build, SQL >Affects Versions: 1.5.0 >Reporter: Cheng Lian >Assignee: Apache Spark >Priority: Blocker > > One of the consequence of this issue is that Parquet tables created in Hive > are not accessible from Spark SQL built with SBT. Maven build is OK. > Git commit: > [69930310115501f0de094fe6f5c6c60dade342bd|https://github.com/apache/spark/commit/69930310115501f0de094fe6f5c6c60dade342bd] > Build with SBT and check the assembly jar for > {{parquet.hadoop.api.WriteSupport}}: > {noformat} > $ ./build/sbt -Phive -Phive-thriftserver -Phadoop-1 -Dhadoop.version=1.2.1 > clean assembly/assembly > ... > $ jar tf > assembly/target/scala-2.10/spark-assembly-1.5.0-SNAPSHOT-hadoop1.2.1.jar | > fgrep "parquet/hadoop/api" > org/apache/parquet/hadoop/api/ > org/apache/parquet/hadoop/api/DelegatingReadSupport.class > org/apache/parquet/hadoop/api/DelegatingWriteSupport.class > org/apache/parquet/hadoop/api/InitContext.class > org/apache/parquet/hadoop/api/ReadSupport$ReadContext.class > org/apache/parquet/hadoop/api/ReadSupport.class > org/apache/parquet/hadoop/api/WriteSupport$FinalizedWriteContext.class > org/apache/parquet/hadoop/api/WriteSupport$WriteContext.class > org/apache/parquet/hadoop/api/WriteSupport.class > {noformat} > Only classes of {{org.apache.parquet:parquet-mr:1.7.0}}. Note that classes in > {{com.twitter:parquet-hadoop-bundle:1.6.0}} are not under the {{org.apache}} > namespace. > Build with Maven and check the assembly jar for > {{parquet.hadoop.api.WriteSupport}}: > {noformat} > $ ./build/mvn -Phive -Phive-thriftserver -Phadoop-1 -Dhadoop.version=1.2.1 > -DskipTests clean package > $ jar tf > assembly/target/scala-2.10/spark-assembly-1.5.0-SNAPSHOT-hadoop1.2.1.jar | > fgrep "parquet/hadoop/api" > org/apache/parquet/hadoop/api/ > org/apache/parquet/hadoop/api/DelegatingReadSupport.class > org/apache/parquet/hadoop/api/DelegatingWriteSupport.class > org/apache/parquet/hadoop/api/InitContext.class > org/apache/parquet/hadoop/api/ReadSupport$ReadContext.class > org/apache/parquet/hadoop/api/ReadSupport.class > org/apache/parquet/hadoop/api/WriteSupport$FinalizedWriteContext.class > org/apache/parquet/hadoop/api/WriteSupport$WriteContext.class > org/apache/parquet/hadoop/api/WriteSupport.class > parquet/hadoop/api/ > parquet/hadoop/api/DelegatingReadSupport.class > parquet/hadoop/api/DelegatingWriteSupport.class > parquet/hadoop/api/InitContext.class > parquet/hadoop/api/ReadSupport$ReadContext.class > parquet/hadoop/api/ReadSupport.class > parquet/hadoop/api/WriteSupport$FinalizedWriteContext.class > parquet/hadoop/api/WriteSupport$WriteContext.class > parquet/hadoop/api/WriteSupport.class > {noformat} > Expected classes are packaged properly. > To reproduce the Parquet table access issue, first create a Parquet table > with Hive (say 0.13.1): > {noformat} > hive> CREATE TABLE parquet_test STORED AS PARQUET AS SELECT 1; > {noformat} > Build Spark assembly jar with the SBT command above, start {{spark-shell}}: > {noformat} > scala> sqlContext.table("parquet_test").show() > 15/08/14 17:52:50 INFO HiveMetaStore: 0: get_table : db=default > tbl=parquet_test > 15/08/14 17:52:50 INFO audit: ugi=lian ip=unknown-ip-addr cmd=get_table > : db=default tbl=parquet_test > java.lang.NoClassDefFoundError: parquet/hadoop/api/WriteSupport > at java.lang.Class.forName0(Native Method) > at java.lang.Class.forName(Class.java:270) > at > org.apache.hadoop.hive.ql.metadata.Table.getOutputFormatClass(Table.java:328) > at > org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$getTableOption$1$$anonfun$2.apply(ClientWrapper.scala:320) > at > org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$getTableOption$1$$anonfun$2.apply(ClientWrapper.scala:303) > at scala.Option.map(Option.scala:145) > at > org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$getTableOption$1.apply(ClientWrapper.scala:303) > at > org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$getTableOption$1.apply(ClientWrapper.scala:298) > at > org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$withHiveState$1.apply(ClientWrapper.scala:256) > at > org.apache.spark.sql.hive.client.ClientWrapper.retryLocked(ClientWrapper.scala:211) > at > org.apache.spark.sql.hive.client.ClientWrapper.withHiveState(ClientW
[jira] [Commented] (SPARK-9974) SBT build: com.twitter:parquet-hadoop-bundle:1.6.0 is not packaged into the assembly jar
[ https://issues.apache.org/jira/browse/SPARK-9974?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14696919#comment-14696919 ] Apache Spark commented on SPARK-9974: - User 'liancheng' has created a pull request for this issue: https://github.com/apache/spark/pull/8198 > SBT build: com.twitter:parquet-hadoop-bundle:1.6.0 is not packaged into the > assembly jar > > > Key: SPARK-9974 > URL: https://issues.apache.org/jira/browse/SPARK-9974 > Project: Spark > Issue Type: Bug > Components: Build, SQL >Affects Versions: 1.5.0 >Reporter: Cheng Lian >Priority: Blocker > > One of the consequence of this issue is that Parquet tables created in Hive > are not accessible from Spark SQL built with SBT. Maven build is OK. > Git commit: > [69930310115501f0de094fe6f5c6c60dade342bd|https://github.com/apache/spark/commit/69930310115501f0de094fe6f5c6c60dade342bd] > Build with SBT and check the assembly jar for > {{parquet.hadoop.api.WriteSupport}}: > {noformat} > $ ./build/sbt -Phive -Phive-thriftserver -Phadoop-1 -Dhadoop.version=1.2.1 > clean assembly/assembly > ... > $ jar tf > assembly/target/scala-2.10/spark-assembly-1.5.0-SNAPSHOT-hadoop1.2.1.jar | > fgrep "parquet/hadoop/api" > org/apache/parquet/hadoop/api/ > org/apache/parquet/hadoop/api/DelegatingReadSupport.class > org/apache/parquet/hadoop/api/DelegatingWriteSupport.class > org/apache/parquet/hadoop/api/InitContext.class > org/apache/parquet/hadoop/api/ReadSupport$ReadContext.class > org/apache/parquet/hadoop/api/ReadSupport.class > org/apache/parquet/hadoop/api/WriteSupport$FinalizedWriteContext.class > org/apache/parquet/hadoop/api/WriteSupport$WriteContext.class > org/apache/parquet/hadoop/api/WriteSupport.class > {noformat} > Only classes of {{org.apache.parquet:parquet-mr:1.7.0}}. Note that classes in > {{com.twitter:parquet-hadoop-bundle:1.6.0}} are not under the {{org.apache}} > namespace. > Build with Maven and check the assembly jar for > {{parquet.hadoop.api.WriteSupport}}: > {noformat} > $ ./build/mvn -Phive -Phive-thriftserver -Phadoop-1 -Dhadoop.version=1.2.1 > -DskipTests clean package > $ jar tf > assembly/target/scala-2.10/spark-assembly-1.5.0-SNAPSHOT-hadoop1.2.1.jar | > fgrep "parquet/hadoop/api" > org/apache/parquet/hadoop/api/ > org/apache/parquet/hadoop/api/DelegatingReadSupport.class > org/apache/parquet/hadoop/api/DelegatingWriteSupport.class > org/apache/parquet/hadoop/api/InitContext.class > org/apache/parquet/hadoop/api/ReadSupport$ReadContext.class > org/apache/parquet/hadoop/api/ReadSupport.class > org/apache/parquet/hadoop/api/WriteSupport$FinalizedWriteContext.class > org/apache/parquet/hadoop/api/WriteSupport$WriteContext.class > org/apache/parquet/hadoop/api/WriteSupport.class > parquet/hadoop/api/ > parquet/hadoop/api/DelegatingReadSupport.class > parquet/hadoop/api/DelegatingWriteSupport.class > parquet/hadoop/api/InitContext.class > parquet/hadoop/api/ReadSupport$ReadContext.class > parquet/hadoop/api/ReadSupport.class > parquet/hadoop/api/WriteSupport$FinalizedWriteContext.class > parquet/hadoop/api/WriteSupport$WriteContext.class > parquet/hadoop/api/WriteSupport.class > {noformat} > Expected classes are packaged properly. > To reproduce the Parquet table access issue, first create a Parquet table > with Hive (say 0.13.1): > {noformat} > hive> CREATE TABLE parquet_test STORED AS PARQUET AS SELECT 1; > {noformat} > Build Spark assembly jar with the SBT command above, start {{spark-shell}}: > {noformat} > scala> sqlContext.table("parquet_test").show() > 15/08/14 17:52:50 INFO HiveMetaStore: 0: get_table : db=default > tbl=parquet_test > 15/08/14 17:52:50 INFO audit: ugi=lian ip=unknown-ip-addr cmd=get_table > : db=default tbl=parquet_test > java.lang.NoClassDefFoundError: parquet/hadoop/api/WriteSupport > at java.lang.Class.forName0(Native Method) > at java.lang.Class.forName(Class.java:270) > at > org.apache.hadoop.hive.ql.metadata.Table.getOutputFormatClass(Table.java:328) > at > org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$getTableOption$1$$anonfun$2.apply(ClientWrapper.scala:320) > at > org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$getTableOption$1$$anonfun$2.apply(ClientWrapper.scala:303) > at scala.Option.map(Option.scala:145) > at > org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$getTableOption$1.apply(ClientWrapper.scala:303) > at > org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$getTableOption$1.apply(ClientWrapper.scala:298) > at > org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$withHiveState$1.apply(ClientWrapper.scala:256) > at > org.apache.spark.sql.hive.client.ClientWrapper.retryLocked(ClientWrapper.scala
[jira] [Assigned] (SPARK-9974) SBT build: com.twitter:parquet-hadoop-bundle:1.6.0 is not packaged into the assembly jar
[ https://issues.apache.org/jira/browse/SPARK-9974?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-9974: --- Assignee: (was: Apache Spark) > SBT build: com.twitter:parquet-hadoop-bundle:1.6.0 is not packaged into the > assembly jar > > > Key: SPARK-9974 > URL: https://issues.apache.org/jira/browse/SPARK-9974 > Project: Spark > Issue Type: Bug > Components: Build, SQL >Affects Versions: 1.5.0 >Reporter: Cheng Lian >Priority: Blocker > > One of the consequence of this issue is that Parquet tables created in Hive > are not accessible from Spark SQL built with SBT. Maven build is OK. > Git commit: > [69930310115501f0de094fe6f5c6c60dade342bd|https://github.com/apache/spark/commit/69930310115501f0de094fe6f5c6c60dade342bd] > Build with SBT and check the assembly jar for > {{parquet.hadoop.api.WriteSupport}}: > {noformat} > $ ./build/sbt -Phive -Phive-thriftserver -Phadoop-1 -Dhadoop.version=1.2.1 > clean assembly/assembly > ... > $ jar tf > assembly/target/scala-2.10/spark-assembly-1.5.0-SNAPSHOT-hadoop1.2.1.jar | > fgrep "parquet/hadoop/api" > org/apache/parquet/hadoop/api/ > org/apache/parquet/hadoop/api/DelegatingReadSupport.class > org/apache/parquet/hadoop/api/DelegatingWriteSupport.class > org/apache/parquet/hadoop/api/InitContext.class > org/apache/parquet/hadoop/api/ReadSupport$ReadContext.class > org/apache/parquet/hadoop/api/ReadSupport.class > org/apache/parquet/hadoop/api/WriteSupport$FinalizedWriteContext.class > org/apache/parquet/hadoop/api/WriteSupport$WriteContext.class > org/apache/parquet/hadoop/api/WriteSupport.class > {noformat} > Only classes of {{org.apache.parquet:parquet-mr:1.7.0}}. Note that classes in > {{com.twitter:parquet-hadoop-bundle:1.6.0}} are not under the {{org.apache}} > namespace. > Build with Maven and check the assembly jar for > {{parquet.hadoop.api.WriteSupport}}: > {noformat} > $ ./build/mvn -Phive -Phive-thriftserver -Phadoop-1 -Dhadoop.version=1.2.1 > -DskipTests clean package > $ jar tf > assembly/target/scala-2.10/spark-assembly-1.5.0-SNAPSHOT-hadoop1.2.1.jar | > fgrep "parquet/hadoop/api" > org/apache/parquet/hadoop/api/ > org/apache/parquet/hadoop/api/DelegatingReadSupport.class > org/apache/parquet/hadoop/api/DelegatingWriteSupport.class > org/apache/parquet/hadoop/api/InitContext.class > org/apache/parquet/hadoop/api/ReadSupport$ReadContext.class > org/apache/parquet/hadoop/api/ReadSupport.class > org/apache/parquet/hadoop/api/WriteSupport$FinalizedWriteContext.class > org/apache/parquet/hadoop/api/WriteSupport$WriteContext.class > org/apache/parquet/hadoop/api/WriteSupport.class > parquet/hadoop/api/ > parquet/hadoop/api/DelegatingReadSupport.class > parquet/hadoop/api/DelegatingWriteSupport.class > parquet/hadoop/api/InitContext.class > parquet/hadoop/api/ReadSupport$ReadContext.class > parquet/hadoop/api/ReadSupport.class > parquet/hadoop/api/WriteSupport$FinalizedWriteContext.class > parquet/hadoop/api/WriteSupport$WriteContext.class > parquet/hadoop/api/WriteSupport.class > {noformat} > Expected classes are packaged properly. > To reproduce the Parquet table access issue, first create a Parquet table > with Hive (say 0.13.1): > {noformat} > hive> CREATE TABLE parquet_test STORED AS PARQUET AS SELECT 1; > {noformat} > Build Spark assembly jar with the SBT command above, start {{spark-shell}}: > {noformat} > scala> sqlContext.table("parquet_test").show() > 15/08/14 17:52:50 INFO HiveMetaStore: 0: get_table : db=default > tbl=parquet_test > 15/08/14 17:52:50 INFO audit: ugi=lian ip=unknown-ip-addr cmd=get_table > : db=default tbl=parquet_test > java.lang.NoClassDefFoundError: parquet/hadoop/api/WriteSupport > at java.lang.Class.forName0(Native Method) > at java.lang.Class.forName(Class.java:270) > at > org.apache.hadoop.hive.ql.metadata.Table.getOutputFormatClass(Table.java:328) > at > org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$getTableOption$1$$anonfun$2.apply(ClientWrapper.scala:320) > at > org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$getTableOption$1$$anonfun$2.apply(ClientWrapper.scala:303) > at scala.Option.map(Option.scala:145) > at > org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$getTableOption$1.apply(ClientWrapper.scala:303) > at > org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$getTableOption$1.apply(ClientWrapper.scala:298) > at > org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$withHiveState$1.apply(ClientWrapper.scala:256) > at > org.apache.spark.sql.hive.client.ClientWrapper.retryLocked(ClientWrapper.scala:211) > at > org.apache.spark.sql.hive.client.ClientWrapper.withHiveState(ClientWrapper.scala:248) >
[jira] [Updated] (SPARK-6624) Convert filters into CNF for data sources
[ https://issues.apache.org/jira/browse/SPARK-6624?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated SPARK-6624: -- Assignee: Yijie Shen > Convert filters into CNF for data sources > - > > Key: SPARK-6624 > URL: https://issues.apache.org/jira/browse/SPARK-6624 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin >Assignee: Yijie Shen > > We should turn filters into conjunctive normal form (CNF) before we pass them > to data sources. Otherwise, filters are not very useful if there is a single > filter with a bunch of ORs. > Note that we already try to do some of these in BooleanSimplification, but I > think we should just formalize it to use CNF. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9966) Handle a couple of corner cases in the PID rate estimator
[ https://issues.apache.org/jira/browse/SPARK-9966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14696928#comment-14696928 ] Apache Spark commented on SPARK-9966: - User 'tdas' has created a pull request for this issue: https://github.com/apache/spark/pull/8199 > Handle a couple of corner cases in the PID rate estimator > - > > Key: SPARK-9966 > URL: https://issues.apache.org/jira/browse/SPARK-9966 > Project: Spark > Issue Type: Sub-task > Components: Streaming >Reporter: Tathagata Das >Assignee: Tathagata Das >Priority: Blocker > > 1. The rate estimator should not estimate any rate when there are no records > in the batch, as there is no data to estimate the rate. In the current state, > it estimates and set the rate to zero. That is incorrect. > 2. The rate estimator should not never set the rate to zero under any > circumstances. Otherwise the system will stop receiving data, and stop > generating useful estimates (see reason 1). So the fix is to define a > parameters that sets a lower bound on the estimated rate, so that the system > always receives some data. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-9966) Handle a couple of corner cases in the PID rate estimator
[ https://issues.apache.org/jira/browse/SPARK-9966?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-9966: --- Assignee: Tathagata Das (was: Apache Spark) > Handle a couple of corner cases in the PID rate estimator > - > > Key: SPARK-9966 > URL: https://issues.apache.org/jira/browse/SPARK-9966 > Project: Spark > Issue Type: Sub-task > Components: Streaming >Reporter: Tathagata Das >Assignee: Tathagata Das >Priority: Blocker > > 1. The rate estimator should not estimate any rate when there are no records > in the batch, as there is no data to estimate the rate. In the current state, > it estimates and set the rate to zero. That is incorrect. > 2. The rate estimator should not never set the rate to zero under any > circumstances. Otherwise the system will stop receiving data, and stop > generating useful estimates (see reason 1). So the fix is to define a > parameters that sets a lower bound on the estimated rate, so that the system > always receives some data. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-9966) Handle a couple of corner cases in the PID rate estimator
[ https://issues.apache.org/jira/browse/SPARK-9966?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-9966: --- Assignee: Apache Spark (was: Tathagata Das) > Handle a couple of corner cases in the PID rate estimator > - > > Key: SPARK-9966 > URL: https://issues.apache.org/jira/browse/SPARK-9966 > Project: Spark > Issue Type: Sub-task > Components: Streaming >Reporter: Tathagata Das >Assignee: Apache Spark >Priority: Blocker > > 1. The rate estimator should not estimate any rate when there are no records > in the batch, as there is no data to estimate the rate. In the current state, > it estimates and set the rate to zero. That is incorrect. > 2. The rate estimator should not never set the rate to zero under any > circumstances. Otherwise the system will stop receiving data, and stop > generating useful estimates (see reason 1). So the fix is to define a > parameters that sets a lower bound on the estimated rate, so that the system > always receives some data. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6624) Convert filters into CNF for data sources
[ https://issues.apache.org/jira/browse/SPARK-6624?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14696936#comment-14696936 ] Apache Spark commented on SPARK-6624: - User 'yjshen' has created a pull request for this issue: https://github.com/apache/spark/pull/8200 > Convert filters into CNF for data sources > - > > Key: SPARK-6624 > URL: https://issues.apache.org/jira/browse/SPARK-6624 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin >Assignee: Yijie Shen > > We should turn filters into conjunctive normal form (CNF) before we pass them > to data sources. Otherwise, filters are not very useful if there is a single > filter with a bunch of ORs. > Note that we already try to do some of these in BooleanSimplification, but I > think we should just formalize it to use CNF. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8887) Explicitly define which data types can be used as dynamic partition columns
[ https://issues.apache.org/jira/browse/SPARK-8887?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14696937#comment-14696937 ] Apache Spark commented on SPARK-8887: - User 'yjshen' has created a pull request for this issue: https://github.com/apache/spark/pull/8201 > Explicitly define which data types can be used as dynamic partition columns > --- > > Key: SPARK-8887 > URL: https://issues.apache.org/jira/browse/SPARK-8887 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 1.4.0 >Reporter: Cheng Lian >Assignee: Cheng Lian > > {{InsertIntoHadoopFsRelation}} implements Hive compatible dynamic > partitioning insertion, which uses {{String.valueOf}} to write encode > partition column values into dynamic partition directories. This actually > limits the data types that can be used in partition column. For example, > string representation of {{StructType}} values is not well defined. However, > this limitation is not explicitly enforced. > There are several things we can improve: > # Enforce dynamic column data type requirements by adding analysis rules and > throws {{AnalysisException}} when violation occurs. > # Abstract away string representation of various data types, so that we don't > need to convert internal representation types (e.g. {{UTF8String}}) to > external types (e.g. {{String}}). A set of Hive compatible implementations > should be provided to ensure compatibility with Hive. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8670) Nested columns can't be referenced (but they can be selected)
[ https://issues.apache.org/jira/browse/SPARK-8670?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14696938#comment-14696938 ] Apache Spark commented on SPARK-8670: - User 'cloud-fan' has created a pull request for this issue: https://github.com/apache/spark/pull/8202 > Nested columns can't be referenced (but they can be selected) > - > > Key: SPARK-8670 > URL: https://issues.apache.org/jira/browse/SPARK-8670 > Project: Spark > Issue Type: Sub-task > Components: PySpark, SQL >Affects Versions: 1.4.0, 1.4.1, 1.5.0 >Reporter: Nicholas Chammas >Assignee: Wenchen Fan >Priority: Blocker > > This is strange and looks like a regression from 1.3. > {code} > import json > daterz = [ > { > 'name': 'Nick', > 'stats': { > 'age': 28 > } > }, > { > 'name': 'George', > 'stats': { > 'age': 31 > } > } > ] > df = sqlContext.jsonRDD(sc.parallelize(daterz).map(lambda x: json.dumps(x))) > df.select('stats.age').show() > df['stats.age'] # 1.4 fails on this line > {code} > On 1.3 this works and yields: > {code} > age > 28 > 31 > Out[1]: Column > {code} > On 1.4, however, this gives an error on the last line: > {code} > +---+ > |age| > +---+ > | 28| > | 31| > +---+ > --- > IndexErrorTraceback (most recent call last) > in () > 19 > 20 df.select('stats.age').show() > ---> 21 df['stats.age'] > /path/to/spark/python/pyspark/sql/dataframe.pyc in __getitem__(self, item) > 678 if isinstance(item, basestring): > 679 if item not in self.columns: > --> 680 raise IndexError("no such column: %s" % item) > 681 jc = self._jdf.apply(item) > 682 return Column(jc) > IndexError: no such column: stats.age > {code} > This means, among other things, that you can't join DataFrames on nested > columns. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9970) SQLContext.createDataFrame failed to properly determine column names
[ https://issues.apache.org/jira/browse/SPARK-9970?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maciej Bryński updated SPARK-9970: -- Priority: Major (was: Minor) > SQLContext.createDataFrame failed to properly determine column names > > > Key: SPARK-9970 > URL: https://issues.apache.org/jira/browse/SPARK-9970 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 1.4.0 >Reporter: Maciej Bryński > > Hi, > I'm trying to do "nested join" of tables. > After first join everything is ok, but second join made some of the column > names lost. > My code is following: > {code:java} > def joinTable(tableLeft, tableRight, columnLeft, columnRight, columnNested, > joinType = "left_outer"): > tmpTable = sqlCtx.createDataFrame(tableRight.rdd.groupBy(lambda r: > r.asDict()[columnRight])) > tmpTable = tmpTable.select(tmpTable._1.alias("joinColumn"), > tmpTable._2.data.alias(columnNested)) > return tableLeft.join(tmpTable, tableLeft[columnLeft] == > tmpTable["joinColumn"], joinType).drop("joinColumn") > user = sqlContext.read.json(path + "user.json") > user.printSchema() > root > |-- id: long (nullable = true) > |-- name: string (nullable = true) > order = sqlContext.read.json(path + "order.json") > order.printSchema(); > root > |-- id: long (nullable = true) > |-- price: double (nullable = true) > |-- userid: long (nullable = true) > lines = sqlContext.read.json(path + "lines.json") > lines.printSchema(); > root > |-- id: long (nullable = true) > |-- orderid: long (nullable = true) > |-- product: string (nullable = true) > {code} > And joining code: > Please look for the result of 2nd join. There are columns _1,_2,_3. Should be > 'id', 'orderid', 'product' > {code:java} > orders = joinTable(order, lines, "id", "orderid", "lines") > orders.printSchema() > root > |-- id: long (nullable = true) > |-- price: double (nullable = true) > |-- userid: long (nullable = true) > |-- lines: array (nullable = true) > ||-- element: struct (containsNull = true) > |||-- id: long (nullable = true) > |||-- orderid: long (nullable = true) > |||-- product: string (nullable = true) > clients = joinTable(user, orders, "id", "userid", "orders") > clients.printSchema() > root > |-- id: long (nullable = true) > |-- name: string (nullable = true) > |-- orders: array (nullable = true) > ||-- element: struct (containsNull = true) > |||-- id: long (nullable = true) > |||-- price: double (nullable = true) > |||-- userid: long (nullable = true) > |||-- lines: array (nullable = true) > ||||-- element: struct (containsNull = true) > |||||-- _1: long (nullable = true) > |||||-- _2: long (nullable = true) > |||||-- _3: string (nullable = true) > {code} > I tried to check if groupBy isn't the cause of the problem but it looks right > {code:java} > grouped = orders.rdd.groupBy(lambda r: r.userid) > print grouped.map(lambda x: list(x[1])).collect() > [[Row(id=1, price=202.3, userid=1, lines=[Row(id=1, orderid=1, > product=u'XXX'), Row(id=2, orderid=1, product=u'YYY')]), Row(id=2, > price=343.99, userid=1, lines=[Row(id=3, orderid=2, product=u'XXX')])], > [Row(id=3, price=399.99, userid=2, lines=[Row(id=4, orderid=3, > product=u'XXX')])]] > {code} > So I assume that the problem is with createDataFrame. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-9976) create function do not work
cen yuhai created SPARK-9976: Summary: create function do not work Key: SPARK-9976 URL: https://issues.apache.org/jira/browse/SPARK-9976 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.4.1, 1.4.0, 1.5.0 Environment: spark 1.4.1 yarn 2.2.0 Reporter: cen yuhai Fix For: 1.4.2, 1.5.0 I use beeline to connect to ThriftServer, but add jar can not work, so I use create function , see the link below. http://www.cloudera.com/content/cloudera/en/documentation/core/v5-3-x/topics/cm_mc_hive_udf.html I do as blew: create function gdecodeorder as 'com.hive.udf.GOrderDecode' USING JAR 'hdfs://mycluster/user/spark/lib/gorderdecode.jar'; It returns Ok, and I connect to the metastore, I see records in table FUNCS. select gdecodeorder(t1) from tableX limit 1; It returns error 'Couldn't find function default.gdecodeorder' 15/08/14 14:53:51 ERROR UserGroupInformation: PriviledgedActionException as:xiaoju (auth:SIMPLE) cause:org.apache.hive.service.cli.HiveSQLException: java.lang.RuntimeException: Couldn't find function default.gdecodeorder 15/08/14 15:04:47 ERROR RetryingHMSHandler: MetaException(message:NoSuchObjectException(message:Function default.t_gdecodeorder does not exist)) at org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.newMetaException(HiveMetaStore.java:4613) at org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.get_function(HiveMetaStore.java:4740) at sun.reflect.GeneratedMethodAccessor57.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.hadoop.hive.metastore.RetryingHMSHandler.invoke(RetryingHMSHandler.java:105) at com.sun.proxy.$Proxy21.get_function(Unknown Source) at org.apache.hadoop.hive.metastore.HiveMetaStoreClient.getFunction(HiveMetaStoreClient.java:1721) at sun.reflect.GeneratedMethodAccessor56.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.invoke(RetryingMetaStoreClient.java:89) at com.sun.proxy.$Proxy22.getFunction(Unknown Source) at org.apache.hadoop.hive.ql.metadata.Hive.getFunction(Hive.java:2662) at org.apache.hadoop.hive.ql.exec.FunctionRegistry.getFunctionInfoFromMetastore(FunctionRegistry.java:546) at org.apache.hadoop.hive.ql.exec.FunctionRegistry.getQualifiedFunctionInfo(FunctionRegistry.java:579) at org.apache.hadoop.hive.ql.exec.FunctionRegistry.getFunctionInfo(FunctionRegistry.java:645) at org.apache.hadoop.hive.ql.exec.FunctionRegistry.getFunctionInfo(FunctionRegistry.java:652) at org.apache.spark.sql.hive.HiveFunctionRegistry.lookupFunction(hiveUdfs.scala:54) at org.apache.spark.sql.hive.HiveContext$$anon$3.org$apache$spark$sql$catalyst$analysis$OverrideFunctionRegistry$$super$lookupFunction(HiveContext.scala:376) at org.apache.spark.sql.catalyst.analysis.OverrideFunctionRegistry$$anonfun$lookupFunction$2.apply(FunctionRegistry.scala:44) at org.apache.spark.sql.catalyst.analysis.OverrideFunctionRegistry$$anonfun$lookupFunction$2.apply(FunctionRegistry.scala:44) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.sql.catalyst.analysis.OverrideFunctionRegistry$class.lookupFunction(FunctionRegistry.scala:44) at org.apache.spark.sql.hive.HiveContext$$anon$3.lookupFunction(HiveContext.scala:376) at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$13$$anonfun$applyOrElse$5.applyOrElse(Analyzer.scala:465) at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$13$$anonfun$applyOrElse$5.applyOrElse(Analyzer.scala:463) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:222) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:222) at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:51) at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:221) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:242) at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103) at scala.collection.mutable.Arra
[jira] [Commented] (SPARK-9955) TPCDS Q8 failed in 1.5
[ https://issues.apache.org/jira/browse/SPARK-9955?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14696975#comment-14696975 ] Apache Spark commented on SPARK-9955: - User 'cloud-fan' has created a pull request for this issue: https://github.com/apache/spark/pull/8203 > TPCDS Q8 failed in 1.5 > --- > > Key: SPARK-9955 > URL: https://issues.apache.org/jira/browse/SPARK-9955 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Davies Liu >Assignee: Wenchen Fan >Priority: Critical > > {code} > -- start query 1 in stream 0 using template query8.tpl > select > s_store_name, > sum(ss_net_profit) > from > store_sales > join store on (store_sales.ss_store_sk = store.s_store_sk) > -- join date_dim on (store_sales.ss_sold_date_sk = date_dim.d_date_sk) > join > (select > a.ca_zip > from > (select > substr(ca_zip, 1, 5) ca_zip, > count( *) cnt > from > customer_address > join customer on (customer_address.ca_address_sk = > customer.c_current_addr_sk) > where > c_preferred_cust_flag = 'Y' > group by > ca_zip > having > count( *) > 10 > ) a > left semi join > (select > substr(ca_zip, 1, 5) ca_zip > from > customer_address > where > substr(ca_zip, 1, 5) in ('89436', '30868', '65085', '22977', '83927', > '77557', '58429', '40697', '80614', '10502', '32779', > '91137', '61265', '98294', '17921', '18427', '21203', '59362', '87291', > '84093', '21505', '17184', '10866', '67898', '25797', > '28055', '18377', '80332', '74535', '21757', '29742', '90885', '29898', > '17819', '40811', '25990', '47513', '89531', '91068', > '10391', '18846', '99223', '82637', '41368', '83658', '86199', '81625', > '26696', '89338', '88425', '32200', '81427', '19053', > '77471', '36610', '99823', '43276', '41249', '48584', '83550', '82276', > '18842', '78890', '14090', '38123', '40936', '34425', > '19850', '43286', '80072', '79188', '54191', '11395', '50497', '84861', > '90733', '21068', '57666', '37119', '25004', '57835', > '70067', '62878', '95806', '19303', '18840', '19124', '29785', '16737', > '16022', '49613', '89977', '68310', '60069', '98360', > '48649', '39050', '41793', '25002', '27413', '39736', '47208', '16515', > '94808', '57648', '15009', '80015', '42961', '63982', > '21744', '71853', '81087', '67468', '34175', '64008', '20261', '11201', > '51799', '48043', '45645', '61163', '48375', '36447', > '57042', '21218', '41100', '89951', '22745', '35851', '83326', '61125', > '78298', '80752', '49858', '52940', '96976', '63792', > '11376', '53582', '18717', '90226', '50530', '94203', '99447', '27670', > '96577', '57856', '56372', '16165', '23427', '54561', > '28806', '44439', '22926', '30123', '61451', '92397', '56979', '92309', > '70873', '13355', '21801', '46346', '37562', '56458', > '28286', '47306', '99555', '69399', '26234', '47546', '49661', '88601', > '35943', '39936', '25632', '24611', '44166', '56648', > '30379', '59785', '0', '14329', '93815', '52226', '71381', '13842', > '25612', '63294', '14664', '21077', '82626', '18799', > '60915', '81020', '56447', '76619', '11433', '13414', '42548', '92713', > '70467', '30884', '47484', '16072', '38936', '13036', > '88376', '45539', '35901', '19506', '65690', '73957', '71850', '49231', > '14276', '20005', '18384', '76615', '11635', '38177', > '55607', '41369', '95447', '58581', '58149', '91946', '33790', '76232', > '75692', '95464', '22246', '51061', '56692', '53121', > '77209', '15482', '10688', '14868', '45907', '73520', '72666', '25734', > '17959', '24677', '66446', '94627', '53535', '15560', > '41967', '69297', '11929', '59403', '33283', '52232', '57350', '43933', > '40921', '36635', '10827', '71286', '19736', '80619', > '25251', '95042', '15526', '36496', '55854', '49124', '81980', '35375', > '49157', '63512', '28944', '14946', '36503', '54010', > '18767', '23969', '43905', '66979', '33113', '21286', '58471', '59080', > '13395', '79144', '70373', '67031', '38360', '26705', > '50906', '52406', '26066', '73146', '15884', '31897', '30045', '61068', > '45550', '92454', '13376', '14354', '19770', '22928', > '97790', '50723', '46081', '30202', '14410', '20223', '88500', '67298', > '13261', '14172', '81410', '93578', '83583', '46047', > '94167', '82564', '21156', '15799', '86709', '37931', '74703', '83103', > '23054', '70470', '72008', '49247', '91911', '69998', > '20961', '70070', '63197', '54853', '88191', '91830', '49521', '19454', > '81450', '89091', '62378', '25683', '61869', '51744', > '36580', '85778', '36871', '48121', '28810', '83712', '45486', '67393', > '26935', '42393', '20132', '
[jira] [Assigned] (SPARK-9955) TPCDS Q8 failed in 1.5
[ https://issues.apache.org/jira/browse/SPARK-9955?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-9955: --- Assignee: Apache Spark (was: Wenchen Fan) > TPCDS Q8 failed in 1.5 > --- > > Key: SPARK-9955 > URL: https://issues.apache.org/jira/browse/SPARK-9955 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Davies Liu >Assignee: Apache Spark >Priority: Critical > > {code} > -- start query 1 in stream 0 using template query8.tpl > select > s_store_name, > sum(ss_net_profit) > from > store_sales > join store on (store_sales.ss_store_sk = store.s_store_sk) > -- join date_dim on (store_sales.ss_sold_date_sk = date_dim.d_date_sk) > join > (select > a.ca_zip > from > (select > substr(ca_zip, 1, 5) ca_zip, > count( *) cnt > from > customer_address > join customer on (customer_address.ca_address_sk = > customer.c_current_addr_sk) > where > c_preferred_cust_flag = 'Y' > group by > ca_zip > having > count( *) > 10 > ) a > left semi join > (select > substr(ca_zip, 1, 5) ca_zip > from > customer_address > where > substr(ca_zip, 1, 5) in ('89436', '30868', '65085', '22977', '83927', > '77557', '58429', '40697', '80614', '10502', '32779', > '91137', '61265', '98294', '17921', '18427', '21203', '59362', '87291', > '84093', '21505', '17184', '10866', '67898', '25797', > '28055', '18377', '80332', '74535', '21757', '29742', '90885', '29898', > '17819', '40811', '25990', '47513', '89531', '91068', > '10391', '18846', '99223', '82637', '41368', '83658', '86199', '81625', > '26696', '89338', '88425', '32200', '81427', '19053', > '77471', '36610', '99823', '43276', '41249', '48584', '83550', '82276', > '18842', '78890', '14090', '38123', '40936', '34425', > '19850', '43286', '80072', '79188', '54191', '11395', '50497', '84861', > '90733', '21068', '57666', '37119', '25004', '57835', > '70067', '62878', '95806', '19303', '18840', '19124', '29785', '16737', > '16022', '49613', '89977', '68310', '60069', '98360', > '48649', '39050', '41793', '25002', '27413', '39736', '47208', '16515', > '94808', '57648', '15009', '80015', '42961', '63982', > '21744', '71853', '81087', '67468', '34175', '64008', '20261', '11201', > '51799', '48043', '45645', '61163', '48375', '36447', > '57042', '21218', '41100', '89951', '22745', '35851', '83326', '61125', > '78298', '80752', '49858', '52940', '96976', '63792', > '11376', '53582', '18717', '90226', '50530', '94203', '99447', '27670', > '96577', '57856', '56372', '16165', '23427', '54561', > '28806', '44439', '22926', '30123', '61451', '92397', '56979', '92309', > '70873', '13355', '21801', '46346', '37562', '56458', > '28286', '47306', '99555', '69399', '26234', '47546', '49661', '88601', > '35943', '39936', '25632', '24611', '44166', '56648', > '30379', '59785', '0', '14329', '93815', '52226', '71381', '13842', > '25612', '63294', '14664', '21077', '82626', '18799', > '60915', '81020', '56447', '76619', '11433', '13414', '42548', '92713', > '70467', '30884', '47484', '16072', '38936', '13036', > '88376', '45539', '35901', '19506', '65690', '73957', '71850', '49231', > '14276', '20005', '18384', '76615', '11635', '38177', > '55607', '41369', '95447', '58581', '58149', '91946', '33790', '76232', > '75692', '95464', '22246', '51061', '56692', '53121', > '77209', '15482', '10688', '14868', '45907', '73520', '72666', '25734', > '17959', '24677', '66446', '94627', '53535', '15560', > '41967', '69297', '11929', '59403', '33283', '52232', '57350', '43933', > '40921', '36635', '10827', '71286', '19736', '80619', > '25251', '95042', '15526', '36496', '55854', '49124', '81980', '35375', > '49157', '63512', '28944', '14946', '36503', '54010', > '18767', '23969', '43905', '66979', '33113', '21286', '58471', '59080', > '13395', '79144', '70373', '67031', '38360', '26705', > '50906', '52406', '26066', '73146', '15884', '31897', '30045', '61068', > '45550', '92454', '13376', '14354', '19770', '22928', > '97790', '50723', '46081', '30202', '14410', '20223', '88500', '67298', > '13261', '14172', '81410', '93578', '83583', '46047', > '94167', '82564', '21156', '15799', '86709', '37931', '74703', '83103', > '23054', '70470', '72008', '49247', '91911', '69998', > '20961', '70070', '63197', '54853', '88191', '91830', '49521', '19454', > '81450', '89091', '62378', '25683', '61869', '51744', > '36580', '85778', '36871', '48121', '28810', '83712', '45486', '67393', > '26935', '42393', '20132', '55349', '86057', '21309', > '80218', '10094', '11357', '48819', '39734', '40758', '30432', '21204',
[jira] [Assigned] (SPARK-9955) TPCDS Q8 failed in 1.5
[ https://issues.apache.org/jira/browse/SPARK-9955?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-9955: --- Assignee: Wenchen Fan (was: Apache Spark) > TPCDS Q8 failed in 1.5 > --- > > Key: SPARK-9955 > URL: https://issues.apache.org/jira/browse/SPARK-9955 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Davies Liu >Assignee: Wenchen Fan >Priority: Critical > > {code} > -- start query 1 in stream 0 using template query8.tpl > select > s_store_name, > sum(ss_net_profit) > from > store_sales > join store on (store_sales.ss_store_sk = store.s_store_sk) > -- join date_dim on (store_sales.ss_sold_date_sk = date_dim.d_date_sk) > join > (select > a.ca_zip > from > (select > substr(ca_zip, 1, 5) ca_zip, > count( *) cnt > from > customer_address > join customer on (customer_address.ca_address_sk = > customer.c_current_addr_sk) > where > c_preferred_cust_flag = 'Y' > group by > ca_zip > having > count( *) > 10 > ) a > left semi join > (select > substr(ca_zip, 1, 5) ca_zip > from > customer_address > where > substr(ca_zip, 1, 5) in ('89436', '30868', '65085', '22977', '83927', > '77557', '58429', '40697', '80614', '10502', '32779', > '91137', '61265', '98294', '17921', '18427', '21203', '59362', '87291', > '84093', '21505', '17184', '10866', '67898', '25797', > '28055', '18377', '80332', '74535', '21757', '29742', '90885', '29898', > '17819', '40811', '25990', '47513', '89531', '91068', > '10391', '18846', '99223', '82637', '41368', '83658', '86199', '81625', > '26696', '89338', '88425', '32200', '81427', '19053', > '77471', '36610', '99823', '43276', '41249', '48584', '83550', '82276', > '18842', '78890', '14090', '38123', '40936', '34425', > '19850', '43286', '80072', '79188', '54191', '11395', '50497', '84861', > '90733', '21068', '57666', '37119', '25004', '57835', > '70067', '62878', '95806', '19303', '18840', '19124', '29785', '16737', > '16022', '49613', '89977', '68310', '60069', '98360', > '48649', '39050', '41793', '25002', '27413', '39736', '47208', '16515', > '94808', '57648', '15009', '80015', '42961', '63982', > '21744', '71853', '81087', '67468', '34175', '64008', '20261', '11201', > '51799', '48043', '45645', '61163', '48375', '36447', > '57042', '21218', '41100', '89951', '22745', '35851', '83326', '61125', > '78298', '80752', '49858', '52940', '96976', '63792', > '11376', '53582', '18717', '90226', '50530', '94203', '99447', '27670', > '96577', '57856', '56372', '16165', '23427', '54561', > '28806', '44439', '22926', '30123', '61451', '92397', '56979', '92309', > '70873', '13355', '21801', '46346', '37562', '56458', > '28286', '47306', '99555', '69399', '26234', '47546', '49661', '88601', > '35943', '39936', '25632', '24611', '44166', '56648', > '30379', '59785', '0', '14329', '93815', '52226', '71381', '13842', > '25612', '63294', '14664', '21077', '82626', '18799', > '60915', '81020', '56447', '76619', '11433', '13414', '42548', '92713', > '70467', '30884', '47484', '16072', '38936', '13036', > '88376', '45539', '35901', '19506', '65690', '73957', '71850', '49231', > '14276', '20005', '18384', '76615', '11635', '38177', > '55607', '41369', '95447', '58581', '58149', '91946', '33790', '76232', > '75692', '95464', '22246', '51061', '56692', '53121', > '77209', '15482', '10688', '14868', '45907', '73520', '72666', '25734', > '17959', '24677', '66446', '94627', '53535', '15560', > '41967', '69297', '11929', '59403', '33283', '52232', '57350', '43933', > '40921', '36635', '10827', '71286', '19736', '80619', > '25251', '95042', '15526', '36496', '55854', '49124', '81980', '35375', > '49157', '63512', '28944', '14946', '36503', '54010', > '18767', '23969', '43905', '66979', '33113', '21286', '58471', '59080', > '13395', '79144', '70373', '67031', '38360', '26705', > '50906', '52406', '26066', '73146', '15884', '31897', '30045', '61068', > '45550', '92454', '13376', '14354', '19770', '22928', > '97790', '50723', '46081', '30202', '14410', '20223', '88500', '67298', > '13261', '14172', '81410', '93578', '83583', '46047', > '94167', '82564', '21156', '15799', '86709', '37931', '74703', '83103', > '23054', '70470', '72008', '49247', '91911', '69998', > '20961', '70070', '63197', '54853', '88191', '91830', '49521', '19454', > '81450', '89091', '62378', '25683', '61869', '51744', > '36580', '85778', '36871', '48121', '28810', '83712', '45486', '67393', > '26935', '42393', '20132', '55349', '86057', '21309', > '80218', '10094', '11357', '48819', '39734', '40758', '30432', '21204',
[jira] [Assigned] (SPARK-9968) BlockGenerator lock structure can cause lock starvation of the block updating thread
[ https://issues.apache.org/jira/browse/SPARK-9968?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-9968: --- Assignee: Apache Spark (was: Tathagata Das) > BlockGenerator lock structure can cause lock starvation of the block updating > thread > > > Key: SPARK-9968 > URL: https://issues.apache.org/jira/browse/SPARK-9968 > Project: Spark > Issue Type: Sub-task > Components: Streaming >Reporter: Tathagata Das >Assignee: Apache Spark > > When the rate limiter is actually limiting the rate at which data is inserted > into the buffer, the synchronized block of BlockGenerator.addData stays > blocked for long time. This causes the thread switching the buffer and > generating blocks (synchronized with addData) to starve and not generate > blocks for seconds. The correct solution is to not block on the rate limiter > within the synchronized block for adding data to the buffer. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-9968) BlockGenerator lock structure can cause lock starvation of the block updating thread
[ https://issues.apache.org/jira/browse/SPARK-9968?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-9968: --- Assignee: Tathagata Das (was: Apache Spark) > BlockGenerator lock structure can cause lock starvation of the block updating > thread > > > Key: SPARK-9968 > URL: https://issues.apache.org/jira/browse/SPARK-9968 > Project: Spark > Issue Type: Sub-task > Components: Streaming >Reporter: Tathagata Das >Assignee: Tathagata Das > > When the rate limiter is actually limiting the rate at which data is inserted > into the buffer, the synchronized block of BlockGenerator.addData stays > blocked for long time. This causes the thread switching the buffer and > generating blocks (synchronized with addData) to starve and not generate > blocks for seconds. The correct solution is to not block on the rate limiter > within the synchronized block for adding data to the buffer. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9968) BlockGenerator lock structure can cause lock starvation of the block updating thread
[ https://issues.apache.org/jira/browse/SPARK-9968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14696987#comment-14696987 ] Apache Spark commented on SPARK-9968: - User 'tdas' has created a pull request for this issue: https://github.com/apache/spark/pull/8204 > BlockGenerator lock structure can cause lock starvation of the block updating > thread > > > Key: SPARK-9968 > URL: https://issues.apache.org/jira/browse/SPARK-9968 > Project: Spark > Issue Type: Sub-task > Components: Streaming >Reporter: Tathagata Das >Assignee: Tathagata Das > > When the rate limiter is actually limiting the rate at which data is inserted > into the buffer, the synchronized block of BlockGenerator.addData stays > blocked for long time. This causes the thread switching the buffer and > generating blocks (synchronized with addData) to starve and not generate > blocks for seconds. The correct solution is to not block on the rate limiter > within the synchronized block for adding data to the buffer. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6624) Convert filters into CNF for data sources
[ https://issues.apache.org/jira/browse/SPARK-6624?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14696988#comment-14696988 ] Apache Spark commented on SPARK-6624: - User 'yjshen' has created a pull request for this issue: https://github.com/apache/spark/pull/8193 > Convert filters into CNF for data sources > - > > Key: SPARK-6624 > URL: https://issues.apache.org/jira/browse/SPARK-6624 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin >Assignee: Yijie Shen > > We should turn filters into conjunctive normal form (CNF) before we pass them > to data sources. Otherwise, filters are not very useful if there is a single > filter with a bunch of ORs. > Note that we already try to do some of these in BooleanSimplification, but I > think we should just formalize it to use CNF. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9734) java.lang.IllegalArgumentException: Don't know how to save StructField(sal,DecimalType(7,2),true) to JDBC
[ https://issues.apache.org/jira/browse/SPARK-9734?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14697003#comment-14697003 ] Rama Mullapudi commented on SPARK-9734: --- Current 1.5 still gives error when creating table using dataframe write.jdbc Create statement issued by spark looks as below CREATE TABLE foo (TKT_GID DECIMAL(10},0}) NOT NULL) There are closing braces } in the decimal format which causing database to throw error. I looked into the code on github and found in jdbcutils class schemaString function has the extra closing braces } which is causing the issue. /** * Compute the schema string for this RDD. */ def schemaString(df: DataFrame, url: String): String = { . case BooleanType => "BIT(1)" case StringType => "TEXT" case BinaryType => "BLOB" case TimestampType => "TIMESTAMP" case DateType => "DATE" case t: DecimalType => s"DECIMAL(${t.precision}},${t.scale}})" case _ => throw new IllegalArgumentException(s"Don't know how to save $field to JDBC") }) } > java.lang.IllegalArgumentException: Don't know how to save > StructField(sal,DecimalType(7,2),true) to JDBC > - > > Key: SPARK-9734 > URL: https://issues.apache.org/jira/browse/SPARK-9734 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.4.1 >Reporter: Greg Rahn >Assignee: Davies Liu > Fix For: 1.5.0 > > > When using a basic example of reading the EMP table from Redshift via > spark-redshift, and writing the data back to Redshift, Spark fails with the > below error, related to Numeric/Decimal data types. > Redshift table: > {code} > testdb=# \d emp > Table "public.emp" > Column | Type | Modifiers > --+---+--- > empno| integer | > ename| character varying(10) | > job | character varying(9) | > mgr | integer | > hiredate | date | > sal | numeric(7,2) | > comm | numeric(7,2) | > deptno | integer | > testdb=# select * from emp; > empno | ename |job| mgr | hiredate | sal | comm | deptno > ---++---+--++-+-+ > 7369 | SMITH | CLERK | 7902 | 1980-12-17 | 800.00 |NULL | 20 > 7521 | WARD | SALESMAN | 7698 | 1981-02-22 | 1250.00 | 500.00 | 30 > 7654 | MARTIN | SALESMAN | 7698 | 1981-09-28 | 1250.00 | 1400.00 | 30 > 7782 | CLARK | MANAGER | 7839 | 1981-06-09 | 2450.00 |NULL | 10 > 7839 | KING | PRESIDENT | NULL | 1981-11-17 | 5000.00 |NULL | 10 > 7876 | ADAMS | CLERK | 7788 | 1983-01-12 | 1100.00 |NULL | 20 > 7902 | FORD | ANALYST | 7566 | 1981-12-03 | 3000.00 |NULL | 20 > 7499 | ALLEN | SALESMAN | 7698 | 1981-02-20 | 1600.00 | 300.00 | 30 > 7566 | JONES | MANAGER | 7839 | 1981-04-02 | 2975.00 |NULL | 20 > 7698 | BLAKE | MANAGER | 7839 | 1981-05-01 | 2850.00 |NULL | 30 > 7788 | SCOTT | ANALYST | 7566 | 1982-12-09 | 3000.00 |NULL | 20 > 7844 | TURNER | SALESMAN | 7698 | 1981-09-08 | 1500.00 |0.00 | 30 > 7900 | JAMES | CLERK | 7698 | 1981-12-03 | 950.00 |NULL | 30 > 7934 | MILLER | CLERK | 7782 | 1982-01-23 | 1300.00 |NULL | 10 > (14 rows) > {code} > Spark Code: > {code} > val url = "jdbc:redshift://rshost:5439/testdb?user=xxx&password=xxx" > val driver = "com.amazon.redshift.jdbc41.Driver" > val t = > sqlContext.read.format("com.databricks.spark.redshift").option("jdbcdriver", > driver).option("url", url).option("dbtable", "emp").option("tempdir", > "s3n://spark-temp-dir").load() > t.registerTempTable("SparkTempTable") > val t1 = sqlContext.sql("select * from SparkTempTable") > t1.write.format("com.databricks.spark.redshift").option("driver", > driver).option("url", url).option("dbtable", "t1").option("tempdir", > "s3n://spark-temp-dir").option("avrocompression", > "snappy").mode("error").save() > {code} > Error Stack: > {code} > java.lang.IllegalArgumentException: Don't know how to save > StructField(sal,DecimalType(7,2),true) to JDBC > at > org.apache.spark.sql.jdbc.package$JDBCWriteDetails$$anonfun$schemaString$1$$anonfun$2.apply(jdbc.scala:149) > at > org.apache.spark.sql.jdbc.package$JDBCWriteDetails$$anonfun$schemaString$1$$anonfun$2.apply(jdbc.scala:136) > at scala.Option.getOrElse(Option.scala:120) > at > org.apache.spark.sql.jdbc.package$JDBCWriteDetails$$anonfun$schemaString$1.apply(jdbc.scala:135) > at > org.apache.spark.sql.jdbc.package$JDBC
[jira] [Created] (SPARK-9977) The usage of a label generated by StringIndexer
Kai Sasaki created SPARK-9977: - Summary: The usage of a label generated by StringIndexer Key: SPARK-9977 URL: https://issues.apache.org/jira/browse/SPARK-9977 Project: Spark Issue Type: Improvement Components: Documentation Affects Versions: 1.4.1 Reporter: Kai Sasaki Priority: Trivial By using {{StringIndexer}}, we can obtain indexed label on new column. So a following estimator should use this new column through pipeline if it wants to use string indexed label. I think it is better to make it explicit on documentation. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9976) create function do not work
[ https://issues.apache.org/jira/browse/SPARK-9976?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] cen yuhai updated SPARK-9976: - Description: I use beeline to connect to ThriftServer, but add jar can not work, so I use create function , see the link below. http://www.cloudera.com/content/cloudera/en/documentation/core/v5-3-x/topics/cm_mc_hive_udf.html I do as blow: create function gdecodeorder as 'com.hive.udf.GOrderDecode' USING JAR 'hdfs://mycluster/user/spark/lib/gorderdecode.jar'; It returns Ok, and I connect to the metastore, I see records in table FUNCS. select gdecodeorder(t1) from tableX limit 1; It returns error 'Couldn't find function default.gdecodeorder' This is the Exception 15/08/14 14:53:51 ERROR UserGroupInformation: PriviledgedActionException as:xiaoju (auth:SIMPLE) cause:org.apache.hive.service.cli.HiveSQLException: java.lang.RuntimeException: Couldn't find function default.gdecodeorder 15/08/14 15:04:47 ERROR RetryingHMSHandler: MetaException(message:NoSuchObjectException(message:Function default.t_gdecodeorder does not exist)) at org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.newMetaException(HiveMetaStore.java:4613) at org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.get_function(HiveMetaStore.java:4740) at sun.reflect.GeneratedMethodAccessor57.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.hadoop.hive.metastore.RetryingHMSHandler.invoke(RetryingHMSHandler.java:105) at com.sun.proxy.$Proxy21.get_function(Unknown Source) at org.apache.hadoop.hive.metastore.HiveMetaStoreClient.getFunction(HiveMetaStoreClient.java:1721) at sun.reflect.GeneratedMethodAccessor56.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.invoke(RetryingMetaStoreClient.java:89) at com.sun.proxy.$Proxy22.getFunction(Unknown Source) at org.apache.hadoop.hive.ql.metadata.Hive.getFunction(Hive.java:2662) at org.apache.hadoop.hive.ql.exec.FunctionRegistry.getFunctionInfoFromMetastore(FunctionRegistry.java:546) at org.apache.hadoop.hive.ql.exec.FunctionRegistry.getQualifiedFunctionInfo(FunctionRegistry.java:579) at org.apache.hadoop.hive.ql.exec.FunctionRegistry.getFunctionInfo(FunctionRegistry.java:645) at org.apache.hadoop.hive.ql.exec.FunctionRegistry.getFunctionInfo(FunctionRegistry.java:652) at org.apache.spark.sql.hive.HiveFunctionRegistry.lookupFunction(hiveUdfs.scala:54) at org.apache.spark.sql.hive.HiveContext$$anon$3.org$apache$spark$sql$catalyst$analysis$OverrideFunctionRegistry$$super$lookupFunction(HiveContext.scala:376) at org.apache.spark.sql.catalyst.analysis.OverrideFunctionRegistry$$anonfun$lookupFunction$2.apply(FunctionRegistry.scala:44) at org.apache.spark.sql.catalyst.analysis.OverrideFunctionRegistry$$anonfun$lookupFunction$2.apply(FunctionRegistry.scala:44) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.sql.catalyst.analysis.OverrideFunctionRegistry$class.lookupFunction(FunctionRegistry.scala:44) at org.apache.spark.sql.hive.HiveContext$$anon$3.lookupFunction(HiveContext.scala:376) at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$13$$anonfun$applyOrElse$5.applyOrElse(Analyzer.scala:465) at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$13$$anonfun$applyOrElse$5.applyOrElse(Analyzer.scala:463) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:222) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:222) at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:51) at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:221) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:242) at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47) at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273) at scala.collection.AbstractIterator.to(Iterator.scala:1157) at scala.collection.Trave
[jira] [Created] (SPARK-9978) Window functions require partitionBy to work as expected
Maciej Szymkiewicz created SPARK-9978: - Summary: Window functions require partitionBy to work as expected Key: SPARK-9978 URL: https://issues.apache.org/jira/browse/SPARK-9978 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 1.4.1 Reporter: Maciej Szymkiewicz I am trying to reproduce following query: {code} df.registerTempTable("df") sqlContext.sql("SELECT x, row_number() OVER (ORDER BY x) as rn FROM df").show() ++--+ | x|rn| ++--+ |0.25| 1| | 0.5| 2| |0.75| 3| ++--+ {code} using PySpark API. Unfortunately it doesn't work as expected: {code} from pyspark.sql.window import Window from pyspark.sql.functions import rowNumber df = sqlContext.createDataFrame([{"x": 0.25}, {"x": 0.5}, {"x": 0.75}]) df.select(df["x"], rowNumber().over(Window.orderBy("x")).alias("rn")).show() ++--+ | x|rn| ++--+ | 0.5| 1| |0.25| 1| |0.75| 1| ++--+ {code} As a workaround It is possible to call partitionBy without additional arguments: {code} df.select( df["x"], rowNumber().over(Window.partitionBy().orderBy("x")).alias("rn") ).show() ++--+ | x|rn| ++--+ |0.25| 1| | 0.5| 2| |0.75| 3| ++--+ {code} but as far as I can tell it is not documented and is rather counterintuitive considering SQL syntax. Moreover this problem doesn't affect Scala API: {code} import org.apache.spark.sql.expressions.Window import org.apache.spark.sql.functions.rowNumber case class Record(x: Double) val df = sqlContext.createDataFrame(Record(0.25) :: Record(0.5) :: Record(0.75)) df.select($"x", rowNumber().over(Window.orderBy($"x")).alias("rn")).show ++--+ | x|rn| ++--+ |0.25| 1| | 0.5| 2| |0.75| 3| ++--+ {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9978) Window functions require partitionBy to work as expected
[ https://issues.apache.org/jira/browse/SPARK-9978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maciej Szymkiewicz updated SPARK-9978: -- Description: I am trying to reproduce following SQL query: {code} df.registerTempTable("df") sqlContext.sql("SELECT x, row_number() OVER (ORDER BY x) as rn FROM df").show() ++--+ | x|rn| ++--+ |0.25| 1| | 0.5| 2| |0.75| 3| ++--+ {code} using PySpark API. Unfortunately it doesn't work as expected: {code} from pyspark.sql.window import Window from pyspark.sql.functions import rowNumber df = sqlContext.createDataFrame([{"x": 0.25}, {"x": 0.5}, {"x": 0.75}]) df.select(df["x"], rowNumber().over(Window.orderBy("x")).alias("rn")).show() ++--+ | x|rn| ++--+ | 0.5| 1| |0.25| 1| |0.75| 1| ++--+ {code} As a workaround It is possible to call partitionBy without additional arguments: {code} df.select( df["x"], rowNumber().over(Window.partitionBy().orderBy("x")).alias("rn") ).show() ++--+ | x|rn| ++--+ |0.25| 1| | 0.5| 2| |0.75| 3| ++--+ {code} but as far as I can tell it is not documented and is rather counterintuitive considering SQL syntax. Moreover this problem doesn't affect Scala API: {code} import org.apache.spark.sql.expressions.Window import org.apache.spark.sql.functions.rowNumber case class Record(x: Double) val df = sqlContext.createDataFrame(Record(0.25) :: Record(0.5) :: Record(0.75)) df.select($"x", rowNumber().over(Window.orderBy($"x")).alias("rn")).show ++--+ | x|rn| ++--+ |0.25| 1| | 0.5| 2| |0.75| 3| ++--+ {code} was: I am trying to reproduce following query: {code} df.registerTempTable("df") sqlContext.sql("SELECT x, row_number() OVER (ORDER BY x) as rn FROM df").show() ++--+ | x|rn| ++--+ |0.25| 1| | 0.5| 2| |0.75| 3| ++--+ {code} using PySpark API. Unfortunately it doesn't work as expected: {code} from pyspark.sql.window import Window from pyspark.sql.functions import rowNumber df = sqlContext.createDataFrame([{"x": 0.25}, {"x": 0.5}, {"x": 0.75}]) df.select(df["x"], rowNumber().over(Window.orderBy("x")).alias("rn")).show() ++--+ | x|rn| ++--+ | 0.5| 1| |0.25| 1| |0.75| 1| ++--+ {code} As a workaround It is possible to call partitionBy without additional arguments: {code} df.select( df["x"], rowNumber().over(Window.partitionBy().orderBy("x")).alias("rn") ).show() ++--+ | x|rn| ++--+ |0.25| 1| | 0.5| 2| |0.75| 3| ++--+ {code} but as far as I can tell it is not documented and is rather counterintuitive considering SQL syntax. Moreover this problem doesn't affect Scala API: {code} import org.apache.spark.sql.expressions.Window import org.apache.spark.sql.functions.rowNumber case class Record(x: Double) val df = sqlContext.createDataFrame(Record(0.25) :: Record(0.5) :: Record(0.75)) df.select($"x", rowNumber().over(Window.orderBy($"x")).alias("rn")).show ++--+ | x|rn| ++--+ |0.25| 1| | 0.5| 2| |0.75| 3| ++--+ {code} > Window functions require partitionBy to work as expected > > > Key: SPARK-9978 > URL: https://issues.apache.org/jira/browse/SPARK-9978 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 1.4.1 >Reporter: Maciej Szymkiewicz > > I am trying to reproduce following SQL query: > {code} > df.registerTempTable("df") > sqlContext.sql("SELECT x, row_number() OVER (ORDER BY x) as rn FROM > df").show() > ++--+ > | x|rn| > ++--+ > |0.25| 1| > | 0.5| 2| > |0.75| 3| > ++--+ > {code} > using PySpark API. Unfortunately it doesn't work as expected: > {code} > from pyspark.sql.window import Window > from pyspark.sql.functions import rowNumber > df = sqlContext.createDataFrame([{"x": 0.25}, {"x": 0.5}, {"x": 0.75}]) > df.select(df["x"], rowNumber().over(Window.orderBy("x")).alias("rn")).show() > ++--+ > | x|rn| > ++--+ > | 0.5| 1| > |0.25| 1| > |0.75| 1| > ++--+ > {code} > As a workaround It is possible to call partitionBy without additional > arguments: > {code} > df.select( > df["x"], > rowNumber().over(Window.partitionBy().orderBy("x")).alias("rn") > ).show() > ++--+ > | x|rn| > ++--+ > |0.25| 1| > | 0.5| 2| > |0.75| 3| > ++--+ > {code} > but as far as I can tell it is not documented and is rather counterintuitive > considering SQL syntax. Moreover this problem doesn't affect Scala API: > {code} > import org.apache.spark.sql.expressions.Window > import org.apache.spark.sql.functions.rowNumber > case class Record(x: Double) > val df = sqlContext.createDataFrame(Record(0.25) :: Record(0.5) :: > Record(0.75)) > df.select($"x", rowNumber().over(Window.orderBy($"x")).alias("rn")).show > ++--+ > | x|rn| > ++--+ > |0.25| 1| > | 0.5| 2| > |0.75| 3| > ++--+ > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332
[jira] [Commented] (SPARK-9947) Separate Metadata and State Checkpoint Data
[ https://issues.apache.org/jira/browse/SPARK-9947?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14697098#comment-14697098 ] Cody Koeninger commented on SPARK-9947: --- You already have access to offsets and can save them however you want. You can provide those offsets on restart, regardless of whether checkpointing was enabled. > Separate Metadata and State Checkpoint Data > --- > > Key: SPARK-9947 > URL: https://issues.apache.org/jira/browse/SPARK-9947 > Project: Spark > Issue Type: Improvement > Components: Streaming >Affects Versions: 1.4.1 >Reporter: Dan Dutrow > Original Estimate: 168h > Remaining Estimate: 168h > > Problem: When updating an application that has checkpointing enabled to > support the updateStateByKey and 24/7 operation functionality, you encounter > the problem where you might like to maintain state data between restarts but > delete the metadata containing execution state. > If checkpoint data exists between code redeployment, the program may not > execute properly or at all. My current workaround for this issue is to wrap > updateStateByKey with my own function that persists the state after every > update to my own separate directory. (That allows me to delete the checkpoint > with its metadata before redeploying) Then, when I restart the application, I > initialize the state with this persisted data. This incurs additional > overhead due to persisting of the same data twice: once in the checkpoint and > once in my persisted data folder. > If Kafka Direct API offsets could be stored in another separate checkpoint > directory, that would help address the problem of having to blow that away > between code redeployment as well. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-9977) The usage of a label generated by StringIndexer
[ https://issues.apache.org/jira/browse/SPARK-9977?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-9977: --- Assignee: Apache Spark > The usage of a label generated by StringIndexer > --- > > Key: SPARK-9977 > URL: https://issues.apache.org/jira/browse/SPARK-9977 > Project: Spark > Issue Type: Improvement > Components: Documentation >Affects Versions: 1.4.1 >Reporter: Kai Sasaki >Assignee: Apache Spark >Priority: Trivial > Labels: documentaion > > By using {{StringIndexer}}, we can obtain indexed label on new column. So a > following estimator should use this new column through pipeline if it wants > to use string indexed label. > I think it is better to make it explicit on documentation. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-9977) The usage of a label generated by StringIndexer
[ https://issues.apache.org/jira/browse/SPARK-9977?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-9977: --- Assignee: (was: Apache Spark) > The usage of a label generated by StringIndexer > --- > > Key: SPARK-9977 > URL: https://issues.apache.org/jira/browse/SPARK-9977 > Project: Spark > Issue Type: Improvement > Components: Documentation >Affects Versions: 1.4.1 >Reporter: Kai Sasaki >Priority: Trivial > Labels: documentaion > > By using {{StringIndexer}}, we can obtain indexed label on new column. So a > following estimator should use this new column through pipeline if it wants > to use string indexed label. > I think it is better to make it explicit on documentation. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9977) The usage of a label generated by StringIndexer
[ https://issues.apache.org/jira/browse/SPARK-9977?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14697104#comment-14697104 ] Apache Spark commented on SPARK-9977: - User 'Lewuathe' has created a pull request for this issue: https://github.com/apache/spark/pull/8205 > The usage of a label generated by StringIndexer > --- > > Key: SPARK-9977 > URL: https://issues.apache.org/jira/browse/SPARK-9977 > Project: Spark > Issue Type: Improvement > Components: Documentation >Affects Versions: 1.4.1 >Reporter: Kai Sasaki >Priority: Trivial > Labels: documentaion > > By using {{StringIndexer}}, we can obtain indexed label on new column. So a > following estimator should use this new column through pipeline if it wants > to use string indexed label. > I think it is better to make it explicit on documentation. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-9979) Unable to request addition to "powered by spark"
Jeff Palmucci created SPARK-9979: Summary: Unable to request addition to "powered by spark" Key: SPARK-9979 URL: https://issues.apache.org/jira/browse/SPARK-9979 Project: Spark Issue Type: Bug Components: Documentation Reporter: Jeff Palmucci Priority: Minor The "powered by spark" page asks to submit new listing requests to "u...@apache.spark.org". However, when I do that, I get: {quote} Hi. This is the qmail-send program at apache.org. I'm afraid I wasn't able to deliver your message to the following addresses. This is a permanent error; I've given up. Sorry it didn't work out. : Must be sent from an @apache.org address or a subscriber address or an address in LDAP. {quote} The project I wanted to list is here: http://engineering.tripadvisor.com/using-apache-spark-for-massively-parallel-nlp/ -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9947) Separate Metadata and State Checkpoint Data
[ https://issues.apache.org/jira/browse/SPARK-9947?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14697157#comment-14697157 ] Dan Dutrow commented on SPARK-9947: --- The desire is to continue using checkpointing for everything but allow selective deleting of certain types of checkpoint data. Using an external database to duplicate that functionality is not desired. > Separate Metadata and State Checkpoint Data > --- > > Key: SPARK-9947 > URL: https://issues.apache.org/jira/browse/SPARK-9947 > Project: Spark > Issue Type: Improvement > Components: Streaming >Affects Versions: 1.4.1 >Reporter: Dan Dutrow > Original Estimate: 168h > Remaining Estimate: 168h > > Problem: When updating an application that has checkpointing enabled to > support the updateStateByKey and 24/7 operation functionality, you encounter > the problem where you might like to maintain state data between restarts but > delete the metadata containing execution state. > If checkpoint data exists between code redeployment, the program may not > execute properly or at all. My current workaround for this issue is to wrap > updateStateByKey with my own function that persists the state after every > update to my own separate directory. (That allows me to delete the checkpoint > with its metadata before redeploying) Then, when I restart the application, I > initialize the state with this persisted data. This incurs additional > overhead due to persisting of the same data twice: once in the checkpoint and > once in my persisted data folder. > If Kafka Direct API offsets could be stored in another separate checkpoint > directory, that would help address the problem of having to blow that away > between code redeployment as well. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9947) Separate Metadata and State Checkpoint Data
[ https://issues.apache.org/jira/browse/SPARK-9947?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14697166#comment-14697166 ] Cody Koeninger commented on SPARK-9947: --- You can't re-use checkpoint data across application upgrades anyway, so that honestly seems kind of pointless. > Separate Metadata and State Checkpoint Data > --- > > Key: SPARK-9947 > URL: https://issues.apache.org/jira/browse/SPARK-9947 > Project: Spark > Issue Type: Improvement > Components: Streaming >Affects Versions: 1.4.1 >Reporter: Dan Dutrow > Original Estimate: 168h > Remaining Estimate: 168h > > Problem: When updating an application that has checkpointing enabled to > support the updateStateByKey and 24/7 operation functionality, you encounter > the problem where you might like to maintain state data between restarts but > delete the metadata containing execution state. > If checkpoint data exists between code redeployment, the program may not > execute properly or at all. My current workaround for this issue is to wrap > updateStateByKey with my own function that persists the state after every > update to my own separate directory. (That allows me to delete the checkpoint > with its metadata before redeploying) Then, when I restart the application, I > initialize the state with this persisted data. This incurs additional > overhead due to persisting of the same data twice: once in the checkpoint and > once in my persisted data folder. > If Kafka Direct API offsets could be stored in another separate checkpoint > directory, that would help address the problem of having to blow that away > between code redeployment as well. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9947) Separate Metadata and State Checkpoint Data
[ https://issues.apache.org/jira/browse/SPARK-9947?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14697180#comment-14697180 ] Dan Dutrow commented on SPARK-9947: --- I want to maintain the data in updateStateByKey between upgrades. This data can be recovered between upgrades so long as objects don't change. > Separate Metadata and State Checkpoint Data > --- > > Key: SPARK-9947 > URL: https://issues.apache.org/jira/browse/SPARK-9947 > Project: Spark > Issue Type: Improvement > Components: Streaming >Affects Versions: 1.4.1 >Reporter: Dan Dutrow > Original Estimate: 168h > Remaining Estimate: 168h > > Problem: When updating an application that has checkpointing enabled to > support the updateStateByKey and 24/7 operation functionality, you encounter > the problem where you might like to maintain state data between restarts but > delete the metadata containing execution state. > If checkpoint data exists between code redeployment, the program may not > execute properly or at all. My current workaround for this issue is to wrap > updateStateByKey with my own function that persists the state after every > update to my own separate directory. (That allows me to delete the checkpoint > with its metadata before redeploying) Then, when I restart the application, I > initialize the state with this persisted data. This incurs additional > overhead due to persisting of the same data twice: once in the checkpoint and > once in my persisted data folder. > If Kafka Direct API offsets could be stored in another separate checkpoint > directory, that would help address the problem of having to blow that away > between code redeployment as well. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9947) Separate Metadata and State Checkpoint Data
[ https://issues.apache.org/jira/browse/SPARK-9947?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14697193#comment-14697193 ] Cody Koeninger commented on SPARK-9947: --- Didn't you already say that you were saving updateStateByKey state yourself? > Separate Metadata and State Checkpoint Data > --- > > Key: SPARK-9947 > URL: https://issues.apache.org/jira/browse/SPARK-9947 > Project: Spark > Issue Type: Improvement > Components: Streaming >Affects Versions: 1.4.1 >Reporter: Dan Dutrow > Original Estimate: 168h > Remaining Estimate: 168h > > Problem: When updating an application that has checkpointing enabled to > support the updateStateByKey and 24/7 operation functionality, you encounter > the problem where you might like to maintain state data between restarts but > delete the metadata containing execution state. > If checkpoint data exists between code redeployment, the program may not > execute properly or at all. My current workaround for this issue is to wrap > updateStateByKey with my own function that persists the state after every > update to my own separate directory. (That allows me to delete the checkpoint > with its metadata before redeploying) Then, when I restart the application, I > initialize the state with this persisted data. This incurs additional > overhead due to persisting of the same data twice: once in the checkpoint and > once in my persisted data folder. > If Kafka Direct API offsets could be stored in another separate checkpoint > directory, that would help address the problem of having to blow that away > between code redeployment as well. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org