[jira] [Commented] (SPARK-16519) Handle SparkR RDD generics that create warnings in R CMD check
[ https://issues.apache.org/jira/browse/SPARK-16519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15420226#comment-15420226 ] Clark Fitzgerald commented on SPARK-16519: -- Thanks [~shivaram] for the heads up. Good thing I didn't use the private RDD functions! > Handle SparkR RDD generics that create warnings in R CMD check > -- > > Key: SPARK-16519 > URL: https://issues.apache.org/jira/browse/SPARK-16519 > Project: Spark > Issue Type: Sub-task > Components: SparkR >Reporter: Shivaram Venkataraman > > One of the warnings we get from R CMD check is that RDD implementations of > some of the generics are not documented. These generics are shared between > RDD, DataFrames in SparkR. The list includes > {quote} > WARNING > Undocumented S4 methods: > generic 'cache' and siglist 'RDD' > generic 'collect' and siglist 'RDD' > generic 'count' and siglist 'RDD' > generic 'distinct' and siglist 'RDD' > generic 'first' and siglist 'RDD' > generic 'join' and siglist 'RDD,RDD' > generic 'length' and siglist 'RDD' > generic 'partitionBy' and siglist 'RDD' > generic 'persist' and siglist 'RDD,character' > generic 'repartition' and siglist 'RDD' > generic 'show' and siglist 'RDD' > generic 'take' and siglist 'RDD,numeric' > generic 'unpersist' and siglist 'RDD' > {quote} > As described in > https://stat.ethz.ch/pipermail/r-devel/2003-September/027490.html this looks > like a limitation of R where exporting a generic from a package also exports > all the implementations of that generic. > One way to get around this is to remove the RDD API or rename the methods in > Spark 2.1 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17049) LAG function fails when selecting all columns
[ https://issues.apache.org/jira/browse/SPARK-17049?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15420207#comment-15420207 ] Dongjoon Hyun commented on SPARK-17049: --- Hi, [~gcivan]. Yes. On Spark 2.0, it fails. Fortunately, the bug seems to be fixed already. {code} scala> sql("create table a as select 1 as col") scala> sql("select *, lag(col) over (order by col) as prev from a") scala> sql("select *, lag(col) over (order by col) as prev from a").show() +---++ |col|prev| +---++ | 1|null| +---++ scala> spark.version res3: String = 2.1.0-SNAPSHOT {code} > LAG function fails when selecting all columns > - > > Key: SPARK-17049 > URL: https://issues.apache.org/jira/browse/SPARK-17049 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Gokhan Civan > > In version 1.6.1, the queries > create table a as select 1 as col; > select *, lag(col) over (order by col) as prev from a; > successfully produce the table > col prev > 1null > However, in version 2.0.0, this fails with the error > org.apache.spark.sql.AnalysisException: Window Frame RANGE BETWEEN UNBOUNDED > PRECEDING AND CURRENT ROW must match the required frame ROWS BETWEEN 1 > PRECEDING AND 1 PRECEDING; > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:40) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:58) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveWindowFrame$$anonfun$apply$29$$anonfun$applyOrElse$10.applyOrElse(Analyzer.scala:1785) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveWindowFrame$$anonfun$apply$29$$anonfun$applyOrElse$10.applyOrElse(Analyzer.scala:1781) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:279) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:279) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:278) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:284) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:284) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:321) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:319) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:284) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionDown$1(QueryPlan.scala:156) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1(QueryPlan.scala:166) > at > org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1$1.apply(QueryPlan.scala:170) > ... > On the other hand, the query works if * is replaced with col as in > select col, lag(col) over (order by col) as prev from a; > It also works as follows: > select col, lag(col) over (order by col ROWS BETWEEN 1 PRECEDING AND 1 > PRECEDING) as prev from a; -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-17049) LAG function fails when selecting all columns
Gokhan Civan created SPARK-17049: Summary: LAG function fails when selecting all columns Key: SPARK-17049 URL: https://issues.apache.org/jira/browse/SPARK-17049 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.0.0 Reporter: Gokhan Civan In version 1.6.1, the queries create table a as select 1 as col; select *, lag(col) over (order by col) as prev from a; successfully produce the table col prev 1null However, in version 2.0.0, this fails with the error org.apache.spark.sql.AnalysisException: Window Frame RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW must match the required frame ROWS BETWEEN 1 PRECEDING AND 1 PRECEDING; at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:40) at org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:58) at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveWindowFrame$$anonfun$apply$29$$anonfun$applyOrElse$10.applyOrElse(Analyzer.scala:1785) at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveWindowFrame$$anonfun$apply$29$$anonfun$applyOrElse$10.applyOrElse(Analyzer.scala:1781) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:279) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:279) at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69) at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:278) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:284) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:284) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:321) at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179) at org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:319) at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:284) at org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionDown$1(QueryPlan.scala:156) at org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1(QueryPlan.scala:166) at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1$1.apply(QueryPlan.scala:170) ... On the other hand, the query works if * is replaced with col as in select col, lag(col) over (order by col) as prev from a; It also works as follows: select col, lag(col) over (order by col ROWS BETWEEN 1 PRECEDING AND 1 PRECEDING) as prev from a; -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16966) App Name is a randomUUID even when "spark.app.name" exists
[ https://issues.apache.org/jira/browse/SPARK-16966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15420116#comment-15420116 ] Weiqing Yang commented on SPARK-16966: -- [~srowen] Thanks for the new PR and review. > App Name is a randomUUID even when "spark.app.name" exists > -- > > Key: SPARK-16966 > URL: https://issues.apache.org/jira/browse/SPARK-16966 > Project: Spark > Issue Type: Bug > Components: Spark Core >Reporter: Weiqing Yang >Assignee: Sean Owen > Fix For: 2.0.1, 2.1.0 > > > When submitting an application with "--name": > ./bin/spark-submit --name myApplicationTest --verbose --executor-cores 3 > --num-executors 1 --master yarn --deploy-mode client --class > org.apache.spark.examples.SparkKMeans > examples/target/original-spark-examples_2.11-2.1.0-SNAPSHOT.jar > hdfs://localhost:9000/lr_big.txt 2 5 > In the history server UI: > App ID: application_1470694797714_0016 > App Name: 70c06dc5-1b99-4b4a-a826-ea27497e977b > The App Name should not be a randomUUID > "70c06dc5-1b99-4b4a-a826-ea27497e977b" since the "spark.app.name" was > myApplicationTest. > The application "org.apache.spark.examples.SparkKMeans" above did not invoke > ".appName()". -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-16966) App Name is a randomUUID even when "spark.app.name" exists
[ https://issues.apache.org/jira/browse/SPARK-16966?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-16966. - Resolution: Fixed Assignee: Sean Owen Fix Version/s: 2.1.0 2.0.1 > App Name is a randomUUID even when "spark.app.name" exists > -- > > Key: SPARK-16966 > URL: https://issues.apache.org/jira/browse/SPARK-16966 > Project: Spark > Issue Type: Bug > Components: Spark Core >Reporter: Weiqing Yang >Assignee: Sean Owen > Fix For: 2.0.1, 2.1.0 > > > When submitting an application with "--name": > ./bin/spark-submit --name myApplicationTest --verbose --executor-cores 3 > --num-executors 1 --master yarn --deploy-mode client --class > org.apache.spark.examples.SparkKMeans > examples/target/original-spark-examples_2.11-2.1.0-SNAPSHOT.jar > hdfs://localhost:9000/lr_big.txt 2 5 > In the history server UI: > App ID: application_1470694797714_0016 > App Name: 70c06dc5-1b99-4b4a-a826-ea27497e977b > The App Name should not be a randomUUID > "70c06dc5-1b99-4b4a-a826-ea27497e977b" since the "spark.app.name" was > myApplicationTest. > The application "org.apache.spark.examples.SparkKMeans" above did not invoke > ".appName()". -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-17041) Columns in schema are no longer case sensitive when reading csv file
[ https://issues.apache.org/jira/browse/SPARK-17041?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15420106#comment-15420106 ] Dongjoon Hyun edited comment on SPARK-17041 at 8/13/16 10:18 PM: - Since I don't have the exact script of your situation, it might be different. But, Spark 2.0 supports `case sensitive`, of course with SQL configuration. In the above, `sql("set spark.sql.caseSensitive=true")`. Could you confirm on your site, [~barrybecker4]? was (Author: dongjoon): Since I don't have the exact script of your situation, it might be different. But, Spark 2.0 supports `case sensitive` of course with SQL configuration. In the above, `sql("set spark.sql.caseSensitive=true")`. Could you confirm on your site, [~barrybecker4]? > Columns in schema are no longer case sensitive when reading csv file > > > Key: SPARK-17041 > URL: https://issues.apache.org/jira/browse/SPARK-17041 > Project: Spark > Issue Type: Bug > Components: Input/Output >Affects Versions: 2.0.0 >Reporter: Barry Becker > > It used to be (in spark 1.6.2) that I could read a csv file that had columns > with names that differed only by case. For example, one column may be > "output" and another called "Output". Now (with spark 2.0.0) if I try to read > such a file, I get an error like this: > {code} > org.apache.spark.sql.AnalysisException: Reference 'Output' is ambiguous, > could be: Output#1263, Output#1295.; > {code} > The schema (dfSchema below) that I pass to the csv read looks like this: > {code} > StructType( StructField(Output,StringType,true), ... > StructField(output,StringType,true), ...) > {code} > The code that does the read is this > {code} > sqlContext.read > .format("csv") > .option("header", "false") // Use first line of all files as header > .option("inferSchema", "false") // Automatically infer data types > .schema(dfSchema) > .csv(dataFile) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17041) Columns in schema are no longer case sensitive when reading csv file
[ https://issues.apache.org/jira/browse/SPARK-17041?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15420106#comment-15420106 ] Dongjoon Hyun commented on SPARK-17041: --- Since I don't have the exact script of your situation, it might be different. But, Spark 2.0 supports `case sensitive` of course with SQL configuration. In the above, `sql("set spark.sql.caseSensitive=true")`. Could you confirm on your site, [~barrybecker4]? > Columns in schema are no longer case sensitive when reading csv file > > > Key: SPARK-17041 > URL: https://issues.apache.org/jira/browse/SPARK-17041 > Project: Spark > Issue Type: Bug > Components: Input/Output >Affects Versions: 2.0.0 >Reporter: Barry Becker > > It used to be (in spark 1.6.2) that I could read a csv file that had columns > with names that differed only by case. For example, one column may be > "output" and another called "Output". Now (with spark 2.0.0) if I try to read > such a file, I get an error like this: > {code} > org.apache.spark.sql.AnalysisException: Reference 'Output' is ambiguous, > could be: Output#1263, Output#1295.; > {code} > The schema (dfSchema below) that I pass to the csv read looks like this: > {code} > StructType( StructField(Output,StringType,true), ... > StructField(output,StringType,true), ...) > {code} > The code that does the read is this > {code} > sqlContext.read > .format("csv") > .option("header", "false") // Use first line of all files as header > .option("inferSchema", "false") // Automatically infer data types > .schema(dfSchema) > .csv(dataFile) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17041) Columns in schema are no longer case sensitive when reading csv file
[ https://issues.apache.org/jira/browse/SPARK-17041?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15420105#comment-15420105 ] Dongjoon Hyun commented on SPARK-17041: --- Hi, [~barrybecker4] I reproduced your problems and I think I can give you the solution. {code} scala> spark.read.format("csv").option("header", "false").option("inferSchema", "false").schema(StructType(Seq(StructField("c", StringType, false), StructField("C", StringType, false.csv("/tmp/csv_caseSensitive").show org.apache.spark.sql.AnalysisException: Reference 'c' is ambiguous, could be: c#45, c#46.; ... scala> sql("set spark.sql.caseSensitive=true") res9: org.apache.spark.sql.DataFrame = [key: string, value: string] scala> spark.read.format("csv").option("header", "false").option("inferSchema", "false").schema(StructType(Seq(StructField("c", StringType, false), StructField("C", StringType, false.csv("/tmp/csv_caseSensitive").show +---+---+ | c| C| +---+---+ | c1| C1| | 1| 2| +---+---+ {code} > Columns in schema are no longer case sensitive when reading csv file > > > Key: SPARK-17041 > URL: https://issues.apache.org/jira/browse/SPARK-17041 > Project: Spark > Issue Type: Bug > Components: Input/Output >Affects Versions: 2.0.0 >Reporter: Barry Becker > > It used to be (in spark 1.6.2) that I could read a csv file that had columns > with names that differed only by case. For example, one column may be > "output" and another called "Output". Now (with spark 2.0.0) if I try to read > such a file, I get an error like this: > {code} > org.apache.spark.sql.AnalysisException: Reference 'Output' is ambiguous, > could be: Output#1263, Output#1295.; > {code} > The schema (dfSchema below) that I pass to the csv read looks like this: > {code} > StructType( StructField(Output,StringType,true), ... > StructField(output,StringType,true), ...) > {code} > The code that does the read is this > {code} > sqlContext.read > .format("csv") > .option("header", "false") // Use first line of all files as header > .option("inferSchema", "false") // Automatically infer data types > .schema(dfSchema) > .csv(dataFile) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17041) Columns in schema are no longer case sensitive when reading csv file
[ https://issues.apache.org/jira/browse/SPARK-17041?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15420100#comment-15420100 ] Dongjoon Hyun commented on SPARK-17041: --- Hi, [~barrybecker4]. Could you give us more reproducible example? > Columns in schema are no longer case sensitive when reading csv file > > > Key: SPARK-17041 > URL: https://issues.apache.org/jira/browse/SPARK-17041 > Project: Spark > Issue Type: Bug > Components: Input/Output >Affects Versions: 2.0.0 >Reporter: Barry Becker > > It used to be (in spark 1.6.2) that I could read a csv file that had columns > with names that differed only by case. For example, one column may be > "output" and another called "Output". Now (with spark 2.0.0) if I try to read > such a file, I get an error like this: > {code} > org.apache.spark.sql.AnalysisException: Reference 'Output' is ambiguous, > could be: Output#1263, Output#1295.; > {code} > The schema (dfSchema below) that I pass to the csv read looks like this: > {code} > StructType( StructField(Output,StringType,true), ... > StructField(output,StringType,true), ...) > {code} > The code that does the read is this > {code} > sqlContext.read > .format("csv") > .option("header", "false") // Use first line of all files as header > .option("inferSchema", "false") // Automatically infer data types > .schema(dfSchema) > .csv(dataFile) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17035) Conversion of datetime.max to microseconds produces incorrect value
[ https://issues.apache.org/jira/browse/SPARK-17035?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17035: Assignee: Apache Spark > Conversion of datetime.max to microseconds produces incorrect value > --- > > Key: SPARK-17035 > URL: https://issues.apache.org/jira/browse/SPARK-17035 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.0.0 >Reporter: Michael Styles >Assignee: Apache Spark >Priority: Minor > > Conversion of datetime.max to microseconds produces incorrect value. For > example, > {noformat} > from datetime import datetime > from pyspark.sql import Row > from pyspark.sql.types import StructType, StructField, TimestampType > schema = StructType([StructField("dt", TimestampType(), False)]) > data = [{"dt": datetime.max}] > # convert python objects to sql data > sql_data = [schema.toInternal(row) for row in data] > # Value is wrong. > sql_data > [(2.534023188e+17,)] > {noformat} > This value should be [(2534023187,)]. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17035) Conversion of datetime.max to microseconds produces incorrect value
[ https://issues.apache.org/jira/browse/SPARK-17035?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17035: Assignee: (was: Apache Spark) > Conversion of datetime.max to microseconds produces incorrect value > --- > > Key: SPARK-17035 > URL: https://issues.apache.org/jira/browse/SPARK-17035 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.0.0 >Reporter: Michael Styles >Priority: Minor > > Conversion of datetime.max to microseconds produces incorrect value. For > example, > {noformat} > from datetime import datetime > from pyspark.sql import Row > from pyspark.sql.types import StructType, StructField, TimestampType > schema = StructType([StructField("dt", TimestampType(), False)]) > data = [{"dt": datetime.max}] > # convert python objects to sql data > sql_data = [schema.toInternal(row) for row in data] > # Value is wrong. > sql_data > [(2.534023188e+17,)] > {noformat} > This value should be [(2534023187,)]. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17035) Conversion of datetime.max to microseconds produces incorrect value
[ https://issues.apache.org/jira/browse/SPARK-17035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15420090#comment-15420090 ] Apache Spark commented on SPARK-17035: -- User 'dongjoon-hyun' has created a pull request for this issue: https://github.com/apache/spark/pull/14631 > Conversion of datetime.max to microseconds produces incorrect value > --- > > Key: SPARK-17035 > URL: https://issues.apache.org/jira/browse/SPARK-17035 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.0.0 >Reporter: Michael Styles >Priority: Minor > > Conversion of datetime.max to microseconds produces incorrect value. For > example, > {noformat} > from datetime import datetime > from pyspark.sql import Row > from pyspark.sql.types import StructType, StructField, TimestampType > schema = StructType([StructField("dt", TimestampType(), False)]) > data = [{"dt": datetime.max}] > # convert python objects to sql data > sql_data = [schema.toInternal(row) for row in data] > # Value is wrong. > sql_data > [(2.534023188e+17,)] > {noformat} > This value should be [(2534023187,)]. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17035) Conversion of datetime.max to microseconds produces incorrect value
[ https://issues.apache.org/jira/browse/SPARK-17035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15420086#comment-15420086 ] Dongjoon Hyun commented on SPARK-17035: --- Hi, [~ptkool]. You're right. It seems the microsecond part of `Timestamp` type is lost. I'll make a PR for this issue soon. > Conversion of datetime.max to microseconds produces incorrect value > --- > > Key: SPARK-17035 > URL: https://issues.apache.org/jira/browse/SPARK-17035 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.0.0 >Reporter: Michael Styles >Priority: Minor > > Conversion of datetime.max to microseconds produces incorrect value. For > example, > {noformat} > from datetime import datetime > from pyspark.sql import Row > from pyspark.sql.types import StructType, StructField, TimestampType > schema = StructType([StructField("dt", TimestampType(), False)]) > data = [{"dt": datetime.max}] > # convert python objects to sql data > sql_data = [schema.toInternal(row) for row in data] > # Value is wrong. > sql_data > [(2.534023188e+17,)] > {noformat} > This value should be [(2534023187,)]. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6378) srcAttr in graph.triplets don't update when the size of graph is huge
[ https://issues.apache.org/jira/browse/SPARK-6378?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15420006#comment-15420006 ] Rabie Saidi commented on SPARK-6378: The discrepancy between vertices and triplets' vertices seems to be more generic than the title of this task. It happens as well for small graphs like in my case where I'm trying to use Pregel to send messages between vertices, but the value of srcAttr in the triplets is not updated. > srcAttr in graph.triplets don't update when the size of graph is huge > - > > Key: SPARK-6378 > URL: https://issues.apache.org/jira/browse/SPARK-6378 > Project: Spark > Issue Type: Bug > Components: GraphX >Affects Versions: 1.2.1 >Reporter: zhangzhenyue > Attachments: TripletsViewDonotUpdate.scala > > > when the size of the graph is huge(0.2 billion vertex, 6 billion edges), the > srcAttr and dstAttr in graph.triplets don't update when using the > Graph.outerJoinVertices(when the data in vertex is changed). > the code and the log is as follows: > {quote} > g = graph.outerJoinVertices()... > g,vertices,count() > g.edges.count() > println("example edge " + g.triplets.filter(e => e.srcId == > 51L).collect() > .map(e =>(e.srcId + ":" + e.srcAttr + ", " + e.dstId + ":" + > e.dstAttr)).mkString("\n")) > println("example vertex " + g.vertices.filter(e => e._1 == > 51L).collect() > .map(e => (e._1 + "," + e._2)).mkString("\n")) > {quote} > the result: > {quote} > example edge 51:0, 2467451620:61 > 51:0, 1962741310:83 // attr of vertex 51 is 0 in > Graph.triplets > example vertex 51,2 // attr of vertex 51 is 2 in > Graph.vertices > {quote} > when the graph is smaller(10 million vertex), the code is OK, the triplets > will update when the vertex is changed -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16966) App Name is a randomUUID even when "spark.app.name" exists
[ https://issues.apache.org/jira/browse/SPARK-16966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15419968#comment-15419968 ] Apache Spark commented on SPARK-16966: -- User 'srowen' has created a pull request for this issue: https://github.com/apache/spark/pull/14630 > App Name is a randomUUID even when "spark.app.name" exists > -- > > Key: SPARK-16966 > URL: https://issues.apache.org/jira/browse/SPARK-16966 > Project: Spark > Issue Type: Bug > Components: Spark Core >Reporter: Weiqing Yang > > When submitting an application with "--name": > ./bin/spark-submit --name myApplicationTest --verbose --executor-cores 3 > --num-executors 1 --master yarn --deploy-mode client --class > org.apache.spark.examples.SparkKMeans > examples/target/original-spark-examples_2.11-2.1.0-SNAPSHOT.jar > hdfs://localhost:9000/lr_big.txt 2 5 > In the history server UI: > App ID: application_1470694797714_0016 > App Name: 70c06dc5-1b99-4b4a-a826-ea27497e977b > The App Name should not be a randomUUID > "70c06dc5-1b99-4b4a-a826-ea27497e977b" since the "spark.app.name" was > myApplicationTest. > The application "org.apache.spark.examples.SparkKMeans" above did not invoke > ".appName()". -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-17048) ML model read for custom transformers in a pipeline does not work
Taras Matyashovskyy created SPARK-17048: --- Summary: ML model read for custom transformers in a pipeline does not work Key: SPARK-17048 URL: https://issues.apache.org/jira/browse/SPARK-17048 Project: Spark Issue Type: Bug Components: ML Affects Versions: 2.0.0 Environment: Spark 2.0.0 Java API Reporter: Taras Matyashovskyy 0. Use Java API :( 1. Create any custom ML transformer 2. Make it MLReadable and MLWritable 3. Add to pipeline 4. Evaluate model, e.g. CrossValidationModel, and save results to disk 5. For custom transformer you can DefaultParamsReader and DefaultParamsWriter, for instance 6. Load model from saved directory 7. All out-of-the-box objects are loaded successfully, e.g. Pipeline, Evaluator, etc. 8. Your custom transformer will fail with NPE Reason: ReadWrite.scala:447 cls.getMethod("read").invoke(null).asInstanceOf[MLReader[T]].load(path) In Java this only works for static methods. As we are implementing MLReadable or MLWritable, then this call should be instance method call. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17048) ML model read for custom transformers in a pipeline does not work
[ https://issues.apache.org/jira/browse/SPARK-17048?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Taras Matyashovskyy updated SPARK-17048: Description: 0. Use Java API :( 1. Create any custom ML transformer 2. Make it MLReadable and MLWritable 3. Add to pipeline 4. Evaluate model, e.g. CrossValidationModel, and save results to disk 5. For custom transformer you can use DefaultParamsReader and DefaultParamsWriter, for instance 6. Load model from saved directory 7. All out-of-the-box objects are loaded successfully, e.g. Pipeline, Evaluator, etc. 8. Your custom transformer will fail with NPE Reason: ReadWrite.scala:447 cls.getMethod("read").invoke(null).asInstanceOf[MLReader[T]].load(path) In Java this only works for static methods. As we are implementing MLReadable or MLWritable, then this call should be instance method call. was: 0. Use Java API :( 1. Create any custom ML transformer 2. Make it MLReadable and MLWritable 3. Add to pipeline 4. Evaluate model, e.g. CrossValidationModel, and save results to disk 5. For custom transformer you can DefaultParamsReader and DefaultParamsWriter, for instance 6. Load model from saved directory 7. All out-of-the-box objects are loaded successfully, e.g. Pipeline, Evaluator, etc. 8. Your custom transformer will fail with NPE Reason: ReadWrite.scala:447 cls.getMethod("read").invoke(null).asInstanceOf[MLReader[T]].load(path) In Java this only works for static methods. As we are implementing MLReadable or MLWritable, then this call should be instance method call. > ML model read for custom transformers in a pipeline does not work > -- > > Key: SPARK-17048 > URL: https://issues.apache.org/jira/browse/SPARK-17048 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.0.0 > Environment: Spark 2.0.0 > Java API >Reporter: Taras Matyashovskyy > Labels: easyfix, features > Original Estimate: 2h > Remaining Estimate: 2h > > 0. Use Java API :( > 1. Create any custom ML transformer > 2. Make it MLReadable and MLWritable > 3. Add to pipeline > 4. Evaluate model, e.g. CrossValidationModel, and save results to disk > 5. For custom transformer you can use DefaultParamsReader and > DefaultParamsWriter, for instance > 6. Load model from saved directory > 7. All out-of-the-box objects are loaded successfully, e.g. Pipeline, > Evaluator, etc. > 8. Your custom transformer will fail with NPE > Reason: > ReadWrite.scala:447 > cls.getMethod("read").invoke(null).asInstanceOf[MLReader[T]].load(path) > In Java this only works for static methods. > As we are implementing MLReadable or MLWritable, then this call should be > instance method call. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-17047) Spark 2 cannot create ORC table when CLUSTERED.
Dr Mich Talebzadeh created SPARK-17047: -- Summary: Spark 2 cannot create ORC table when CLUSTERED. Key: SPARK-17047 URL: https://issues.apache.org/jira/browse/SPARK-17047 Project: Spark Issue Type: Bug Affects Versions: 2.0.0 Reporter: Dr Mich Talebzadeh This does not work with CLUSTERED BY clause in Spark 2 now! CREATE TABLE test.dummy2 ( ID INT , CLUSTERED INT , SCATTERED INT , RANDOMISED INT , RANDOM_STRING VARCHAR(50) , SMALL_VC VARCHAR(10) , PADDING VARCHAR(10) ) CLUSTERED BY (ID) INTO 256 BUCKETS STORED AS ORC TBLPROPERTIES ( "orc.compress"="SNAPPY", "orc.create.index"="true", "orc.bloom.filter.columns"="ID", "orc.bloom.filter.fpp"="0.05", "orc.stripe.size"="268435456", "orc.row.index.stride"="1" ) scala> HiveContext.sql(sqltext) org.apache.spark.sql.catalyst.parser.ParseException: Operation not allowed: CREATE TABLE ... CLUSTERED BY(line 2, pos 0) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-14880) Parallel Gradient Descent with less map-reduce shuffle overhead
[ https://issues.apache.org/jira/browse/SPARK-14880?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ahmed Mahran updated SPARK-14880: - External issue URL: https://github.com/mashin-io/rich-spark/blob/master/main/src/main/scala/org/apache/spark/mllib/optimization/ParallelSGD.scala (was: https://github.com/mashin-io/rich-spark/blob/master/src/main/scala/org/apache/spark/mllib/optimization/ParallelSGD.scala) > Parallel Gradient Descent with less map-reduce shuffle overhead > --- > > Key: SPARK-14880 > URL: https://issues.apache.org/jira/browse/SPARK-14880 > Project: Spark > Issue Type: Improvement > Components: MLlib >Reporter: Ahmed Mahran > Labels: performance > > The current implementation of (Stochastic) Gradient Descent performs one > map-reduce shuffle per iteration. Moreover, when the sampling fraction gets > smaller, the algorithm becomes shuffle-bound instead of CPU-bound. > {code} > (1 to numIterations or convergence) { > rdd > .sample(fraction) > .map(Gradient) > .reduce(Update) > } > {code} > A more performant variation requires only one map-reduce regardless from the > number of iterations. A local mini-batch SGD could be run on each partition, > then the results could be averaged. This is based on (Zinkevich, Martin, > Markus Weimer, Lihong Li, and Alex J. Smola. "Parallelized stochastic > gradient descent." In Advances in neural information processing systems, > 2010, > http://www.research.rutgers.edu/~lihong/pub/Zinkevich11Parallelized.pdf). > {code} > rdd > .shuffle() > .mapPartitions((1 to numIterations or convergence) { >iter.sample(fraction).map(Gradient).reduce(Update) > }) > .reduce(Average) > {code} > A higher level iteration could enclose the above variation; shuffling the > data before the local mini-batches and feeding back the average weights from > the last iteration. This allows more variability in the sampling of the > mini-batches with the possibility to cover the whole dataset. Here is a Spark > based implementation > https://github.com/mashin-io/rich-spark/blob/master/src/main/scala/org/apache/spark/mllib/optimization/ParallelSGD.scala > {code} > (1 to numIterations1 or convergence) { > rdd > .shuffle() > .mapPartitions((1 to numIterations2 or convergence) { > iter.sample(fraction).map(Gradient).reduce(Update) > }) > .reduce(Average) > } > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-14880) Parallel Gradient Descent with less map-reduce shuffle overhead
[ https://issues.apache.org/jira/browse/SPARK-14880?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ahmed Mahran updated SPARK-14880: - Description: The current implementation of (Stochastic) Gradient Descent performs one map-reduce shuffle per iteration. Moreover, when the sampling fraction gets smaller, the algorithm becomes shuffle-bound instead of CPU-bound. {code} (1 to numIterations or convergence) { rdd .sample(fraction) .map(Gradient) .reduce(Update) } {code} A more performant variation requires only one map-reduce regardless from the number of iterations. A local mini-batch SGD could be run on each partition, then the results could be averaged. This is based on (Zinkevich, Martin, Markus Weimer, Lihong Li, and Alex J. Smola. "Parallelized stochastic gradient descent." In Advances in neural information processing systems, 2010, http://www.research.rutgers.edu/~lihong/pub/Zinkevich11Parallelized.pdf). {code} rdd .shuffle() .mapPartitions((1 to numIterations or convergence) { iter.sample(fraction).map(Gradient).reduce(Update) }) .reduce(Average) {code} A higher level iteration could enclose the above variation; shuffling the data before the local mini-batches and feeding back the average weights from the last iteration. This allows more variability in the sampling of the mini-batches with the possibility to cover the whole dataset. Here is a Spark based implementation https://github.com/mashin-io/rich-spark/blob/master/main/src/main/scala/org/apache/spark/mllib/optimization/ParallelSGD.scala {code} (1 to numIterations1 or convergence) { rdd .shuffle() .mapPartitions((1 to numIterations2 or convergence) { iter.sample(fraction).map(Gradient).reduce(Update) }) .reduce(Average) } {code} was: The current implementation of (Stochastic) Gradient Descent performs one map-reduce shuffle per iteration. Moreover, when the sampling fraction gets smaller, the algorithm becomes shuffle-bound instead of CPU-bound. {code} (1 to numIterations or convergence) { rdd .sample(fraction) .map(Gradient) .reduce(Update) } {code} A more performant variation requires only one map-reduce regardless from the number of iterations. A local mini-batch SGD could be run on each partition, then the results could be averaged. This is based on (Zinkevich, Martin, Markus Weimer, Lihong Li, and Alex J. Smola. "Parallelized stochastic gradient descent." In Advances in neural information processing systems, 2010, http://www.research.rutgers.edu/~lihong/pub/Zinkevich11Parallelized.pdf). {code} rdd .shuffle() .mapPartitions((1 to numIterations or convergence) { iter.sample(fraction).map(Gradient).reduce(Update) }) .reduce(Average) {code} A higher level iteration could enclose the above variation; shuffling the data before the local mini-batches and feeding back the average weights from the last iteration. This allows more variability in the sampling of the mini-batches with the possibility to cover the whole dataset. Here is a Spark based implementation https://github.com/mashin-io/rich-spark/blob/master/src/main/scala/org/apache/spark/mllib/optimization/ParallelSGD.scala {code} (1 to numIterations1 or convergence) { rdd .shuffle() .mapPartitions((1 to numIterations2 or convergence) { iter.sample(fraction).map(Gradient).reduce(Update) }) .reduce(Average) } {code} > Parallel Gradient Descent with less map-reduce shuffle overhead > --- > > Key: SPARK-14880 > URL: https://issues.apache.org/jira/browse/SPARK-14880 > Project: Spark > Issue Type: Improvement > Components: MLlib >Reporter: Ahmed Mahran > Labels: performance > > The current implementation of (Stochastic) Gradient Descent performs one > map-reduce shuffle per iteration. Moreover, when the sampling fraction gets > smaller, the algorithm becomes shuffle-bound instead of CPU-bound. > {code} > (1 to numIterations or convergence) { > rdd > .sample(fraction) > .map(Gradient) > .reduce(Update) > } > {code} > A more performant variation requires only one map-reduce regardless from the > number of iterations. A local mini-batch SGD could be run on each partition, > then the results could be averaged. This is based on (Zinkevich, Martin, > Markus Weimer, Lihong Li, and Alex J. Smola. "Parallelized stochastic > gradient descent." In Advances in neural information processing systems, > 2010, > http://www.research.rutgers.edu/~lihong/pub/Zinkevich11Parallelized.pdf). > {code} > rdd > .shuffle() > .mapPartitions((1 to numIterations or convergence) { >iter.sample(fraction).map(Gradient).reduce(Update) > }) > .reduce(Average) > {code} > A higher level iteration could enclose the above variation; shuffling the > data before the local mini-batches and feeding back the average we
[jira] [Resolved] (SPARK-16893) Spark CSV Provider option is not documented
[ https://issues.apache.org/jira/browse/SPARK-16893?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-16893. --- Resolution: Not A Problem > Spark CSV Provider option is not documented > --- > > Key: SPARK-16893 > URL: https://issues.apache.org/jira/browse/SPARK-16893 > Project: Spark > Issue Type: Documentation >Affects Versions: 2.0.0 >Reporter: Aseem Bansal >Priority: Minor > > I was working with databricks spark csv library and came across an error. I > have logged the issue in their github but it would be good to document that > in Apache Spark's documentation also > I faced it with CSV. Someone else faced that with JSON > http://stackoverflow.com/questions/38761920/spark2-0-error-multiple-sources-found-for-json-when-read-json-file > Complete Issue details here > https://github.com/databricks/spark-csv/issues/367 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17023) Update Kafka connetor to use Kafka 0.10.0.1
[ https://issues.apache.org/jira/browse/SPARK-17023?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-17023: -- Assignee: Luciano Resende Priority: Trivial (was: Minor) > Update Kafka connetor to use Kafka 0.10.0.1 > --- > > Key: SPARK-17023 > URL: https://issues.apache.org/jira/browse/SPARK-17023 > Project: Spark > Issue Type: Improvement > Components: Build >Reporter: Luciano Resende >Assignee: Luciano Resende >Priority: Trivial > Fix For: 2.0.1, 2.1.0 > > > Update Kafka connector to use latest version of Kafka dependencies (0.10.0.1) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-17023) Update Kafka connetor to use Kafka 0.10.0.1
[ https://issues.apache.org/jira/browse/SPARK-17023?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-17023. --- Resolution: Fixed Fix Version/s: 2.1.0 2.0.1 Issue resolved by pull request 14606 [https://github.com/apache/spark/pull/14606] > Update Kafka connetor to use Kafka 0.10.0.1 > --- > > Key: SPARK-17023 > URL: https://issues.apache.org/jira/browse/SPARK-17023 > Project: Spark > Issue Type: Improvement > Components: Build >Reporter: Luciano Resende >Priority: Minor > Fix For: 2.0.1, 2.1.0 > > > Update Kafka connector to use latest version of Kafka dependencies (0.10.0.1) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16968) Allow to add additional options when creating a new table in DF's JDBC writer.
[ https://issues.apache.org/jira/browse/SPARK-16968?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-16968: -- Assignee: Jie Huang > Allow to add additional options when creating a new table in DF's JDBC > writer. > --- > > Key: SPARK-16968 > URL: https://issues.apache.org/jira/browse/SPARK-16968 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.0 >Reporter: Jie Huang >Assignee: Jie Huang >Priority: Minor > Fix For: 2.1.0 > > > We met some problem when trying to export Dataframe to external mysql thru > JDBC driver (if the table doesn't exist). In general, Spark will create a new > table automatically if it doesn't exist. However it doesn't support to add > additional options when creating a new table. > For example, we need to set the default "CHARSET=utf-8" in some customer's > table. Otherwise, some UTF-8 columns cannot be exported to mysql > successfully. Some encoding exception will be thrown and finally break the > job. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-16968) Allow to add additional options when creating a new table in DF's JDBC writer.
[ https://issues.apache.org/jira/browse/SPARK-16968?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-16968. --- Resolution: Fixed Fix Version/s: 2.1.0 Issue resolved by pull request 14559 [https://github.com/apache/spark/pull/14559] > Allow to add additional options when creating a new table in DF's JDBC > writer. > --- > > Key: SPARK-16968 > URL: https://issues.apache.org/jira/browse/SPARK-16968 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.0 >Reporter: Jie Huang >Priority: Minor > Fix For: 2.1.0 > > > We met some problem when trying to export Dataframe to external mysql thru > JDBC driver (if the table doesn't exist). In general, Spark will create a new > table automatically if it doesn't exist. However it doesn't support to add > additional options when creating a new table. > For example, we need to set the default "CHARSET=utf-8" in some customer's > table. Otherwise, some UTF-8 columns cannot be exported to mysql > successfully. Some encoding exception will be thrown and finally break the > job. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-12370) Documentation should link to examples from its own release version
[ https://issues.apache.org/jira/browse/SPARK-12370?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-12370. --- Resolution: Fixed Assignee: Jagadeesan A S Fix Version/s: 2.1.0 2.0.1 Resolved by https://github.com/apache/spark/pull/14596 > Documentation should link to examples from its own release version > -- > > Key: SPARK-12370 > URL: https://issues.apache.org/jira/browse/SPARK-12370 > Project: Spark > Issue Type: Improvement > Components: Documentation >Reporter: Brian London >Assignee: Jagadeesan A S >Priority: Minor > Fix For: 2.0.1, 2.1.0 > > > When documentation is built is should reference examples from the same build. > There are times when the docs have links that point to files in the github > head which may not be valid on the current release. > As an example the spark streaming page for 1.5.2 (currently at > http://spark.apache.org/docs/latest/streaming-programming-guide.html) links > to the stateful network word count example (at > https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/streaming/StatefulNetworkWordCount.scala). > That example file utilizes a number of 1.6 features that are not available > in 1.5.2. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17046) prevent user using dataframe.select with empty param list
[ https://issues.apache.org/jira/browse/SPARK-17046?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17046: Assignee: (was: Apache Spark) > prevent user using dataframe.select with empty param list > - > > Key: SPARK-17046 > URL: https://issues.apache.org/jira/browse/SPARK-17046 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.1.0 >Reporter: Weichen Xu > Original Estimate: 24h > Remaining Estimate: 24h > > currently, we can use: > dataframe.select() > which select nothing. > it is illegal and we should prevent it in API level. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17046) prevent user using dataframe.select with empty param list
[ https://issues.apache.org/jira/browse/SPARK-17046?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17046: Assignee: Apache Spark > prevent user using dataframe.select with empty param list > - > > Key: SPARK-17046 > URL: https://issues.apache.org/jira/browse/SPARK-17046 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.1.0 >Reporter: Weichen Xu >Assignee: Apache Spark > Original Estimate: 24h > Remaining Estimate: 24h > > currently, we can use: > dataframe.select() > which select nothing. > it is illegal and we should abandon it in API level. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17046) prevent user using dataframe.select with empty param list
[ https://issues.apache.org/jira/browse/SPARK-17046?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Weichen Xu updated SPARK-17046: --- Description: currently, we can use: dataframe.select() which select nothing. it is illegal and we should prevent it in API level. was: currently, we can use: dataframe.select() which select nothing. it is illegal and we should abandon it in API level. > prevent user using dataframe.select with empty param list > - > > Key: SPARK-17046 > URL: https://issues.apache.org/jira/browse/SPARK-17046 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.1.0 >Reporter: Weichen Xu >Assignee: Apache Spark > Original Estimate: 24h > Remaining Estimate: 24h > > currently, we can use: > dataframe.select() > which select nothing. > it is illegal and we should prevent it in API level. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17046) prevent user using dataframe.select with empty param list
[ https://issues.apache.org/jira/browse/SPARK-17046?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15419855#comment-15419855 ] Apache Spark commented on SPARK-17046: -- User 'WeichenXu123' has created a pull request for this issue: https://github.com/apache/spark/pull/14629 > prevent user using dataframe.select with empty param list > - > > Key: SPARK-17046 > URL: https://issues.apache.org/jira/browse/SPARK-17046 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.1.0 >Reporter: Weichen Xu > Original Estimate: 24h > Remaining Estimate: 24h > > currently, we can use: > dataframe.select() > which select nothing. > it is illegal and we should abandon it in API level. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-17046) prevent user using dataframe.select with empty param list
Weichen Xu created SPARK-17046: -- Summary: prevent user using dataframe.select with empty param list Key: SPARK-17046 URL: https://issues.apache.org/jira/browse/SPARK-17046 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.1.0 Reporter: Weichen Xu currently, we can use: dataframe.select() which select nothing. it is illegal and we should abandon it in API level. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-17039) cannot read null dates from csv file
[ https://issues.apache.org/jira/browse/SPARK-17039?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-17039. --- Resolution: Duplicate Oh I understand the issue now, my fault. I agree this is a duplicate, regardless of what the specific final fix is. > cannot read null dates from csv file > > > Key: SPARK-17039 > URL: https://issues.apache.org/jira/browse/SPARK-17039 > Project: Spark > Issue Type: Bug > Components: Input/Output >Affects Versions: 2.0.0 >Reporter: Barry Becker > > I see this exact same bug as reported in this [stack overflow > post|http://stackoverflow.com/questions/38265640/spark-2-0-pre-csv-parsing-error-if-missing-values-in-date-column] > using Spark 2.0.0 (released version). > In scala, I read a csv using > sqlContext.read > .format("csv") > .option("header", "false") > .option("inferSchema", "false") > .option("nullValue", "?") > .option("dateFormat", "-MM-dd'T'HH:mm:ss") > .schema(dfSchema) > .csv(dataFile) > The data contains some null dates (represented with ?). > The error I get is: > {code} > org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in > stage 8.0 failed 1 times, most recent failure: Lost task 0.0 in stage 8.0 > (TID 10, localhost): java.text.ParseException: Unparseable date: "?" > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17033) GaussianMixture should use treeAggregate to improve performance
[ https://issues.apache.org/jira/browse/SPARK-17033?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15419842#comment-15419842 ] Apache Spark commented on SPARK-17033: -- User 'WeichenXu123' has created a pull request for this issue: https://github.com/apache/spark/pull/14628 > GaussianMixture should use treeAggregate to improve performance > --- > > Key: SPARK-17033 > URL: https://issues.apache.org/jira/browse/SPARK-17033 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib >Reporter: Yanbo Liang >Priority: Minor > > {{GaussianMixture}} should use {{treeAggregate}} rather than {{aggregate}} to > improve performance and scalability. In my test of dataset with 200 features > and 1M instance, I found there is 20% increased performance. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org