[jira] [Commented] (SPARK-16519) Handle SparkR RDD generics that create warnings in R CMD check

2016-08-13 Thread Clark Fitzgerald (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15420226#comment-15420226
 ] 

Clark Fitzgerald commented on SPARK-16519:
--

Thanks [~shivaram] for the heads up. Good thing I didn't use the private RDD 
functions!

> Handle SparkR RDD generics that create warnings in R CMD check
> --
>
> Key: SPARK-16519
> URL: https://issues.apache.org/jira/browse/SPARK-16519
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Reporter: Shivaram Venkataraman
>
> One of the warnings we get from R CMD check is that RDD implementations of 
> some of the generics are not documented. These generics are shared between 
> RDD, DataFrames in SparkR. The list includes
> {quote}
> WARNING
> Undocumented S4 methods:
>   generic 'cache' and siglist 'RDD'
>   generic 'collect' and siglist 'RDD'
>   generic 'count' and siglist 'RDD'
>   generic 'distinct' and siglist 'RDD'
>   generic 'first' and siglist 'RDD'
>   generic 'join' and siglist 'RDD,RDD'
>   generic 'length' and siglist 'RDD'
>   generic 'partitionBy' and siglist 'RDD'
>   generic 'persist' and siglist 'RDD,character'
>   generic 'repartition' and siglist 'RDD'
>   generic 'show' and siglist 'RDD'
>   generic 'take' and siglist 'RDD,numeric'
>   generic 'unpersist' and siglist 'RDD'
> {quote}
> As described in 
> https://stat.ethz.ch/pipermail/r-devel/2003-September/027490.html this looks 
> like a limitation of R where exporting a generic from a package also exports 
> all the implementations of that generic. 
> One way to get around this is to remove the RDD API or rename the methods in 
> Spark 2.1



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17049) LAG function fails when selecting all columns

2016-08-13 Thread Dongjoon Hyun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17049?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15420207#comment-15420207
 ] 

Dongjoon Hyun commented on SPARK-17049:
---

Hi, [~gcivan].
Yes. On Spark 2.0, it fails. Fortunately, the bug seems to be fixed already.
{code}
scala> sql("create table a as select 1 as col")
scala> sql("select *, lag(col) over (order by col) as prev from a")
scala> sql("select *, lag(col) over (order by col) as prev from a").show()
+---++
|col|prev|
+---++
|  1|null|
+---++

scala> spark.version
res3: String = 2.1.0-SNAPSHOT
{code}

> LAG function fails when selecting all columns
> -
>
> Key: SPARK-17049
> URL: https://issues.apache.org/jira/browse/SPARK-17049
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Gokhan Civan
>
> In version 1.6.1, the queries
> create table a as select 1 as col;
> select *, lag(col) over (order by col) as prev from a;
> successfully produce the table
> col  prev
> 1null
> However, in version 2.0.0, this fails with the error
> org.apache.spark.sql.AnalysisException: Window Frame RANGE BETWEEN UNBOUNDED 
> PRECEDING AND CURRENT ROW must match the required frame ROWS BETWEEN 1 
> PRECEDING AND 1 PRECEDING;
> at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:40)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:58)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveWindowFrame$$anonfun$apply$29$$anonfun$applyOrElse$10.applyOrElse(Analyzer.scala:1785)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveWindowFrame$$anonfun$apply$29$$anonfun$applyOrElse$10.applyOrElse(Analyzer.scala:1781)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:279)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:279)
> at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:278)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:284)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:284)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:321)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:319)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:284)
> at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionDown$1(QueryPlan.scala:156)
> at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1(QueryPlan.scala:166)
> at 
> org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1$1.apply(QueryPlan.scala:170)
> ...
> On the other hand, the query works if * is replaced with col as in
> select col, lag(col) over (order by col) as prev from a;
> It also works as follows:
> select col, lag(col) over (order by col ROWS BETWEEN 1 PRECEDING AND 1 
> PRECEDING) as prev from a;



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-17049) LAG function fails when selecting all columns

2016-08-13 Thread Gokhan Civan (JIRA)
Gokhan Civan created SPARK-17049:


 Summary: LAG function fails when selecting all columns
 Key: SPARK-17049
 URL: https://issues.apache.org/jira/browse/SPARK-17049
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.0.0
Reporter: Gokhan Civan


In version 1.6.1, the queries

create table a as select 1 as col;
select *, lag(col) over (order by col) as prev from a;

successfully produce the table

col  prev
1null

However, in version 2.0.0, this fails with the error

org.apache.spark.sql.AnalysisException: Window Frame RANGE BETWEEN UNBOUNDED 
PRECEDING AND CURRENT ROW must match the required frame ROWS BETWEEN 1 
PRECEDING AND 1 PRECEDING;
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:40)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:58)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveWindowFrame$$anonfun$apply$29$$anonfun$applyOrElse$10.applyOrElse(Analyzer.scala:1785)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveWindowFrame$$anonfun$apply$29$$anonfun$applyOrElse$10.applyOrElse(Analyzer.scala:1781)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:279)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:279)
at 
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:278)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:284)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:284)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:321)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:319)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:284)
at 
org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionDown$1(QueryPlan.scala:156)
at 
org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1(QueryPlan.scala:166)
at 
org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1$1.apply(QueryPlan.scala:170)
...

On the other hand, the query works if * is replaced with col as in

select col, lag(col) over (order by col) as prev from a;

It also works as follows:

select col, lag(col) over (order by col ROWS BETWEEN 1 PRECEDING AND 1 
PRECEDING) as prev from a;



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16966) App Name is a randomUUID even when "spark.app.name" exists

2016-08-13 Thread Weiqing Yang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15420116#comment-15420116
 ] 

Weiqing Yang commented on SPARK-16966:
--

[~srowen] Thanks for the new PR and review.

> App Name is a randomUUID even when "spark.app.name" exists
> --
>
> Key: SPARK-16966
> URL: https://issues.apache.org/jira/browse/SPARK-16966
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Weiqing Yang
>Assignee: Sean Owen
> Fix For: 2.0.1, 2.1.0
>
>
> When submitting an application with "--name":
> ./bin/spark-submit --name myApplicationTest --verbose --executor-cores 3 
> --num-executors 1 --master yarn --deploy-mode client --class 
> org.apache.spark.examples.SparkKMeans 
> examples/target/original-spark-examples_2.11-2.1.0-SNAPSHOT.jar 
> hdfs://localhost:9000/lr_big.txt 2 5
> In the history server UI:
> App ID: application_1470694797714_0016
> App Name: 70c06dc5-1b99-4b4a-a826-ea27497e977b
> The App Name should not be a randomUUID 
> "70c06dc5-1b99-4b4a-a826-ea27497e977b"  since the "spark.app.name" was 
> myApplicationTest.
> The application "org.apache.spark.examples.SparkKMeans" above did not invoke 
> ".appName()". 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-16966) App Name is a randomUUID even when "spark.app.name" exists

2016-08-13 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16966?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-16966.
-
   Resolution: Fixed
 Assignee: Sean Owen
Fix Version/s: 2.1.0
   2.0.1

> App Name is a randomUUID even when "spark.app.name" exists
> --
>
> Key: SPARK-16966
> URL: https://issues.apache.org/jira/browse/SPARK-16966
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Weiqing Yang
>Assignee: Sean Owen
> Fix For: 2.0.1, 2.1.0
>
>
> When submitting an application with "--name":
> ./bin/spark-submit --name myApplicationTest --verbose --executor-cores 3 
> --num-executors 1 --master yarn --deploy-mode client --class 
> org.apache.spark.examples.SparkKMeans 
> examples/target/original-spark-examples_2.11-2.1.0-SNAPSHOT.jar 
> hdfs://localhost:9000/lr_big.txt 2 5
> In the history server UI:
> App ID: application_1470694797714_0016
> App Name: 70c06dc5-1b99-4b4a-a826-ea27497e977b
> The App Name should not be a randomUUID 
> "70c06dc5-1b99-4b4a-a826-ea27497e977b"  since the "spark.app.name" was 
> myApplicationTest.
> The application "org.apache.spark.examples.SparkKMeans" above did not invoke 
> ".appName()". 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-17041) Columns in schema are no longer case sensitive when reading csv file

2016-08-13 Thread Dongjoon Hyun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17041?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15420106#comment-15420106
 ] 

Dongjoon Hyun edited comment on SPARK-17041 at 8/13/16 10:18 PM:
-

Since I don't have the exact script of your situation, it might be different. 
But, Spark 2.0 supports `case sensitive`, of course with SQL configuration. In 
the above, `sql("set spark.sql.caseSensitive=true")`.

Could you confirm on your site, [~barrybecker4]?


was (Author: dongjoon):
Since I don't have the exact script of your situation, it might be different. 
But, Spark 2.0 supports `case sensitive` of course with SQL configuration. In 
the above, `sql("set spark.sql.caseSensitive=true")`.

Could you confirm on your site, [~barrybecker4]?

> Columns in schema are no longer case sensitive when reading csv file
> 
>
> Key: SPARK-17041
> URL: https://issues.apache.org/jira/browse/SPARK-17041
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 2.0.0
>Reporter: Barry Becker
>
> It used to be (in spark 1.6.2) that I could read a csv file that had columns 
> with  names that differed only by case. For example, one column may be 
> "output" and another called "Output". Now (with spark 2.0.0) if I try to read 
> such a file, I get an error like this:
> {code}
> org.apache.spark.sql.AnalysisException: Reference 'Output' is ambiguous, 
> could be: Output#1263, Output#1295.;
> {code}
> The schema (dfSchema below) that I pass to the csv read looks like this:
> {code}
> StructType( StructField(Output,StringType,true), ... 
> StructField(output,StringType,true), ...)
> {code}
> The code that does the read is this
> {code}
> sqlContext.read
>   .format("csv")
>   .option("header", "false") // Use first line of all files as header
>   .option("inferSchema", "false") // Automatically infer data types
>   .schema(dfSchema)
>   .csv(dataFile)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17041) Columns in schema are no longer case sensitive when reading csv file

2016-08-13 Thread Dongjoon Hyun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17041?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15420106#comment-15420106
 ] 

Dongjoon Hyun commented on SPARK-17041:
---

Since I don't have the exact script of your situation, it might be different. 
But, Spark 2.0 supports `case sensitive` of course with SQL configuration. In 
the above, `sql("set spark.sql.caseSensitive=true")`.

Could you confirm on your site, [~barrybecker4]?

> Columns in schema are no longer case sensitive when reading csv file
> 
>
> Key: SPARK-17041
> URL: https://issues.apache.org/jira/browse/SPARK-17041
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 2.0.0
>Reporter: Barry Becker
>
> It used to be (in spark 1.6.2) that I could read a csv file that had columns 
> with  names that differed only by case. For example, one column may be 
> "output" and another called "Output". Now (with spark 2.0.0) if I try to read 
> such a file, I get an error like this:
> {code}
> org.apache.spark.sql.AnalysisException: Reference 'Output' is ambiguous, 
> could be: Output#1263, Output#1295.;
> {code}
> The schema (dfSchema below) that I pass to the csv read looks like this:
> {code}
> StructType( StructField(Output,StringType,true), ... 
> StructField(output,StringType,true), ...)
> {code}
> The code that does the read is this
> {code}
> sqlContext.read
>   .format("csv")
>   .option("header", "false") // Use first line of all files as header
>   .option("inferSchema", "false") // Automatically infer data types
>   .schema(dfSchema)
>   .csv(dataFile)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17041) Columns in schema are no longer case sensitive when reading csv file

2016-08-13 Thread Dongjoon Hyun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17041?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15420105#comment-15420105
 ] 

Dongjoon Hyun commented on SPARK-17041:
---

Hi, [~barrybecker4]
I reproduced your problems and I think I can give you the solution.
{code}
scala> spark.read.format("csv").option("header", "false").option("inferSchema", 
"false").schema(StructType(Seq(StructField("c", StringType, false), 
StructField("C", StringType, false.csv("/tmp/csv_caseSensitive").show
org.apache.spark.sql.AnalysisException: Reference 'c' is ambiguous, could be: 
c#45, c#46.;
...

scala> sql("set spark.sql.caseSensitive=true")
res9: org.apache.spark.sql.DataFrame = [key: string, value: string]

scala> spark.read.format("csv").option("header", "false").option("inferSchema", 
"false").schema(StructType(Seq(StructField("c", StringType, false), 
StructField("C", StringType, false.csv("/tmp/csv_caseSensitive").show
+---+---+
|  c|  C|
+---+---+
| c1| C1|
|  1|  2|
+---+---+
{code}

> Columns in schema are no longer case sensitive when reading csv file
> 
>
> Key: SPARK-17041
> URL: https://issues.apache.org/jira/browse/SPARK-17041
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 2.0.0
>Reporter: Barry Becker
>
> It used to be (in spark 1.6.2) that I could read a csv file that had columns 
> with  names that differed only by case. For example, one column may be 
> "output" and another called "Output". Now (with spark 2.0.0) if I try to read 
> such a file, I get an error like this:
> {code}
> org.apache.spark.sql.AnalysisException: Reference 'Output' is ambiguous, 
> could be: Output#1263, Output#1295.;
> {code}
> The schema (dfSchema below) that I pass to the csv read looks like this:
> {code}
> StructType( StructField(Output,StringType,true), ... 
> StructField(output,StringType,true), ...)
> {code}
> The code that does the read is this
> {code}
> sqlContext.read
>   .format("csv")
>   .option("header", "false") // Use first line of all files as header
>   .option("inferSchema", "false") // Automatically infer data types
>   .schema(dfSchema)
>   .csv(dataFile)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17041) Columns in schema are no longer case sensitive when reading csv file

2016-08-13 Thread Dongjoon Hyun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17041?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15420100#comment-15420100
 ] 

Dongjoon Hyun commented on SPARK-17041:
---

Hi, [~barrybecker4].
Could you give us more reproducible example?

> Columns in schema are no longer case sensitive when reading csv file
> 
>
> Key: SPARK-17041
> URL: https://issues.apache.org/jira/browse/SPARK-17041
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 2.0.0
>Reporter: Barry Becker
>
> It used to be (in spark 1.6.2) that I could read a csv file that had columns 
> with  names that differed only by case. For example, one column may be 
> "output" and another called "Output". Now (with spark 2.0.0) if I try to read 
> such a file, I get an error like this:
> {code}
> org.apache.spark.sql.AnalysisException: Reference 'Output' is ambiguous, 
> could be: Output#1263, Output#1295.;
> {code}
> The schema (dfSchema below) that I pass to the csv read looks like this:
> {code}
> StructType( StructField(Output,StringType,true), ... 
> StructField(output,StringType,true), ...)
> {code}
> The code that does the read is this
> {code}
> sqlContext.read
>   .format("csv")
>   .option("header", "false") // Use first line of all files as header
>   .option("inferSchema", "false") // Automatically infer data types
>   .schema(dfSchema)
>   .csv(dataFile)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17035) Conversion of datetime.max to microseconds produces incorrect value

2016-08-13 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17035?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17035:


Assignee: Apache Spark

> Conversion of datetime.max to microseconds produces incorrect value
> ---
>
> Key: SPARK-17035
> URL: https://issues.apache.org/jira/browse/SPARK-17035
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.0.0
>Reporter: Michael Styles
>Assignee: Apache Spark
>Priority: Minor
>
> Conversion of datetime.max to microseconds produces incorrect value. For 
> example,
> {noformat}
> from datetime import datetime
> from pyspark.sql import Row
> from pyspark.sql.types import StructType, StructField, TimestampType
> schema = StructType([StructField("dt", TimestampType(), False)])
> data = [{"dt": datetime.max}]
> # convert python objects to sql data
> sql_data = [schema.toInternal(row) for row in data]
> # Value is wrong.
> sql_data
> [(2.534023188e+17,)]
> {noformat}
> This value should be [(2534023187,)].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17035) Conversion of datetime.max to microseconds produces incorrect value

2016-08-13 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17035?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17035:


Assignee: (was: Apache Spark)

> Conversion of datetime.max to microseconds produces incorrect value
> ---
>
> Key: SPARK-17035
> URL: https://issues.apache.org/jira/browse/SPARK-17035
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.0.0
>Reporter: Michael Styles
>Priority: Minor
>
> Conversion of datetime.max to microseconds produces incorrect value. For 
> example,
> {noformat}
> from datetime import datetime
> from pyspark.sql import Row
> from pyspark.sql.types import StructType, StructField, TimestampType
> schema = StructType([StructField("dt", TimestampType(), False)])
> data = [{"dt": datetime.max}]
> # convert python objects to sql data
> sql_data = [schema.toInternal(row) for row in data]
> # Value is wrong.
> sql_data
> [(2.534023188e+17,)]
> {noformat}
> This value should be [(2534023187,)].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17035) Conversion of datetime.max to microseconds produces incorrect value

2016-08-13 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15420090#comment-15420090
 ] 

Apache Spark commented on SPARK-17035:
--

User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/14631

> Conversion of datetime.max to microseconds produces incorrect value
> ---
>
> Key: SPARK-17035
> URL: https://issues.apache.org/jira/browse/SPARK-17035
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.0.0
>Reporter: Michael Styles
>Priority: Minor
>
> Conversion of datetime.max to microseconds produces incorrect value. For 
> example,
> {noformat}
> from datetime import datetime
> from pyspark.sql import Row
> from pyspark.sql.types import StructType, StructField, TimestampType
> schema = StructType([StructField("dt", TimestampType(), False)])
> data = [{"dt": datetime.max}]
> # convert python objects to sql data
> sql_data = [schema.toInternal(row) for row in data]
> # Value is wrong.
> sql_data
> [(2.534023188e+17,)]
> {noformat}
> This value should be [(2534023187,)].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17035) Conversion of datetime.max to microseconds produces incorrect value

2016-08-13 Thread Dongjoon Hyun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15420086#comment-15420086
 ] 

Dongjoon Hyun commented on SPARK-17035:
---

Hi, [~ptkool].
You're right. It seems the microsecond part of `Timestamp` type is lost.
I'll make a PR for this issue soon.

> Conversion of datetime.max to microseconds produces incorrect value
> ---
>
> Key: SPARK-17035
> URL: https://issues.apache.org/jira/browse/SPARK-17035
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.0.0
>Reporter: Michael Styles
>Priority: Minor
>
> Conversion of datetime.max to microseconds produces incorrect value. For 
> example,
> {noformat}
> from datetime import datetime
> from pyspark.sql import Row
> from pyspark.sql.types import StructType, StructField, TimestampType
> schema = StructType([StructField("dt", TimestampType(), False)])
> data = [{"dt": datetime.max}]
> # convert python objects to sql data
> sql_data = [schema.toInternal(row) for row in data]
> # Value is wrong.
> sql_data
> [(2.534023188e+17,)]
> {noformat}
> This value should be [(2534023187,)].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6378) srcAttr in graph.triplets don't update when the size of graph is huge

2016-08-13 Thread Rabie Saidi (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6378?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15420006#comment-15420006
 ] 

Rabie Saidi commented on SPARK-6378:


The discrepancy between vertices and triplets' vertices seems to be more 
generic than the title of this task. It happens as well for small graphs like 
in my case where I'm trying to use Pregel to send messages between vertices, 
but the value of srcAttr in the triplets is not updated. 

> srcAttr in graph.triplets don't update when the size of graph is huge
> -
>
> Key: SPARK-6378
> URL: https://issues.apache.org/jira/browse/SPARK-6378
> Project: Spark
>  Issue Type: Bug
>  Components: GraphX
>Affects Versions: 1.2.1
>Reporter: zhangzhenyue
> Attachments: TripletsViewDonotUpdate.scala
>
>
> when the size of the graph is huge(0.2 billion vertex, 6 billion edges), the 
> srcAttr and dstAttr in graph.triplets don't update when using the 
> Graph.outerJoinVertices(when the data in vertex is changed).
> the code and the log is as follows:
> {quote}
> g = graph.outerJoinVertices()...
> g,vertices,count()
> g.edges.count()
> println("example edge " + g.triplets.filter(e => e.srcId == 
> 51L).collect()
>   .map(e =>(e.srcId + ":" + e.srcAttr + ", " + e.dstId + ":" + 
> e.dstAttr)).mkString("\n"))
> println("example vertex " + g.vertices.filter(e => e._1 == 
> 51L).collect()
>   .map(e => (e._1 + "," + e._2)).mkString("\n"))
> {quote}
> the result:
> {quote}
> example edge 51:0, 2467451620:61
> 51:0, 1962741310:83 // attr of vertex 51 is 0 in 
> Graph.triplets
> example vertex 51,2 // attr of vertex 51 is 2 in 
> Graph.vertices
> {quote}
> when the graph is smaller(10 million vertex), the code is OK, the triplets 
> will update when the vertex is changed



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16966) App Name is a randomUUID even when "spark.app.name" exists

2016-08-13 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15419968#comment-15419968
 ] 

Apache Spark commented on SPARK-16966:
--

User 'srowen' has created a pull request for this issue:
https://github.com/apache/spark/pull/14630

> App Name is a randomUUID even when "spark.app.name" exists
> --
>
> Key: SPARK-16966
> URL: https://issues.apache.org/jira/browse/SPARK-16966
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Weiqing Yang
>
> When submitting an application with "--name":
> ./bin/spark-submit --name myApplicationTest --verbose --executor-cores 3 
> --num-executors 1 --master yarn --deploy-mode client --class 
> org.apache.spark.examples.SparkKMeans 
> examples/target/original-spark-examples_2.11-2.1.0-SNAPSHOT.jar 
> hdfs://localhost:9000/lr_big.txt 2 5
> In the history server UI:
> App ID: application_1470694797714_0016
> App Name: 70c06dc5-1b99-4b4a-a826-ea27497e977b
> The App Name should not be a randomUUID 
> "70c06dc5-1b99-4b4a-a826-ea27497e977b"  since the "spark.app.name" was 
> myApplicationTest.
> The application "org.apache.spark.examples.SparkKMeans" above did not invoke 
> ".appName()". 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-17048) ML model read for custom transformers in a pipeline does not work

2016-08-13 Thread Taras Matyashovskyy (JIRA)
Taras Matyashovskyy created SPARK-17048:
---

 Summary: ML model read for custom transformers in a pipeline does 
not work 
 Key: SPARK-17048
 URL: https://issues.apache.org/jira/browse/SPARK-17048
 Project: Spark
  Issue Type: Bug
  Components: ML
Affects Versions: 2.0.0
 Environment: Spark 2.0.0
Java API

Reporter: Taras Matyashovskyy


0. Use Java API :( 
1. Create any custom ML transformer
2. Make it MLReadable and MLWritable
3. Add to pipeline
4. Evaluate model, e.g. CrossValidationModel, and save results to disk
5. For custom transformer you can DefaultParamsReader and DefaultParamsWriter, 
for instance 
6. Load model from saved directory
7. All out-of-the-box objects are loaded successfully, e.g. Pipeline, 
Evaluator, etc.
8. Your custom transformer will fail with NPE

Reason:
ReadWrite.scala:447
cls.getMethod("read").invoke(null).asInstanceOf[MLReader[T]].load(path)

In Java this only works for static methods.
As we are implementing MLReadable or MLWritable, then this call should be 
instance method call. 






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17048) ML model read for custom transformers in a pipeline does not work

2016-08-13 Thread Taras Matyashovskyy (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17048?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Taras Matyashovskyy updated SPARK-17048:

Description: 
0. Use Java API :( 
1. Create any custom ML transformer
2. Make it MLReadable and MLWritable
3. Add to pipeline
4. Evaluate model, e.g. CrossValidationModel, and save results to disk
5. For custom transformer you can use DefaultParamsReader and 
DefaultParamsWriter, for instance 
6. Load model from saved directory
7. All out-of-the-box objects are loaded successfully, e.g. Pipeline, 
Evaluator, etc.
8. Your custom transformer will fail with NPE

Reason:
ReadWrite.scala:447
cls.getMethod("read").invoke(null).asInstanceOf[MLReader[T]].load(path)

In Java this only works for static methods.
As we are implementing MLReadable or MLWritable, then this call should be 
instance method call. 




  was:
0. Use Java API :( 
1. Create any custom ML transformer
2. Make it MLReadable and MLWritable
3. Add to pipeline
4. Evaluate model, e.g. CrossValidationModel, and save results to disk
5. For custom transformer you can DefaultParamsReader and DefaultParamsWriter, 
for instance 
6. Load model from saved directory
7. All out-of-the-box objects are loaded successfully, e.g. Pipeline, 
Evaluator, etc.
8. Your custom transformer will fail with NPE

Reason:
ReadWrite.scala:447
cls.getMethod("read").invoke(null).asInstanceOf[MLReader[T]].load(path)

In Java this only works for static methods.
As we are implementing MLReadable or MLWritable, then this call should be 
instance method call. 





> ML model read for custom transformers in a pipeline does not work 
> --
>
> Key: SPARK-17048
> URL: https://issues.apache.org/jira/browse/SPARK-17048
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.0.0
> Environment: Spark 2.0.0
> Java API
>Reporter: Taras Matyashovskyy
>  Labels: easyfix, features
>   Original Estimate: 2h
>  Remaining Estimate: 2h
>
> 0. Use Java API :( 
> 1. Create any custom ML transformer
> 2. Make it MLReadable and MLWritable
> 3. Add to pipeline
> 4. Evaluate model, e.g. CrossValidationModel, and save results to disk
> 5. For custom transformer you can use DefaultParamsReader and 
> DefaultParamsWriter, for instance 
> 6. Load model from saved directory
> 7. All out-of-the-box objects are loaded successfully, e.g. Pipeline, 
> Evaluator, etc.
> 8. Your custom transformer will fail with NPE
> Reason:
> ReadWrite.scala:447
> cls.getMethod("read").invoke(null).asInstanceOf[MLReader[T]].load(path)
> In Java this only works for static methods.
> As we are implementing MLReadable or MLWritable, then this call should be 
> instance method call. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-17047) Spark 2 cannot create ORC table when CLUSTERED.

2016-08-13 Thread Dr Mich Talebzadeh (JIRA)
Dr Mich Talebzadeh created SPARK-17047:
--

 Summary: Spark 2 cannot create ORC table when CLUSTERED.
 Key: SPARK-17047
 URL: https://issues.apache.org/jira/browse/SPARK-17047
 Project: Spark
  Issue Type: Bug
Affects Versions: 2.0.0
Reporter: Dr Mich Talebzadeh


This does not work with CLUSTERED BY clause in Spark 2 now!




CREATE TABLE test.dummy2
 (
 ID INT
   , CLUSTERED INT
   , SCATTERED INT
   , RANDOMISED INT
   , RANDOM_STRING VARCHAR(50)
   , SMALL_VC VARCHAR(10)
   , PADDING  VARCHAR(10)
)
CLUSTERED BY (ID) INTO 256 BUCKETS
STORED AS ORC
TBLPROPERTIES ( "orc.compress"="SNAPPY",
"orc.create.index"="true",
"orc.bloom.filter.columns"="ID",
"orc.bloom.filter.fpp"="0.05",
"orc.stripe.size"="268435456",
"orc.row.index.stride"="1" )

scala> HiveContext.sql(sqltext)
org.apache.spark.sql.catalyst.parser.ParseException:
Operation not allowed: CREATE TABLE ... CLUSTERED BY(line 2, pos 0)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14880) Parallel Gradient Descent with less map-reduce shuffle overhead

2016-08-13 Thread Ahmed Mahran (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14880?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ahmed Mahran updated SPARK-14880:
-
External issue URL: 
https://github.com/mashin-io/rich-spark/blob/master/main/src/main/scala/org/apache/spark/mllib/optimization/ParallelSGD.scala
  (was: 
https://github.com/mashin-io/rich-spark/blob/master/src/main/scala/org/apache/spark/mllib/optimization/ParallelSGD.scala)

> Parallel Gradient Descent with less map-reduce shuffle overhead
> ---
>
> Key: SPARK-14880
> URL: https://issues.apache.org/jira/browse/SPARK-14880
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: Ahmed Mahran
>  Labels: performance
>
> The current implementation of (Stochastic) Gradient Descent performs one 
> map-reduce shuffle per iteration. Moreover, when the sampling fraction gets 
> smaller, the algorithm becomes shuffle-bound instead of CPU-bound.
> {code}
> (1 to numIterations or convergence) {
>  rdd
>   .sample(fraction)
>   .map(Gradient)
>   .reduce(Update)
> }
> {code}
> A more performant variation requires only one map-reduce regardless from the 
> number of iterations. A local mini-batch SGD could be run on each partition, 
> then the results could be averaged. This is based on (Zinkevich, Martin, 
> Markus Weimer, Lihong Li, and Alex J. Smola. "Parallelized stochastic 
> gradient descent." In Advances in neural information processing systems, 
> 2010, 
> http://www.research.rutgers.edu/~lihong/pub/Zinkevich11Parallelized.pdf).
> {code}
> rdd
>  .shuffle()
>  .mapPartitions((1 to numIterations or convergence) {
>iter.sample(fraction).map(Gradient).reduce(Update)
>  })
>  .reduce(Average)
> {code}
> A higher level iteration could enclose the above variation; shuffling the 
> data before the local mini-batches and feeding back the average weights from 
> the last iteration. This allows more variability in the sampling of the 
> mini-batches with the possibility to cover the whole dataset. Here is a Spark 
> based implementation 
> https://github.com/mashin-io/rich-spark/blob/master/src/main/scala/org/apache/spark/mllib/optimization/ParallelSGD.scala
> {code}
> (1 to numIterations1 or convergence) {
>  rdd
>   .shuffle()
>   .mapPartitions((1 to numIterations2 or convergence) {
> iter.sample(fraction).map(Gradient).reduce(Update)
>   })
>   .reduce(Average)
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14880) Parallel Gradient Descent with less map-reduce shuffle overhead

2016-08-13 Thread Ahmed Mahran (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14880?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ahmed Mahran updated SPARK-14880:
-
Description: 
The current implementation of (Stochastic) Gradient Descent performs one 
map-reduce shuffle per iteration. Moreover, when the sampling fraction gets 
smaller, the algorithm becomes shuffle-bound instead of CPU-bound.

{code}
(1 to numIterations or convergence) {
 rdd
  .sample(fraction)
  .map(Gradient)
  .reduce(Update)
}
{code}

A more performant variation requires only one map-reduce regardless from the 
number of iterations. A local mini-batch SGD could be run on each partition, 
then the results could be averaged. This is based on (Zinkevich, Martin, Markus 
Weimer, Lihong Li, and Alex J. Smola. "Parallelized stochastic gradient 
descent." In Advances in neural information processing systems, 2010, 
http://www.research.rutgers.edu/~lihong/pub/Zinkevich11Parallelized.pdf).

{code}
rdd
 .shuffle()
 .mapPartitions((1 to numIterations or convergence) {
   iter.sample(fraction).map(Gradient).reduce(Update)
 })
 .reduce(Average)
{code}

A higher level iteration could enclose the above variation; shuffling the data 
before the local mini-batches and feeding back the average weights from the 
last iteration. This allows more variability in the sampling of the 
mini-batches with the possibility to cover the whole dataset. Here is a Spark 
based implementation 
https://github.com/mashin-io/rich-spark/blob/master/main/src/main/scala/org/apache/spark/mllib/optimization/ParallelSGD.scala

{code}
(1 to numIterations1 or convergence) {
 rdd
  .shuffle()
  .mapPartitions((1 to numIterations2 or convergence) {
iter.sample(fraction).map(Gradient).reduce(Update)
  })
  .reduce(Average)
}
{code}

  was:
The current implementation of (Stochastic) Gradient Descent performs one 
map-reduce shuffle per iteration. Moreover, when the sampling fraction gets 
smaller, the algorithm becomes shuffle-bound instead of CPU-bound.

{code}
(1 to numIterations or convergence) {
 rdd
  .sample(fraction)
  .map(Gradient)
  .reduce(Update)
}
{code}

A more performant variation requires only one map-reduce regardless from the 
number of iterations. A local mini-batch SGD could be run on each partition, 
then the results could be averaged. This is based on (Zinkevich, Martin, Markus 
Weimer, Lihong Li, and Alex J. Smola. "Parallelized stochastic gradient 
descent." In Advances in neural information processing systems, 2010, 
http://www.research.rutgers.edu/~lihong/pub/Zinkevich11Parallelized.pdf).

{code}
rdd
 .shuffle()
 .mapPartitions((1 to numIterations or convergence) {
   iter.sample(fraction).map(Gradient).reduce(Update)
 })
 .reduce(Average)
{code}

A higher level iteration could enclose the above variation; shuffling the data 
before the local mini-batches and feeding back the average weights from the 
last iteration. This allows more variability in the sampling of the 
mini-batches with the possibility to cover the whole dataset. Here is a Spark 
based implementation 
https://github.com/mashin-io/rich-spark/blob/master/src/main/scala/org/apache/spark/mllib/optimization/ParallelSGD.scala

{code}
(1 to numIterations1 or convergence) {
 rdd
  .shuffle()
  .mapPartitions((1 to numIterations2 or convergence) {
iter.sample(fraction).map(Gradient).reduce(Update)
  })
  .reduce(Average)
}
{code}


> Parallel Gradient Descent with less map-reduce shuffle overhead
> ---
>
> Key: SPARK-14880
> URL: https://issues.apache.org/jira/browse/SPARK-14880
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: Ahmed Mahran
>  Labels: performance
>
> The current implementation of (Stochastic) Gradient Descent performs one 
> map-reduce shuffle per iteration. Moreover, when the sampling fraction gets 
> smaller, the algorithm becomes shuffle-bound instead of CPU-bound.
> {code}
> (1 to numIterations or convergence) {
>  rdd
>   .sample(fraction)
>   .map(Gradient)
>   .reduce(Update)
> }
> {code}
> A more performant variation requires only one map-reduce regardless from the 
> number of iterations. A local mini-batch SGD could be run on each partition, 
> then the results could be averaged. This is based on (Zinkevich, Martin, 
> Markus Weimer, Lihong Li, and Alex J. Smola. "Parallelized stochastic 
> gradient descent." In Advances in neural information processing systems, 
> 2010, 
> http://www.research.rutgers.edu/~lihong/pub/Zinkevich11Parallelized.pdf).
> {code}
> rdd
>  .shuffle()
>  .mapPartitions((1 to numIterations or convergence) {
>iter.sample(fraction).map(Gradient).reduce(Update)
>  })
>  .reduce(Average)
> {code}
> A higher level iteration could enclose the above variation; shuffling the 
> data before the local mini-batches and feeding back the average we

[jira] [Resolved] (SPARK-16893) Spark CSV Provider option is not documented

2016-08-13 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16893?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-16893.
---
Resolution: Not A Problem

> Spark CSV Provider option is not documented
> ---
>
> Key: SPARK-16893
> URL: https://issues.apache.org/jira/browse/SPARK-16893
> Project: Spark
>  Issue Type: Documentation
>Affects Versions: 2.0.0
>Reporter: Aseem Bansal
>Priority: Minor
>
> I was working with databricks spark csv library and came across an error. I 
> have logged the issue in their github but it would be good to document that 
> in Apache Spark's documentation also
> I faced it with CSV. Someone else faced that with JSON 
> http://stackoverflow.com/questions/38761920/spark2-0-error-multiple-sources-found-for-json-when-read-json-file
> Complete Issue details here
> https://github.com/databricks/spark-csv/issues/367



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17023) Update Kafka connetor to use Kafka 0.10.0.1

2016-08-13 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17023?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-17023:
--
Assignee: Luciano Resende
Priority: Trivial  (was: Minor)

> Update Kafka connetor to use Kafka 0.10.0.1
> ---
>
> Key: SPARK-17023
> URL: https://issues.apache.org/jira/browse/SPARK-17023
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Reporter: Luciano Resende
>Assignee: Luciano Resende
>Priority: Trivial
> Fix For: 2.0.1, 2.1.0
>
>
> Update Kafka connector to use latest version of Kafka dependencies (0.10.0.1)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-17023) Update Kafka connetor to use Kafka 0.10.0.1

2016-08-13 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17023?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-17023.
---
   Resolution: Fixed
Fix Version/s: 2.1.0
   2.0.1

Issue resolved by pull request 14606
[https://github.com/apache/spark/pull/14606]

> Update Kafka connetor to use Kafka 0.10.0.1
> ---
>
> Key: SPARK-17023
> URL: https://issues.apache.org/jira/browse/SPARK-17023
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Reporter: Luciano Resende
>Priority: Minor
> Fix For: 2.0.1, 2.1.0
>
>
> Update Kafka connector to use latest version of Kafka dependencies (0.10.0.1)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-16968) Allow to add additional options when creating a new table in DF's JDBC writer.

2016-08-13 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16968?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-16968:
--
Assignee: Jie Huang

> Allow to add additional options when creating a new table in DF's JDBC 
> writer. 
> ---
>
> Key: SPARK-16968
> URL: https://issues.apache.org/jira/browse/SPARK-16968
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Jie Huang
>Assignee: Jie Huang
>Priority: Minor
> Fix For: 2.1.0
>
>
> We met some problem when trying to export Dataframe to external mysql thru 
> JDBC driver (if the table doesn't exist). In general, Spark will create a new 
> table automatically if it doesn't exist. However it doesn't support to add 
> additional options when creating a new table. 
> For example, we need to set the default "CHARSET=utf-8" in some customer's 
> table. Otherwise, some UTF-8 columns cannot be exported to mysql 
> successfully. Some encoding exception will be thrown and finally break the 
> job.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-16968) Allow to add additional options when creating a new table in DF's JDBC writer.

2016-08-13 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16968?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-16968.
---
   Resolution: Fixed
Fix Version/s: 2.1.0

Issue resolved by pull request 14559
[https://github.com/apache/spark/pull/14559]

> Allow to add additional options when creating a new table in DF's JDBC 
> writer. 
> ---
>
> Key: SPARK-16968
> URL: https://issues.apache.org/jira/browse/SPARK-16968
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Jie Huang
>Priority: Minor
> Fix For: 2.1.0
>
>
> We met some problem when trying to export Dataframe to external mysql thru 
> JDBC driver (if the table doesn't exist). In general, Spark will create a new 
> table automatically if it doesn't exist. However it doesn't support to add 
> additional options when creating a new table. 
> For example, we need to set the default "CHARSET=utf-8" in some customer's 
> table. Otherwise, some UTF-8 columns cannot be exported to mysql 
> successfully. Some encoding exception will be thrown and finally break the 
> job.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-12370) Documentation should link to examples from its own release version

2016-08-13 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12370?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-12370.
---
   Resolution: Fixed
 Assignee: Jagadeesan A S
Fix Version/s: 2.1.0
   2.0.1

Resolved by https://github.com/apache/spark/pull/14596

> Documentation should link to examples from its own release version
> --
>
> Key: SPARK-12370
> URL: https://issues.apache.org/jira/browse/SPARK-12370
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Reporter: Brian London
>Assignee: Jagadeesan A S
>Priority: Minor
> Fix For: 2.0.1, 2.1.0
>
>
> When documentation is built is should reference examples from the same build. 
>  There are times when the docs have links that point to files in the github 
> head which may not be valid on the current release.
> As an example the spark streaming page for 1.5.2 (currently at 
> http://spark.apache.org/docs/latest/streaming-programming-guide.html) links 
> to the stateful network word count example (at 
> https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/streaming/StatefulNetworkWordCount.scala).
>   That example file utilizes a number of 1.6 features that are not available 
> in 1.5.2.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17046) prevent user using dataframe.select with empty param list

2016-08-13 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17046?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17046:


Assignee: (was: Apache Spark)

> prevent user using dataframe.select with empty param list
> -
>
> Key: SPARK-17046
> URL: https://issues.apache.org/jira/browse/SPARK-17046
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Weichen Xu
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> currently, we can use:
> dataframe.select() 
> which select nothing.
> it is illegal and we should prevent it in API level.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17046) prevent user using dataframe.select with empty param list

2016-08-13 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17046?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17046:


Assignee: Apache Spark

> prevent user using dataframe.select with empty param list
> -
>
> Key: SPARK-17046
> URL: https://issues.apache.org/jira/browse/SPARK-17046
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Weichen Xu
>Assignee: Apache Spark
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> currently, we can use:
> dataframe.select() 
> which select nothing.
> it is illegal and we should abandon it in API level.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17046) prevent user using dataframe.select with empty param list

2016-08-13 Thread Weichen Xu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17046?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weichen Xu updated SPARK-17046:
---
Description: 
currently, we can use:

dataframe.select() 

which select nothing.

it is illegal and we should prevent it in API level.

  was:
currently, we can use:

dataframe.select() 

which select nothing.

it is illegal and we should abandon it in API level.


> prevent user using dataframe.select with empty param list
> -
>
> Key: SPARK-17046
> URL: https://issues.apache.org/jira/browse/SPARK-17046
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Weichen Xu
>Assignee: Apache Spark
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> currently, we can use:
> dataframe.select() 
> which select nothing.
> it is illegal and we should prevent it in API level.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17046) prevent user using dataframe.select with empty param list

2016-08-13 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17046?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15419855#comment-15419855
 ] 

Apache Spark commented on SPARK-17046:
--

User 'WeichenXu123' has created a pull request for this issue:
https://github.com/apache/spark/pull/14629

> prevent user using dataframe.select with empty param list
> -
>
> Key: SPARK-17046
> URL: https://issues.apache.org/jira/browse/SPARK-17046
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Weichen Xu
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> currently, we can use:
> dataframe.select() 
> which select nothing.
> it is illegal and we should abandon it in API level.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-17046) prevent user using dataframe.select with empty param list

2016-08-13 Thread Weichen Xu (JIRA)
Weichen Xu created SPARK-17046:
--

 Summary: prevent user using dataframe.select with empty param list
 Key: SPARK-17046
 URL: https://issues.apache.org/jira/browse/SPARK-17046
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.1.0
Reporter: Weichen Xu


currently, we can use:

dataframe.select() 

which select nothing.

it is illegal and we should abandon it in API level.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-17039) cannot read null dates from csv file

2016-08-13 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17039?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-17039.
---
Resolution: Duplicate

Oh I understand the issue now, my fault. I agree this is a duplicate, 
regardless of what the specific final fix is.

> cannot read null dates from csv file
> 
>
> Key: SPARK-17039
> URL: https://issues.apache.org/jira/browse/SPARK-17039
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 2.0.0
>Reporter: Barry Becker
>
> I see this exact same bug as reported in this [stack overflow 
> post|http://stackoverflow.com/questions/38265640/spark-2-0-pre-csv-parsing-error-if-missing-values-in-date-column]
>   using Spark 2.0.0 (released version).
> In scala, I read a csv using 
> sqlContext.read
>   .format("csv")
>   .option("header", "false")
>   .option("inferSchema", "false") 
>   .option("nullValue", "?")
>   .option("dateFormat", "-MM-dd'T'HH:mm:ss")
>   .schema(dfSchema)
>   .csv(dataFile)
> The data contains some null dates (represented with ?).
> The error I get is:
> {code}
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
> stage 8.0 failed 1 times, most recent failure: Lost task 0.0 in stage 8.0 
> (TID 10, localhost): java.text.ParseException: Unparseable date: "?"
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17033) GaussianMixture should use treeAggregate to improve performance

2016-08-13 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17033?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15419842#comment-15419842
 ] 

Apache Spark commented on SPARK-17033:
--

User 'WeichenXu123' has created a pull request for this issue:
https://github.com/apache/spark/pull/14628

> GaussianMixture should use treeAggregate to improve performance
> ---
>
> Key: SPARK-17033
> URL: https://issues.apache.org/jira/browse/SPARK-17033
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Reporter: Yanbo Liang
>Priority: Minor
>
> {{GaussianMixture}} should use {{treeAggregate}} rather than {{aggregate}} to 
> improve performance and scalability. In my test of dataset with 200 features 
> and 1M instance, I found there is 20% increased performance.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org